Overspending on Redshift Spectrum ? How to monitor and optimize Redshift Spectrum Cost ?

Rajesh Saluja
6 min readApr 8, 2020

What is Amazon Redshift Spectrum ?

Using Amazon Redshift Spectrum, you can efficiently query and retrieve structured and semistructured data from files in Amazon S3 without having to load the data into Amazon Redshift tables. Redshift Spectrum queries employ massive parallelism to execute very fast against large datasets. Much of the processing occurs in the Redshift Spectrum layer, and most of the data remains in Amazon S3. Multiple clusters can concurrently query the same dataset in Amazon S3 without the need to make copies of the data for each cluster. You need Redshift Infrastructure to use Redshift Spectrum. With the Redshift spectrum , users pay for the amount of data scanned by queries they run as well as for Redshift instances and S3 storage cost.

What is Amazon Redshift ?

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more. This enables you to use your data to acquire new insights for your business and customers.

What is Amazon S3 ?

Amazon Simple Storage Service ( S3 ) is storage for the Internet. It is designed to make web-scale computing easier for developers.Amazon S3 has a simple web services interface that you can use to store and retrieve any amount of data, at any time, from anywhere on the web. It gives any developer access to the same highly scalable, reliable, fast, inexpensive data storage infrastructure that Amazon uses to run its own global network of websites. The service aims to maximize benefits of scale and to pass those benefits on to developers.

What are we trying to Solve ?

How much is the Redshift Spectrum Spend ? Who is spending ? How much is spent per day ? What is the trend of Spend ? What are the top ten expensive queries consuming cost ? Accessing the data from spectrum isn’t a fixed cost and cost may skyrocket if not tracked and controlled . The blog guides methods to generate the data , dynamically forecast the cost as per spend on each day , generate visualization reports , create monitoring alerts , determine action plans etc.

You can’t reduce till you start measuring !

Assumptions

Several options are available to explore the data residing on S3 using EMR , Redshift Spectrum , Athena etc. The blog focuses on exploring the data on S3 using Redshift Spectrum through external schemas pointing to hive metastore.

Methodology

If I can’t measure, I can’t reduce/improve . In 2003 , Michael Phelps the swimming champion swam the 200-meter butterfly in 1:53.93 sec to break his own world record of 1:54:58 sec set in 2001 and became the first man to swim under 1:54:00 sec.

The key over here is measurement, we started measuring the cost of each query used by Redshift Spectrum and populated the spectrum usage history on a daily basis from a spectrum usage view . The usage data can persist on your choice of database or S3 location but we choose Redshift to create the repository. Collecting the data over a period of time determined the pattern and usage of the tables and the outcome became endless.

Few examples of data exploration from spectrum usage view and spectrum usage history table can answer below questions .

Which Users are consuming more than x $ for a query ?

What is the peak time of data extraction through spectrum ?

How much is daily spent on weekdays over weekends ?

What is the monthly / yearly cost trend ?

What is the standard deviation for a specific day ?

How does daily cost per user look like ?

Who are the top X consumers ?

How to generate anomalies and determine run away queries for optimization ?

How does the forecast look like to set the budget ?

Collecting the Spectrum Usage Metrics

Spectrum Usage view was built by extracting the data from svl_s3query_summary view to get details on s3_scanned_bytes , query , start time , end time and the view can be joined with other tables like pg_user to extract username , svl_qlog to extract if the query was aborted. Once you know the s3_scanned_bytes then you can determine the cost by multiplying with spectrum price ( $5/TB and if bytes scanned is less than 10 MB by the query then you need to round it off to 10MB ) . The data was persisted on the spectrum usage history as data on underlying above system objects are short lived for a few days only.

Visualization

Amazon Quicksight and Vmware Wavefront were used for data visualization to cater needs of building dashboards for leadership cost budgeting and visibility on wavefront dashboards to all users within the organization to check daily cost trends , expensive cost consuming query and the cost of the query they are running. Quicksight can directly extract the spectrum usage history data from Redshift and various dashboards can be built and emailed on a daily basis ( sample few dashboards are attached down in the blog ) . Lambda function was created to extract the data from the redshift database table, process it and send it in required format as wavefront metrics for the wavefront dashboards to publish the data in near real time.

Architecture Diagram

Sample Quicksight Screenshots — Spectrum Usage History

Daily Cost Spend Trend on Spectrum for last 30 days

Rolling number of execution of queries trend for last 30 days

Moving Forecast using random cut forecast machine learning algorithm ( yellow shade denotes the forecast )

Conclusion

Answering fundamental cost controls questions is often difficult but once you start measuring then you can track and control . The cost pattern will determine which objects are heavily used and guides to obtain the list of objects which are consuming most of the cost . You can take a decision and build processes to move the heavily scanned objects to Redshift from S3 on a defined interval to reduce the spectrum cost , if needed. Also you can generate alerts on the cost usage to keep the budget in control.

You can add a “spectrum_scan_size_mb” custom rule under Workload Management on Redshift to avoid run away queries chewing up the budget. It would be great if Redshift comes with an option to define the cost before running the query itself , maybe as part of an explain plan or AWS comes with a method to control the spectrum cost at a broader level.

Author : Rajesh Saluja

Contributors : KusumKumari , Dilip Rachamalla

Reviewers : Giriraj Bagdi , Anil Madan

The blog content is a team effort and I would like to Thank the contributors , reviewers and several other cross functional team members.

--

--