Head to Head on Amazon Redshift & Apache Spark — What is preferred for Big data ?

Rajesh Saluja
5 min readApr 2, 2020

What is Big Data Processing ?

Big data processing is a process of handling large volumes of data . 3 V’s ( Volume , Variety and Velocity ) are three defining properties of big data . Volume refers to the amount of data , variety refers to the number of types of data and velocity refers to speed of data processing.

What are we trying to Solve ?

How should I do processing on Big Data ? What are the choices ? How should I choose and decide ? There are organizations who want to set up Big Data Infrastructure , there are organizations who want to move away from one MPP to another or find other ways to do processing and exploration . The blog guides pros and cons using different dimensions assuming infrastructure is running on AWS.

Use of Big Data

Big data has been used in the industry to provide customer insights , analyzing and predicting customer behavior through data derived from various sources . Analytics on big data helps organizations to identify new opportunities , make intelligent business moves at the right time to stay ahead of the competitors and in return make higher profits with satisfied customer experience. The world’s most valuable resource is no longer oil , but data.

While there is no doubt that all of this is extremely important in maintaining competitive edge, none of it would be possible if we don’t pay attention to making sure that the systems needed to capture, stream and store all of the required transactional and non-transactional data, are in place and capable of scaling to support peak transaction volumes, peak data arrival rates and peak ingestion rates. In addition, in order to maximize the value of new insights produced in big data analytical environments, these analytical systems need to be integrated back into core operational transaction processing systems so that prescriptive insights are available to all that need them to continuously optimize operations and maximize operational effectiveness.

Therefore, although much of the focus in the world of big data is on analytics, it is clear that the role of transaction processing systems is mission critical to big data success. Transaction systems have always been mission-critical and so speed, availability, and scalability are central to their operation. Transaction processing systems, and the non-transactional data systems, such as clickstream web logs that accompany them, provide data needed by traditional and big data analytical systems. They also make insights produced by analytical systems, available to the right people at the right time in the context of everyday operational activities. What this implies is that, even with the new focus on big data analytics, the basic business requirement is still the same. Organizations more than ever need to close the loop between analytical systems and their core operational transaction processing systems to maximize success.

With the background the assumptions is the transaction and non transactional data lands in a lake on Amazon S3. This document won’t take a deep dive on how to pull or push the data from source transactional systems to the data lake but will cover the analytics part on available choices , capabilities and platform for data marts where mammoth amount of data is structured and aggregated to support high volume to build various key KPI dashboards to measure business performance.

Choices to Explore

Several options are available to process and build aggregated data in data mart , the document focuses on pros and cons on using AWS Redshift , Spectrum and Spark to process and explore data. You can leverage the blog if you are moving from one MPP to another MPP platform or in the initial phase of deciding the technologies to use on the big data journey.

  • Redshift : Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more. This enables you to use your data to acquire new insights for your business and customers.
  • Redshift Spectrum ( Spectrum ) : Using Amazon Redshift Spectrum, you can efficiently query and retrieve structured and semistructured data from files in Amazon S3 without having to load the data into Amazon Redshift tables. Redshift Spectrum queries employ massive parallelism to execute very fast against large datasets. Much of the processing occurs in the Redshift Spectrum layer, and most of the data remains in Amazon S3. Multiple clusters can concurrently query the same dataset in Amazon S3 without the need to make copies of the data for each cluster.
  • Apache Spark : Apache Spark is an open source, general-purpose distributed computing engine used for processing and analyzing a large amount of data. Just like Hadoop MapReduce, it also works with the system to distribute data across the cluster and process the data in parallel.

Summary

Conclusion

There is no silver bullet but the above summary can guide and assist to make technology decisions and give directions based on what is the most important for your organization’s big data goals. If you are looking for faster onboarding with centralized management you may want to go the redshift route and if your developers , analyst and data scientists are tech savvy you may prefer the spark route. Also , mix and match of both Redshift and Spark flavor could be adopted to cater the simplest to complex business use cases for your Big Data Platform.

Author : Rajesh Saluja

Contributor : Dilip Rachamalla

Reviewers : Giriraj Bagdi , Anil Madan

The blog content is a team effort and I would like to Thank the contributors , reviewers and several other cross functional team members.

--

--

Rajesh Saluja

Principal Big Data Engineer at Small Business and Self Employed Group, Intuit. https://www.linkedin.com/in/rajeshsaluja/