An Insider’s Guide to Apache Spark is a useful new resource directed toward enterprise thought leaders who wish to gain strategic insights into this exciting new computing framework. As one of the most exciting and widely adopted open-source projects, Apache Spark in-memory clusters are driving new opportunities for application development as well as increased intake of IT infrastructure. This article is the second in a series that explores a high-level view of how and why many companies are deploying Apache Spark as a solution for their big data technology requirements. The complete An Insider’s Guide to Apache Spark is available for download from the insideAI News White Paper Library.
Why is Spark So Hot?
The amount of data generated around the globe each day is 2.5 exabytes (Adepta, March 2015), and the big data market reached $27.4 billion in 2014 (Wikibon, March 2015). Spark is clearly a computing architecture expressly designed for this level of growth. This notion is supported by IBM’s announcement in June 2015 that it will educate more than 1 million data scientists and engineers on Spark. From a technological standpoint, there are a number of reasons for Spark’s continued upward trajectory in the big data industry:
Revisiting MapReduce
Spark, when compared to MapReduce, offers greater flexibility. MapReduce only offers two operations: Map and Reduce, whereas Spark offers more than 80 high-level operations. MapReduce’s inefficient handling of iterative algorithms as well as interactive analytic tools served as the motivation for developing alternatives. Spark excels at programming models involving iterations, interactivity, streaming and more.
Spark’s Use of HDFS
Spark is able to make use of the Hadoop File System (HDFS). Spark, at the same time, does not require HDFS.
Spark’s Use of YARN
Spark can make use of Hadoop’s YARN workload manager.
Spark Provides for Analytics Workflows
From its library for machine learning (MLlib) and API for graph analytics (GraphX), to its support for SQL-based queries and streaming applications, Spark delivers a convergent analytics platform. This means, for example, that you can write your own code using Java, Scala or Python, that makes use of one or more of these components in crafting an analytics workflow.
Spark’s Efficient Use of Memory
In contrast to Hadoop’s two-stage disk-based MapReduce model, Spark’s multi-stage in-memory primitives provides performance up to 100 times faster for certain applications. By allowing user programs to load data into a cluster’s memory and query it repeatedly, Spark is well-suited to machine learning algorithms. In benchmarking studies involving in-memory storage, i.e., a diskless HDFS instance, Spark outperformed Hadoop by a factor of 20. Efficiencies aside, Spark must be doing something very differently. Spark even beats Hadoop by a factor of 10 when memory is unavailable and it has to use disks. Spark owes its high-performance reputation to RDDs. RDDs are a relatively new abstraction for in-memory computing resulting from a research project in the AMPLab at UC Berkeley. As the name implies, RDDs are fault-tolerant, parallel data structures ideally suited to in-memory cluster computing. Consistent with the Hadoop paradigm, RDDs can persist and be partitioned across a Big Data infrastructure ensuring that data is optimally placed. Ultimately, RDDs comprise the primary justification for the escalating interest in Spark.
The recent Databricks Spark Survey Report 2015 includes several important takeaways that speak to the state of the Apache Spark industry. In addition, the figure below shows how widely distributed the adoption of Spark has become with ten top industries. Notice that performance is the number one motivation for using Spark—91% see performance as the most important aspect.
- Spark adoption is growing rapidly – with over 600 Spark contributors in the last 12 months, Spark is the most active Apache Open Source project in big data. Spark is not only being used to solve an increasing variety of data problems but also an increasing complexity of data problems. Spark adoption is growing quickly as users find it easy to use, reliably fast, and aligned to growth in real-time analytics.
- Spark is growing far beyond Hadoop – the rapid acceleration of Spark adoption across new and diverse data problems is impressive. The fact that this growth is propelling Spark beyond Hadoop is astounding. Spark is the focus of a growing group of innovators that are driving tomorrow’s data culture.
- Spark is increasing access to big data – Spark is creating opportunities for big data exploration by making is easier for a wide range of people to solve a growing variety of data problems. It’s not just distributed data engineers who want to work with Spark but also a growing constituent of data scientists.
If you prefer the complete An Insider’s Guide to Apache Spark is available for download in PDF from the insideAI News White Paper Library, courtesy of TIBCO. Click HERE to take in a webinar event recorded on November 17, 2015.
Speak Your Mind