Spark Summit event report: IBM unveiled big plans for Apache Spark this week, but it’s not alone in banking on this open-source analytics platform. Here’s why this still-green technology is quickly gaining adoption.
Why is Apache Spark, an open-source project that only reached its 1.0 release a little more than a year ago, getting so much attention?
IBM, for one, announced a big commitment to the platform at this week’s Spark Summit in San Francisco. And as IBM execs told analysts at the company’s new Spark Technology Center here, it’s an all-in bet to integrate nearly everything in the analytics portfolio with Spark. Other tech vendors betting on Spark range from Amazon to Zoomdata, even as real-world deployments number in the hundreds and are mostly experiments and proof-of-concept projects.
Describing Spark as “an operating system for analytics,” IBM execs cited Spark strengths including:
- Abstracted data management. Spark lets data scientists focus on the analysis, not data movement, as data pipelines can access data where it lies, whether that’s in Hadoop, in a database, or on Spark’s own cluster.
- Rich data-processing functionality. The Spark Core provides a flexible data-processing platform supporting distributed task management, scheduling, and basic I/O functionality as well as transformations such as map, filter, reduce and join.
- In-memory performance. The Spark Core delivers up to 100 times faster performance than Hadoop MapReduce. In iterative machine learning and other predictive analyses, you can squeeze in that many more processing cycles against all of the data, not just samples of data.
- Plentiful analytics options. IBM execs didn’t dwell on this strength, but Spark offers multiple analytic libraries than run on top of the core, including MLLib (machine learning), Spark SQL, Graph X, Spark Streaming and, released last week, Spark R. IBM plans to run its own software on top of the platform, including SPSS, IBM Streams, and (soon to be open sourced) SystemML, all of which are being ported to run on Spark.
In contrast to IBM, analytics rival SAS views Spark as a competitor. For more than three years, SAS has been working on its own big-data-capable, in-memory analytics platform, the SAS LASR Analytics Server, which runs SAS Visual Analytics (VA) and SAS Visual Statistics (VS) as well as SAS analytics libraries. The SAS LASR Analytic Server can be deployed on a single server, on a dedicated cluster, or on top of Hadoop (see my just-published report, “The Era of Self-Service Analytics Emerges: Inside SAS Visual Analytics and Visual Statistics”).
When I asked SAS about its views on Spark earlier this week, product manager Mike Ames offered the following (lukewarm) statement: “While Spark is currently an immature technology, it shows promise with rapid adoption as a result of its data processing capabilities. SAS and Spark are very capable of coexisting, with products such as SAS Data Loader for Hadoop, which can push transform logic to Spark.”
SAS is right about Spark’s immaturity. I’ve talked to practitioners and integrators who acknowledge that the technology is still green. Like a lot of 1.X open-source software projects, Spark is still buggy and it doesn’t have all the table-stakes systems-management, security and high-availability features that many enterprises would insist upon before running mission-critical workloads.
That’s not to say that Spark is incapable of running production workloads reliably or at scale. Hundreds of companies are doing just that, but they tend to be pioneers with an appetite for innovation and strong engineering teams that are willing to fix bugs and develop best practices where none exist.
Plenty of early adopters presented at Spark Summit, including Airbnb, Baidu, Edmunds.com, NASA, NBC Universal, Netflix, Shopify, Toyota Motor Sales, and Under Armour. Keynoter James Peng, principal architect at Baidu, the Google of China, described that company’s 1,000-plus-node, petabyte-scale Spark deployment, which is delivering 50 times faster performance than the conventional MapReduce processing it previously relied upon. Baidu is also pioneering the use of Spark SQL and the Tachyon caching layer.
MyPOV on Spark
Yes, it’s early days for Spark, but there’s good reason why IBM described it as “potentially the most significant open-source project of the next decade.” SAS acknowledged Spark’s data-processing capabilities, but that’s just the starting point. Even IBM’s characterization of Spark as “an operating system for analytics” seems like a left-handed compliment.
With all those libraries on top of the Spark Core – machine learning, SQL, graph, streaming and R — Databricks and the Spark community are trying to build out an all-purpose analytics platform capable of supporting many forms of analysis and blended analyses. By blending machine learning and streaming, for example, you could create a real-time risk-management app. What’s more, Spark supports development in Scala, Java, Phython and R, which is another reason the community is growing so quickly.
At Spark Summit, Amazon Web Services announced a free Spark service running on Amazon Elastic Map Reduce, and IBM announced plans for Spark services on BlueMix (currently in private beta) and SoftLayer. These cloud services will open the floodgates to developers, and IBM’s contributions will surely help to harden the Spark Core for enterprise adoption.
In short, it’s hard to see any open-source project matching Spark on depth and breadth of analysis and development flexibility (despite a prominent tout of Apache Flink at last week’s Hadoop Summit). And that’s why you’re hearing so much about Spark.