On Analytics, Data Platforms and Smart Applications
Spark adopters including Bloomberg, Comcast, Capital One and EBay share compelling use cases. Data processing, streaming and analytics use-case scenarios multiply.
What’s the business case for Apache Spark? After the opening (general-session) day of Spark Summit East 2016 in New York, I was thinking that Spark promoter and Summit host Databricks needed to do a better job of telling that story. Then on the final day, executives from Capital One, Synchronoss and eBay offered compelling, business-use-case focused keynotes that knocked it out of the park.
Chris D’Agostino, VP of Technology at Capital One, a top-ten bank and cutting-edge user of data and analytics, outlined the bank’s use of Spark to generate tabular and graph representations of data at high scale. These views of data are used in low-latency fraud-detection, anti-money-laundering and know-your-customer analyses.
Suren Nathan, senior director of Big Data Platforms and Analytics Frameworks at Synchronoss, a cloud services company, outlined four eras of data pipeline maturation. It started with the “V1” traditional ETL/data warehouse era and then moved on to the V2 appliance-assisted era (think Exadata and Netezza, with comparatively high costs yet lacking support for multi-structured data). In the V3 early Hadoop era we suffered with slow batch processing in MapReduce. In today’s V4 era, you get scale on Hadoop but also superior performance with support for both batch and streaming workloads thanks to Spark.
Seshu Adunuthula of eBay described how “the other e-commerce giant” is making use of Spark within its evolved enterprise data platform. eBay’s move into selling more new items at fixed prices (as opposed to used items through auctions) has led it to take a catalog-oriented, rather than search-oriented, approach. That has necessitated a more structured yet dynamic data modelling approach. eBay remains one of the largest users of Teradata in the world, but Adunuthula said the company is moving many workloads into Hadoop. Spark was added to support fast batch analysis and streaming-data services. “Once we introduced Spark, we saw rapid adoption, and now we’re seeing more and more use cases as adoption grows,” Adunuthula concluded.
The opening-day keynotes were offered by vendor execs who had mixed success in painting a big picture. IBM’s Anjul Bhambhri, VP, Big Data, came close, explaining that “the beauty of Spark is that all components [Spark SQL, Spark Streaming, Spark R, MLLib, etc.] work together in a seamless way. You don’t need half a dozen products; you need just one, foundational platform.”
Matai Zaharia, CTO of Databricks, stuck to detailing Apache Spark accomplishments in 2015 and reviewing new capabilities coming in Spark 2.0, set for release in late April or early May. Highlights of the coming release include whole-stage code generation aimed at improving Spark performance and throughput by as much as 10X. Spark Streaming improvements in 2.0 will support the common scenario of mixed streaming and batch workloads, as when you want to track the state of a stream while also firing off SQL queries. Finally, Spark’s DataFrame and Dataset APIs will be merged in 2.0, simplifying matters for developers by presenting fewer libraries and concepts to worry about.
Ali Ghodsi, co-founder and recently promoted to CEO of Databricks, confined his remarks to Databricks’ own services, highlighting the appeal of using Spark through the company’s cloud-based commercial offering. He also introduced a free (sandbox-scale) cloud-based Databricks Community Edition. This led to a demo by Databrick’s Michael Ambrust, who touted Databricks’ cloud offerings as a great way to avoid the pain of deploying Spark software.
MyPOV: With its cloud-based platform, Databricks competes, in a way, against the open-source software it helps to create and promote. Commercial companies behind open source software more typically support on-premises software deployments. Is it a coincidence that Spark customers and third parties including Cloudera and IBM seem to be carrying the load of making the case for enterprise Spark adoption? Plenty of Databricks execs offered technical presentations on Apache Spark software capabilities and roadmaps, but they were preaching to the choir. At Spark Summit East, Databricks missed a chance to be more of a champion of Apache Spark in the enterprise.
Thinking back, it took a partnership with O’Reilly to turn Hadoop World into Strata + Hadoop World, an event with a broader audience and a higher-level purpose. I’ll grant that much of what Databricks does to promote and contribute to a healthy Apache Spark ecosystem goes on behind the scenes. And you can’t argue with the project’s success. But it seems to me that Spark is ready for a bigger stage.
PS: Interesting tech seen at Spark Summit included h2o.ai, which supports distributed machine learning on Hadoop or Spark, and Data Robot, which automatically generates predictive models (using leading open-source algorithms in R, Python and Spark) and tests, validates and selects the most accurate ones, speeding predictive and easing the data science talent shortage. Also interesting was SnappyData, a company recently spun out of Pivotal that has ported the open source Geode (formerly GemFire) in-memory, distributed database to run on Spark. It offers SQL querying in a persistent store that is part of/runs on Spark rather than requiring separate infrastructure.
The most talked about topic at Spark Summit NOT on the agenda or announced at the event was Apache Arrow, a new project which promises an in-memory, columnar data layer that can be shared by multiple open source projects, eliminating redundant infrastructure, cost and copies of data while enabling fast analytics across many workloads. The project launched with support from a whopping 13 open source projects, including Cassandra, Drill, Hadoop, HBase, Impala, Parquet, Pandas and Spark.