On Analytics, Data Platforms and Smart Applications
MapR, Pivotal and Teradata were among the who’s who of big data vendors making announcements at Strata + Hadoop World 2015. Here’s a quick analysis.
If Strata + Hadoop World 2014 seemed to be all about Hadoop, Hadoop, Hadoop, the 2015 buzzword was Spark, Spark, Spark. Initial impressions aside, there was more going on than adoption of that notable open-source, in-memory data processing and data analysis framework. Here’s a quick rundown of a few of the bigger announcements, along with my analysis of the developments.
MapR Adds In-Hadoop Document Database
MapR announced here that it has added JSON-handling document-database features to the MapR DB component of its Hadoop Distribution. MapR DB is the vendor’s version of HBase, which architecturally differs from the open-source, high-scale NoSQL database in order to deliver five- to seven-times faster performance, according to MapR.
Adding an In-Hadoop Document Database to MapR DB will save developers time, eliminate redundancies in data and infrastructure, and eliminate the time and trouble of moving and copying data to handle both transactional and analytical needs.
MyPOV: This combination makes sense, and it will surely appeal to existing MapR customers who are looking to do as much as possible with their MapR deployments. Will it change the dynamics of the Hadoop or NoSQL database markets? I suspect not, as organizations and developers seeking a NoSQL database will not look to make a sweeping choice of the Hadoop platform at the same time. MapR points out that you can deploy MapR DB independently, but without co-location and sharing of the data in a Hadoop cluster, the advantages largely evaporate. Think of the document database feature on MapR DB as a nice add for existing customers and one more selling point for customers looking for a Hadoop distribution and support company.
Pivotal Takes HAWQ Open Source
Pivotal this week announced contributions of many of its data engines to open source. That move started with Pivotal’s GemFire in-memory database, which became Apache Geode in April. At Strata + Hadoop World, Pivotal announced that its HAWQ SQL-on-Hadoop tool is now Apache HAWQ (incubating) and the MADlib machine learning library is now Apache MADlib (inclubating). Soon, the Pivotal Greenplum database and a query optimizer shared by Greenplum and HAWQ will also be contributed to open source.
HAWQ, which is based on Greenplum, was one of the earliest SQL-on-Hadoop options based on a relational database. MADlib, which began as an open-source project in 2002, is a collection of scale out, parallel machine learning algorithms that runs in HAWQ and Greenplum.
MyPOV: Being early to the market in 2013 didn’t appear to help HAWQ win a landslide of new customers. Several databases since ported to run on Hadoop – like Actian Vortex and HP Vertica – also offer extensive SQL compliance and fast query performance, yet they, too, haven’t taken the Hadoop market by storm.
Will an Apache open-source license make a big difference for HAWQ? I suspect the big data community will continue to associate HAWQ with Pivotal, even if it’s now billed by the company as a “Hadoop Native” product. Pivotal’s most compelling big data attractions are the breadth of its analysis options and its flexible, subscription-based approach, which lets you mix, match and switch between engines without cost implications.
Spark Gains Yet More Support
There were plenty of nods to Apache Spark at Strata + Hadoop World, starting with Cloudera’s “one platform” pledge to make Spark an enterprise-class data-processing and data-analysis choice on top of Hadoop.
There were also a spate of announcements around Spark as a data-transformation and processing engine within data integration products. SnapLogic, for example, announced Spark-based big-data capabilities through the Fall release of its SnapLogic Elastic Integration Platform. In the same vein, Syncsort and Talend have also announced Spark-based data-processing options. And in an analyst briefing held by Oracle on Monday, the vendor explained that it’s been working with Spark developer Databricks for more than 18 months to take advantage of the framework’s data-processing and data-analysis capabilities. Expect related announcements at Oracle Open World
MyPOV: “Spark inside” was a common claim seen at Strata + Hadoop World, and it’s clear this framework is seeing a broad vendor support. This is a theme we’ve seen all year, though it does not mean that the Spark core and all its components can be described as mature or production ready. Rather than take on the risk yourself, it’s best to work with certified vendors or Databricks itself if you hope to eventually take advantage of Spark’s fast, in-memory processing and analysis.
Teradata Embraces Python
Python is an increasingly popular language for big data analytics work. As evidence, an entire workshop track was dedicated to “PyData” at Strata + Hadoop World. Responding to this interest, Teradata this week introduced the Teradata Module for Python, which it’s pitching as a boon to DevOps-enabled applications.
The Teradata Module for Python module makes it a quicker and easier proposition for developers to embed SQL queries that invoke Teradata sources into their applications. Operations types like DBAs, meanwhile, gain granular visibility into Web and mobile apps and new versions of those apps that query against Teradata.
MyPOV: Developers were already embedding SQL queries into apps and operations teams were already dealing with Web and mobile apps invoking, and sometimes impacting the performance of, Teradata. This module should make life easier for developers and DBAs. It’s easy to guess that developers will add R and languages to the wish list.
Hot Startups Seen At Strata
After a dozen briefings at Strata, I’ve developed a short list for deeper research. I’ll close by noting two startups that caught my attention. Startup AtScale has the focused mission to help organizations use their existing BI systems and tools with Hadoop. It exposes the data inside Hadoop as fat, virtual tables to SQL-based tools and as virtual OLAP cubes to sources that use MDX.
Another startup that impressed was DataTorrent, which is working on a fast streaming and low-latency batch processing platform with plentiful connectors and an easy-to-use, drag-and-drop app-development interface. The company has contributed its platform to Apache as Project Apex, and it claims faster streaming performance than both Spark and Storm.