On Analytics, Data Platforms and Smart Applications
Cloudera executives can’t talk about IPO or cloud-services rumors. Here what’s on the record from the Cloudera Analyst Conference.
There were a few elephants in the room at the March 21-22 Cloudera Analyst Conference in San Francisco. But between a blanket “no comment” about IPO rumors and non-disclosure demands around cloud plans — even whether such plans exist, or not — Cloudera execs managed to dance around two of those elephants.
The third elephant was, of course, Hadoop, which seems to be going through the proverbial trough of disillusionment. Some are stoking fear, uncertainty and doubt about the future of Hadoop. Signs of the herd shifting the focus off Hadoop include Cloudera and O’Reilly changing the name of Strata + Hadoop World to Strata Data. Even open-source zealot Hortonworks has rebranded its Hadoop Summit as DataWorks Summit, reflecting that company’s diversification into streaming data with its Apache NiFI-based Hortonworks DataFlow platform.
At the Cloudera Analyst Conference, Chief Strategy Officer Mike Olson said that he couldn’t wait for the day when people would stop describing his company as “a Hadoop software distributor” mentioned in the same breath with Hortonworks and MapR. Instead, Olson positioned the company as a major vendor of enterprise data platforms based on open-source innovation.
MapReduce (which is fading away), HDFS and other Hadoop components are outnumbered by other next-generation, open-source data management technologies, Olson said, and he noted that there are some customers who are just using Cloudera’s distributed and supported Apache Spark on top of Amazon S3, without using any components of Hadoop.
Cloudera has recast its messaging accordingly. Where years ago the company’s platform diagrams detailed the many open source components inside (currently about 26), Cloudera now presents a simplified diagram of three use-case-focused deployment options (shown below), all of which are built on the same “unified” platform.
Cloudera-developed Apache Impala is a centerpiece of the Analytic DB offering, and it competes with everything from Netezza and Greenplum to cloud-only high-scale analytic databases like Amazon Redshift and Snowflake. HBase is the centerpiece of the Operational DB offering, a high-scale alternative to DB2 and Oracle Database on the one hand and Cassandra, MapR and MemSQL on the other. The Data Science & Engineering option handles data transformation at scale as well as advanced, predictive analysis and machine learning.
Many companies start out with these lower-cost, focused deployment options, which were introduced last year. But 70% to 75% percent of customers opt for Cloudera’s all-inclusive Enterprise Data Hub license, according to CEO Tom Reilly. You can expect that if and when Cloudera introduces its own cloud services, it will offer focused deployment options that can be launched, quickly scaled and just as quickly turned off, taking advantage of cloud economies and elasticity.
Navigating around the non-disclosure requests, here are a few illuminating factoids and updates from the analyst conference:
Cloudera Data Science Workbench: Announced March 14, this offering for data scientists brings Cloudera into the analytic tools market, expanding its addressable market but also setting up competition with the likes of IBM, Databricks, Domino Data, Alpine Data Labs, Dataiku and a bit of coopetition with partners like SAS. Based on last year’s Sense acquisition, Data Science Workbench will enable data scientists to use R, Python and Scala with open source frameworks and libraries while directly and securely accessing data on Hadoop clusters with Spark and Impala. IT provides access to the data within the confines of Hadoop security, including Kerberos.
Apache Kudu: Made generally available in January, this Cloudera-developed columnar, relational data store provides real-time update capabilities not supported by the Hadoop Distributed File System. Kudu went through extensive beta use with customers, and Cloudera says it’s seeing a split of deployment in conjunction with Spark, for streaming data applications, and with Impala, for SQL-centric analysis and real-time dashboard monitoring scenarios.
MyTake On Cloudera Positioning and Moves
Yes, there’s much more to Cloudera’s platform than Hadoop, but given that the vast majority of customers store their data in what can only be described as Hadoop clusters, I expect the association to stick. Nonetheless, I don’t see any reason to demure about selling Hadoop. Cloudera isn’t saying a word about business results these days — likely because of the rumored IPO. But consider the erstwhile competitors. In February Hortonworks, which has been public for two years, reported a 39% increase in fourth-quarter revenue and a 51% increase on full-year revenue (setting aside the topic of profitability). MapR, which is private, last year claimed (at a December analyst event) an even higher growth rate than Hortonworks.
Assuming Cloudera is seeing similar results, it’s experiencing far healthier growth than any of the traditional data-management vendors. Whether you call it Hadoop and Spark or use a markety euphemism like next-generation data platform, the upside customers want is open source innovation, distributed scalability and lower cost than traditional commercial software.
As for the complexity of deploying and running such a platform on premises, there’s no getting around the fact that it’s challenging – despite all the things that Cloudera does to knit together all those open-source components. I see the latest additions to the distribution, Kudu and the Data Science Workbench, as very positive developments that add yet more utility and value to the platform. But they also contribute to total system complexity and sprawl. We don’t seem to be seeing any components being deprecated to simplify the total platform.
Deploying Cloudera’s software in the cloud at least gives you agility and infrastructure flexibility. That’s the big reason why cloud deployment is the fastest-growing part of Cloudera’s business. If and when Cloudera starts offering its own cloud services, it would be able to offer hybrid deployment options that cloud-only providers, like Amazon (EMR) and Google (DataProc) can’t offer. And almost every software vendor embracing the cloud path also talks up cross-cloud support and avoidance of lock-in as differentiators compared to cloud-only options.
I have no doubt that Cloudera can live up to its name and succeed in the cloud. But as we’ve also seen many times, the shift to the cloud can be disruptive to a company’s on-premises offerings. I suspect that’s why we’re currently seeing introductions like the Data Science Workbench. It’s a safe bet. If and when Cloudera truly goes cloud, and if and when it becomes a public company, things will change and change quickly.