Hadoop security, data management, data governance and analysis options remain works in progress, but a rich ecosystem is emerging to fill gaps and democratize the platform.
Apache Hadoop marks its 10th anniversary as an open source project this year, a fitting milestone to review its (betwixt-and-between) state as an enterprise computing platform.
Inspired by a Google white paper, born at Yahoo and embraced in its early years almost exclusively by Internet giants, Apache Hadoop is today accepted as a de facto standard platform for any enterprise interested in taking advantage of big data. Over the last five years, the top three Hadoop software distributors, Cloudera, Hortonworks and MapR, have cracked all major vertical industry categories and have collectively gained more than 3,000 paying customers for their supported enterprise editions. Tens of thousands more firms are self-supporting free community distributions of Hadoop, though the largest share of these deployments are no doubt about experimentation rather than production use.
Equally significant – and now the fastest-growing part of the Hadoop user community, by most accounts – are the thousands of organizations using cloud-based Hadoop services, such as Amazon Elastic MapReduce, Microsoft Azure HDInsight, Altiscale, Qubole and various managed Hadoop service offerings.
Looking beyond these sheer numbers, I heard plenty of fresh evidence of proven industry use cases at recent Cloudera and Hortonworks analyst events. Cloudera detailed an impressive list of vertical industry use cases at its event while Hortonworks cited unnamed customers at “55 out of the top 100 financial services firms, 75 out of the top 100 retailers, eight out of the top nine telecommunications companies in North America, and eight of the world’s top 20 automotive companies.”
So there’s plenty of reason for confidence in this platform, and we continue to see steady maturation. But Hadoop still has weaknesses and gaps, and plenty of experiments have failed. Even the hand-picked customers attending the Cloudera and Hortonworks events, who shared mostly success stories, had to admit to ongoing challenges:
- “Better data governance is the number-one priority on our [Hadoop] wish list,” said a VP of platforms and data architecture attending the Hortonworks event. His employer, a digital marketing company, has been on an acquisitions tear, and segregating, securing and otherwise governing specific data sets has proven difficult as the company has consolidated separate Hadoop deployments.
- “It was messy on the data-lineage end,” confessed a director of analytics attending the Cloudera event. This warehousing and logistics firm pulls data from dozens of legacy database applications to calculate how to jam more products into its distribution centers. But before it could begin the optimization work, the firm “spent months working out the details for data ingestion.”
- “We have three people working with Hadoop, but we have more than 150 business users who need access to the data,” said a BI solutions architect at an aerospace firm that is using Hortonworks’ distribution. “I’d like to see better ease of use for business users,” he said, noting the wonky, coding-intensive nature of many Hadoop components and data-management tools.
- “The sooner we can have an all-purpose [tool] for getting data into Hadoop, the better,” said an IT executive of a data services company using Cloudera. “We use a lot of RDF-linked data, but there’s not a lot of support for that in Cloudera.”
MyPOV on Hadoop Maturity
So is the glass half empty or half full? In my view you should be optimistic but realistic about this ten-year-old platform. I relate it to my experience as a parent. We never left my son home alone when he was 10 years old, but now that he’s 14, I trust that he’ll be safe and will even get his homework done if we get home late from work. In much the same spirit, an executive at a major e-retailer shared in a recent briefing that his firm isn’t ready to open up wide access to the firm’s Hadoop cluster until data-access, governance and security controls are more mature. Maybe if PCI data wasn’t involved he’d feel differently? Just as a parent has to know the child, you have to understand your data, your users and your risks. Maturity and trust will come.
Fortunately, we’re seeing a rich ecosystem emerging around Hadoop that will help make data access, data management, data governance and data analysis easier, less coding intensive, more repeatable and, in many cases, more accessible to business users. Some of these capabilities will undoubtedly be duplicated within open source tools. But we’ll also see data-management and governance capabilities that will extend beyond Hadoop, supporting data pipelines and data-driven applications that span multiple platforms.
Next week I’ll be discussing the possibilities and positive developments in the educational webinar, “Democratizing the Data Lake: The State of Big Data Management in the Enterprise.” Set for Tuesday, April 26 at 1pm ET/10am PT, this webinar will delve into data access, data cataloging and metadata management options for Hadoop as well as big data integration and data-prep options. We’ll also discuss Apache Spark and its role in data processing, stream processing and data analysis in the context of Hadoop. Click on the link above to register for the event.