Doug Henschen on Analytics, Big Data & Smart Apps
Hortonworks integrates Hortonworks Data Platform (Hadoop) and Hortonworks DataFlow (streaming data) platforms to offer a cohesive approach to analyzing data in motion and data at rest. Here’s how they fit together.
The “Connected Data Platforms” that Hortonworks introduced on March 1 are its well-known Hortonworks Data Platform (HDP) Hadoop distribution and its Hortonworks DataFlow (HDF) platform aimed at collecting, curating and routing real-time data from any source to any destination. HDP and HDF can be used independently, but here’s how they fit together to become a cohesive platform for managing and analyzing streaming and historical data.
Interest in streaming data analysis has been growing steadily in recent years, but the emergence of Internet of Things (IoT) opportunities has interest soaring. The thing is, streaming-data use cases such as connected-cars, smart oil fields, smart utilities and precision medicine often require analysis of historical data, which brings context to the real-time insights. That’s why HDF and HDP need to be connected.
This week Hortonworks introduced HDP’s 2.4 release. Notable upgrades include support for and bundling of Apache Spark 1.6 software as well as improved system management and remote optimization capabilities through Apache Ambari 2.2 and SmartSense 2.2. Ambari, the open source management software, gained an Express Upgrade feature that lets you quickly stop jobs, update software and restart the cluster and running jobs all within one hour, even on large systems. SmartSense is a “phone home” capability that relays system-performance parameters to Hortonworks, which can diagnose problems and offer more than 250 recommendations on optimizing system performance and availability.
The biggest development with HDP 2.4 is a new distribution strategy with two separate release cadences. Core Apache Hadoop components including HDFS, MapReduce and YARN as well as Apache Zookeeper will be updated annually, in line with other members of the ODPi consortium. Hortonworks is expediting other, newer capabilities through new “Extended Services” releases, which will be offered as quickly as they can be made available. One example of an Extended Service is support for Spark 1.6. Other candidates for this release approach will include Hive, HBase, Ambari “and more,” says Hortonworks.
MyPOV on HDP 2.4: I like this two-pronged strategy with the stable, slower moving core complemented throughout the year by extended services. Hortonworks has lagged behind Cloudera in the past in adding certain new capabilities that customers have been anxious to use. This is a good approach to fast tracking capabilities that are in demand (although they presumably can’t require changes to Hadoop core components). The approach also simplifies matters for other distributors of ODPi-based distributions.
Hortonworks DataFlow 1.2
HDF is Hortonwork’s streaming data platform based on Apache NiFi and adapted from last year’s Onyara acquisition. Upgrades with the move HDF 1.2, which will be available later this month, include the integration of Apache Kafka and Apache Storm streaming analytics engines. The release also gains support for Kerberos for centralized authentication across applications. On the near-term roadmap is support for Spark Streaming, which should be available by early summer, according to Hortonworks.
MyPOV on HDF: There’s much to like in Hortonworks DataFlow, including a drag-and-drop approach for developing the routing, transformation and mediation within dataflows. It also offers built-in data-security and data-provenance capabilities. One exec described it as “a FedEx for streaming data,” providing the digital equivalent of a logistics system for routing streaming data and tracking sources and changes to digital information along the way. The ecosystem seems strong, with support for more than 130 processors for systems including Kafka, Couchbase, Microsoft Azure Event Hub and Splunk.
How HDP and HDF are Connected
Hortonworks wants to be a multi-product company, so it has stressed that HDP and HDF will be sold and can be used independently. HDF can route data to (and draw from) other Hadoop distributions, databases such as Cassandra and cloud-based sources, such as Amazon S3.
When use cases span data in motion and data-at-rest, HDP and HDF have commonalities that makes them easier to use together. For example, both HDP and HDF share more than 70 data processors and both use Ambari for system deployment and management. What’s more, Hortonworks is promising that SmartSense, and the Ranger and Atlas security and governance projects will also support both platforms.
MyPOV on Connected Platforms: The need for the combination of streaming and historical data analysis is popping up in many quarters. It was touted as a benefit of Spark Streaming 2.0 at the recent Spark Summit East event, and MapR also has a strategy to address both forms of data in one platform.
Hype around streaming data opportunities is nothing new. More than a decade ago, complex event processing systems were touted as “ready to go mainstream.” At long last, I think we’re finally seeing signs that streaming data analysis is emerging. The mobile, social, cloud and big data trends set the stage and maybe, just maybe, the promise of IoT possibilities is pushing it over the top.
PS: Hortonworks also spotlighted two promising Spark related developments this week. First, it’s shipping a preview of Apache Zeppelin with HDP 2.4, providing a coding-free UI for visualization and a notebook-style approach to working on Spark. This is a usability improvement and democratization tool that Spark sorely needs. Second, in a partnership with HP Enterprise Labs, Hortonworks will bring to open source an optimized shuffle engine for Spark that HP Enterprise says will offer 5X to 15X performance improvements as well as optimized use of memory. This tech doesn’t have project status yet, let alone acceptance from the Spark community, but Hortonworks says it will ship the software with HDP later this year.