Self-Service Data Prep Options Proliferate

SnapLogic and Logi Analytics are the latest to embrace the self-service data-prep trend. But should your choice be an integration vendor, a BI/analytics player or a stand-alone offering?

It seems 2015 is shaping up as the emergent year for self-service data-preparation capabilities. Thus far we’ve seen announcements from Informatica (Informatica Rev), Qlik (Smart Data Load) and independents including Paxata, Trifacta and Tamr. The latest to join the bandwagon are SnapLogic and Logi Analytics. As options proliferate, the question for buyers is, which type of vendor is your best choice for self-service data preparation?

The three vendor types bringing self-service data prep to the market include BI and analytics vendors (like Qlik and Logi), integration vendors (like Informatica and SnapLogic) and stand-alone-vendors (like Paxata and Trifacta) that are most often associated with big data work.

SnapLogic hides the messy details of integration inside reusable
SnapLogic hides the messy details of integration inside reusable “Snaps” and sub-pipelines that non-experts can assemble into new data pipelines. A sub-pipeline preview feature lets admins inspect the steps inside a larger data pipeline.

BI and analytics vendors have long been purveyors of data-integration modules, most typically conventional extract-transform-load (ETL) technologies serving basic data-integration needs. The idea was to be a one-stop shop for customers seeking to bring data warehouses and other sources into the BI and analytics environment. These same vendors have seen their customers demanding self-service data exploration and data visualization capabilities in recent years, so the rise in self-service data prep should be no shock.

Logi Analytics fits the pattern. It responded to the self-service data exploration and data visualization craze by introducing Logi Vision in early 2014. Last week it added Logi DataHub, designed to give data professionals as well as analyst types a logical data view for self-service data prep, data access and data enrichment. The short list of integration-ready sources includes cloud-app sources like Salesforce and Marketo, cloud-platform sources from the likes of Amazon and Google, and on-premises sources including HP Vertica, PostgreSQL, and ODBC-standard databases.

Dedicated data-integration vendors have always prided themselves on broader, deeper and generally more sophisticated capabilities than what you’d typically get in an optional module from a BI or analytics vendor. These specialists typically span service-bus-style application integration and ETL-style data integration, and many have developed robust capabilities for integrating cloud-based apps and data sources.

SnapLogic joined the self-service trend last week with its Summer 2015 release of the cloud-based SnapLogic Elastic Integration Platform. (The majority of SnapLogic’s business is in cloud-based app integration, but its software also run on Hadoop and can be deployed on-premises for data-integration-focused needs.) SnapLogic takes a componentized approach in which the data experts handle the messy details of data-access, transformation and processing by preconfiguring reusable Snaps and Sub-pipelines. The non-experts can then assemble new data pipelines by snapping together these components. New features supporting self-service include a Sub-pipeline preview, which lets admins and non-integration experts drill down and see the Snaps and processing steps within a reusable sub-pipeline.

Safeguards built into SnapLogic’s self-service approach include a Lifecycle Management feature that lets the experts create, compare and test Snaps and sub-pipelines before sharing them with business users. They can also test pipelines developed by the non-experts before moving them into production. A schedule task view is designed to help admins coordinate integration workloads, spot potential performance bottlenecks and schedule lower-priority tasks for off-peak periods.

MyPOV On Self-Service Data Prep

When it comes to choosing one of these vendors, I expect the prevailing selection patterns will live on. So companies focused on BI and analytics will first turn to those vendors to meet their data-integration needs. If needs span application integration and data integration, customers very likely already work with a dedicated integration vendor (or a suite from the likes of IBM or Oracle). There’s good reason to leverage these products and the expertise of your data-integration experts. Beyond this general rule of thumb, I’d investigate the relative user-friendliness of the candidates you’re considering; some are “easy to use” for business users while others are really geared to data-savvy analyst types.

I like the self-service offerings that embed the data-integration experts into self-service capabilities in an oversight capacity. The danger in ungoverned self-service is that not-so-data-savvy users will mashup and interpret data in inconsistent ways. Look for features whereby IT can ensure consistent data definitions and data modeling, providing guard rails around the use of data so that the non-experts don’t end up going off track.

As for that third set of vendors – that stand-alone self-service players focused on big data — they’ve been getting a lot of attention this year, with Trifacta and Paxata in particular getting a lot of buzz. For practitioners who are in thick of big data analysis, these specialists are filling a void the platform players have yet to address.

As the big data world matures, I suspect niche players will be good candidates for acquisition. Alteryx and Datameer are two vendors I can think of that are often tapped for self-service data prep, but these features are part of larger analytics offerings. It strikes me that self-service data prep is a feature we’re going to see inside many products and platforms.

Related resources:

Informatica Sets New Goals For Its Growing Data Platform
Qlik Morphs Into A Platform Vendor
Alteryx Pioneers Self-Service Data-Prep and Analytics

1 Comment

  1. This is very inspiring and very useful now that data prep is becoming a hot topic and tools are proliferating, Doug. Thanks for that.I especially like your insights on the last part.
    As you mention, it happens that more and more people across organizations are spending too much time to find and adjust the available data sets to address their information needs.
    It may be users of BI or data discovery tools. With this respect, it is interesting to note that the name of the mentioned product from Qlik is not data prep, but rather smart data load. And I feel after looking at the demo that the purpose seems mostly to inject data into Qlik associative engine, a capability that is very specific to Qlik in memory engine. Although this is very useful in the context of Qlik, I’m unsure it aims to compete with “traditional” data profiling and transformation functions found in other tools.
    It may be users of advanced analytics, like data scientists, and this explains why advanced analytics providers like Alteryx or Datameer are introducing data prep features inside there toolsets.
    It may be users of integration tools, like in your Snaplogic example. At Talend, we have introduced self-service capabilities in our iPaaS platform, because cloud based application tend to make it easier for non experts to collaborate on design and development tasks that used to be pure IT tasks.

    But, overall, I think data prep should be considered as more than an embedded feature needed in data centric products and platforms, but rather a service that an organizations has to set-up and provide to the business users. Data is everywhere and more and more people need it for their daily work. While some of them may have tools like the one mentioned below, many others use personal tools especially Excel. More and more people are spending a lot of their time with this. And because it is not under control, we hear more and more horror stories related to leaks: for example, Wikileaks has a Sony section that includes a search engine on very sensitive leaked data from Excel files.

    So, we at Talend feel that Data prep should go beyond providing a productivity tool for individual users. Data is a shared asset, and the good news is that tools now allow us to consume it as self service. But, in most organizations where information maturity is still at its early phase, achieving it should be a collaborative effort, where data experts have responsibility to organize information reuse and guide the business users on they road to autonomy. Providing successfully data as a self service needs organization empowered by collaborative tools that can share reusable data catalogs and data preparation tasks across the aforementioned use case.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s