ChatGPT, ZeroETL, and Other Data Engineering Disruptors

The modern data stack reigns supreme because it supports use cases and unlocks value from data in ways that were previously, if not impossible, then certainly very difficult. Machine learning moved from buzzword to revenue generator. Analytics and experimentation can go deeper to support bigger decisions.

The same will be true for each of the trends below. There will be pros and cons, but what will drive adoption is how they, or the dark horse idea we haven’t yet discovered, unlock new ways to leverage data. Let’s look closer at each.

Zero-ETL

What it is: A misnomer for one thing; the data pipeline still exists.

Today, data is often generated by a service and written into a transactional database. An automatic pipeline is deployed, which not only moves the raw data to the analytical data warehouse but modifies it slightly along the way.

For example, APIs will export data in JSON format, and the ingestion pipeline will need to not only transport the data but apply light transformation to ensure it is in a table format that can be loaded into the data warehouse. Other common light transformations done within the ingestion phase are data formatting and deduplication.

While you can do heavier transformations by hard-coding pipelines in Python, and some have advocated for doing just that to deliver data pre-modeled to the warehouse, most data teams choose not to do so for expediency and visibility/quality reasons.

Zero-ETL changes this ingestion process by having the transactional database do the data cleaning and normalization prior to automatically loading it into the data warehouse. It’s important to note the data is still in a relatively raw state.

At the moment, this tight integration is possible because most zero-ETL architectures require both the transactional database and data warehouse to be from the same cloud provider.

Pros: Reduced latency. No duplicate data storage. One less source for failure.

Cons: Less ability to customize how the data is treated during the ingestion phase. Some vendor lock-in.

Who’s driving it: AWS is the driver behind the buzzword (Aurora to Redshift), but GCP (BigTable to BigQuery) and Snowflake (Unistore) all offer similar capabilities. Snowflake (Secure Data Sharing) and Databricks (Delta Sharing) are also pursuing what they call “no copy data sharing.” This process actually doesn’t involve ETL and instead provides expanded access to the data where it’s stored.

Practicality and value unlock potential: On one hand, with the tech giants behind it and ready-to-go capabilities, zero-ETL seems like it’s only a matter of time. On the other, I’ve observed data teams decoupling rather than more tightly integrating their operational and analytical databases to prevent unexpected schema changes from crashing the entire operation.

This innovation could further lower the visibility and accountability of software engineers toward the data their services produce. Why should they care about the schema when the data is already on its way to the warehouse shortly after the code is committed?

With data streaming and micro-batch approaches seeming to serve most demands for “real-time” data at the moment, I see the primary business driver for this type of innovation as infrastructure simplification. And while that’s nothing to scoff at, the possibility of no copy data sharing to remove obstacles to lengthy security reviews may result in greater adoption in the long run (although, to be clear, it’s not an either/or).

One Big Table and Large Language Models

What it is: Currently, business stakeholders need to express their requirements, metrics, and logic to data professionals, who then translate it all into a SQL query and maybe even a dashboard. That process takes time, even when all the data already exists within the data warehouse. Not to mention on the data team’s list of favorite activities, ad-hoc data requests rank somewhere between a root canal and documentation.

There is a bevy of startups aiming to take the power of large language models like GPT-4 to automate that process by letting consumers “query” the data in their natural language in a slick interface.