What is Materialize?
Materialize is a streaming SQL materialized view engine.
If that jargon-filled sentence doesn’t answer all of your questions, we’ll cover some important conceptual topics related to Materialize:
- What it helps teams accomplish
- What it actually is
- How it works
Materialize’s design objective
Materialize lets teams very quickly get answers to questions they routinely ask.
It’s useful to think of these questions as the kind of queries you might use to power business intelligence dashboards. You might have a ton of data that comes in all day, and each team in your org wants to answer some questions using that data.
Traditionally, to get answers, some batch processing system performs a query over all of the data every night. This approach makes intuitive sense: the working set is often larger than any machine’s memory and you need to process all of it. These kind of computations are therefore relatively expensive, so it’s not feasible to execute them on the fly. Instead, you want to compute the answer once and then cache it until you derive a new answer.
This once-a-day update cycle works, but it’s inopportune to continually recompute an answer that you once knew. It’s also slow and doesn’t allow teams to nimbly make decisions.
In contrast to the traditional approach, Materialize let teams continually update the answers to their queries as new data comes in. This means that you no longer need to rely on day-old data to inform your decisions. You can now simply look at a real-time answer.
Understanding the jargon
“A streaming SQL materialized view engine” is difficult to wade through, so let’s take examine the terms piece-by-piece.
SQL & Views
One way to consider the answer to one of the aforementioned analytical queries is as a “view” of your data. There is some defined query that you want to execute, that results in giving you an answer that contains some subsets and computations of your data.
In the language of SQL, a view is a pre-defined query and its results; every time you query a view, the underlying query is executed and you get the results back.
However, continually re-executing queries is expensive, so another implementation of views instead stores the view’s results as a physical table, i.e. it caches them. Because this involves storing the results, they’re referred to as materialized views.
Materialized views are often maintained with some periodicity and don’t reflect the underlying data with total fidelity. This essentially makes them snapshots of the results, meaning any updates that have happened since the snapshot was generated will not be reflected in the materialized view. It’s easy to liken this to the batch processing strategy discussed above; you occasionally perform some expensive computation and cache the results.
However, it’s also possible that if you were to describe the changes that happened in the underlying database that you would be able to reflect those changes in materialized views without needing to periodically recompute the answer to the query in total. Instead, each successful operation could be passed to an engine that could then parse the operation and, if necessary, update the materialized view. This is known as an incrementally updated materialized view.
If we understand the implementation of incrementally updated materialized views as “the need to see a stream of updates that are occurring to the underlying database”, it could be modeled simply as “change data capture” (CDC). Using CDC, whenever a change occurs to the underlying database, a structure describing the change is propagated to some destination.
One such destination you can use for CDC is a stream like Apache Kafka. As your database changes, you can describe those changes in a published feed, and services that care about those changes can subscribe to it.
To maintain materialized views using a stream of data, you need an engine to subscribe to the CDC streams, and then perform the computations for and maintenance of the materialized view.
One approach to this is using a dataflow engine, which is a set of computations that work over streams of data; they take a stream as their input, transform it, and output their own stream. If you think of dataflows in similar terms as functional programming, you can see that they can complete arbitrarily complex tasks.
By modeling a SQL query as a dataflow, you can take in a CDC stream, apply a set of transformations to it, and then observe the results as final operator’s output set.
Putting it all together
To put all of this together, Materialize lets you:
- Describe queries you want to know the answer to as materialized views, which is a concept implemented in SQL.
- Use CDC streams to feed our dataflow engine to create and maintain the answers to those queries.
Ultimately, this lets you refresh answers to your queries very quickly because Materialize continually updates their answers as new data streams in.
How does Materialize work?
While we’ve covered the high-level details above, this section gives some of the more specific details about Materialize itself.
Materialize ingests streams of data from Kafka, which are publishing a CDC stream from an upstream relational database (e.g. MySQL or PostgreSQL). In its current iteration, the only CDC tool known to work with Materialize is Debezium.
Using a familiar SQL interface, you can define which tables from Kafka you want Materialize to monitor (known as “sources”), as well as which queries you want answer (known as “views”). Materialize then creates a dataflow for your views using a Differential dataflow.
Like we mentioned above, a dataflow is a set of connected operators that each both consume and produce streams of data, creating a network of computation that can respond to streams of input data that sources supply. Differential dataflows are special, though, in that they can easily perform incremental updates at each step of the dataflow.
As data streams in from your sources, your dataflows determine which data are relevant and update their result sets only if it needs to, e.g. there are new rows or values they care about have changed.
When you query one of your views, Materialize can simply return the dataflow’s result set from memory, which should always be faster than computing the answer from scratch.
Materialize vs. Other Methodologies
Batch processing relies on perform large, expensive computations infrequently. This means that you can never achieve real-time results, and it often does not build upon its prior computations.
In contrast, Materialize continually updates queries as data comes in, which means it performs small, efficient computations constantly. This enables real-time results and builds upon all of your prior computations.
Materialized views in relational databases
To maintain materialized views, most RDBMSes occasionally re-run the view’s underlying query. This results in a potential impact to the database’s performance while updating the view, as well as data that is only infrequently up-to-date.