Materialize Logo

Architecture Overview

Everything in Materialize is currently handled by the materialized process (pronounced materialize-dee; the “d” is for daemon), which interacts with the outside world by interfacing with:


Materialize deployment diagram

Above: Materialize deployed with multiple Kafka feeds as sources.

Below: Zooming in on Materialize’s internal structure in the above deployment.

Materialize internal diagram

SQL shell: interacting with clients

Right now, Materialize provides its interactive interface through psql running locally on a client machine; this uses the PostgreSQL wire protocol (pgwire) to communicate with materialized. We have a client called mzcli that we recommend using, but it’s just a modified wrapper around pgcli.

Because this is a SQL shell, Materialize lets you interact with your node through SQL statements sent over pgwire to an internal queue, where they are dequeued by a sql thread that parses the statement.

Broadly, there are three classes of statements in Materialize:

Creating sources

When Materialize receives a CREATE SOURCE... statement, it connects to some destination to read data. In the case of streaming sources, it attempts to connect to a Kafka stream, which it plumbs into its local instance of Differential. You can find more information about how that works in the Sources section.

Reading data

Like any SQL API, you read data from Materialize using SELECT statements. When the sql thread parses some arbitrary SELECT statement, it generates a plan––plans in Materialize are dataflows, which can be executed by Differential. This plan gets passed to the dataflow package, which works as the glue between Materialize and its internal Differential engine.

Differential then passes this new dataflow to all of its workers, which begin processing. Once Differential determines that the computation’s complete, the results are passed back to the client, and the dataflow is terminated.

Unfortunately, if the user passes the same query to Materialize again, it must repeat the entire process––creating a new dataflow, waiting for its execution, etc. The inefficiency of this is actually Materialize’s raison d'être, and leads us to the thing you actually want to do with the software: creating views.

Creating views

If you know that you are routinely interested in knowing the answer to a specific query (how many widgets were sold in Oklahoma today?), you can do something much smarter than repeatedly ask Materialize to tabulate the answer from a blank slate––instead, you can create a view of the query, which Materialize will persist and continually keep up to date.

When users define views (i.e. CREATE MATERIALIZED VIEW some_view AS SELECT...), the internal SELECT statement is parsed––just as it is for ad hoc queries––but instead of only executing a single time, the generated dataflow persists. Then, as data comes in from Kafka, Differential workers collaborate to maintain the dataflow and its attendant view.

To read data from views (as opposed to ad hoc queries), users target the view with SELECT * FROM some_view; from here, Materialize can simply return the result from the already-up-to-date view. No substantive processing necessary.

Reading data vs. creating views

As a quick summary: the difference between simply reading data and creating a view is in terms of how long the generated dataflow persists.

The only “wrinkle” in the above explanation is when you perform reads on views: no dataflow gets created, and Materialize instead serves the result from an existing dataflow.’

Sources: Ingesting data

For Materialize to ingest data, it must read it from a source, of which there are two varieties:

File sources are more straightforward, so we’ll focus on streaming sources.

When using a streaming source, Materialize subscribes to Kafka topics and monitors the stream for data it should ingest.

As this data streams in, all Differential workers receive updates and determine which––if any––of their dataflows should process this new data. This works because each Differential worker determines the partitions it’s responsible for; the isolation this self-election process provides prevents contention. Phrased another way, you don’t have to worry about two workers both trying to process the same piece of data.

The actual processing of Differential workers maintaining materialized views is also very interesting and one day we hope to explain it to you here. In the meantime, more curious readers can take the first step towards enlightenment themselves.

Implicit in this design are a few key points:

Learn more

Check out: