How to Join MySQL and Postgres in a Live Materialized View

When you're working on a project that consists of a lot of microservices, it'll likely also include multiple databases.

For example, you might have a MySQL database and a PostgreSQL database, both running on separate servers.

Usually, to join the data from the two databases, you would have to introduce a new microservice that would join the data together. But this would increase the complexity of the system.

In this tutorial, we will be using Materialize to join MySQL and Postgres in a live materialized view. We'll then be able to query that directly and get results back from both databases in real-time using standard SQL.

Materialize is a source-available streaming database written in Rust that maintains the results of a SQL query (a materialized view) in memory as the data changes.

The tutorial includes a demo project which you can start using docker-compose.

The demo project that we are going to use will monitor the orders on our mock website. It'll generate events that could, later on, be used to send notifications when a cart has been abandoned for a long time.

The architecture of the demo project is as follows:

Prerequisites

All of the services that we will be using in the demo will run inside Docker containers, that way you will not have to install any additional services on your laptop or server rather than Docker and Docker Compose.

In case that you don't have Docker and Docker Compose already installed, you can follow the official instructions on how to do that here:

Overview

As shown in the diagram above, we will have the following components:

A mock service to continually generate orders.
The orders will be stored in a MySQL database.
As the database writes occur, Debezium streams the changes out of MySQL to a Redpanda topic.
We'll also have a Postgres database where we can get our users.
We'll then ingest this Redpanda topic into Materialize directly along with the users from the Postgres database.
In Materialize we'll join our orders and users together, do some filtering, and create a materialized view that shows the abandoned cart information.
We will then create a sink to send the abandoned cart data out to a new Redpanda topic.
At the end we will use Metabase to visualize the data.
You could, later on, use the information from that new topic to send out notifications to your users and remind them that they have an abandoned cart.

As a side note here, you would be perfectly fine using Kafka instead of Redpanda. I just like the simplicity that Redpanda brings to the table, as you can run a single Redpanda instance instead of all of the Kafka components.

How to Run the Demo

First, start by cloning the repository:

git clone https://github.com/bobbyiliev/materialize-tutorials.git

After that you can access the directory:

cd materialize-tutorials/mz-join-mysql-and-postgresql

Let's start by first running the Redpanda container:

docker-compose up -d redpanda

Build the images:

docker-compose build

Finally, start all of the services:

docker-compose up -d

In order to Launch the Materialize CLI, you can run the following command:

docker-compose run mzcli

This is just a shortcut to a Docker container with postgres-client pre-installed. If you already have psql you could run psql -U materialize -h localhost -p 6875 materialize instead.

How to Create a Materialize Kafka Source

Now that you're in the Materialize CLI, let's define the orders tables in the mysql.shop database as Redpanda sources:

CREATE SOURCE orders
FROM KAFKA BROKER 'redpanda:9092' TOPIC 'mysql.shop.orders'
FORMAT AVRO USING CONFLUENT SCHEMA REGISTRY 'http://redpanda:8081'
ENVELOPE DEBEZIUM;

If you were to check the available columns from the orders source by running the following statement:

SHOW COLUMNS FROM orders;

You would be able to see that, as Materialize is pulling the message schema data from the Redpanda registry, it knows the column types to use for each attribute:

    name      | nullable |   type
--------------+----------+-----------
 id           | f        | bigint
 user_id      | t        | bigint
 order_status | t        | integer
 price        | t        | numeric
 created_at   | f        | text
 updated_at   | t        | timestamp

How to Create Materialized Views

Next, we will create our first Materialized View, to get all of the data from the orders Redpanda source:

CREATE MATERIALIZED VIEW orders_view AS
SELECT * FROM orders;

CREATE MATERIALIZED VIEW abandoned_orders AS
    SELECT
        user_id,
        order_status,
        SUM(price) as revenue,
        COUNT(id) AS total
    FROM orders_view
    WHERE order_status=0
    GROUP BY 1,2;

You can now use SELECT * FROM abandoned_orders; to see the results:

SELECT * FROM abandoned_orders;

For more information on creating materialized views, check out the Materialized Views section of the Materialize documentation.

How to Create a Postgres Source

There are two ways to create a Postgres source in Materialize:

Using Debezium just like we did with the MySQL source.
Using the Postgres Materialize Source, which allows you to connect Materialize direct to Postgres so you don't have to use Debezium.

For this demo, we will use the Postgres Materialize Source just as a demonstration on how to use it, but feel free to use Debezium instead.

To create a Postgres Materialize Source run the following statement:

CREATE MATERIALIZED SOURCE "mz_source" FROM POSTGRES
CONNECTION 'user=postgres port=5432 host=postgres dbname=postgres password=postgres'
PUBLICATION 'mz_source';

A quick rundown of the above statement:

MATERIALIZED: Materializes the PostgreSQL source’s data. All of the data is retained in memory and makes sources directly selectable.
mz_source: The name for the PostgreSQL source.
CONNECTION: The PostgreSQL connection parameters.
PUBLICATION: The PostgreSQL publication, containing the tables to be streamed to Materialize.

Once we've created the PostgreSQL source, in order to be able to query the PostgreSQL tables, we would need to create views that represent the upstream publication’s original tables.

In our case, we only have one table called users so the statement that we would need to run is:

CREATE VIEWS FROM SOURCE mz_source (users);

To see the available views execute the following statement:

SHOW FULL VIEWS;

Once that is done, you can query the new views directly:

SELECT * FROM users;

Next, let's go ahead and create a few more views.

How to Create a Kafka Sink

Sinks let you send data from Materialize to an external source.

For this Demo, we will be using Redpanda.

Redpanda is Kafka API-compatible and Materialize can process data from it just as it would process data from a Kafka source.

Let's create a materialized view, that will hold all of the high volume unpaid orders:

 CREATE MATERIALIZED VIEW high_value_orders AS
      SELECT
        users.id,
        users.email,
        abandoned_orders.revenue,
        abandoned_orders.total
      FROM users
      JOIN abandoned_orders ON abandoned_orders.user_id = users.id
      GROUP BY 1,2,3,4
      HAVING revenue > 2000;

As you can see, here we are actually joining the users view which is ingesting the data directly from our Postgres source, and the abandond_orders view which is ingesting the data from the Redpanda topic, together.

Let's create a Sink where we will send the data of the above materialized view:

CREATE SINK high_value_orders_sink
    FROM high_value_orders
    INTO KAFKA BROKER 'redpanda:9092' TOPIC 'high-value-orders-sink'
    FORMAT AVRO USING
    CONFLUENT SCHEMA REGISTRY 'http://redpanda:8081';

Now if you were to connect to the Redpanda container and use the rpk topic consume command, you will be able to read the records from the topic.

However, as of the time being, we won’t be able to preview the results with rpk because it’s AVRO formatted. Redpanda would most likely implement this in the future, but for the moment, we can actually stream the topic back into Materialize to confirm the format.

First, get the name of the topic that has been automatically generated:

SELECT topic FROM mz_kafka_sinks;

Output:

                              topic
-----------------------------------------------------------------
 high-volume-orders-sink-u12-1637586945-13670686352905873426

For more information on how the topic names are generated check out the documentation here.

Then create a new Materialized Source from this Redpanda topic:

CREATE MATERIALIZED SOURCE high_volume_orders_test
FROM KAFKA BROKER 'redpanda:9092' TOPIC ' high-volume-orders-sink-u12-1637586945-13670686352905873426'
FORMAT AVRO USING CONFLUENT SCHEMA REGISTRY 'http://redpanda:8081';

Make sure to change the topic name accordingly!

Finally, query this new materialized view:

SELECT * FROM high_volume_orders_test LIMIT 2;

Now that you have the data in the topic, you can have other services connect to it and consume it and then trigger emails or alerts for example.

How to Connect Metabase

In order to access the Metabase instance visit http://localhost:3030 if you are running the demo locally or http://your_server_ip:3030 if you are running the demo on a server. Then follow the steps to complete the Metabase setup.

Make sure to select Materialize as the source of the data.

Once ready you will be able to visualize your data just as you would with a standard PostgreSQL database.

How to Stop the Demo

To stop all of the services, run the following command:

docker-compose down

Conclusion

As you can see, this is a very simple example of how to use Materialize. You can use Materialize to ingest data from a variety of sources and then stream it to a variety of destinations.