Gordon Murray

Using OpenTofu to create an Apache Flink cluster on AWS

2024-02-17T19:30:00+00:00

Tofu init

Open Tofu, the open source fork of Terraform™ became generally available last month. I wanted to create a project using Tofu to try it out and see if there were any issues or differences compared to Terraform.

I had a couple of things in mind doing this. Aside from trying Open Tofu for the first time I wanted to make sure Tofu could:

Use EC2 AMIs created by Hashicorps Packer.
See if the Ansible provider for Terraform would work with Tofu.
See if Infracost could still read the HCL and state to estimate cost.

The quick answer is that all 3 continue to work with Tofu. I didn’t experience a single issue.

I’m working a lot with Flink right now and enjoying it, so I created a Tofu version of a Flink cluster on AWS: https://github.com/gordonmurray/tofu_aws_apache_flink

Packer

To start off, Packer is used to create an AMI. It is a base instance with Java, Flink and some JARs installed.

Launch templates

Tofu can then use the resulting AMI to create one or more EC2 instancess. For the Flink Job Manager I used a standard aws_instance resource. For the Task Managers I used aws_launch_template launch templates. I used lauch templates instead of directly creating EC2 instances so that any failing Task Manager would be replaced automatically from the base image. Any workloads running on Flink could continue to work uninterupted.

I could use the same approach of using templates for the Job Managers too. That would involve running 2 or 3 Job Managers and so Zookeeper needs to get involved to hold things together. I might add that to the repo in the near future.

Spot instances for Task Managers

Using a launch template opens up the opportunity to use Spot instances. Those are EC2 instances that are lower cost to run. They are lower cost but can disappear at any time. This is where the launch template comes in, it will replace any instance that disappears.

I created 2 independent launch templates and 2 auto scaling groups. Both use spot instances but both have different instance types defined, an m7g.large and an m7g.xlarge. This can help to make sure that at least 1 type of instance is running if the other is unavailable as a Spot instance. It also allows both groups to be scaled up or down independently. One group might better handle sourcing data from Kafka for example and the other might be better suited to sinking data to an s3 bucket.

While using spot instances helps to keep the cost down, it also provides an opportunity to use a larger instance type than one might ordinarily consider using due to being too expensive. Flink loves to use more memory!

User data

User data is a way to run some commands on an EC2 instance when it starts up. Tofu applies the user data to update the Flink config on a task manager as it starts up, such as setting the Job Manager IP address so that a new Task Manager can join the cluster.

Ansible provider

With a Flink cluster up and running the next step is to give it some work to do. Flink can take in work in the form of Java applications compiled as JARs and also in the form of SQL using Flink SQL. I used the Ansible provider for Terraform to get Tofu to call an Ansible playbook to submit Flink SQL work to the cluster.

Using an Ansible provider to submit work to Flink might be an unusual step though it has advantages. Details of resources created by Tofu such as a database, cache or S3 bucket can be passed to Ansible as variables. Ansible can use that information in its Flink SQL jobs.

The Job to submit to Flink is in the form of an Ansible role. A task reads in a template file that contains the Flink SQL and can perform any interpolation needed such as Kafka broker addresses, registry addresses, sink addresses and so on. The directory structure is as follows. More Roles can be added over time to add more work to the Flink cluster

flink_jobs
├── tasks
│   └── main.yml
└── templates
    └── queries.sql.j2

3 directories, 2 files

Overall, Tofu worked well out of the box. If felt faster to apply changes compared to Terraform though that’s probably since its a small project.

Some improvements to this project could include:

Changing the Job Managers to use a launch template also, which introduces a need for Apache Zookeeper
Have the user data update the config of the instances with different memory settings based on the instance size used.
For workloads use RocksDB for incremental checkpoints
Store checkpoints on S3

Excluding sensitive data from Debezium CDC

2024-01-20T16:00:00+00:00

When using Debezium to source data from a relational database like Mariadb, it might be neccessary to block some data from being copied. It might be for privacy reasons like removing personally identifiable information, or it might be for effecienty reasons to remove some verbose fields from being copied.

The main questions I had in mind for this were (a) if I could get it to work and (b) would it exclude the field(s) in both the initial snapshot phase (whereby all existing data in a table is copied) or in the ongoing CDC phase (whereby ongoing changes are copied) that Debezium performs, or both.

The good news is that I got it working well and that a field is excluded in both the snapshot and the ongoing CDC phase, which is great. I was worried it might might only work in one of those phases.

I lost some time down a rabbit hole at the start. I had read somewhere that the field to use in the connector to exlude a field was column.blacklist and no matter what values I used in this, I couldn’t get it to work. After some Googling I found that it had been renamed to column.exclude.list.

I created a project on Github with the files. It uses Docker Compose to create a mariadb instance, populate it with some sample data and then a connector config to CDC from a table in to Kafka, with some fields to be excluded.

The sample data I used is a users table with some generated data:

CREATE TABLE users (
    id INT AUTO_INCREMENT PRIMARY KEY,
    first_name VARCHAR(50),
    surname VARCHAR(50),
    email VARCHAR(100),
    date_of_birth DATE,
    signed_up DATETIME,
    user_type ENUM('user', 'admin')
);

INSERT INTO users (first_name, surname, email, date_of_birth, signed_up, user_type) VALUES
('John', 'Doe', 'john.doe@example.com', '1990-01-01', '2022-01-15 08:30:00', 'user'),
('Jane', 'Smith', 'jane.smith@example.com', '1985-05-20', '2022-02-10 09:45:00', 'admin'),
('Alice', 'Johnson', 'alice.johnson@example.com', '1992-07-11', '2022-03-05 10:00:00', 'user'),
('Bob', 'Brown', 'bob.brown@example.com', '1988-09-30', '2022-04-20 11:20:00', 'user'),
('Charlie', 'Davis', 'charlie.davis@example.com', '1995-11-15', '2022-05-25 13:45:00', 'admin');

I then created a connector for Debezium that included the following values to exclude the surname, email and date of birth fields from being copied to kafka.

"column.exclude.list": "mydatabase.users.surname, mydatabase.users.email, mydatabase.users.date_of_birth",

Once I started the connector, I was able to query the messages in the resulting topic to see if the fields were present.

➜  ~ ./kafka/bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic testing.mydatabase.users --from-beginning | jq

One of the records now looks like the following example. Not much left after the sensitivie fields have been removed, but its working well.

  "payload": {
    "id": 5,
    "first_name": "Charlie",
    "signed_up": 1653486300000,
    "user_type": "admin"
  }

To test the CDC part, I added and updated a record or two in the source database, to see what the data might look like:

INSERT INTO users (first_name, surname, email, date_of_birth, signed_up, user_type) VALUES
('Bobby', 'Tables', 'bobby.tables@example.com', '1995-07-01', '2023-03-15 08:30:00', 'user');

update users set email = 'bobby.tables.new@example.com' where first_name = 'Bobby' and surname = 'Tables';

The topic content was still good, no sign of the sensitive fields being carried over:

  "payload": {
    "id": 6,
    "first_name": "Bobby",
    "signed_up": 1678869000000,
    "user_type": "user"
  }

So it works well. A 1-line change in a source connector can remove any unwanted or sensitive fields, so theres no excuse for sensitive data getting in to your Datalake!

The files are available on https://github.com/gordonmurray/debezium_exclude_columns

Sending Kafka topics events to HTTP endpoints

2024-01-13T11:00:00+00:00

My experience with streaming data so far has been for internal use, streaming data within the same infrastructure. I wanted to try out the HTTP sink connector to see how sending data out to external endpoints might work.

I created a small project to source data from a relational database using Debezium to populate some kafka topics and then added the HTTP connector to send data from one of the topics to a HTTP endpoint.

It works well and had a couple of pleasant surprises such as the connector auto creates topics in the kafka cluster to store a record of errors and successes of the HTTP calls. That’s great for debugging and delivery issues. You can control the topic names in the connector config too.

I created a docker compose file that creates a Mariadb Database with a tiny products table and a Debezium container. The first connector is added to source the data from the database and populate kafka topics. The second connector is the HTTP connector, to send events form one of the topics to a HTTP endpoint.

I used Warpstream as my kafka cluster and I used Ngrok as a temporary endpoint to send data to so I could inspect the data that is received.

The source connector to read data from a table in the database is located at files/connector_mariadb.json looks like the following, it is a minimal connector, no serialization of the data:

{
    "connector.class": "io.debezium.connector.mysql.MySqlConnector",
    "database.history.kafka.bootstrap.servers": "api-xxxxxxxxxx.warpstream.com:9092",
    "database.history.kafka.topic": "history",
    "database.hostname": "mariadb",
    "database.password": "rootpassword",
    "database.port": "3306",
    "database.server.id": "12",
    "database.server.name": "myconnector",
    "database.user": "root",
    "database.whitelist": "mydatabase",
    "schema.history.internal.kafka.bootstrap.servers": "api-xxxxxxxxxx.warpstream.com:9092",
    "schema.history.internal.kafka.topic": "schema-changes.mydatabase",
    "table.whitelist": "mydatabase.products",
    "tasks.max": "1",
    "topic.prefix": "testing",
    "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
    "transforms": "unwrap"
}

The fields to pay attention to in the JSON config are:

database.history.kafka.bootstrap.servers & schema.history.internal.kafka.bootstrap.servers - your kafka brokers, in my case Im using WarpStream
database.* fields - to specify the connection details to the database
table.whitelist - the table you want to pull data from, written in the format of schema.tablename

Debezium will create a number of topics in the kafka cluster, in this case it will create a topic called testing.mydatabase.products bases on the prefix and table.whitelist contents.

Once the topic is in place with some records, I can then add the HTTP sink connector, located at files/connector_http.json. The connector config looks like the following:

{
    "connector.class": "io.confluent.connect.http.HttpSinkConnector",
    "tasks.max": "1",
    "topics": "testing.mydatabase.products",
    "http.api.url": "https://xxxxxxxxxx.ngrok-free.app",
    "request.method": "POST",
    "headers": "Content-Type:application/json",
    "batch.size": "3",
    "max.retries": "3",
    "retry.backoff.ms": "3000",
    "connection.timeout.ms": "2000",
    "request.timeout.ms": "5000",
    "confluent.topic.bootstrap.servers": "api-xxxxxxxxxx.warpstream.com:9092",
    "reporter.bootstrap.servers": "api-xxxxxxxxxx.warpstream.com:9092",
    "reporter.error.topic.name": "error-topic",
    "reporter.error.topic.replication.factor": "1",
    "reporter.result.topic.name": "success-topic",
    "reporter.result.topic.replication.factor": "1"
}

The fields to pay attention to in the JSON config are:

topics - thats the topic in kafka you want to respond to, in my case its ‘products’
http.api.url - thats the URL to send the events to, in my case is an ngrok endpoint for testing
confluent.topic.bootstrap.servers & reporter.bootstrap.servers - your kafka brokers, in my case Im using WarpStream

For an API endpoint I used ngrok, I started it locally using the following command and it gave me a URL to add to my HTTP connector

ngrok http 80

Once the connectors are updated with any changes you want to make you can start the containers.

Run docker-compose up -d and connect to the Debezium container:

docker exec -it debezium /bin/bash

Docker compose will upload the connectors in to the Debezium container, you can start the connectors using to following commands:

Create the source connector:

curl -X PUT http://localhost:8083/connectors/my_database/config -H "Content-Type: application/json" -d @connector_mariadb.json

Give it a moment to create the topics and then create the HTTP sink connector:

curl -X PUT http://localhost:8083/connectors/http/config -H "Content-Type: application/json" -d @connector_http.json

You should see at least 3 calls to the API endpoint, one for each of the 3 records thats in the database by default. Adding or updating records in the table will trigger more calls to the endpoint.

The events look like the following in Ngrok

Struct{id=3,name=Product C,price=39.99}

It doesn’t trigger a HTTP event for any schema changes, but any events after a schema change will contain the data for any new columns that are added. I added a notes column and the data appeared in the next HTTP call without any changes.

Struct{id=5,name=product E,price=33.00,notes=great product}

Schema changes are recorded in another topic called schema-changes.mydatabase so another HTTP sink could be used to monitor that topic if you wanted to watch for schema changes.

Overall a quick and successful test, Ill be able to use this connector in future to send content form topics off to API endpoints, handy for keeping systems in sync.

The files I used for this are available here in Github: https://github.com/gordonmurray/debezium_warpstream_http_sink

Exposing metrics from logs using vector.dev

2023-12-28T12:30:00+00:00

I’ve spent some time recently creating Apache Flink jobs to process data from a number of Kafka topics. The jobs work out some customer related counts so that recurring work could be taken away from some relational databases.

To verfiy the results produced by the Flink jobs were OK, I was able to query the source database and the destination Redis instance that Flink sends its results to. As things scale up though it isn’t practial for me to query the database and redis for every customer thats processed.

There is a feature of Vector.dev I read about in the past and wanted to try out, the ability to convert logs to metrics that Prometheus could scrape. I wasn’t sure how this would work but I took some time to try it out. Im glad I spent the time on it, it gives me a clear visual in a Grafana dashboard showing me the results are progressing or are fully ok.

I wrote a bash script that could query the source database and destination redis periodically for a number of customers and write a brief log line each time. Ideally the source and destination count would be the same resulting in a differecnce of 0 and a “Pass”. If the counts were different it would log the difference and label it as a “Fail”.

The Pass and Fail are mainly there for me to see at a glance, they are not needed for the process to work. The logs look something like the following:

datetime, customerID, source count, destination count, difference, pass/fail
2023-12-18 09:33:22, 12345, 1000, 1000, 0, PASS
2023-12-18 10:33:12, 67890, 500, 200, 300, FAIL

The bash script runs every so often and appends results to the log file each time.

This is where Vector.dev comes in. Vector contains a config file that defines a source, a transformation and a destination. In my case I had a log file as a source and Prometheus metrics as a destination. Vector Remap Language (VRL) is used to define the transformation to apply to the log lines.

The full config file for Vector looks like this:

# A source section pointing to my results log file
[sources.my_source]
  type = "file"
  include = ["/path/to/results.csv"]

# A transformation written using VRL to parse the log lines
[transforms.parse_logs]
  type = "remap"
  inputs = ["my_source"]
  source = '''
  parsed = parse_csv!(.message)
  .log_timestamp = parsed[0]
  .customer = parsed[1]
  .source_count = parsed[2]
  .destination_count = parsed[3]
  .difference = to_int!(parsed[4])
  '''

# structure the output
[transforms.to_metrics]
  type = "log_to_metric"
  inputs = ["parse_logs"]
  metrics = [
    { type = "gauge", field = "difference", tags = { "customer" = "" }, name = "log_difference" }
  ]

# A sink to expose the transformed data for Prometheus
[sinks.prometheus]
  type = "prometheus_exporter"
  inputs = ["to_metrics"]
  address = "0.0.0.0:9598" # Prometheus scrape URL

When Vector starts up, it exposes an endpoint at http://localhost:9598 that Prometheus can scrape.

The output produced by vector looks something like the following, with a lable for each customer, the resulting difference and a timestamp.

log_difference{customer="12345"} 0 1703793649000
log_difference{customer="67890"} 300 1703793649000

A quick update to Prometheus to make it aware of the new endpoint using the port 9598:

scrape_configs:

  - job_name: 'custom_metrics'
    static_configs:
    - targets: ['xxx.xxx.xxx.xxx:9598']

I was then able to create a new Visualization in a Grafana dashboard alonside other Flink info. The following image shows the data is way off in the beginning and becomes consistent ( source and destination = 0 ) as the Flink job completes processing the data.

Having this visualization has been a great addition. I can see the numbers go out of sync at times when Flink is processing large amounts of new data coming in from Kafka and then it syncs up again.

WarpStream Apache Flink and Iceberg for a cost effective scalable logging solution

2023-11-18T20:00:00+00:00

When developing and debugging systems, having logs is critical. A number of services exist to capture logs and provide a UI for searching and alerting, both 3rd party and self hosted.

However if you have a large volume of logs or if you hope to retain the logs for a long time, the costs can add up quickly.

Using a combination of Vector.dev (a logging agent), Warpstream (a kafka compatible data streaming platform) and Apache Flink (a distributed processing engine) can provide a hugely cost effective and powerful logging solution worth looking at.

vector.dev is a logging agent that can submit logs to a Kafka cluster. FluentBit can also send logs to Kafka, I tried it out recently here https://github.com/gordonmurray/warpstream_fluent_bit
Warpstream is a kafka compatible platform that stores data in s3 instead of costly kafka nodes. Saving on maintenance and costs around managing a kafka cluster. Flink can happliy read from WarpStream, I tried it out recently here https://github.com/gordonmurray/apache_flink_and_warpstream
Flink is a processing engine that can read the data in real time. It can provide insights such as pattern recoginition using its Complex Event Processing (CEP) library as well as stream the data to s3 for long term storage. The data in s3 can use a table format such as Apache Iceberg for long term and structured data. I tried out using Flink to store data in Iceberg format earlier here https://gordonmurray.com/data/2023/11/09/apache-flink-and-apache-iceberg.html

Using some generated access log data we can create a working example of these tools working together to try it out.

The following diagram shows the end result. Logs coming from a source to WarpStream. WarpStream stores data to s3. Flink reads the data from WarpStream and allows you to run queries on the data and snd the data back to s3 in a format suitable for long time storage and querying.

Installing and running Warpstream

First, create a Warpstream account at https://warpstream.com and create your first virtual cluster. Take note of the cluster ID and create an API key

Install the Warpstream agent using the steps here https://docs.warpstream.com/warpstream/install-the-warpstream-agent

Once installed use the following command to start the virtual cluster:

warpstream agent -agentPoolName apn_[YOUR CLUSTER] -bucketURL s3://[YOUR S3 BUCKET] -apiKey aks_[YOUR API KEY] -defaultVirtualClusterID vci_[YOUR CLUSTER ID] -httpPort 8090

Installing and running Vector.dev

Vector.dev is quick to install and use. There is a handy quick start guide on their site https://vector.dev/docs/setup/quickstart/

Next, add a config file to tell vector to generate some sample logs for us to use, and send those logs to WarpStream

#/etc/vector/vector.toml

data_dir = "/var/lib/vector"

[api]
enabled = true

[sinks.my_sink_id.encoding]
  codec = "json"

[sources.my_source_id]
type = "demo_logs"
count = 1000000
format = "json"
interval = 1
lines = [ "line1" ]

[transforms.parse_apache_logs]
type = "remap"
inputs = ["my_source_id"]
source = '''
  parsed_json = parse_json!(.message)
  .host = parsed_json.host
  .user_identifier = parsed_json."user-identifier"
  .datetime = parsed_json.datetime
  .method = parsed_json.method
  .request = parsed_json.request
  .protocol = parsed_json.protocol
  .status = parsed_json.status
  .bytes = parsed_json.bytes
  .referer = parsed_json.referer
'''

[sinks.my_sink_id]
type = "kafka"
inputs = [ "my_source_id" ]
bootstrap_servers = "localhost:9092"
topic = "logs"

This config consists of a source which generates sample data and a sink which is our warpstream virtual cluster.

The transform block is extracting the fields we need like host, identifier, date and so on from a nested message json.

The bootstrap_servers variable points to localhost:9092 which is the endpoint of your WarpStream agent running locally.

You can start vector with your config file using:

vector -c /etc/vector/vector.toml

With vector and warpstream running, you should be receiving some data in to your s3 bucket.

If you have the AWS CLI installed, you could view the content of your s3 bucket using the following command to see if there is any new data in your bucket

aws s3 ls s3://my-test-bucket/

Or if you have the kafka cli, you could list the topics and get some messages

# list topics in the cluster
./bin/kafka-topics.sh --bootstrap-server localhost:9092 --list

# list messages in a topic called logs
./bin/kafka-console-consumer.sh --topic logs --from-beginning --bootstrap-server localost:9092

Flink job to read from the logs topic

You can start a Flink cluster using Docker compose from this repo https://github.com/gordonmurray/apache_flink_and_warpstream_for_logs

Now we can get to the fun part of querying the log data using Flink. First we can create a table to hold the logs and perform some quick queries

Connect to the Flink SQL console using:

./bin/sql-client.sh

And create a table to hold the logs. Make sure to add your Magic URL to the bootstrap.servers field.

CREATE TABLE apache_logs (
  `bytes` INT,
  `datetime` STRING,
  `host` STRING,
  `message` STRING,
  `method` STRING,
  `protocol` STRING,
  `referer` STRING,
  `request` STRING,
  `service` STRING,
  `source_type` STRING,
  `status` STRING,
  `mytimestamp` TIMESTAMP(3) METADATA FROM 'timestamp',  -- assuming timestamp is in standard format
  `user_identifier` STRING,
    WATERMARK FOR mytimestamp AS mytimestamp - INTERVAL '5' SECONDS
) WITH (
    'connector' = 'kafka',
    'topic' = 'logs',
    'properties.bootstrap.servers' = 'api-xxxxxxxxxxxxxxxxxx.warpstream.com:9092',
    'properties.group.id' = 'flink',
    'scan.startup.mode' = 'earliest-offset',
    'format' = 'json',
    'json.fail-on-missing-field' = 'false',
    'json.ignore-parse-errors' = 'true'
);

Once the table is created you can query it like a regular table in a relational database

select * from apache_logs;

Pattern Recognition using FLink

We can use Flink to keep an eye on our logs for some key patterns such as

repeated access from the same IP address with varying user agents within a short time frame, suggesting a script or bot trying to access your system.
Identify sequences of increasing error rates (e.g., 4xx or 5xx HTTP status codes)
frequent calls to specific endpoints, to optimize backend performance or improve API design.

An example query in flink to spot the same ip address with varying user agents as an example:

SELECT
    host AS ip,
    COUNT(DISTINCT user_identifier) AS user_identifier_count,
    TUMBLE_START(mytimestamp, INTERVAL '1' MINUTE) as window_start,
    TUMBLE_END(mytimestamp, INTERVAL '1' MINUTE) as window_end
FROM apache_logs
GROUP BY
    host,
    TUMBLE(mytimestamp, INTERVAL '1' MINUTE)
HAVING
    COUNT(DISTINCT user_identifier) > 2;

heres some sample output from a query like this if it found some results that needed attention

| ip          | user_identifier_count | window_start        | window_end          |
|-------------|-----------------------|---------------------|---------------------|
| 192.168.1.1 | 3                     | 2023-11-18 10:00:00 | 2023-11-18 10:01:00 |
| 172.16.0.2  | 4                     | 2023-11-18 10:01:00 | 2023-11-18 10:02:00 |
| 10.0.0.3    | 5                     | 2023-11-18 10:02:00 | 2023-11-18 10:03:00 |
| 192.168.1.1 | 3                     | 2023-11-18 10:03:00 | 2023-11-18 10:04:00 |
| 10.0.0.3    | 6                     | 2023-11-18 10:04:00 | 2023-11-18 10:05:00 |

In this table:

ip is the IP address of the host.
user_identifier_count is the count of distinct user identifiers accessing the host in that time window.
window_start and window_end define the one-minute interval over which this aggregation is calculated.

This output suggests that between 10:00 and 10:01, the host at IP 192.168.1.1 was accessed by 3 distinct user identifiers. The fact that these counts are greater than 2 (as per the HAVING clause) indicates a relatively high level of activity or possibly different user agents interacting with the same host within that minute, which might be an interesting pattern to investigate further. Or it could just be members of a family accessing a popular web site.

While its great to query the logs here, the table disappears once you close the Flink console.

By default, Kafka or in this case WarpStream won’t store the log data long term either by design. You can set up infinite retention in Kafka, though that will add up in disk space cost when using kafka nodes.

Apache Iceberg for long term and low cost storage

Flink can help out here too. It can read the data from the logs topic and save it to s3 using a table format from Apache Iceberg. Apache Iceberg is “is a high-performance format for huge analytic tables. Iceberg brings the reliability and simplicity of SQL tables to big data”.

WarpStream will hold the newest logs as they come in and Flink can copy that data to s3 in a format that is efficient and available to query at any time for any historic queries you might like to run.

In Flink you can create a new Catalog which facilitates the storage to s3 for you.

Then you can create a table in the catalog and send data to it.

Heres the full Job for flink to read the data and send it to s3

USE CATALOG default_catalog;

CREATE CATALOG iceberg_catalog WITH (
    'type' = 'iceberg',
    'catalog-type' = 'hadoop',
    'warehouse' = 's3a://my-test-bucket-gordon/iceberg',
    's3a.access-key' = 'xxxxx',
    's3a.secret-key' = 'xxxxx',
    'property-version' = '1'
);

USE CATALOG iceberg_catalog;

CREATE DATABASE IF NOT EXISTS logs_database;

USE logs_database;

create temporary table apache_logs (
  `bytes` INT,
  `datetime` STRING,
  `host` STRING,
  `message` STRING,
  `method` STRING,
  `protocol` STRING,
  `referer` STRING,
  `request` STRING,
  `service` STRING,
  `source_type` STRING,
  `status` STRING,
  `mytimestamp` TIMESTAMP(3) METADATA FROM 'timestamp',  -- assuming timestamp is in standard format
  `user_identifier` STRING,
    WATERMARK FOR mytimestamp AS mytimestamp - INTERVAL '5' SECONDS
) WITH (
    'connector' = 'kafka',
    'topic' = 'logs',
    'properties.bootstrap.servers' = 'api-xxxxxxxxxxxxxxxxxx.warpstream.com:9092',
    'properties.group.id' = 'flink',
    'scan.startup.mode' = 'earliest-offset',
    'format' = 'json',
    'json.fail-on-missing-field' = 'false',
    'json.ignore-parse-errors' = 'true'
);

CREATE TABLE IF NOT EXISTS apache_logs_archive (
    `bytes` INT,
    `datetime` STRING,
    `host` STRING,
    `message` STRING,
    `method` STRING,
    `protocol` STRING,
    `referer` STRING,
    `request` STRING,
    `service` STRING,
    `source_type` STRING,
    `status` STRING,
    `mytimestamp` TIMESTAMP(3) METADATA FROM 'timestamp',  -- assuming timestamp is in standard format
    `user_identifier` STRING,
) WITH (
    'format-version'='1'
);

SET 'execution.checkpointing.interval' = '60 s';

INSERT INTO apache_logs_archive (bytes, datetime, host, method, protocol,referrer, request, service) SELECT bytes, datetime, host, method, protocol,referrer, request, service FROM apache_logs;

You now have some (sample) data logging to WarpStream and s3. Flink is then pulling in the data and storing it for long term storage in Iceberg format on s3 with Flink ready to work on your data to spot patterns such as errors or malicious use and send it wherever you need.

Visualizing logs with Apache Superset

Apache Superset can add the final step, a “data exploration and data visualization able to handle data at petabyte scale.”

Using Flinks SQL Gateway and JDBC connector, Superset can read from Flink but thats a post for another day!

The code snippets used here are available on Github at https://github.com/gordonmurray/apache_flink_and_warpstream_for_logs

Monitoring Apache Flink containers using Prometheus

2023-11-15T21:43:00+00:00

Apache Flink comes with a Prometheus plugin already in place in the /plugins folder and ready to use without adding any additional JAR files.

While running Flink in containers, we’ll need to use cAdvisor to gather the metrics from the containers so that Prometheus can scrape them.

One quick note; if you are running docker compose on an AWS EC2 instance, the cAdvisor container won’t work on ARM based instances like the t4g or r6gs as it doesn’t have a build for ARM based instances that I have found.

We can add cAdvisor to the docker compose file using the following block. It will run a container called cadvisor that will collect the metrics from the running flink containers. When its running, cAdvisor has a UI at htp://localhost:8080 and the metrics can be scraped from http://localhost:8080/metrics

Add the following to your docker-compose.yml file:

  cadvisor:
    image: gcr.io/cadvisor/cadvisor
    container_name: cadvisor
    ports:
      - "8080:8080"
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:rw
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro

In our Flink containers, we need to add some environment variables to enable the prometheus metrics:

metrics.reporters: prom
metrics.reporter.prom.factory.class: org.apache.flink.metrics.prometheus.PrometheusReporterFactory
metrics.reporter.prom.port: 9091

To run the containers use the following command. If you are running docker compose for the first time, it may take a minute to download the images and run them.

docker compose up -d

Once the containers are up and running you can see a list of them using:

docker ps

To confirm that each container is exposing its prometheus metrics, you can connect to each container and curl localhost:9091 as follows

# first connect to a task manager
 docker exec -it apache_flink_and_docker_compose-taskmanager-1 /bin/bash

# then curl the metrics
root@1aa3fb3bd007:/opt/flink# curl localhost:9091

you should see a longish output similar to the following:

# HELP flink_taskmanager_Status_JVM_Memory_NonHeap_Used Used (scope: taskmanager_Status_JVM_Memory_NonHeap)
# TYPE flink_taskmanager_Status_JVM_Memory_NonHeap_Used gauge
flink_taskmanager_Status_JVM_Memory_NonHeap_Used{host="172_18_0_7",tm_id="172_18_0_7:38235_53686e",} 5.5132808E7
# HELP flink_taskmanager_Status_JVM_CPU_Load Load (scope: taskmanager_Status_JVM_CPU)
# TYPE flink_taskmanager_Status_JVM_CPU_Load gauge
flink_taskmanager_Status_JVM_CPU_Load{host="172_18_0_7",tm_id="172_18_0_7:38235_53686e",} 0.0
# HELP flink_taskmanager_Status_JVM_GarbageCollector_G1_Old_Generation_Time Time (scope: taskmanager_Status_JVM_GarbageCollector_G1_Old_Generation)
# TYPE flink_taskmanager_Status_JVM_GarbageCollector_G1_Old_Generation_Time gauge

Assuming you have prometheus running on another server, you won’t be able to get Prometheus to connect to each container running on a remote server. This is where cAdvisor comes in. It can get the metrics from each container and expose them so that Prometheus can scrape them remotely.

Assuming your cAdvisor container is running, you can curl localhost on port 8080/metrics to see the combined metrics from the running containers:

curl localhost:8080/metrics

You should see a similar but longer version of the list of metrics from when you curled the metrics from inside a container.

A guide on how to set up Prometheus would need a blog post in itself. However if you have Prometheus already running, in your Prometheus config file, you’ll need to add a job to let prometheus know about our flink instances so it can poll the metrics to power a grafana dashboard or any alerts. You’ll need to add a block like this one, pointing to the IP address of your flink server:

  - job_name: 'flink_instance'
    static_configs:
      - targets: ['100.xxx.xxx.xxx:8080']

Or, if you are using service discovery in Prometheus you can use the following job to look for ec2 instances using a particular Tag and value instead:

  - job_name: 'flink-job-managers'
    ec2_sd_configs:
      - region: 'us-east-1'
        profile: 'MY-IAM-PROFILE'
        port: 8080
        filters:
          - name: tag:Name
            values:
              - 'Flink Job Manager'
    relabel_configs:

Some of the metrics Id recommend monitoring to help debug flink jobs are below. You won’t see these metric names show up until you have some jobs running first in Flink.

flink_taskmanager_job_task_busyTimeMsPerSecond - This will help show how busy the task managers are for any given second. Often this value will spike up if you are starting a new Flink Job and will settle once the job is running smoothly.
flink_taskmanager_job_task_backPressuredTimeMsPerSecond - Similar to busy time, this will show how busy your flink jobs are. This should settle too and if it doesn’t you’ll need to look in to the performance of your Flink jobs.
flink_jobmanager_job_runningTime - This one is useful to watch. It should be a growing value counting up the running time of a job. If it keeps starting over, that means one or more of your jobs have restarted and may need attention
flink_jobmanager_job_numRestarts - This is a more direct metrics to let you know if your jobs are restarting and how often
flink_taskmanager_Status_JVM_Memory_Heap_Used - Its useful to keep an eye on the heap memory being used by the flink jobs. If this goes too high, your jobs might be suffering any you might need to give Flink more memory or optimize yur jobs

The following image is a visualisation of the flink_taskmanager_job_task_backPressuredTimeMsPerSecond metric showing a new job thats busy for a few minutes.

The changes mentioned here to enable the metrics have been added to a docker compose file here in an earlier project that shows how to perform some basic CDC from ariadb to Redis using Flink:

https://github.com/gordonmurray/apache_flink_and_docker_compose

Apache Flink and Apache Iceberg

2023-11-09T22:20:00+00:00

I recently tried out using Flink with Apache Paimon. Paimon is a “Streaming data lake platform with high-speed data ingestion”. My hope is to find a convenient way for Flink to send data to s3 for longer term storage in a format that I can query easily again if need to down the line.

Paimon was straight forward to get up and running within Flink and it stored data in s3 in ORC format. ORC is not a format I’ve worked with before so I wasn’t too excited by it and unfortunately I didn’t see a option to change the format.

After trying Paimon, I was reminded of Apache Iceberg. Paimon is still in the Incubating stage at Apache and Iceberg graduated from the Incubator in 2020 so Iceberg might have been a more mature solution for me to try out first.

Having tried Iceberg, the data that it produces in an s3 bucket after CDCing from a test database seems more usable to me compared to the ORC Files produced by Paimon. The data is stored in parquet format and its snapshots are stored in Avro format, which I have some experience with. It has added metadata files too in json format.

The folder structure it creates on s3 is below. Its a folder structure with the name of the database and a folder per table. Inside each “table” is a data folder with the parquet files and a “metadata” folder with snapshots in avro format.

my-test-bucket/iceberg/

└── my_database
    └── my_products
        ├── data
        │   └── 00000-0-97c48300-6a94-485e-ae0d-d103aec5731f-00001.parquet
        └── metadata
            ├── 8bf311cc-aa78-4fe3-b6ce-c8c1191c9591-m0.avro
            ├── snap-818359114004327704-1-681279e0-1f89-428e-8ca4-350082edd535.avro
            ├── v22.metadata.json
            └── version-hint.text

I was able to insert more records in to the test database and Flink picked up on the changes. The new records added to the Iceberg data in s3 without issue.

When I altered the table to add a new column however I didn’t see that reflected in the data in s3. The newer metadata files created after adding a column still show a structure with the 3 original fields in it and not 4 as expected. So theres more to learn there.

Even though Im working with only a dozen records or so for testing, the Sink task in the Flink job continued to be busy after sending all the data, which is a bit concerning for such as small number of records. Though it could be the checkpointing as I had that set to checkpoint everey 10 seconds which is a bit much and adds a bit of overhead to its workload as far as I know.

With the data now in s3, I was able to start a new Flink job and query the existing data which is great. Overall, using Iceberg could be a great option for long term storage of data on s3 in a structured format that Flink or other tools like Apache Drill can readily query when needed.

The source to replicate this is on Github at https://github.com/gordonmurray/apache_flink_and_iceberg. The main part to get this running was the SQL command in Flink to create a catalog, nearly identical to the process for Paimon too.

CREATE CATALOG s3_catalog WITH (
    'type' = 'iceberg',
    'catalog-type' = 'hadoop',
    'warehouse' = 's3a://my-test-bucket-gordon/iceberg',
    's3a.access-key' = 'xxxxxx',
    's3a.secret-key' = 'xxxxx',
    'property-version' = '1'
);

After creating a database, a table and sending some data to the table. I was able to start another Flink Job, define the catalog again and query the data on s3 just like I would in a relational database:

use catalog s3_catalog;

use my_database;

select * from my_products;

So all I have to do now if figure out how Iceberg handles schema changes!

Trying out Apache Paimon with Flink

2023-11-05T21:25:00+00:00

I’ve been working with Apache Flink recently processing data from Kafka topics. While creating pipelines I wanted to see if I could also send the data from the topics to some longer term storage instead of taking up space on the kafka cluster. I also wanted to see if I could do this using Flink rather than another tool re-consuming the same topics.

There are a number of options available to send data from Flink to other storage locations, such as sinking to a relational database or to s3. While I was reading about Flink Catalogs after some recent Catalog use, I found out about Apache Paimon.

Streaming data lake platform with high-speed data ingestion, changelog tracking and efficient real-time analytics.

In its Getting Started section, it had a guide to working with Flink, so I tried it out.

I used docker compose to get Flink up and running. I added a database to stream some data and added the Paimon and S3 JARs.

The docker compose file is:

version: '3.7'

services:
  mariadb:
    image: mariadb:10.6.14
    environment:
      MYSQL_ROOT_PASSWORD: rootpassword
    volumes:
      - ./sql/mariadb.cnf:/etc/mysql/mariadb.conf.d/mariadb.cnf
      - ./sql/init.sql:/docker-entrypoint-initdb.d/init.sql
    ports:
      - "3306:3306"

  jobmanager:
    image: flink:1.17.1
    container_name: jobmanager
    environment:
      - JOB_MANAGER_RPC_ADDRESS=jobmanager
    ports:
      - "8081:8081"
    command: jobmanager
    volumes:
      - ./jars/flink-sql-connector-mysql-cdc-2.4.1.jar:/opt/flink/lib/flink-sql-connector-mysql-cdc-2.4.1.jar
      - ./jars/flink-connector-jdbc-3.1.0-1.17.jar:/opt/flink/lib/flink-connector-jdbc-3.1.0-1.17.jar
      - ./jars/paimon-flink-1.17-0.6-20231030.002108-52.jar:/opt/flink/lib/paimon-flink-1.17-0.6-20231030.002108-52.jar
      - ./jars/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar
      - ./jars/flink-s3-fs-hadoop-1.17.1.jar:/opt/flink/plugins/s3-fs-hadoop/flink-s3-fs-hadoop-1.17.1.jar
      - ./jars/paimon-s3-0.6-20231030.002108-57.jar:/opt/flink/lib/paimon-s3-0.6-20231030.002108-57.jar
      - ./jobs/job.sql:/opt/flink/job.sql
    deploy:
          replicas: 1
  taskmanager:
    image: flink:1.17.1
    environment:
      - JOB_MANAGER_RPC_ADDRESS=jobmanager
    depends_on:
      - jobmanager
    command: taskmanager
    volumes:
      - ./jars/flink-sql-connector-mysql-cdc-2.4.1.jar:/opt/flink/lib/flink-sql-connector-mysql-cdc-2.4.1.jar
      - ./jars/flink-connector-jdbc-3.1.0-1.17.jar:/opt/flink/lib/flink-connector-jdbc-3.1.0-1.17.jar
      - ./jars/paimon-flink-1.17-0.6-20231030.002108-52.jar:/opt/flink/lib/paimon-flink-1.17-0.6-20231030.002108-52.jar
      - ./jars/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar
      - ./jars/flink-s3-fs-hadoop-1.17.1.jar:/opt/flink/plugins/s3-fs-hadoop/flink-s3-fs-hadoop-1.17.1.jar
      - ./jars/paimon-s3-0.6-20231030.002108-57.jar:/opt/flink/lib/paimon-s3-0.6-20231030.002108-57.jar
    deploy:
          replicas: 2

I started the mini Flink cluster using:

docker compose up -d

With the cluster running, I added some SQL commands to try out Paimon via a Catalog.

I created the following SQL to CDC from the database and send it to a table in Paimon.

USE CATALOG default_catalog;

CREATE CATALOG s3_catalog WITH (
    'type' = 'paimon',
    'warehouse' = 's3://my-test-bucket/paimon',
    's3.access-key' = '',
    's3.secret-key' = ''
);

USE CATALOG s3_catalog;

CREATE DATABASE my_database;

USE my_database;

CREATE TABLE myproducts (
    id INT PRIMARY KEY NOT ENFORCED,
    name VARCHAR,
    price DECIMAL(10, 2)
);

create temporary table products (
    id INT,
    name VARCHAR,
    price DECIMAL(10, 2),
    PRIMARY KEY (id) NOT ENFORCED
) WITH (
    'connector' = 'mysql-cdc',
    'connection.pool.size' = '10',
    'hostname' = 'mariadb',
    'port' = '3306',
    'username' = 'root',
    'password' = 'rootpassword',
    'database-name' = 'mydatabase',
    'table-name' = 'products'
);

SET 'execution.checkpointing.interval' = '10 s';

INSERT INTO myproducts (id,name) SELECT id, name FROM products;

The SQL creates a catalog called “s3_catalog” and inside it creates a database “my_database” and a table “myproducts”.

Using Paimon was as easy as creating the catalog with suitable s3 credentials and then creating and querying tables as normal to populate data in s3:

CREATE CATALOG s3_catalog WITH (
    'type' = 'paimon',
    'warehouse' = 's3://my-test-bucket/paimon',
    's3.access-key' = '',
    's3.secret-key' = ''
);

I submitted the job to Flink:

docker exec -it jobmanager /opt/flink/bin/sql-client.sh embedded -f job.sql

With the Job up and running, I checked on s3 using the AWS CLI to list the contents of my s3 test bucket:

aws s3 ls my-test-bucket/paimon/my_database.db/myproducts/

The job had created the following folders in the bucket:

PRE bucket-0/
PRE manifest/
PRE schema/
PRE snapshot

The schema it stored for the products table on s3 was in JSON format:

{
  "id" : 0,
  "fields" : [ {
    "id" : 0,
    "name" : "id",
    "type" : "INT NOT NULL"
  }, {
    "id" : 1,
    "name" : "name",
    "type" : "STRING"
  }, {
    "id" : 2,
    "name" : "price",
    "type" : "DECIMAL(10, 2)"
  } ],
  "highestFieldId" : 2,
  "partitionKeys" : [ ],
  "primaryKeys" : [ "id" ],
  "options" : { },
  "timeMillis" : 1696694538055
}

And in a folder called bucket-0 there was the data from my test database, in ORC format which stands for Optimized Row Columnar.

2023-11-05 21:11:27       1279 data-19c71b4d-91c2-45fa-b9f2-b7403e2269e4-0.orc
2023-11-05 20:46:56       1279 data-dec5ca81-ad69-4619-9180-99267f6c60f5-0.orc

ORC is comparable to the Parquet file format and AWS have a quick comparison here on their respective strengths: https://docs.aws.amazon.com/athena/latest/ug/columnar-storage.html

In the end it was quick and easy to get Paimon running to help Flink send data to s3 in a structured format.

However when I first tried this a few days ago I didn’t save my data and when I went back to try it again, I couldn’t for the life of me get it to write data to s3 again.

After trying different Jars for Paimon and s3, I even submitted an issue to the Paimon Github repo. Only to close it a few minutes later after I re-read the docs and found the all important line, with a comment showing how important it is:

Once I added that, data was writing to the s3 bucket again.

All I need to do now it to make sure I can read the ORC data using Flink. Hopefully Flink can pull long term data back in from s3 quickly and easily for longer term type queries rather than keeping it all in Kafka.

The files used for this are on Github at https://github.com/gordonmurray/apache_flink_and_paimon

Deploying Flink CDC Jobs with Docker compose

2023-11-02T22:37:00+00:00

Running Apache Flink containers using Docker Compose is a convenient way to get up and running to try out some Flink workloads.

You can start out using a docker compose file, then upload and run an SQL file that contains the jobs you want to run.

This approach of running workloads on Flink is using Flink SQL, it’s one of several ways to run workloads. Writing Java apps, compiling them as Jar files and uploading them to run is probably the more common way to run workloads.

Here is a minimal docker compose file to run a Flink job manager and 2 task managers

version: '3.7'

services:

  jobmanager:
    image: flink:1.17.1
    environment:
      - JOB_MANAGER_RPC_ADDRESS=jobmanager
    ports:
      - "8081:8081"
    command: jobmanager

  taskmanager:
    image: flink:1.17.1
    environment:
      - JOB_MANAGER_RPC_ADDRESS=jobmanager
    depends_on:
      - jobmanager
    command: taskmanager
    deploy:
          replicas: 2

Assuming docker compose is installed you can start the containers using the following command in the same folder as the docker compose file:

docker compose up -d

This will start the containers in the background and you can check that the containers are running using

docker ps

You should see something like the following showing a jobmanager and 2 task managers.

CONTAINER ID   IMAGE             COMMAND                  CREATED          STATUS          PORTS                                                 NAMES
eb87408560be   flink:1.17.1      "/docker-entrypoint.…"   35 minutes ago   Up 35 minutes   6123/tcp, 8081/tcp                                    apache_flink_and_docker_compose-taskmanager-2
565fd52d250a   flink:1.17.1      "/docker-entrypoint.…"   35 minutes ago   Up 35 minutes   6123/tcp, 8081/tcp                                    apache_flink_and_docker_compose-taskmanager-1
0b3e3eaa5c06   flink:1.17.1      "/docker-entrypoint.…"   35 minutes ago   Up 35 minutes   6123/tcp, 0.0.0.0:8081->8081/tcp, :::8081->8081/tcp   jobmanager

To start adding some work to Flink you can access the Flink console using the following command and from there you can try out various jobs like creating tables.

Use the docker-compose.yml file in this repo to create Flink, Mariadb and redis containers instead of the minimal example provided earlier.

docker exec -it jobmanager /opt/flink/bin/sql-client.sh

To get a feel for using Flink, create a table that will read data from a database table running in another container:

-- read in the data from the table in mariadb
CREATE TABLE sales_records_table (
    sale_id INT,
    product_id INT,
    sale_date DATE,
    sale_amount DECIMAL(10, 2),
    PRIMARY KEY (sale_id) NOT ENFORCED
) WITH (
    'connector' = 'mysql-cdc',
    'hostname' = 'mariadb',
    'port' = '3306',
    'username' = 'root',
    'password' = 'rootpassword',
    'database-name' = 'sales_database',
    'table-name' = 'sales_records'
);

The view the data using:

select * from sales_records_table;

If you want to take things up a notch, you can write your SQL commands to a file, then submit the file to Flink for it to run in the background.

The following file has a few commands to read from a fictional log of sales in a table in a source database, perform ongoing change data capture (CDC), perform a sum of all sales and then sink the resulting sales sum in to redis.

-- read in the data from the table in mariadb
CREATE TABLE sales_records_table (
    sale_id INT,
    product_id INT,
    sale_date DATE,
    sale_amount DECIMAL(10, 2),
    PRIMARY KEY (sale_id) NOT ENFORCED
) WITH (
    'connector' = 'mysql-cdc',
    'hostname' = 'mariadb',
    'port' = '3306',
    'username' = 'root',
    'password' = 'rootpassword',
    'database-name' = 'sales_database',
    'table-name' = 'sales_records'
);

-- create a view that aggregates the sales records
CREATE TEMPORARY VIEW total_sales AS
SELECT
    SUM(sale_amount) AS total_sales_amount
FROM
    sales_records_table;

-- create a redis sink table
CREATE TABLE redis_sink (
    key_name STRING,
    total DECIMAL(10, 2),
    PRIMARY KEY (key_name) NOT ENFORCED
) WITH (
    'connector' = 'redis',
    'redis-mode' = 'single',
    'host' = 'redis',
    'port' = '6379',
    'database' = '0',
    'command' = 'SET'
);

-- insert the aggregated sales records into the redis sink table
INSERT INTO
    redis_sink
SELECT
    'total_sales',
    total_sales_amount
FROM
    total_sales;

This job.sql will already be available in the container ready to run it:

docker exec -it jobmanager /opt/flink/bin/sql-client.sh embedded -f job.sql

While this is a made up example its a good example of what Flink can do on its own.

Once the Job is running, check the Flink UI and you’ll see your running Job by going to http://localhost:8081/#/overview

You can check redis to see if the value is in there:

redis-cli -h localhost
get total_sales

# "5500.00"

You can expand on this by adding Checkpoints, which can be handy to help Flink jobs tolerate restarts. I wrote about checkpoints recently here: https://gordonmurray.com/data/2023/10/25/using-checkpoints-in-apache-flink-jobs.html

Once you’re done, you can run docker compose down to stop the containers.

Misusing Catalogs in Apache Flink for identifying Jobs

2023-10-28T15:00:00+00:00

When jobs are created in Flink using SQL, they show up in the jobs list with default names such as insert-into_default_catalog.default_database.sink_name. If you’re pulling records from multiple sources and sinking them to the same place such as a Redis cache it can be hard to tell which one is which if a job needs some attention. As far as I can tell you can only provide names when submitting jobs via Java.

I was looking in to Catalogs to see what they could do. I wanted to store CDC data somewhere to avoid re-snaphotting data from source databases, or to share data between jobs. When I created a couple of new jobs using a Catalog I noticed the jobs had different naming.

The Jobs now have a Catalog name and a database name that helps tell them apart.

I know this isn’t a proper use of Catalogs in Flink. Catalogs can do more than just helping with the naming of jobs though for me it’s definitely effective to label dozens of jobs sinking to a central place.

I used the following to create a Catalog and a database and use them when creating any tables in a Job. Its an in-memory Catalog and it doesn’t help with my initial hope of storing the CDC data to avoid performing snapshots, though it gives me some useful naming at a glance.

Theres a Docker Compose file and related files in Github here to try this out : https://github.com/gordonmurray/apache_flink_catalog_misuse

USE CATALOG default_catalog;

CREATE CATALOG myproject WITH ('type'='generic_in_memory');

USE CATALOG myproject;

CREATE DATABASE mydatabase;

USE mydatabase;

CREATE TABLE []..]

I’ve been using Flink with Kafka since to help take the pressure off the databases which works well.

For a proper use of Catalogs, I tried out Apache Paimon briefly for storing data on S3 in ORC format and plan to revisit it again soon. Theres definitely more to learn about Catalogs.

https://github.com/gordonmurray/apache_flink_and_paimon