Creating a Kafka Connect Docker Image
If you have been using Kafka, or learning about Kafka, you have most probably came across Kafka Connect. The idea of Kafka Connect is to simplify the process of getting data into and out of a Kafka Cluster. The idea is that Kafka Connect abstracts all the logic of resilience, scalability and reliability out of your way to let you focus on writing value-added services. Kafka Connect has a lot of plugins, some commercial and some open source. Its parallel with Kafka Streams makes it a good complement for your Kafka based applications. In this article, I will show you how to build your own Kafka Connect Docker image with your plugins, no matter if they are open source or your own, custom-developed plugins.
Introduction
Kafka is mostly distributed as a cluster of Kafka Brokers to which your clients connect to. On top of this low-level cluster, we have two APIs that serve two different use cases: Kafka Connect and Kafka Streams.
Kafka Streams is an API that enables the development of stream applications, that is, applications that process data from Kafka Streams in a real-time manner. This means that we can easily create applications that are scalable, reliable and fast, focusing only on the details of our application and leaving the scalability and reliability part to the platform. I will not be focusing on Kafka Streams in the post because I feel it is a technology that is more widespread than Kafka Connect. If you would like me to write about Kafka Streams, send a message or reply to this article.
Kafka Connect is the other API that the Kafka Open Source distribution provides. Its use case is different: the goal of Kafka Connect is to simplify and streamline the process of getting and pushing data into a Kafka Cluster. Kafka Connect abstracts the process of transforming data, scaling and reliability. This is achieved through the concept of Connect Plugins. We can develop plugins for Kafka Connect or use one of the many open-source plugins available. Each plugin has its purpose: some are source plugins, which means they obtain data from external systems; others are sink plugins, which means they send data to external systems; others are still transformers or predicate plugins, which allow data to be transformed or filtered, respectively.
Kafka Connect is not as widely documented as Kafka Streams, but that does not mean that it is not ready for prime time. It is part of Kafka’s official distribution, and there are many projects that rely on it. For instance, Mirror Maker 2, Kafka’s tool to replicate data across different Kafka clusters, is built with Kafka Connect. This should give you a sense of how Kafka Connect is crucial for Kafka itself.
Deploying Kafka Connect
Like Kafka Streams Applications, Kafka Connect deployments can be performed in a distributed environment. The idea is simple: several instances of Kafka Connect collaborate in sharing the load of the tasks being performed. Like Kafka Connect, this allows for both scalability and reliability without having to change your application. The problem with deploying Kafka Connect is: how do we make a Docker image for our services? We could leverage one of the many paid options of running Kafka Connect in the cloud, but if we want to have control and better cost controls, we can deploy our own system.
This means we need to have a Docker image with Kafka Connect and the plugins that we will be using, no matter if they are open or closed source. The simplest way I found to do this is to have a small Java wrapper that starts Kafka Connect and maps environment variables to Kafka Connect properties. This is similar to what Confluent does in their commercial Docker images. Here, we are in control of what we put in the Docker image. There is no proprietary code calling home, but there is one drawback: we will not have the hub to install plugins; we will be installing them ourselves when the Docker image is created.
Wrapping Kafka Connect
The idea to wrap Kafka Connect came from this project. The goal is to create a small Java wrapper that maps environment variables, starting with CONNECT_, into Kafka Connect parameters. This will make our Java wrapper code small and enable us to control the parameters by changing the environment variables of the container, be it in a Docker, Docker Composer, or a Kubernetes environment.
public class Main {
private static final Logger LOGGER = LogManager.getLogger();
static final String CONNECT_ENV_PREFIX = "CONNECT_";
static Supplier<Map<String, String>> envSupplier = System::getenv;
public static void main(final String[] args) {
LOGGER.info("Starting Kafka Connect wrapper");
try {
ConnectDistributed.main(new String[]{createConnectProperties(envSupplier.get()).getAbsolutePath()});
} catch (Exception e) {
LOGGER.error("Error starting wrapper {}", ConnectDistributed.class.getSimpleName(), e);
System.exit(1);
}
}
static File createConnectProperties(Map<String, String> env) throws IOException {
final File workerPropFile =
File.createTempFile("tmp-connect-distributed", ".properties");
workerPropFile.deleteOnExit();
try (PrintWriter pw = new PrintWriter(new FileOutputStream(workerPropFile))) {
LOGGER.trace("Writing Kafka Connect worker properties to '{}'.", workerPropFile.getAbsolutePath());
env.entrySet()
.stream()
.filter(entry -> entry.getKey()
.startsWith(CONNECT_ENV_PREFIX))
.map(e -> new AbstractMap.SimpleEntry<>(connectEnvVarToProp(e.getKey()), e.getValue()))
.forEach(e -> {
final String k = e.getKey();
final String v = e.getValue();
LOGGER.info("Property defined: {}={}", k, v);
pw.printf("%s=%s%n", k, v);
});
LOGGER.trace("Kafka Connect worker properties written.");
}
return workerPropFile;
}
private static String connectEnvVarToProp(String k) {
return k.toLowerCase()
.substring(CONNECT_ENV_PREFIX.length())
.replace('_', '.');
}
}The project is quite straightforward. We are leveraging Maven to get Kafka Connect, its runtime and any plugins we want to include in our Docker image.
<!-- ... -->
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>connect-api</artifactId>
<version>${kafka.version}</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>connect-runtime</artifactId>
<version>${kafka.version}</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>connect-transforms</artifactId>
<version>${kafka.version}</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>connect-json</artifactId>
<version>${kafka.version}</version>
<scope>runtime</scope>
</dependency>
<!-- ... -->Building a Docker Image
We then leverage Google’s Jib project to have a Docker image that automatically starts the project.
<plugin>
<groupId>com.google.cloud.tools</groupId>
<artifactId>jib-maven-plugin</artifactId>
<version>3.4.1</version>
<configuration>
<to>
<image>kafkaconnect</image>
</to>
</configuration>
</plugin>Running the Docker Image
To run this Docker image, we just need to do:
docker run --rm -ti --env-file envfile kafkaconnectDon’t forget to provide the needed Kafka Connect environment variables in the envfile file.
# Kafka Connect Settings
CONNECT_KEY_CONVERTER=org.apache.kafka.connect.json.JsonConverter
CONNECT_VALUE_CONVERTER=org.apache.kafka.connect.json.JsonConverter
CONNECT_GROUP_ID=kafka-connect
CONNECT_OFFSET_STORAGE_TOPIC=kafka-connect-offset-storage-topic
CONNECT_CONFIG_STORAGE_TOPIC=kafka-connect-config-storage-topic
CONNECT_STATUS_STORAGE_TOPIC=kafka-connect-status-storage-topic
CONNECT_BOOTSTRAP_CONTROLLERS=kafka-cluster:9092
# ...Wrapping Up
In this article, we have seen how simple it is to build a docker image with docker compose and some of the Kafka Connect plug-ins that are part of the standard distribution. Adding other open source plug-ins or your own is quite easy: you just need to add a dependency to the project. If you have difficulties, send a message and I’ll write a small article on the subject.
I hope this article has helped you. If it did, leave some feedback. The source code for the project is on GitHub.