by Amit Rathi
How to know if Apache Kafka is right for you
Apache Kafka has grown a lot in functionality and reach in last couple of years. It’s used in production by one third of the Fortune 500, including seven of the top 10 global banks, eight of the top 10 insurance companies, and nine of the top 10 U.S. telecom companies [source].
This article gives you a quick tour of the core functionality offered by Kafka. I will present lot of examples to help you understand common usage patterns. Hopefully you’ll find some correlation with your own workflows so you can start leveraging the power of Kafka. Let’s start by looking at two core functionalities offered by Kafka.
1. Kafka as a Messaging System
Messaging is widely used in two ways:
- Queuing (SQS, celery, and so on): Queue consumers act as a worker group. Each message goes to only one of the worker processes, effectively dividing the work.
- Publish-Subscribe (SNS, PubNub, and so on): Subscribers are typically independent of each other. Each subscriber gets a copy of each message. It acts like a notification system.
Both of these are useful paradigms. Queuing divides up the work, and is great for fault tolerance and scale. Publish-Subscribe allows multi-subscribers, which let’s you decouple your systems. The beauty of Kafka is that it combines both the queuing and publish-subscribe paradigms into a single robust messaging system.
I highly recommend reading the documentation which explains the underlying design and how this combination is achieved with the help of topic, partitions, and consumer groups. To be fair, this functionality can also be achieved with RabbitMQ or SNS-SQS combination.
2. Kafka for Stream Processing
Once you have a robust, scalable messaging system, all you need is an easy way to process the stream of messages. Stream API provides just that. It’s a Java client library (now Scala, too) that provides higher level abstraction than producer and consumer APIs.
It makes it easy to perform:
- stateless operations, such as filtering and transforming stream messages
- stateful operations, such as join and aggregation over a time window
The stream API handles the serializing/deserializing of messages and maintains the state required for stateful operations.
Show me some code
Here is a Stream API example which reads plain text on the input stream, counts occurrences of each word, and writes the count to an output stream. See the full version here.
With windowing it’s easy to aggregate over a time range and keep track of things like top-N words that day (not demonstrated here).
Typical use cases of Kafka (examples)
- Imagine you run a travel website. The price of hotels and flights keeps changing all the time. A few components of your system (price alerts, analytics) need to be informed of these changes. You post the changes on Kafka topics, and each component that needs to be notified acts as a subscriber. All nodes of a single subscriber system form a single consumer group. A given message is sent to only one node in the consumer group. This way each component gets the copy of the message, plus work gets effectively divided inside each component.
- Website activity (page views, searches, or other actions users may take) can be tracked and analysed through Kafka. In fact, this was the original use case for which Kafka was invented at LinkedIn. Website activities are published to central topics with one topic per activity type. The feed can be processed in real time to gain insights into user engagement, drop-offs, page flows, and so on.
- Imagine you have location data coming in from GPS beacons or smartphone devices, and you want to process it in real time to show vehicle path, distance travelled, and so on. Incoming data can be published on Kafka topics and processed with Stream API. Stateful processing with windowing comes in handy when you need to extract and process all location data of a given user for a certain period of time.
When not to use Kafka
- If you can’t or don’t want to move to Java/Scala for services talking to Kafka cluster, then you are going to miss out on all the higher level abstractions provided by Kafka Streams. Streams API is essentially a client library talking to Kafka cluster. Confluent, the company behind Kafka, is focused on Java at the moment. Popular languages like Python have also had an open issue for streaming support for over 1.5 years now.
- If all you need is a task queue, consider RabbitMQ instead. With Kafka, each partition can be consumed by a single consumer only. And you have to decide the partition while putting the task on the queue. So it’s possible that a flood of tasks on a given partition can cause starvation, and you can’t do anything as adding consumers doesn’t help.
- If you are only processing few thousand messages each day, then Kafka is probably overkill. Kafka is really built for handling large-scale stream processing, so setting up and maintaining it is not worth it if you don’t have/anticipate scale.
That’s all folks. This covers the important things you need to know about Apache Kafka. If you enjoyed reading it, follow my blog. Let me know if you would like to see an overview of any other tool.
Originally published at blog.amirathi.com on March 3, 2018.