What you can do with Apache Kafka
This article describes Apache Kafka simply and looks at some of the use cases that benefit from it.
Kafka in nutshell
Kafka is self-described as a “distributed streaming platform”. It is a centralized system for publishing and subscribing to data.
What can Kafka be used for?
Website Activity tracking
As a website owner, you’d be interested in tracking the actions that your users perform on your website such as searches made, pages visited and content clicked. This activity tracking is often very high volume as many activity messages are generated for each user page view.
This data can be published to Kafka and divided by category. Feeds of each category of data can be loaded into Hadoop or offline data warehousing systems for offline processing and reporting.
Messaging
Kafka works well as a replacement for a more traditional message broker. A message broker decouples the producer and consumer of a message. That is, the producer of the message does not have to worry about whether the consumer has gotten the message or not, while the consumer does not have to worry about communicating with the producer to get the messages. In comparison to most messaging systems Kafka has better throughput, built-in partitioning, replication, and fault-tolerance which makes it a good solution for large scale message processing applications.
Stream processing
Many users of Kafka process data in processing pipelines consisting of multiple stages, where raw input data is consumed from Kafka topics and then aggregated, enriched, or otherwise transformed into new topics for further consumption or follow-up processing. For example, a processing pipeline for recommending news articles might crawl article content from RSS feeds and publish it to an “articles” topic; further processing might normalize or deduplicate this content and publish the cleansed article content to a new topic; a final processing stage might attempt to recommend this content to users. Such processing pipelines create graphs of real-time data flows based on the individual topics.
Metrics
Say you own a set of independent but related websites. All of them generate data used for performance analytics. But you don’t want to process this data separately. Kafka lets you convert multiple streams of data into a single stream for processing.