Apache Kafka
High-throughput messaging system as a distributed commit log
- publish-subscribe messaging designed as a distributed commit log
- High volume of events/activities to be recorded (100k+/sec)
- messages/events to be delivered at least once
- mix of online and batch consumers
- consumers at different pace
- Multiple consumers per message, where in a typical message queue you will be forced to use a queue per consumer
- there by duplicating all the data
- Heavy message queue affects over all throughput
- Mine is a data-driven company; Events like user activities drive the business and events are becoming first class citizens
- no real-time data processing tool is complete without Kafka integration
- your message queue performance dips severly whenever it is backed up beyond what could be kept in memory
Decoupled data pipeline
- it is a quite simple abstraction
- Producers send messages to Kafka clusters
- Kafka cluster holds these messages for specified time
- Consumers read content from cluster on their own pace
- producer need not know anything abt downstream pipeline
How is Kafka different?
- Explicitly distributed
- Partitioned
- Replicated
- Dumb pipelines
- central commit log and all published messages are retained for a configurable period of time
- distributed kakfa is a cluster of machines called as brokers
- kafka assumes that producers, brokers and consumers are all spread across multiple machines
- Prodcers, brokers and consumers run as a logical group with the help of Zookeeper
- partitioned helps to scale easily
- replicated which makes the system highly available and predictable
- State information as what is being consumed is part of consumer and not data pipeline
- break apart your entire infrastructure
- all your systems can dump data into it, why because it is cheap and easy to scale and predictable
Brokers
- topic is the virtual category/feed, it is the key abstraction
- For each topic, kafka maintains partitioned log
- Kafka topic is a append-only or write-ahead log
- so ordered is guaranteed at the partition level
Partition leader
- Each partition has one broker which acts as the "leader" and zero or more servers which act as “followers”, which replicates the leader
- If the leader fails, one of the followers will automatically become the new leader.
- Each server acts as a leader for some of its partitions and a follower for others so load is well balanced within the cluster.
- producer sends data directly to the broker that is that is the partition leader
- so how will the producer know that? all kafka nodes will return metadata saying which servers are alive, partition leaders info
Partitioned
- partitions are log files on the disk
- partitioning allow the log to scale beyond a size that will fit on a single server
Overview
- So, is kafka a queueing system or publsh-subscribe/broadcast system?
- it has single consumer abstraction
- Consumers label themselves with a consumer group name, and each message published to a topic is delivered to one consumer instance within each subscribing consumer group.
- If all the consumer instances have the same consumer group, then this works just like a traditional queue balancing load over the consumers.
- If all the consumer instances have different consumer groups, then this works like publish-subscribe and all messages are broadcast to all consumers.
Topic and Partitions
-
- Every record is a key-value pair, key determines the partition
- key could be computed randomly/producer might specify the key explicitly
- every record has an offset number - consumer uses this offset to determine its position
- Messages are simply byte arrays and the developers can use them to store any object in any format – with String, JSON, and Avro the most common
- consumer specifies offset in the log with each request and receives back a chunk of log beginning from that position - better control and can can re-consume.
Topic and Partitions
- Producers
- lots of other customization options
- Asynchronous/Synchronous send
- Batching
- Compression
- consumers
- How is it different from other messaging queues? In a typical messaging system, queue will push messages to consumers and maintain all other associated metadata
- Each consumer process belongs to a consumer group
- each message is delivered to exactly one process within every consumer group
- In an ideal world, there will be multiple logical consumer groups, each consisting of a cluster of consuming machines
- In the case of large data that no matter how many consumers a topic has, a message is stored only a single time.
- Here, consumers pull messsages
Where to use?
- Real time event/log aggregations
- Speed layer in the Lambda architecture
- Real time news feeds/metrics/alerts/monitoring
- Data loading for data processing systems
- Event sourcing
- Commit logs
- no real-time data processing tool is complete without Kafka integration
- best suited when multiple consumers, as that is what it is best optimized for
- commit logs - database state capture
- Because Kafka topics are very cheap from a performance and overhead standpoint,
- it’s possible for us to create as many queues as we want, scaled to the performance we want
- and optimizing resource utilization across the system. Because they can be created dynamically,
- we can make our business rules very flexible.
- event sourcing style of application design where state changes are logged as a time-ordered sequence of records
- which is fast becoming the centre of gravity for data logging
- all your systems can dump data into it, why because it is cheap and easy to scale and predictable
Apache Kafka
High-throughput messaging system as a distributed commit log
publish-subscribe messaging designed as a distributed commit log