Apache Kafka

High-throughput messaging system as a distributed commit log

publish-subscribe messaging designed as a distributed commit log

High volume of events/activities to be recorded (100k+/sec)
messages/events to be delivered at least once
mix of online and batch consumers
consumers at different pace
Multiple consumers per message, where in a typical message queue you will be forced to use a queue per consumer
there by duplicating all the data
Heavy message queue affects over all throughput
Mine is a data-driven company; Events like user activities drive the business and events are becoming first class citizens
no real-time data processing tool is complete without Kafka integration
your message queue performance dips severly whenever it is backed up beyond what could be kept in memory

Decoupled data pipeline

it is a quite simple abstraction
Producers send messages to Kafka clusters
Kafka cluster holds these messages for specified time
Consumers read content from cluster on their own pace
producer need not know anything abt downstream pipeline

How is Kafka different?

Explicitly distributed
Partitioned
Replicated
Dumb pipelines

central commit log and all published messages are retained for a configurable period of time
distributed kakfa is a cluster of machines called as brokers
kafka assumes that producers, brokers and consumers are all spread across multiple machines
Prodcers, brokers and consumers run as a logical group with the help of Zookeeper
partitioned helps to scale easily
replicated which makes the system highly available and predictable
State information as what is being consumed is part of consumer and not data pipeline
break apart your entire infrastructure
all your systems can dump data into it, why because it is cheap and easy to scale and predictable

Brokers

topic is the virtual category/feed, it is the key abstraction
For each topic, kafka maintains partitioned log
Kafka topic is a append-only or write-ahead log
so ordered is guaranteed at the partition level

Replicated

Partition leader

Each partition has one broker which acts as the "leader" and zero or more servers which act as “followers”, which replicates the leader
If the leader fails, one of the followers will automatically become the new leader.
Each server acts as a leader for some of its partitions and a follower for others so load is well balanced within the cluster.
producer sends data directly to the broker that is that is the partition leader
so how will the producer know that? all kafka nodes will return metadata saying which servers are alive, partition leaders info

Partitioned

partitions are log files on the disk
partitioning allow the log to scale beyond a size that will fit on a single server

Overview

So, is kafka a queueing system or publsh-subscribe/broadcast system?
it has single consumer abstraction
Consumers label themselves with a consumer group name, and each message published to a topic is delivered to one consumer instance within each subscribing consumer group.
If all the consumer instances have the same consumer group, then this works just like a traditional queue balancing load over the consumers.
If all the consumer instances have different consumer groups, then this works like publish-subscribe and all messages are broadcast to all consumers.

Topic and Partitions

Every record is a key-value pair, key determines the partition
key could be computed randomly/producer might specify the key explicitly
every record has an offset number - consumer uses this offset to determine its position
Messages are simply byte arrays and the developers can use them to store any object in any format – with String, JSON, and Avro the most common
consumer specifies offset in the log with each request and receives back a chunk of log beginning from that position - better control and can can re-consume.

Topic and Partitions

Producers
lots of other customization options
Asynchronous/Synchronous send
Batching
Compression
consumers
How is it different from other messaging queues? In a typical messaging system, queue will push messages to consumers and maintain all other associated metadata
Each consumer process belongs to a consumer group
each message is delivered to exactly one process within every consumer group
In an ideal world, there will be multiple logical consumer groups, each consisting of a cluster of consuming machines
In the case of large data that no matter how many consumers a topic has, a message is stored only a single time.
Here, consumers pull messsages

Overview

Where to use?

Real time event/log aggregations
Speed layer in the Lambda architecture
Real time news feeds/metrics/alerts/monitoring
Data loading for data processing systems
Event sourcing
Commit logs

no real-time data processing tool is complete without Kafka integration
best suited when multiple consumers, as that is what it is best optimized for
commit logs - database state capture
Because Kafka topics are very cheap from a performance and overhead standpoint,
it’s possible for us to create as many queues as we want, scaled to the performance we want
and optimizing resource utilization across the system. Because they can be created dynamically,
we can make our business rules very flexible.
event sourcing style of application design where state changes are logged as a time-ordered sequence of records
which is fast becoming the centre of gravity for data logging
all your systems can dump data into it, why because it is cheap and easy to scale and predictable

Apache Kafka High-throughput messaging system as a distributed commit log publish-subscribe messaging designed as a distributed commit log

git – The time machine

kiruthikasamapathy

git – The time machine

0 0

git_talk

Apache Kafka

High-throughput messaging system as a distributed commit log

Decoupled data pipeline

How is Kafka different?

Brokers

Replicated

Partition leader

Partitioned

Overview

Topic and Partitions

Topic and Partitions

Overview

Where to use?

git – The time machine

kiruthikasamapathy

git – The time machine

0 0 (function() { var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true; po.src = 'https://apis.google.com/js/platform.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s); })();

git_talk

Apache Kafka

High-throughput messaging system as a distributed commit log

Decoupled data pipeline

How is Kafka different?

Brokers

Replicated

Partition leader

Partitioned

Overview

Topic and Partitions

Topic and Partitions

Overview

Where to use?

0 0