Apache Kafka – High-throughput messaging system as a distributed commit log



Apache Kafka – High-throughput messaging system as a distributed commit log

0 0


kafka_talk

5 minutes talk about Apache Kakfa tool

On Github kiruthikasamapathy / kafka_talk

Apache Kafka

High-throughput messaging system as a distributed commit log

  • publish-subscribe messaging designed as a distributed commit log
  • High volume of events/activities to be recorded (100k+/sec)
  • messages/events to be delivered at least once
  • mix of online and batch consumers
  • consumers at different pace
  • Multiple consumers per message, where in a typical message queue you will be forced to use a queue per consumer
  • there by duplicating all the data
  • Heavy message queue affects over all throughput
  • Mine is a data-driven company; Events like user activities drive the business and events are becoming first class citizens
  • no real-time data processing tool is complete without Kafka integration
  • your message queue performance dips severly whenever it is backed up beyond what could be kept in memory

Decoupled data pipeline

  • it is a quite simple abstraction
  • Producers send messages to Kafka clusters
  • Kafka cluster holds these messages for specified time
  • Consumers read content from cluster on their own pace
  • producer need not know anything abt downstream pipeline

How is Kafka different?

  • Explicitly distributed
  • Partitioned
  • Replicated
  • Dumb pipelines
  • central commit log and all published messages are retained for a configurable period of time
  • distributed kakfa is a cluster of machines called as brokers
  • kafka assumes that producers, brokers and consumers are all spread across multiple machines
  • Prodcers, brokers and consumers run as a logical group with the help of Zookeeper
  • partitioned helps to scale easily
  • replicated which makes the system highly available and predictable
  • State information as what is being consumed is part of consumer and not data pipeline
  • break apart your entire infrastructure
  • all your systems can dump data into it, why because it is cheap and easy to scale and predictable

Brokers

  • topic is the virtual category/feed, it is the key abstraction
  • For each topic, kafka maintains partitioned log
  • Kafka topic is a append-only or write-ahead log
  • so ordered is guaranteed at the partition level

Replicated

Partition leader

  • Each partition has one broker which acts as the "leader" and zero or more servers which act as “followers”, which replicates the leader
  • If the leader fails, one of the followers will automatically become the new leader.
  • Each server acts as a leader for some of its partitions and a follower for others so load is well balanced within the cluster.
  • producer sends data directly to the broker that is that is the partition leader
  • so how will the producer know that? all kafka nodes will return metadata saying which servers are alive, partition leaders info

Partitioned

  • partitions are log files on the disk
  • partitioning allow the log to scale beyond a size that will fit on a single server

Overview

  • So, is kafka a queueing system or publsh-subscribe/broadcast system?
  • it has single consumer abstraction
  • Consumers label themselves with a consumer group name, and each message published to a topic is delivered to one consumer instance within each subscribing consumer group.
  • If all the consumer instances have the same consumer group, then this works just like a traditional queue balancing load over the consumers.
  • If all the consumer instances have different consumer groups, then this works like publish-subscribe and all messages are broadcast to all consumers.

Topic and Partitions

  • Every record is a key-value pair, key determines the partition
  • key could be computed randomly/producer might specify the key explicitly
  • every record has an offset number - consumer uses this offset to determine its position
  • Messages are simply byte arrays and the developers can use them to store any object in any format – with String, JSON, and Avro the most common
  • consumer specifies offset in the log with each request and receives back a chunk of log beginning from that position - better control and can can re-consume.

Topic and Partitions

  • Producers
  • lots of other customization options
  • Asynchronous/Synchronous send
  • Batching
  • Compression
  • consumers
  • How is it different from other messaging queues? In a typical messaging system, queue will push messages to consumers and maintain all other associated metadata
  • Each consumer process belongs to a consumer group
  • each message is delivered to exactly one process within every consumer group
  • In an ideal world, there will be multiple logical consumer groups, each consisting of a cluster of consuming machines
  • In the case of large data that no matter how many consumers a topic has, a message is stored only a single time.
  • Here, consumers pull messsages

Overview

Where to use?

  • Real time event/log aggregations
  • Speed layer in the Lambda architecture
  • Real time news feeds/metrics/alerts/monitoring
  • Data loading for data processing systems
  • Event sourcing
  • Commit logs
  • no real-time data processing tool is complete without Kafka integration
  • best suited when multiple consumers, as that is what it is best optimized for
  • commit logs - database state capture
  • Because Kafka topics are very cheap from a performance and overhead standpoint,
  • it’s possible for us to create as many queues as we want, scaled to the performance we want
  • and optimizing resource utilization across the system. Because they can be created dynamically,
  • we can make our business rules very flexible.
  • event sourcing style of application design where state changes are logged as a time-ordered sequence of records
  • which is fast becoming the centre of gravity for data logging
  • all your systems can dump data into it, why because it is cheap and easy to scale and predictable
Apache Kafka High-throughput messaging system as a distributed commit log publish-subscribe messaging designed as a distributed commit log