Richard Searle

home

Injecting pigs into Kafka

30 May 2015

Kafka is designed to move a stream of data, with an implicit assumption that the stream is continuous.

There are use cases where boundaries need to be defined within the stream. For example:

Many of these use cases could be implemented by:

Both of these require thread safe designs since the consumer will either be referencing data owned by another thread or will be called by more than one thread. In either case, performance is adversely impacted.

Alternatively, the boundary can be defined by a message produced into the existing Kafka stream. The concept is something like the pig used to seperate products in a pipeline.

The implementation requires at least one (and ideally only one) pig message be delivered to each consumer. This requires a pig message for each partition, ideally with one consumer per partition.

Fortunately, the new Java producer allows explicit specification of the partition for a message.

Note that mirrormaker does not currently provide the precise control over partitioning required for this implementation.