Goals:
- Learn about implementation of dummy distributed application that uses kafka for communication between components
- Investigate throughput of a kafka-based solution considering different number of producers, consumers, partitions and replicas
Implementation:
- Kafka environment
- Producer service
- Consumer service
- Reporting service
As dataset was used 1 million Reddit comments from 40 subreddits, from which the last two columns were removed by running the command cut -d, -f1-2 kaggle_RC_2019-05.csv > reddit_ds.csv
.
And thus, the working dataset reddit_ds.csv
contains the following columns:
subreddit (categorical)
: on which subreddit the comment was postedbody (str)
: comment content
Set up
git clone https://github.com/conduktor/kafka-stack-docker-compose.git
cd kafka-stack-docker-compose
docker compose -f full-stack.yml up -d
cd ..
git clone https://github.com/artemiuss/kafka_throughput_investigation.git
cd kafka_throughput_investigation
docker compose build
Run tests
./run_tests.sh
Stop and clean-up
docker compose down --rmi all -v --remove-orphans
cd kafka-stack-docker-compose
docker compose -f full-stack.yml down --rmi all -v --remove-orphans
- One producer, a topic with one partition, one consumer
- One producer, a topic with one partition, 2 consumers
- One producer, a topic with 2 partitions, 2 consumers
- One producer, a topic with 5 partitions, 5 consumers
- One producer, a topic with 10 partitions, 1 consumers
- One producer, a topic with 10 partitions, 5 consumers
- One producer, a topic with 10 partitions, 10 consumers
- 2 producers (input data should be split into 2 parts somehow), a topic with 10 partitions, 10 consumers