The aiven_kafka_demo from alxxthegeek

Exercise
========

Your task is to implement a system that generates operating system metrics and
passes the events through Aiven Kafka instance to Aiven PostgreSQL database.
For this, you need a Kafka producer which sends data to a Kafka topic, and a
Kafka consumer storing the data to Aiven PostgreSQL database. For practical
reasons, these components may run in the same machine (or container or whatever
system you choose), but in production use similar components would run in
different systems.

You can choose what metric or metrics you collect and send. You can implement
metrics collection by yourself or use any available tools or libraries.

Even though this is a small concept program, returned homework should include
tests and proper packaging. If your tests require Kafka and PostgreSQL
services, for simplicity your tests can assume those are already running,
instead of integrating Aiven service creation and deleting.




Installation
============

git clone https://github.com/alxxthegeek/aiven_kafka_demo.git
cd aiven_kafka_demo
pip3 install -r requirements.txt


How to use
==========

open two windows
1. python3 metrics_producer.py
2. python3 metrics_consumer.py

PostgreSQL schema
================
Database kaf_demo
Table system_metrics

CREATE TABLE public.system_metrics (
    id SERIAL NOT NULL,
    datetime timestamp without time zone,
    hostname character varying(64) NOT NULL,
    metric character varying(64) NOT NULL,
    value real NOT NULL,
    CONSTRAINT sys_metrics_prikey PRIMARY KEY (id)
);


select * from system_metrics ORDER BY datetime DESC LIMIT 40;
+----+--------------------------+--------+---------------------+----------+
|id  |datetime                  |hostname|metric               |value     |
+----+--------------------------+--------+---------------------+----------+
|1800|2020-02-10 15:00:09.644000|kaf     |total_cpu            |3.2       |
|1796|2020-02-10 14:59:59.394000|kaf     |percent_memory_use   |65.9      |
|1795|2020-02-10 14:59:59.394000|kaf     |available_memory     |685416450 |
|1797|2020-02-10 14:59:59.394000|kaf     |network_traffic_in   |366120896 |
|1793|2020-02-10 14:59:59.394000|kaf     |current_cpu_frequency|3092.839  |
|1794|2020-02-10 14:59:59.394000|kaf     |memory_use           |2009927680|
|1792|2020-02-10 14:59:59.394000|kaf     |total_cpu            |2.9       |
|1798|2020-02-10 14:59:59.394000|kaf     |network_traffic_out  |366120896 |
|1799|2020-02-10 14:59:59.394000|kaf     |total_network_traffic|904035200 |
|1789|2020-02-10 14:59:49.130000|kaf     |network_traffic_in   |366112992 |
|1788|2020-02-10 14:59:49.130000|kaf     |percent_memory_use   |65.9      |
|1787|2020-02-10 14:59:49.130000|kaf     |available_memory     |685395970 |
|1791|2020-02-10 14:59:49.130000|kaf     |total_network_traffic|904019520 |
|1790|2020-02-10 14:59:49.130000|kaf     |network_traffic_out  |366112992 |
|1784|2020-02-10 14:59:49.130000|kaf     |total_cpu            |2.1       |
|1786|2020-02-10 14:59:49.130000|kaf     |memory_use           |2009927680|
|1785|2020-02-10 14:59:49.130000|kaf     |current_cpu_frequency|3092.839  |
|1780|2020-02-10 14:59:38.885000|kaf     |percent_memory_use   |65.9      |
|1781|2020-02-10 14:59:38.885000|kaf     |network_traffic_in   |366103136 |
|1779|2020-02-10 14:59:38.885000|kaf     |available_memory     |685379580 |
|1782|2020-02-10 14:59:38.885000|kaf     |network_traffic_out  |366103136 |
|1777|2020-02-10 14:59:38.885000|kaf     |current_cpu_frequency|3092.839  |
|1783|2020-02-10 14:59:38.885000|kaf     |total_network_traffic|904000190 |
|1776|2020-02-10 14:59:38.885000|kaf     |total_cpu            |1.8       |
|1778|2020-02-10 14:59:38.885000|kaf     |memory_use           |2009927680|
|1774|2020-02-10 14:59:28.641000|kaf     |network_traffic_out  |366095168 |
|1770|2020-02-10 14:59:28.641000|kaf     |memory_use           |2009927680|
|1769|2020-02-10 14:59:28.641000|kaf     |current_cpu_frequency|3092.839  |
|1771|2020-02-10 14:59:28.641000|kaf     |available_memory     |685363200 |
|1772|2020-02-10 14:59:28.641000|kaf     |percent_memory_use   |65.9      |
|1775|2020-02-10 14:59:28.641000|kaf     |total_network_traffic|903984130 |
|1773|2020-02-10 14:59:28.641000|kaf     |network_traffic_in   |366095168 |
|1768|2020-02-10 14:59:28.641000|kaf     |total_cpu            |2.6       |
|1765|2020-02-10 14:59:18.399000|kaf     |network_traffic_in   |366087264 |
|1766|2020-02-10 14:59:18.399000|kaf     |network_traffic_out  |366087264 |
|1763|2020-02-10 14:59:18.399000|kaf     |available_memory     |685355010 |
|1762|2020-02-10 14:59:18.399000|kaf     |memory_use           |2009927680|
|1764|2020-02-10 14:59:18.399000|kaf     |percent_memory_use   |65.9      |
|1760|2020-02-10 14:59:18.399000|kaf     |total_cpu            |2.6       |
|1767|2020-02-10 14:59:18.399000|kaf     |total_network_traffic|903968770 |
+----+--------------------------+--------+---------------------+----------+



Discussion
==========

Logs are stored into
../kaf_demo/logs

Please note the logging is currently set to debug which will produce an excessive amount of log entries but is useful when learning how a
system works(as I've never used kafka before).
Logging code is from previous projects of mine(pretty simple).

Implementation is using a single producer and single consumer.
Performance could be improved by  implementing a separate producer for each metric type and a separate consumer using threads.
A threaded alternative to metrics_producer.py is metrics.prod.py. In this implementation it does not provide any benefits.
The corresponding metri

The metrics producer uses python-kafka serializer as I saw it in an example(medium article) and decided to use it.

Only eight metrics were implemented - total cpu %, current cpu frequency (GHz), memory use(Total memory usage), percent memory use(%),
network traffic in (bytes received), network traffic out(bytes sent) and total network traffic(bytes received and sent).
Used psutils as it works on all platforms, I have used it before and it is easy to get it up and running.

I used the pyscopg2 library for the postgres access.

I tried to use the pyscopg2.extra.execute_values but was getting tuple errors that I could not fix in the time I set myself
so reverted to using a loop and saving each metric from the message separately. Not optimal/in-efficient.

No recovery from errors is implemented.If an error or exception occurs the system will exit.

The table schema is simple(naive) for ease/speed of implementation and is fairly wasteful. It does not record the metric type or
type of units. A more optimal solution would be a table per metric or to use a time series database like InfluxDB.

The solution is manually deployed.
A better solution would have been to setup the project in a virtual env and then deploy the virtual env via pip.

All the logging, get_config() and other utility functions, could have been moved to util.py for a much cleaner implementation.
In a production version I would remove all the print statements and minimize the logging to log only what was required.
Having good logs is important but need to keep them clean and to the point otherwise log diving can be very tedious.

I first install kafka and postgres on a centos vm.
Got my code running.
Then setup the services on Aiven.
Then got the code running against the services.

metrics_producer and metrics_consumer work.

metrics_con.py - doesn't work its a threaded version of the metrics_consumer but gets a json error.
kaf_demo.py - does not work - It starts the consumer and producer but is not fully working. Consumer thread does not get started.
Tests do not work and I ran out of time.




References:
https://aiven.io/blog/data-streaming-made-simple-with-apache-kafka/
https://help.aiven.io/en/articles/489573-getting-started-with-aiven-postgresql
https://aiven.io/blog/getting-started-with-aiven-kafka/
https://www.digitalocean.com/community/tutorials/how-to-install-apache-kafka-on-centos-7#step-5-%E2%80%94-testing-the-installation
http://kafka-python.readthedocs.io/
https://kafka.apache.org/quickstart
https://www.psycopg.org/docs/usage.html
https://timber.io/blog/hello-world-in-kafka-using-python/

Code/example to use the python-kafka deserializer
https://medium.com/@mukeshkumar_46704/consume-json-messages-from-kafka-using-kafka-pythons-deserializer-859f5d39e02c
https://towardsdatascience.com/kafka-python-explained-in-10-lines-of-code-800e3e07dad1
alxxthegeek / aiven_kafka_demo Goto Github PK

aiven_kafka_demo's Introduction

aiven_kafka_demo's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent