Git Product home page Git Product logo

aiven_kafka_demo's Introduction

Exercise
========

Your task is to implement a system that generates operating system metrics and
passes the events through Aiven Kafka instance to Aiven PostgreSQL database.
For this, you need a Kafka producer which sends data to a Kafka topic, and a
Kafka consumer storing the data to Aiven PostgreSQL database. For practical
reasons, these components may run in the same machine (or container or whatever
system you choose), but in production use similar components would run in
different systems.

You can choose what metric or metrics you collect and send. You can implement
metrics collection by yourself or use any available tools or libraries.

Even though this is a small concept program, returned homework should include
tests and proper packaging. If your tests require Kafka and PostgreSQL
services, for simplicity your tests can assume those are already running,
instead of integrating Aiven service creation and deleting.




Installation
============

git clone https://github.com/alxxthegeek/aiven_kafka_demo.git
cd aiven_kafka_demo
pip3 install -r requirements.txt


How to use
==========

open two windows
1. python3 metrics_producer.py
2. python3 metrics_consumer.py

PostgreSQL schema
================
Database kaf_demo
Table system_metrics

CREATE TABLE public.system_metrics (
    id SERIAL NOT NULL,
    datetime timestamp without time zone,
    hostname character varying(64) NOT NULL,
    metric character varying(64) NOT NULL,
    value real NOT NULL,
    CONSTRAINT sys_metrics_prikey PRIMARY KEY (id)
);


select * from system_metrics ORDER BY datetime DESC LIMIT 40;
+----+--------------------------+--------+---------------------+----------+
|id  |datetime                  |hostname|metric               |value     |
+----+--------------------------+--------+---------------------+----------+
|1800|2020-02-10 15:00:09.644000|kaf     |total_cpu            |3.2       |
|1796|2020-02-10 14:59:59.394000|kaf     |percent_memory_use   |65.9      |
|1795|2020-02-10 14:59:59.394000|kaf     |available_memory     |685416450 |
|1797|2020-02-10 14:59:59.394000|kaf     |network_traffic_in   |366120896 |
|1793|2020-02-10 14:59:59.394000|kaf     |current_cpu_frequency|3092.839  |
|1794|2020-02-10 14:59:59.394000|kaf     |memory_use           |2009927680|
|1792|2020-02-10 14:59:59.394000|kaf     |total_cpu            |2.9       |
|1798|2020-02-10 14:59:59.394000|kaf     |network_traffic_out  |366120896 |
|1799|2020-02-10 14:59:59.394000|kaf     |total_network_traffic|904035200 |
|1789|2020-02-10 14:59:49.130000|kaf     |network_traffic_in   |366112992 |
|1788|2020-02-10 14:59:49.130000|kaf     |percent_memory_use   |65.9      |
|1787|2020-02-10 14:59:49.130000|kaf     |available_memory     |685395970 |
|1791|2020-02-10 14:59:49.130000|kaf     |total_network_traffic|904019520 |
|1790|2020-02-10 14:59:49.130000|kaf     |network_traffic_out  |366112992 |
|1784|2020-02-10 14:59:49.130000|kaf     |total_cpu            |2.1       |
|1786|2020-02-10 14:59:49.130000|kaf     |memory_use           |2009927680|
|1785|2020-02-10 14:59:49.130000|kaf     |current_cpu_frequency|3092.839  |
|1780|2020-02-10 14:59:38.885000|kaf     |percent_memory_use   |65.9      |
|1781|2020-02-10 14:59:38.885000|kaf     |network_traffic_in   |366103136 |
|1779|2020-02-10 14:59:38.885000|kaf     |available_memory     |685379580 |
|1782|2020-02-10 14:59:38.885000|kaf     |network_traffic_out  |366103136 |
|1777|2020-02-10 14:59:38.885000|kaf     |current_cpu_frequency|3092.839  |
|1783|2020-02-10 14:59:38.885000|kaf     |total_network_traffic|904000190 |
|1776|2020-02-10 14:59:38.885000|kaf     |total_cpu            |1.8       |
|1778|2020-02-10 14:59:38.885000|kaf     |memory_use           |2009927680|
|1774|2020-02-10 14:59:28.641000|kaf     |network_traffic_out  |366095168 |
|1770|2020-02-10 14:59:28.641000|kaf     |memory_use           |2009927680|
|1769|2020-02-10 14:59:28.641000|kaf     |current_cpu_frequency|3092.839  |
|1771|2020-02-10 14:59:28.641000|kaf     |available_memory     |685363200 |
|1772|2020-02-10 14:59:28.641000|kaf     |percent_memory_use   |65.9      |
|1775|2020-02-10 14:59:28.641000|kaf     |total_network_traffic|903984130 |
|1773|2020-02-10 14:59:28.641000|kaf     |network_traffic_in   |366095168 |
|1768|2020-02-10 14:59:28.641000|kaf     |total_cpu            |2.6       |
|1765|2020-02-10 14:59:18.399000|kaf     |network_traffic_in   |366087264 |
|1766|2020-02-10 14:59:18.399000|kaf     |network_traffic_out  |366087264 |
|1763|2020-02-10 14:59:18.399000|kaf     |available_memory     |685355010 |
|1762|2020-02-10 14:59:18.399000|kaf     |memory_use           |2009927680|
|1764|2020-02-10 14:59:18.399000|kaf     |percent_memory_use   |65.9      |
|1760|2020-02-10 14:59:18.399000|kaf     |total_cpu            |2.6       |
|1767|2020-02-10 14:59:18.399000|kaf     |total_network_traffic|903968770 |
+----+--------------------------+--------+---------------------+----------+



Discussion
==========

Logs are stored into
../kaf_demo/logs

Please note the logging is currently set to debug which will produce an excessive amount of log entries but is useful when learning how a
system works(as I've never used kafka before).
Logging code is from previous projects of mine(pretty simple).

Implementation is using a single producer and single consumer.
Performance could be improved by  implementing a separate producer for each metric type and a separate consumer using threads.
A threaded alternative to metrics_producer.py is metrics.prod.py. In this implementation it does not provide any benefits.
The corresponding metri

The metrics producer uses python-kafka serializer as I saw it in an example(medium article) and decided to use it.

Only eight metrics were implemented - total cpu %, current cpu frequency (GHz), memory use(Total memory usage), percent memory use(%),
network traffic in (bytes received), network traffic out(bytes sent) and total network traffic(bytes received and sent).
Used psutils as it works on all platforms, I have used it before and it is easy to get it up and running.

I used the pyscopg2 library for the postgres access.

I tried to use the pyscopg2.extra.execute_values but was getting tuple errors that I could not fix in the time I set myself
so reverted to using a loop and saving each metric from the message separately. Not optimal/in-efficient.

No recovery from errors is implemented.If an error or exception occurs the system will exit.

The table schema is simple(naive) for ease/speed of implementation and is fairly wasteful. It does not record the metric type or
type of units. A more optimal solution would be a table per metric or to use a time series database like InfluxDB.

The solution is manually deployed.
A better solution would have been to setup the project in a virtual env and then deploy the virtual env via pip.

All the logging, get_config() and other utility functions, could have been moved to util.py for a much cleaner implementation.
In a production version I would remove all the print statements and minimize the logging to log only what was required.
Having good logs is important but need to keep them clean and to the point otherwise log diving can be very tedious.

I first install kafka and postgres on a centos vm.
Got my code running.
Then setup the services on Aiven.
Then got the code running against the services.

metrics_producer and metrics_consumer work.

metrics_con.py - doesn't work its a threaded version of the metrics_consumer but gets a json error.
kaf_demo.py - does not work - It starts the consumer and producer but is not fully working. Consumer thread does not get started.
Tests do not work and I ran out of time.




References:
https://aiven.io/blog/data-streaming-made-simple-with-apache-kafka/
https://help.aiven.io/en/articles/489573-getting-started-with-aiven-postgresql
https://aiven.io/blog/getting-started-with-aiven-kafka/
https://www.digitalocean.com/community/tutorials/how-to-install-apache-kafka-on-centos-7#step-5-%E2%80%94-testing-the-installation
http://kafka-python.readthedocs.io/
https://kafka.apache.org/quickstart
https://www.psycopg.org/docs/usage.html
https://timber.io/blog/hello-world-in-kafka-using-python/

Code/example to use the python-kafka deserializer
https://medium.com/@mukeshkumar_46704/consume-json-messages-from-kafka-using-kafka-pythons-deserializer-859f5d39e02c
https://towardsdatascience.com/kafka-python-explained-in-10-lines-of-code-800e3e07dad1

















aiven_kafka_demo's People

Contributors

alxxthegeek avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.