Git Product home page Git Product logo

bubuku's People

Contributors

a1exsh avatar adyach avatar antban avatar drummerwolli avatar ferbncode avatar kunal-jha avatar lmontrieux avatar rcillo avatar samizzy avatar tor-vs-floki avatar v-stepanov avatar vlad-ro avatar zrobgar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bubuku's Issues

Optimize rebalance

Right now rebalance is made step by step with step size of 1 partition.
Maybe it will be better to transfer more partitions at a time (5, 10, or by leader count), and it will be faster.

Create time limit for leader election

Right now if leader election can't proceed (for some strange reason) Broker won't start at all. Probably it will be good idea to start after waiting for leader election for some big amount of time (10 minutes for example)

graceful_terminate is not working in some cases

graceful_terminate works with initial version of controller, but it may happen that controller is already changed, and it's not working anymore, but shutdown hook is already installed...

Prepare dev infrastructure

It is good to have some kind of local environment to test the new features or reproduce the bugs. Currently we do not have any possibility to test what we did.

Wrong change name representation

The change is RestartBrokerChange, but string representation is RestartOnZkChange, also file name is quite frustrating: restart_on_zk_change.py

Exception when trying to call `migrate`

Hi,

we have a Kafka cluster and I want to move partitions to other nodes. As far as I understand, this is what bubuku-cli migrate is for.

However when I call it like this:

$ bubuku-cli stats
Broker Id  Free kb     Used kb  
52397540   464380884   42792420 
52399021   4930335540  26419188 
52400677   4801094116  155660612
52402004   478177812   28995492 
52404200   486099964   21073340
$ bubuku-cli migrate --from "52399021,52400677" --to "52402004,52404200,52397540"

I get the following error:

INFO:kazoo.client:Zookeeper session lost, state: CLOSED
Traceback (most recent call last):
  File "/usr/local/bin/bubuku-cli", line 11, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/bubuku/cli.py", line 94, in migrate_broker
    RemoteCommandExecutorCheck.register_migration(zookeeper, from_.split(','), to.split(','), shrink, broker_id)
AttributeError: 'NoneType' object has no attribute 'split'

Seems like either from or to is not picked up correctly by click? Should I use a different format?

Thanks!

Extend logging for CheckBrokerChange

We had situations, when the broker was restarted without any clear reasons.
The change checks if the broker is running and registered in zk. I have assumption, that the broker was running normally, but the connection to zk was lost, which triggered restart of the broker, because it was not possible to check if it is registered or not.
So, I suggest to extend logging for the check in order to identify the reason of restarts.

Issue with rebalancing

Sometimes leadership fails to became balanced.
The reason is that according to weights distribution it is possible to move partitions that are already not present on broker.

Optimize work with zookeeper

If there are no changes in queue, than there is no need to take lock for processing. Right now bubuku tries to take lock on each step (5 seconds) which means that zk now can not be deployed on t2 instances because of credits usage.

Kafka restart can be improved

Currently kafka restart is not very safe IMHO. With buku we had a situation that kafka did a start before the old one finished shutdown. We also do what buku does:
process.terminate()
process.wait()
So that doesn't guarantee that old kafka stopped.
I know that here we additionally check that node doesn't exist in ZK any more.
But I think to be on the safe side we can also check that Java process has ended (in a hacky way)

Bubuku dies on any exception and terminates kafka instance

The problem is that if bubuku is in docker, than kafka instance will be terminated immediately (without writing data).

In our case it was like this:

Aug 19 10:58:07 ip-172-31-139-127 docker/26f64e7b1654[888]: WARNING:kazoo.client:Connection dropped: socket connection error: None
Aug 19 10:58:07 ip-172-31-139-127 docker/26f64e7b1654[888]: INFO:kazoo.client:Connecting to 172.31.172.94:2181
Aug 19 10:58:07 ip-172-31-139-127 docker/26f64e7b1654[888]: WARNING:kazoo.client:Connection dropped: socket connection broken
Aug 19 10:58:08 ip-172-31-139-127 docker/26f64e7b1654[888]: INFO Opening socket connection to server ip-172-31-174-143.eu-west-1.compute.internal/172.31.174.143:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
Aug 19 10:58:08 ip-172-31-139-127 docker/26f64e7b1654[888]: Traceback (most recent call last):
Aug 19 10:58:08 ip-172-31-139-127 docker/26f64e7b1654[888]:   File "/usr/local/lib/python3.5/dist-packages/kazoo/retry.py", line 123, in __call__
Aug 19 10:58:08 ip-172-31-139-127 docker/26f64e7b1654[888]:     return func(*args, **kwargs)
Aug 19 10:58:08 ip-172-31-139-127 docker/26f64e7b1654[888]:   File "/usr/local/lib/python3.5/dist-packages/kazoo/client.py", line 1026, in get
Aug 19 10:58:08 ip-172-31-139-127 docker/26f64e7b1654[888]:     return self.get_async(path, watch).get()
Aug 19 10:58:08 ip-172-31-139-127 docker/26f64e7b1654[888]:   File "/usr/local/lib/python3.5/dist-packages/kazoo/handlers/utils.py", line 72, in get
Aug 19 10:58:08 ip-172-31-139-127 docker/26f64e7b1654[888]:     raise self._exception
Aug 19 10:58:08 ip-172-31-139-127 docker/26f64e7b1654[888]: kazoo.exceptions.SessionExpiredError
Aug 19 10:58:08 ip-172-31-139-127 docker/26f64e7b1654[888]: 
Aug 19 10:58:08 ip-172-31-139-127 docker/26f64e7b1654[888]: During handling of the above exception, another exception occurred:
Aug 19 10:58:08 ip-172-31-139-127 docker/26f64e7b1654[888]: 
Aug 19 10:58:08 ip-172-31-139-127 docker/26f64e7b1654[888]: Traceback (most recent call last):
Aug 19 10:58:08 ip-172-31-139-127 docker/26f64e7b1654[888]:   File "/usr/local/bin/bubuku", line 11, in <module>
Aug 19 10:58:08 ip-172-31-139-127 docker/26f64e7b1654[888]:     sys.exit(main())
Aug 19 10:58:08 ip-172-31-139-127 docker/26f64e7b1654[888]:   File "/usr/local/lib/python3.5/dist-packages/bubuku/daemon.py", line 81, in main
Aug 19 10:58:08 ip-172-31-139-127 docker/26f64e7b1654[888]:     controller.loop()
Aug 19 10:58:08 ip-172-31-139-127 docker/26f64e7b1654[888]:   File "/usr/local/lib/python3.5/dist-packages/bubuku/controller.py", line 125, in loop
Aug 19 10:58:08 ip-172-31-139-127 docker/26f64e7b1654[888]:     self.make_step(ip)
Aug 19 10:58:08 ip-172-31-139-127 docker/26f64e7b1654[888]:   File "/usr/local/lib/python3.5/dist-packages/bubuku/controller.py", line 143, in make_step
Aug 19 10:58:08 ip-172-31-139-127 docker/26f64e7b1654[888]:     self._add_change_to_queue(check.check_if_time())
Aug 19 10:58:08 ip-172-31-139-127 docker/26f64e7b1654[888]:   File "/usr/local/lib/python3.5/dist-packages/bubuku/controller.py", line 37, in check_if_time
Aug 19 10:58:08 ip-172-31-139-127 docker/26f64e7b1654[888]:     return self.check()
Aug 19 10:58:08 ip-172-31-139-127 docker/26f64e7b1654[888]:   File "/usr/local/lib/python3.5/dist-packages/bubuku/features/restart_if_dead.py", line 58, in check
Aug 19 10:58:08 ip-172-31-139-127 docker/26f64e7b1654[888]:     if self.broker.is_running_and_registered():
Aug 19 10:58:08 ip-172-31-139-127 docker/26f64e7b1654[888]:   File "/usr/local/lib/python3.5/dist-packages/bubuku/broker.py", line 29, in is_running_and_registered
Aug 19 10:58:08 ip-172-31-139-127 docker/26f64e7b1654[888]:     return self.id_manager.is_registered()
Aug 19 10:58:08 ip-172-31-139-127 docker/26f64e7b1654[888]:   File "/usr/local/lib/python3.5/dist-packages/bubuku/id_generator.py", line 62, in is_registered
Aug 19 10:58:08 ip-172-31-139-127 docker/26f64e7b1654[888]:     return self.zk.is_broker_registered(self.broker_id)
Aug 19 10:58:08 ip-172-31-139-127 docker/26f64e7b1654[888]:   File "/usr/local/lib/python3.5/dist-packages/bubuku/zookeeper/__init__.py", line 166, in is_broker_registered
Aug 19 10:58:08 ip-172-31-139-127 docker/26f64e7b1654[888]:     _, stat = self.exhibitor.get('/brokers/ids/{}'.format(broker_id))
Aug 19 10:58:08 ip-172-31-139-127 docker/26f64e7b1654[888]:   File "/usr/local/lib/python3.5/dist-packages/bubuku/zookeeper/__init__.py", line 110, in get
Aug 19 10:58:08 ip-172-31-139-127 docker/26f64e7b1654[888]:     return self.client.retry(self.client.get, *params)
Aug 19 10:58:08 ip-172-31-139-127 docker/26f64e7b1654[888]:   File "/usr/local/lib/python3.5/dist-packages/kazoo/client.py", line 273, in _retry
Aug 19 10:58:08 ip-172-31-139-127 docker/26f64e7b1654[888]:     return self._retry.copy()(*args, **kwargs)
Aug 19 10:58:08 ip-172-31-139-127 docker/26f64e7b1654[888]:   File "/usr/local/lib/python3.5/dist-packages/kazoo/retry.py", line 136, in __call__
Aug 19 10:58:08 ip-172-31-139-127 docker/26f64e7b1654[888]:     raise RetryFailedError("Exceeded retry deadline")
Aug 19 10:58:08 ip-172-31-139-127 docker/26f64e7b1654[888]: kazoo.retry.RetryFailedError: Exceeded retry deadline

In case of any unpredictable errors bubuku should reinitialize itself without terminating kafka, and continue to work.

Refresh exhibitor list in more reliable way

Now if exhibitor list can't be refreshed, than AWSExhibitorAddressProvider will query data from master_exhibitors. But it's much better to query data from AWS to get new exhibitor list from lb name.

And of course it's better not to start kafka instances if there are no exhibitor instances.
Maybe it's better not to start kafka if exhibitor list is empty.

Optimize rebalancing

Right now rebalance process thinks that 1,2,3 and 1,3,2 distribution is different and tries to move it (sometimes to [4,5,6]). Copying data does not makes any sense, so it's better not to make difference between 1,2,3 and 1,3,2 and use (leader_count, overall_partition_count) as optimization strategy

Swap partitions feature is not working correctly

In order to work correctly there is need to refresh free space for node more frequently, than swapping partitions. The problem is that partition size is changed dynamically right after rebalance, and that means that if there are several swap data events in queue, they will do bad things one after another, because broker free space is not updated after rebalance process.

use pre-generated broker.id from metadata.properties

On very first start kafka will generate file "metadata.properties" that will hold broker.id that will be used in all subsequent starts.
During restarts kafka won't used broker.id from server.properties, it will use one from metadata.properties. bubuku should respect this.

Support different configuration modes

Right not bubuku supports configuration only by environment properties.
It must be possible to configure it using different means (file based configuration)

Kafka start timeout should be configurable

Right now on kafka start bubuku tries to do next steps:

  1. Start kafka process
  2. Wait for broker id to be available in zk in WAIT_TIMEOUT seconds
  3. Of it is not there within timeout:
  • increase WAIT_TIMEOUT by 60 seconds
  • forcibly stop running kafka process
  • start from 1

Initial wait timeout is 300 seconds, but for some environments it should be significantly bigger (30 minutes, for example), or step of increase must be progressive.

In case of network error it may happend that bubuku won't start

If unhandled exception will occur in run_daemon_loop cycle, that bubuku process will try to restart it's own work cycle, and recreate all internal entities (having in mind existing kafka process).
But if in that case AmazonEnvProvder.get_region() will return null, than it won't be able to refresh exhibitor address list, and that will start unlimited cycle of exceptions (see AWSExhibitorAddressProvider.get_addresses_by_lb_name).

It takes too much time to read data from zk

Every time before kafka starts, bubuku tries to do a lot of requests to zookeeper. The start process is delayed for 5 minutes (on 2000 topics and zk cross-region setup), Rebalance is delayed for 3 minutes. Can't work with it...

Rebalance is not processing after broker restart

In case when broker instance dies, rebalance is added to actions queue, But 'start' action forces rebalance not to process (started rebalance is removed because of start progress), ongoing rebalances are removed during other (or this) broker restart.
Root cause of this bug - 'start' and 'rebalance' actions are added to queue in the same time, (first - rebalance, then - start), then in run() rebalance says that it will stop because there is start in progress, and start just starts.

Create possibility to force rebalance process

Now rebalance is triggered only by some actions, like broker list change or new instance start. It will be good to have possibility to run commands to bubuku remotely, (for example - create special nodes in zk with actions to run, and periodicaly check for it)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.