Git Product home page Git Product logo

bmon's Introduction

bmon

A Bitcoin network monitor

screenshot

Provides log aggregation, a Grafana dashboard, automated alerting, and a framework for doing realtime analysis (via logs and RPC) on a collection of bitcoind nodes.

Local dev

  1. Ensure you have Python 3.10+, Docker, and docker-compose on your host.
    • pip install docker-compose
  2. Install the local infrastructure tools:
    • pip install -e ./infra

Easy way

  1. Bring everything up with bmon-config && ./dev reup

Manual way

  1. Build local config tree: bmon-config
  2. Run the database migrations: ./dev managepy migrate
  3. Bring docker-compose up: docker-compose up [-d]

Then browse to http://localhost:3000 to access Grafana; use the default admin credentials admin/admin. You should see a nice little sample dashboard displaying bitcoind logs etc.

Running tests

  1. ./dev test
  2. Try generating a block locally:
    • docker-compose up -d
    • In one terminal: ./dev watchlogs
    • In another: ./dev generateblock

Important tools for investigation

Full grep of all node logs

bmon-infra rg <query>

Query redis contents

ssh some-bmon-host
cd bmon/
./dev shell
from bmon.server_tasks import redisdb
from bmon.mempool import full_scan
full_scan(redisdb, '*<some txid>*')

Adding alerts

Modify ./etc/prom-alerts.yml and redeploy to the server with bmon-infra -f bmon deploy.

Onboarding a new bitcoind host

  1. Add an entry to ./infra/hosts_prod.yml corresponding to the desired bitcoind settings. You might want to specify ssh_hostname: and become_method:.
  2. Run bmon-infra bootstrap with the required arguments. If for some reason the script doesn't or can't run to completion, just do the stuff that's in there manually - it shouldn't be hard to figure out. This will output a wireguard pubkey that you should use in subsequent steps.
  3. Modify wg-bmon wireguard configuration on the serverside (the bmon administrator has to do this) using the bitcoind wg pubkey.
  4. Update the bmon secrets store with sudo_password for host.
  5. Test deployment to the new host sh bmon-infra -f new-hostname deploy
  6. If that succeeds, update the server's monitoring configs etc. sh bmon-infra -t server deploy

And the new host should be fully online.

Design

Bmon consists of two machine types: one server and many nodes. The nodes run bitcoind, and provide information to the server, which collects and synthesizes all the data necessary. The server also provides views on the data, including log exploration, metric presentation, and other high-level insights (TBD).

The bmon server runs

  • loki, for log aggregation
  • alertmanager, for alerts
  • grafana, for presenting logs and metrics
  • prometheus, for aggregating metrics
  • [tbd] bmon_collector, which aggregates insights

Each bmon node (the analogue of a bitcoind node) runs

  • bitcoind, which runs bitcoin
  • promtail, which pushes logs into loki (on the server)
  • node_exporter, which offers system metrics for scraping by prometheus
  • bmon_exporter, which pushes interesting high-level data into
flowchart TD
  subgraph node
      node_exporter
      bmon_exporter
  end
  subgraph server
      loki
      grafana
      alertmanager
      prometheus
      loki --> grafana
      prometheus --> grafana
      bmon_exporter --> bmon_collector
  end
  subgraph node
    promtail
    promtail --> loki
    bitcoind --> /bmon/logs/bitcoin.log
    /bmon/logs/bitcoin.log --> promtail
    bitcoind --> bmon_exporter
    node_exporter --> prometheus
    prometheus --> alertmanager
  end
Loading

For simplification, all servers participate in a single wireguard network.

How are hosts configured?

All known participants in bmon are listed in ./infra/hosts.yml. This file is parsed by ./infra/bmon_infra/infra.py (which gets installed as the bmon-infra), which then configures each host over SSH (using fscm, which itself uses mitogen, a Python library that basically facilitates remote execution of Python code over an SSH connection).

During provisioning, a copy of the bmon repo is cloned on each host at ~/bmon, and then bmon-config (./infra/bmon_infra/config.py) is run to generate a .env file with all configuration and secrets based on the host's entry in hosts.yml.

The .env file is read in by docker-compose and used to set various parameters of the container runtimes. The docker-compose lifecycle is managed by systemd on each host; a user-level systemd unit is installed by the bmon-infra command.

How is wireguard used?

Since monitored hosts will live on different networks, wireguard is used to create a flat networking topology so that all hosts can be easily reached by the central bmon server, which aggregates measurements across each host.

To add a host, file an issue here and I'll give a wireguard config to use.

Wireguard is also used to simulate geographical dispersion of the monitored nodes. A VPN provider gives us Wireguard configurations for diverse networks, which we then use on certain monitored bitcoind hosts.

Node versions

  • One for each major release
  • One for current RC
  • Maintain 3 rotating versions of master, staggered backwards by
    • 1 week
    • 4 weeks
    • 16 weeks

Uses

  • For a given block, determine when it was seen by each node. Present variance. Alert on anomalous variance.

  • For a given transaction, determine when it was seen by each node. Present variance. Alert on anomalous variance.

  • "Selfish mining" detector: alert on multiple blocks in rapid succession that cause a reorg.

Notify on

  • mempool empty
  • inflation (rolling sum of UTXO amounts + (block_created_amt - block_destroyed_amt) > supply_at_height)
  • tip older than 90 minutes
  • transactions rejected from mempool
  • bad blocks
  • reorgs

Measurements

  • block reception time per node
  • txn reception time per node
  • reorg count (number of unused tips?)
  • usual system metrics: memory usage, disk usage, CPU load, etc.

Comparison across nodes

  • mempool contents
  • getblocktemplate contents (do they differ at all?)
  • block processing time (per logs)
  • block reception time diff
  • txn reception time diff

Features

  • logs sent to a centralized log explorer (Loki-Grafana)

Misc.

Resizing existing vagrant disk

sudo cfdisk /dev/sda sudo resize2fs -p -F /dev/sda1

bmon's People

Contributors

jamesob avatar josibake avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

bmon's Issues

Fix outbound wireguard connections

We use wireguard configurations with (essentially) AllowedIPs = 0.0.0.0/0 (minus the bmon subnet) in order to simulate having geographically dispersed nodes; e.g.

  bitcoin-01: 
    tags: [bitcoind]
    check_host_keys: accept
    outbound_wireguard: wg-switzerland-01   <-- like this
    ...
    bitcoin_version: v23.0

However, for whatever reason when activating the wireguard connection (in addition to the wg-bmon connection), the node is unable to reach the internet. Investigate why this might be and fix it.

Could be an issue with the VPN provider?

backfill process for old debug.log files

it would be really handy to be able to run old log files through bmon for stuff where historical data is needed (think old mempool logs). at the very least, these can be scripts and not apart of the actual bmon infra, but i think it makes sense to have them in this repo, at the very least. its also a good way to test bmons log parsing over different versions of core.

a few considerations:

  • if we can just load a log file and have bmon parse it, it might be good to separate it into its own queue so it doesnt affect the live data collection
  • it might also make sense to have a bmon-cli which allows us to access log parsers and write our custom logic of where to put the data (either in the postgres database or some offline storage like GCP), but this way we ensure that all of the logic to parse the data is consistent across time

Web/RSS list of interesting events and alerts

Have a persistent list of interesting events, ideally with an attached RSS feed. Things like

  • saw a bad block
  • orphaned block/reorg
  • saw a bad transaction
  • large mempool eviction
  • etc.

This will be somewhat disparate from alertmanager in that alertman surfaces a lot of stuff that isn't "interesting" for end consumers of bmon; e.g. node is low on disk space or is down.

specify hive partitions when writing to GCP

if it's easy to do on the bmon side, we should partition the data when storing it in GCP. I need to do some benchmarking to figure out what the ideal partion for a bigquery external table is, but im thinking something like:

gs://bucket/table/dt=2020-04-06/
gs://bucket/table/dt=2020-04-07/
gs://bucket/table/dt=2020-04-08/

alternatively, we can keep bmon as is and have some process on the GCP side move data from the "sink" into partitions for better querying and long-term storage.

right now this isn't an issue, but if we start having months, years, etc of data, we will want to have query partitions.

change avro schema to datetime for timestamp field

in the MempoolAccept model, we define timestamp as a DateTime:

timestamp = models.DateTimeField()

but in the avro schema we define it as a string:

{'name': 'timestamp', 'type': 'string'},

as of avro 1.8.1, timestamp is a logical type (https://avro.apache.org/docs/1.8.1/spec.html#Logical+Types), so this should be pretty easy but we would need to confirm that the python fastavro library supports it and set up some tests.

why ?

given that avro has a self-describing schema, we should be as accurate as possible in describing the underlying data. this also makes using the data in GCP much easier as whatever is querying the data can infer the schema directly and we dont have worry about casting, etc

Determine why logwatcher is reading truncated lines

Occasionally, the bitcoind-watcher process surfaces errors that look like

bitcoind-watcher_1      | 2022-10-07 06:08:24,228 bmon.logparse ERROR    Failed to process line "2022-10-07T06:08:24.227608Z [] UpdateTip: new best=00000000000000000000042697715d779b50cc0a7d1afdf3e5d5a49f3dae4e62 height=757481 version=0x26a6c000 log2_work=93.770429 tx=770271020 date='2022-10-07T06:08:00Z' progress=1.000000 cache=38."
bitcoind-watcher_1      | Traceback (most recent call last):
bitcoind-watcher_1      |   File "/src/bmon/logparse.py", line 31, in watch_logs
bitcoind-watcher_1      |     got = cb_listener.process_line(line)
bitcoind-watcher_1      |           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
bitcoind-watcher_1      |   File "/src/bmon/logparse.py", line 274, in process_line
bitcoind-watcher_1      |     cachesize_txo=float(matchgroups['cachesize_txo']),
bitcoind-watcher_1      |                         ~~~~~~~~~~~^^^^^^^^^^^^^^^^^
bitcoind-watcher_1      | KeyError: 'cachesize_txo'

and it subsequently misses ConnectBlock* events.

Apparently, read_logfile_forever is somehow yielding incomplete lines. Figure out what's going on here and fix.

It is worth noting that the corresponding bitcoind debug.log line looks complete:

(.venv) vagrant@b3:~/bmon$ docker-compose logs bitcoind | grep -A 4 height=757538

bitcoind_1              | 2022-10-07T14:39:57.300744Z [] UpdateTip: new best=00000000000000000002fdcbe93ff28b3040f6a349d5f86088589ef9a2e9bde1 height=757538 version=0x20000000 log2_work=93.771085 tx=770375653 date='2022-10-07T14:39:49Z' progress=1.000000 cache=80.0MiB(593550txo) warning='87 of last 100 blocks have unexpected version'
bitcoind_1              | 2022-10-07T14:39:57.301429Z []   - Connect postprocess: 9.51ms [5.70s (48.70ms/blk)]
bitcoind_1              | 2022-10-07T14:39:57.301807Z [] - Connect block: 33.46ms [16.58s (141.68ms/blk)]
bitcoind_1              | 2022-10-07T14:39:57.307974Z [] received: headers (82 bytes) peer=24

Decouple `bmon-infra deploy` from infrastructure restart

Certain bmon-server infrastructure often doesn't need to be restarted during deploys (e.g. db, redis). Specifically in the case of redis, we don't want to routinely restart the container because of the ephemeral nature of redis' storage.

Split out deployment for app containers (e.g. server-task-worker) from the more infra-based containers. This probably amounts to having separate systemd unit files for each "type" of process, perhaps:

  • Infra containers
    • db
    • redis
  • Grafan containers
    • grafana
    • prom
    • alertman
    • loki
  • bmon containers
    • web
    • server-task-worker

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.