Git Product home page Git Product logo

cht-sync's Introduction

CHT Sync

CHT Sync is an integrated solution designed to enable data synchronization between CouchDB and PostgreSQL for the purpose of analytics. Read more detail on the CHT docs site.

Using

See the CHT docs site!

Testing

See testing.

cht-sync's People

Contributors

witash avatar andrablaj avatar njuguna-n avatar dianabarsan avatar samuelimoisili avatar lorerod avatar aodhan-domhnaill avatar njogz avatar mrjones-plip avatar jkuester avatar m5r avatar 1yuv avatar

Stargazers

Leomar Espinal avatar Sugat Bajracharya avatar Jin Yao avatar

Watchers

Derick Lung'aho avatar Josh Nesbit avatar Reagan avatar James Cloos avatar  avatar Craig Landry avatar femi oni avatar Gareth Bowen avatar Henok avatar Rafa avatar Elijah Karari avatar Paul Kitutu avatar  avatar Hareet avatar Irene AM avatar Romuald DANSOU avatar  avatar paulpascal avatar Bill Wambua avatar Niraj avatar falmata avatar  avatar  avatar  avatar  avatar Kica Ronald Okello avatar Justin Diomaye NDIAYE avatar

cht-sync's Issues

Docker Compose to Kubernetes migration

From @Hareet : We requested cht-sync to be deployed / migrated to EKS a while back as we have less people to fix the automation, and maintenance required on isolated EC2 instances. We've solved that in EKS, and really want to push to utilizing that.

This seems like a perfect continuation of the effort to merge several components of the current cht-sync/cht-pipeline/dataemon setup.

Additionally, document the steps required for deploying a cht-sync instance to k3s.

BRAC - CHP Follow Up - Connect instance to RDBMS

We need to create an RDBMS instance from where Klipfolio will get the formatted data. Right now the only way to access the database hosted on the instance is through an SSH tunnel. While Klipfolio has a way to connect to a db through SSH I am not how I would be able to generate a public key that I would to the server to ensure the Klipfolio app has access to the instance.

This ticket tracks the creation of the DB by the Infra team

Sync specific databases

This issue requires setting up a way to sync only the medic-users-meta database to Postgres. Add functionality to cht-sync to ensure that this is possible to do

Create Postgres instance using EKS

Create a postgres instance for use with the BRAC work as suggested here. The current RDS instance is full and requires the support of the infra team to debug and get logs.

running "make production": no such service: postgrest

I'm trying to run CHT Sync against a "production" CHT instance which is really just a an instance of CHT Docker Helper. This has an external to CHT Sync, URL, port, login and password. Here's the entries from my .env file:

COUCHDB_USER=medic
COUCHDB_PASSWORD=password
COUCHDB_DB=medic-users-meta
COUCHDB_HOST=192-168-0-37.local-ip.medicmobile.org
COUCHDB_PORT=10447
COUCHDB_SECURE=true

When I run make prod I get the following output:

➜  cht-sync git:(main) make prod
rm -rf ./data/json_docs
docker-compose -f docker-compose.couchdb.yml -f docker-compose.postgres.yml -f docker-compose.yml \
	down -v --remove-orphans
docker-compose build 
[+] Building 0.0s (0/2)                                                                                                                                                                                            
[+] Building 0.2s (2/3)                                                                                                                                                                                            
 => [internal] load build definition from Dockerfile                                                                                                                                                          0.0s
 => => transferring dockerfile: 212B                                                                                                                                                                          0.0s
 => [internal] load .dockerignore                                                                                                                                                                             0.0s
 => => transferring context: 2B                                                                                                                                                                               0.0s
[+] Building 0.4s (7/8)                                                                                                                                                                                            
 => [internal] load build definition from Dockerfile                                                                                                                                                          0.0s
 => => transferring dockerfile: 212B                                                                                                                                                                          0.0s
 => [internal] load .dockerignore                                                                                                                                                                             0.0s
 => => transferring context: 2B                                                                                                                                                                               0.0s
 => [internal] load metadata for docker.io/library/logstash:7.14.1                                                                                                                                            0.4s
 => [1/3] FROM docker.io/library/logstash:7.14.1@sha256:8f1969b7dfe7fffc13f55d72b6001804e45c1f655b8169bc84a6dfeda481c2b9                                                                                      0.0s
 => [internal] load build context                                                                                                                                                                             0.0s
 => => transferring context: 106B                                                                                                                                                                             0.0s
[+] Building 0.4s (8/8) FINISHED                                                                                                                                                                                   
 => [internal] load build definition from Dockerfile                                                                                                                                                          0.0s
 => => transferring dockerfile: 212B                                                                                                                                                                          0.0s
 => [internal] load .dockerignore                                                                                                                                                                             0.0s
 => => transferring context: 2B                                                                                                                                                                               0.0s
 => [internal] load metadata for docker.io/library/logstash:7.14.1                                                                                                                                            0.4s
 => [1/3] FROM docker.io/library/logstash:7.14.1@sha256:8f1969b7dfe7fffc13f55d72b6001804e45c1f655b8169bc84a6dfeda481c2b9                                                                                      0.0s
 => [internal] load build context                                                                                                                                                                             0.0s
 => => transferring context: 106B                                                                                                                                                                             0.0s
 => CACHED [2/3] COPY ./pipeline /usr/share/logstash/pipeline                                                                                                                                                 0.0s
 => CACHED [3/3] COPY ./entrypoint.sh /usr/local/bin/entrypoint.sh                                                                                                                                            0.0s
 => exporting to image                                                                                                                                                                                        0.0s
 => => exporting layers                                                                                                                                                                                       0.0s
 => => writing image sha256:999e357ca35ffbb05bd7838c566ba69ef47644ab92c029c9e4f20a9d4597e239                                                                                                                  0.0s
 => => naming to docker.io/medicmobile/cht-logstash-couchdb:latest                                                                                                                                            0.0s
docker-compose up logstash postgrest dbt
no such service: postgrest
make: *** [Makefile:19: prod] Error 1

Steps to reproduce:

  1. start local CHT instance in docker helper
  2. clone this repo locally: git clone https://github.com/medic/cht-sync
  3. change directory: cd cht-sync
  4. copy env file: cp env.template .env
  5. edit .env to have couchdb credentials from zeroeth step
  6. start local instance: make prod

expected: cht sync starts
actual: cht sync fails to start and shows error

Enhance integration tests to ensure accurate data transfer

During integration testing, verify data accuracy from CouchDB to Postgres to ensure the model reflects the source data. Additionally, rename existing End-to-End tests to Integration tests to reflect their purpose.

What else to test?

Initial Data Sync: Verify that data from CouchDB is correctly transferred to PostgreSQL during initial sync (Data in PostgreSQL should exactly match the data in CouchDB).
Continuous Data Sync: Ensure continuous syncing works correctly when data is updated or added in CouchDB. Incremental changes are synced properly without reprocessing the entire dataset.
Conflict Resolution: Ensure that conflicts (e.g., same document updated differently in multiple places) are handled properly during the sync process. (Not sure yet how this could happen)
Large Data Volume: Test the system’s ability to handle large volumes of data.
Error Handling: Validate the application's ability to handle and recover from errors during the sync process.

Pending tests:
Deletion Sync: Confirm that document deletions in CouchDB are correctly propagated to PostgreSQL.

Investigate and document required config for postgrest

When setting up the BRAC test instance logstash was getting 401 errors from posgrest maybe because of not using PGRST_DB_ANON_ROLE. For this ticket we need to investigate the configuration required to have postgrest running well and in a secure manner and documenting that in the readme file.

CHT Sync monitoring

Set up and document a monitoring solution for CHT Sync (CHT Watchdog), together with relevant metrics and alerts.

Use AuthSession cookie for login instead of basic auth

Describe the issue
We're discussing increasing pbkdf2 iterations for CouchDb.
This will mean that computing the hashed password every time basic auth is used will become a costly server-side operation.

Describe the improvement you'd like
Instead of using basic auth for every request, make one initial _session request, and use the provided AuthSession cookie to authenticate all following requests within one conf run.

Describe alternatives you've considered
None.

running "make local" results in Database Error in model raw_contacts

Hi there! When I follow the steps on the readme, I get the following error repeatedly in the dbt container Database Error in model raw_contacts (models/contact_views/raw_contacts.sql).

Here's the complete logs from docker logs pipeline-dbt-1

22:52:47  Found 5 models, 0 tests, 0 snapshots, 0 analyses, 307 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics, 0 groups
22:52:47  
22:52:47  Concurrency: 1 threads (target='dev')
22:52:47  
22:52:47  1 of 5 START sql view model dbt.couchdb ........................................ [RUN]
22:52:47  1 of 5 OK created sql view model dbt.couchdb ................................... [CREATE VIEW in 0.13s]
22:52:47  2 of 5 START sql incremental model dbt.data_record ............................. [RUN]
22:52:47  2 of 5 OK created sql incremental model dbt.data_record ........................ [INSERT 0 1 in 0.13s]
22:52:47  3 of 5 START sql incremental model dbt.person .................................. [RUN]
22:52:48  3 of 5 OK created sql incremental model dbt.person ............................. [INSERT 0 8 in 0.08s]
22:52:48  4 of 5 START sql view model dbt.raw_contacts ................................... [RUN]
22:52:48  4 of 5 ERROR creating sql view model dbt.raw_contacts .......................... [ERROR in 0.03s]
22:52:48  5 of 5 SKIP relation dbt.contactview_metadata .................................. [SKIP]
22:52:48  
22:52:48  Finished running 3 view models, 2 incremental models in 0 hours 0 minutes and 0.52 seconds (0.52s).
22:52:48  
22:52:48  Completed with 1 error and 0 warnings:
22:52:48  
22:52:48  Database Error in model raw_contacts (models/contact_views/raw_contacts.sql)
22:52:48    syntax error at or near ";"
22:52:48    LINE 10: ...ct_hospital'::text, 'health_center'::text, 'person'::text]);
22:52:48                                                                           ^
22:52:48    compiled Code at target/run/pipeline/models/contact_views/raw_contacts.sql
22:52:48  
22:52:48  Done. PASS=3 WARN=0 ERROR=1 SKIP=1 TOTAL=5
22:52:58  Running with dbt=1.5.0

Steps to reproduce:

  1. clone this repo locally: git clone https://github.com/medic/cht-sync
  2. change directory: cd cht-sync
  3. copy env file: cp env.template .env
  4. start local instance: make local

expected: no errors
actual: errors

related: checking the SQL instance I do see rows are being added successfully on regular intervals

Handle Multiple DBs in Redis worker

In order to move fast the current worker hard codes the postgrest URL which means we are only able to replicate the medic database in CouchDB. Refactor the code so that the URL is built using values set by the COUCHDB_DBS environment variable.

Deal with missing values in redis worker

Logstash sometimes sends requests where the body doesn't include the full doc.
This causes problems with the redis worker because either postgrest refuses these requests, or if we add default values, inserts them with null values, whihc is not useful.

What are these partial requests coming from logstash?
Why are they only affecting the redis worker, shouldn't postgrest have rejected them straight from logstash?
How best to handle them in the redis worker?

Specify `platform` in Docker Compose for M1 MacBooks

Currently, Docker Compose does not provide a way to specify the platform as linux/arm64 in the compose file, which is necessary for proper functioning on M1 MacBooks. As a result, Docker Compose may encounter compatibility issues or fail to run on these devices. It is essential to add support for platform specification in the compose file to ensure seamless operation on M1 MacBooks.

Check and update default env variables. Not connecting to Postgres container on specific environment.

This issue was first detected when doing acceptance testing on cht-pipeline default models issue.

We should be using the postgres container by default. Check and update default env variables for the postgres container.

Environment: MacOS 13.1 (22C65); Docker desktop 4.15.0 (93002)l; Docker engine: 20.10.21
CHT 4.1.0: Local using docker compose files cht-core.yml and cht-couchdb.yml
Obs:

  • In cht-couchdb.yml I had to open port 5984 for cht-sync to be able to connect to Couchdb.

For cht-sync environment variables I used:

export POSTGRES_USER=postgres
export POSTGRES_PASSWORD=postgres
export POSTGRES_DB=cht-sync
export POSTGRES_TABLE=couchdb
export POSTGRES_SCHEMA=v2
export POSTGRES_PORT=5432
export POSTGRES_HOST=localhost
export DBT_POSTGRES_USER=postgres
export DBT_POSTGRES_PASSWORD=postgres
export DBT_POSTGRES_SCHEMA=dbt
export COUCHDB_USER=medic
export COUCHDB_PASSWORD=password
export COUCHDB_DB=medic
export COUCHDB_HOST=****.my.local-ip.co
export COUCHDB_PORT=5984
export COUCHDB_SECURE=false

Postgres container wasn't able to run. The log:

2023-01-25 18:43:30 psql: error: connection to server at "localhost" (127.0.0.1), port 5432 failed: Connection refused
2023-01-25 18:43:30     Is the server running on that host and accepting TCP/IP connections?
2023-01-25 18:43:30 connection to server at "localhost" (::1), port 5432 failed: Cannot assign requested address
2023-01-25 18:43:30     Is the server running on that host and accepting TCP/IP connections?

PostREST log:

2023-01-25 18:44:14 25/Jan/2023:21:44:14 +0000: {"code":"PGRST000","details":"connection to server at \"localhost\" (127.0.0.1), port 5432 failed: Connection refused\n\tIs the server running on that host and accepting TCP/IP connections?\nconnection to server at \"localhost\" (::1), port 5432 failed: Address not available\n\tIs the server running on that host and accepting TCP/IP connections?\n","hint":null,"message":"Database connection error. Retrying the connection."}
2023-01-25 18:44:14 25/Jan/2023:21:44:14 +0000: Attempting to reconnect to the database in 32 seconds...

Remove host dependency on node

The docs currently refer to running the project with node on the host.
The node script is mostly issuing docker compose commands with different arguments based on environment
and some other things which are either not relevant anymore or only for dev (creating logstash file which is now in source control, zipping/unzip data directory for local couchdb)

This node script could be replaced with a bash script which sets the arguments for docker-compose. Similar to https://github.com/medic/cht-interoperability/blob/main/startup.sh.
And the bash script is only for convenience; if the dev and other stuff is removed, its possible to run by using the docker compose commands directly

this would remove the requirement that node be installed on the host. maybe not a big requirement but node is not very lightweight and fewer dependencies is always better. It undermines the portability benefits of docker-compose a little bit whenever new requirements for the host are added; although unlikely that this script would run differently on different version of node/npm, it is still possible.

Improve environment variables documentation.

Describe the issue
It would be helpful if the configuration setup for cht-sync and cht-pipeline could be made more understandable.

Describe the improvement you'd like
We already have a template for the environment variables that can be used as an example of how to set up an instance.
I will explain the meaning and purpose of each group of environment variables.

Context: This came up because of the confusion when defining values for the environment variables.

Deploy on Gamma

This ticket aims to deploy CHT Sync using the existing DBT model on the Gamma environment.

Description:

  • Deploy CHT Sync on the Gamma environment.
  • Utilize the current DBT model for data transformation.
  • Ensure seamless integration and compatibility between CHT Sync and the DBT model.
  • Perform necessary configuration and testing to ensure successful deployment.
  • Document the deployment process and any relevant instructions or dependencies.

Help @medic/site-reliability-engineering

cht sync logstash downtime error

logstash seems to have an error that resolves itself but could have a downtime of over an hour meaning no data is being copied over from couchDB to postgres

Objects getting dropped by CHTSync during refresh

When CHTSync is doing a refresh, it is dropping objects and recreating them. This causes issues because if the database is queried while a refresh is going on, the functions will not return any data.

Views
A full Brac dbt config has 42 views
Selection_086
Here is a screenshot in the midst of a CHTSync run showing only 18 views.
Selection_084

By SELECT * FROM pg_catalog.pg_tables where schemaname= 'dbt_clone';, it seems tables are not being dropped, but they are now taking very long to update.

cc @njogz @derickl

Remove .env File from Git Commit and Add Environment Variable Template

I noticed that the .env file is currently being committed to our Git repository, which poses a security risk by exposing sensitive information. I think we should do the following:

  1. Remove .env file from Git Commit:

  2. Create an Environment Variable Template:
    Add a template file (e.g., .env.template) to the repository. The template should include placeholders and instructions for developers to fill in their own values. This will ensure a consistent approach to managing environment variables.

@njogz your feedback and suggestions on this matter are welcome.

Implement E2E test

Describe the issue
Currently, the project doesn't count with a set of e2e tests.

Describe the improvement you'd like
Implement the e2e test for one first scenario

Describe alternatives you've considered
None.

Tasks:

  • Implement e2e LTFU flow
  • Add e2e to CI pipeline

Collect aggregate data and submit back to a central repo

There are two main classes of data that we want to submit to a central repo...

  1. Impact data. Knowing the impact the CHT is having helps focus our efforts on those that have the best impact. Additionally it can be used to justify grant applications to continue developing the CHT. Currently this is collected manually, is labor intensive, and prone to delays and bugs.
  2. Usage data. Being able to query user telemetry across projects and understand how the CHT is being used is critical for informing on potential improvements, and assessing the success of delivered improvements.

Data collection must be opt-in, and any data collected must be completely free of PII (aggregated and/or sanitized) so we don't risk exposing patient information.

Deploy to BRAC dev

Please deploy the containers from https://github.com/medic/cht-sync/blob/first_release/docker-compose.yml to point at BRAC dev couchdb and a new BRAC postgres database/schema

Config

There are env vars mapped in the linked docker-compose.yml that need to be set like,

POSTGRES_USER=<User with write access to the schema below>
POSTGRES_PASSWORD=<user password>
POSTGRES_DB=<Name of new database>
POSTGRES_TABLE=couchdb
POSTGRES_SCHEMA=v1

COUCHDB_USER=<read access user>
COUCHDB_PASSWORD=<password>
COUCHDB_DB=<main database with all records>
COUCHDB_HOST=<hostname for couchdb server>

New DB Schema

Should be configured by running the below using the above env vars,

cht-sync/init-db.sh

Lines 14 to 25 in 636e330

psql -v ON_ERROR_STOP=1 --username "$POSTGRES_USER" --dbname "$POSTGRES_DB" <<-EOSQL
CREATE SCHEMA IF NOT EXISTS $POSTGRES_SCHEMA AUTHORIZATION $POSTGRES_USER;
CREATE TABLE $POSTGRES_SCHEMA.$POSTGRES_TABLE (
"@version" TEXT,
"@timestamp" TIMESTAMP,
"_id" TEXT,
"_rev" TEXT,
doc jsonb,
doc_as_upsert BOOLEAN,
UNIQUE ("_id", "_rev")
);
EOSQL

Patch Logstash Docker Image to fix SSL issue

The current image we use to build the Logstash container uses acouchdb plugin that was last released in 2019. There is a bug where the secure option does not use https protocol. Additionally it seems that Let's Encrypt certs do not work possibly due to the dated code not have updated root ca certs. The proposed solution is to patch the plugin while building the Docker container.

CHT DATA PIPELINE Issues with large data sets

When hosting this we will need to work through how we sync large data sets for both self-hosted and medic hosted projects. The current suggested options are

  1. We run the sync during maintenance windows and run large batches to catch up.
  2. When we query CouchDB, we run smaller batches so as not to impact performance but be ready to wait for long till all the data is synced.
  3. Start from the sequence last reported by couch2pg ( we already have the past data so do we really need to sync it afresh?)
  4. Clone the production CouchDB and pull data from that till the pipeline catches up and then point it to production.
    ^^ @henokgetachew , @Hareet , @aodhan-domhnaill

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.