medic / cht-sync Goto Github PK

Data synchronization between CouchDB and PostgreSQL for the purpose of analytics.

License: GNU General Public License v3.0

Shell 1.56% Python 7.37% Dockerfile 1.68% JavaScript 89.38%

cht-sync's Introduction

CHT Sync

CHT Sync is an integrated solution designed to enable data synchronization between CouchDB and PostgreSQL for the purpose of analytics. Read more detail on the CHT docs site.

Using

See the CHT docs site!

Testing

See testing.

cht-sync's People

Contributors

Stargazers

Watchers

Forkers

paulpascal dubdabasoduba kelvinkapuya i-tech-uw

cht-sync's Issues

Add documentation to cht-docs

Installation instructions and how to run it in prodcution
How to define models
Superset instructions

Remove superset container from the cht-sync repo

Given the decision not to include superset in the cht-sync repo, we need to clean up the code and remove the Superset container and any tests regarding Superset.

Docker Compose to Kubernetes migration

From @Hareet : We requested cht-sync to be deployed / migrated to EKS a while back as we have less people to fix the automation, and maintenance required on isolated EC2 instances. We've solved that in EKS, and really want to push to utilizing that.

This seems like a perfect continuation of the effort to merge several components of the current cht-sync/cht-pipeline/dataemon setup.

Additionally, document the steps required for deploying a cht-sync instance to k3s.

BRAC - CHP Follow Up - Connect instance to RDBMS

We need to create an RDBMS instance from where Klipfolio will get the formatted data. Right now the only way to access the database hosted on the instance is through an SSH tunnel. While Klipfolio has a way to connect to a db through SSH I am not how I would be able to generate a public key that I would to the server to ensure the Klipfolio app has access to the instance.

This ticket tracks the creation of the DB by the Infra team

Sync specific databases

This issue requires setting up a way to sync only the medic-users-meta database to Postgres. Add functionality to cht-sync to ensure that this is possible to do

Create Postgres instance using EKS

Create a postgres instance for use with the BRAC work as suggested here. The current RDS instance is full and requires the support of the infra team to debug and get logs.

Use Redis for sequence token

See logstash-plugins/logstash-input-couchdb_changes#50

running "make production": no such service: postgrest

I'm trying to run CHT Sync against a "production" CHT instance which is really just a an instance of CHT Docker Helper. This has an external to CHT Sync, URL, port, login and password. Here's the entries from my .env file:

COUCHDB_USER=medic
COUCHDB_PASSWORD=password
COUCHDB_DB=medic-users-meta
COUCHDB_HOST=192-168-0-37.local-ip.medicmobile.org
COUCHDB_PORT=10447
COUCHDB_SECURE=true

When I run make prod I get the following output:

➜  cht-sync git:(main) make prod
rm -rf ./data/json_docs
docker-compose -f docker-compose.couchdb.yml -f docker-compose.postgres.yml -f docker-compose.yml \
	down -v --remove-orphans
docker-compose build 
[+] Building 0.0s (0/2)                                                                                                                                                                                            
[+] Building 0.2s (2/3)                                                                                                                                                                                            
 => [internal] load build definition from Dockerfile                                                                                                                                                          0.0s
 => => transferring dockerfile: 212B                                                                                                                                                                          0.0s
 => [internal] load .dockerignore                                                                                                                                                                             0.0s
 => => transferring context: 2B                                                                                                                                                                               0.0s
[+] Building 0.4s (7/8)                                                                                                                                                                                            
 => [internal] load build definition from Dockerfile                                                                                                                                                          0.0s
 => => transferring dockerfile: 212B                                                                                                                                                                          0.0s
 => [internal] load .dockerignore                                                                                                                                                                             0.0s
 => => transferring context: 2B                                                                                                                                                                               0.0s
 => [internal] load metadata for docker.io/library/logstash:7.14.1                                                                                                                                            0.4s
 => [1/3] FROM docker.io/library/logstash:7.14.1@sha256:8f1969b7dfe7fffc13f55d72b6001804e45c1f655b8169bc84a6dfeda481c2b9                                                                                      0.0s
 => [internal] load build context                                                                                                                                                                             0.0s
 => => transferring context: 106B                                                                                                                                                                             0.0s
[+] Building 0.4s (8/8) FINISHED                                                                                                                                                                                   
 => [internal] load build definition from Dockerfile                                                                                                                                                          0.0s
 => => transferring dockerfile: 212B                                                                                                                                                                          0.0s
 => [internal] load .dockerignore                                                                                                                                                                             0.0s
 => => transferring context: 2B                                                                                                                                                                               0.0s
 => [internal] load metadata for docker.io/library/logstash:7.14.1                                                                                                                                            0.4s
 => [1/3] FROM docker.io/library/logstash:7.14.1@sha256:8f1969b7dfe7fffc13f55d72b6001804e45c1f655b8169bc84a6dfeda481c2b9                                                                                      0.0s
 => [internal] load build context                                                                                                                                                                             0.0s
 => => transferring context: 106B                                                                                                                                                                             0.0s
 => CACHED [2/3] COPY ./pipeline /usr/share/logstash/pipeline                                                                                                                                                 0.0s
 => CACHED [3/3] COPY ./entrypoint.sh /usr/local/bin/entrypoint.sh                                                                                                                                            0.0s
 => exporting to image                                                                                                                                                                                        0.0s
 => => exporting layers                                                                                                                                                                                       0.0s
 => => writing image sha256:999e357ca35ffbb05bd7838c566ba69ef47644ab92c029c9e4f20a9d4597e239                                                                                                                  0.0s
 => => naming to docker.io/medicmobile/cht-logstash-couchdb:latest                                                                                                                                            0.0s
docker-compose up logstash postgrest dbt
no such service: postgrest
make: *** [Makefile:19: prod] Error 1

Steps to reproduce:

start local CHT instance in docker helper
clone this repo locally: git clone https://github.com/medic/cht-sync
change directory: cd cht-sync
copy env file: cp env.template .env
edit .env to have couchdb credentials from zeroeth step
start local instance: make prod

expected: cht sync starts
actual: cht sync fails to start and shows error

Enhance integration tests to ensure accurate data transfer

During integration testing, verify data accuracy from CouchDB to Postgres to ensure the model reflects the source data. Additionally, rename existing End-to-End tests to Integration tests to reflect their purpose.

What else to test?

Initial Data Sync: Verify that data from CouchDB is correctly transferred to PostgreSQL during initial sync (Data in PostgreSQL should exactly match the data in CouchDB).
Continuous Data Sync: Ensure continuous syncing works correctly when data is updated or added in CouchDB. Incremental changes are synced properly without reprocessing the entire dataset.
Conflict Resolution: Ensure that conflicts (e.g., same document updated differently in multiple places) are handled properly during the sync process. (Not sure yet how this could happen)
Large Data Volume: Test the system’s ability to handle large volumes of data.
Error Handling: Validate the application's ability to handle and recover from errors during the sync process.

Pending tests:
Deletion Sync: Confirm that document deletions in CouchDB are correctly propagated to PostgreSQL.

Investigate and document required config for postgrest

When setting up the BRAC test instance logstash was getting 401 errors from posgrest maybe because of not using PGRST_DB_ANON_ROLE. For this ticket we need to investigate the configuration required to have postgrest running well and in a secure manner and documenting that in the readme file.

Add SonarCloud configuration

Add a file containing the repo-level SonarCloud configuration for cht-sync.

Similar to medic/cht-core#8466

Logstash fails with a 404 error when calling postgrest in the BRAC instance

CHT Sync monitoring

Set up and document a monitoring solution for CHT Sync (CHT Watchdog), together with relevant metrics and alerts.

Use AuthSession cookie for login instead of basic auth

Describe the issue
We're discussing increasing pbkdf2 iterations for CouchDb.
This will mean that computing the hashed password every time basic auth is used will become a costly server-side operation.

Describe the improvement you'd like
Instead of using basic auth for every request, make one initial _session request, and use the provided AuthSession cookie to authenticate all following requests within one conf run.

Describe alternatives you've considered
None.

postgrest dies with 137 in m1

pipeline-postgrest-1 exited with code 137

Related issues:
supabase/cli#33
Probable solution: here

Document how to run e2e tests locally

Document how to run cht-sync e2e tests locally.

Build and publish reusable activity dashboards

This dashboards will be activity dashboards based on user interactions such as sync rate.
TBD.

running "make local" results in Database Error in model raw_contacts

Hi there! When I follow the steps on the readme, I get the following error repeatedly in the dbt container Database Error in model raw_contacts (models/contact_views/raw_contacts.sql).

Here's the complete logs from docker logs pipeline-dbt-1

22:52:47  Found 5 models, 0 tests, 0 snapshots, 0 analyses, 307 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics, 0 groups
22:52:47  
22:52:47  Concurrency: 1 threads (target='dev')
22:52:47  
22:52:47  1 of 5 START sql view model dbt.couchdb ........................................ [RUN]
22:52:47  1 of 5 OK created sql view model dbt.couchdb ................................... [CREATE VIEW in 0.13s]
22:52:47  2 of 5 START sql incremental model dbt.data_record ............................. [RUN]
22:52:47  2 of 5 OK created sql incremental model dbt.data_record ........................ [INSERT 0 1 in 0.13s]
22:52:47  3 of 5 START sql incremental model dbt.person .................................. [RUN]
22:52:48  3 of 5 OK created sql incremental model dbt.person ............................. [INSERT 0 8 in 0.08s]
22:52:48  4 of 5 START sql view model dbt.raw_contacts ................................... [RUN]
22:52:48  4 of 5 ERROR creating sql view model dbt.raw_contacts .......................... [ERROR in 0.03s]
22:52:48  5 of 5 SKIP relation dbt.contactview_metadata .................................. [SKIP]
22:52:48  
22:52:48  Finished running 3 view models, 2 incremental models in 0 hours 0 minutes and 0.52 seconds (0.52s).
22:52:48  
22:52:48  Completed with 1 error and 0 warnings:
22:52:48  
22:52:48  Database Error in model raw_contacts (models/contact_views/raw_contacts.sql)
22:52:48    syntax error at or near ";"
22:52:48    LINE 10: ...ct_hospital'::text, 'health_center'::text, 'person'::text]);
22:52:48                                                                           ^
22:52:48    compiled Code at target/run/pipeline/models/contact_views/raw_contacts.sql
22:52:48  
22:52:48  Done. PASS=3 WARN=0 ERROR=1 SKIP=1 TOTAL=5
22:52:58  Running with dbt=1.5.0

Steps to reproduce:

clone this repo locally: git clone https://github.com/medic/cht-sync
change directory: cd cht-sync
copy env file: cp env.template .env
start local instance: make local

expected: no errors
actual: errors

related: checking the SQL instance I do see rows are being added successfully on regular intervals

Broken link in Readme

There's a broken link to resource on Readme.md for schema change.

Add superset to docker-compose

Add a superset docker image to the docker compose file.

Add TLS to Postgres Service in EKS

Add TLS to the Postgres service in cht-sync to add an extra layer of security

Run cht-pipeline with docker-compose

Add cht-pipeline to the docker-compose file

Handle Multiple DBs in Redis worker

In order to move fast the current worker hard codes the postgrest URL which means we are only able to replicate the medic database in CouchDB. Refactor the code so that the URL is built using values set by the COUCHDB_DBS environment variable.

Create default Superset dashboard

Create a default superset dashboard for the default data model.

TODO: Flesh out the scope of the dashboard.

Deal with missing values in redis worker

Logstash sometimes sends requests where the body doesn't include the full doc.
This causes problems with the redis worker because either postgrest refuses these requests, or if we add default values, inserts them with null values, whihc is not useful.

What are these partial requests coming from logstash?
Why are they only affecting the redis worker, shouldn't postgrest have rejected them straight from logstash?
How best to handle them in the redis worker?

Techinical Spike: Try CHT Sync with Apache Druid

Replace postgres with apache druid in cht-sync and document the findings.

CHT Sync observability

Set up a tool (CHT Watchdog? DOT?) to help understand the health of data.

Specify `platform` in Docker Compose for M1 MacBooks

Currently, Docker Compose does not provide a way to specify the platform as linux/arm64 in the compose file, which is necessary for proper functioning on M1 MacBooks. As a result, Docker Compose may encounter compatibility issues or fail to run on these devices. It is essential to add support for platform specification in the compose file to ensure seamless operation on M1 MacBooks.

Check and update default env variables. Not connecting to Postgres container on specific environment.

This issue was first detected when doing acceptance testing on cht-pipeline default models issue.

We should be using the postgres container by default. Check and update default env variables for the postgres container.

Environment: MacOS 13.1 (22C65); Docker desktop 4.15.0 (93002)l; Docker engine: 20.10.21
CHT 4.1.0: Local using docker compose files cht-core.yml and cht-couchdb.yml
Obs:

In cht-couchdb.yml I had to open port 5984 for cht-sync to be able to connect to Couchdb.

For cht-sync environment variables I used:

export POSTGRES_USER=postgres
export POSTGRES_PASSWORD=postgres
export POSTGRES_DB=cht-sync
export POSTGRES_TABLE=couchdb
export POSTGRES_SCHEMA=v2
export POSTGRES_PORT=5432
export POSTGRES_HOST=localhost
export DBT_POSTGRES_USER=postgres
export DBT_POSTGRES_PASSWORD=postgres
export DBT_POSTGRES_SCHEMA=dbt
export COUCHDB_USER=medic
export COUCHDB_PASSWORD=password
export COUCHDB_DB=medic
export COUCHDB_HOST=****.my.local-ip.co
export COUCHDB_PORT=5984
export COUCHDB_SECURE=false

Postgres container wasn't able to run. The log:

2023-01-25 18:43:30 psql: error: connection to server at "localhost" (127.0.0.1), port 5432 failed: Connection refused
2023-01-25 18:43:30     Is the server running on that host and accepting TCP/IP connections?
2023-01-25 18:43:30 connection to server at "localhost" (::1), port 5432 failed: Cannot assign requested address
2023-01-25 18:43:30     Is the server running on that host and accepting TCP/IP connections?

PostREST log:

2023-01-25 18:44:14 25/Jan/2023:21:44:14 +0000: {"code":"PGRST000","details":"connection to server at \"localhost\" (127.0.0.1), port 5432 failed: Connection refused\n\tIs the server running on that host and accepting TCP/IP connections?\nconnection to server at \"localhost\" (::1), port 5432 failed: Address not available\n\tIs the server running on that host and accepting TCP/IP connections?\n","hint":null,"message":"Database connection error. Retrying the connection."}
2023-01-25 18:44:14 25/Jan/2023:21:44:14 +0000: Attempting to reconnect to the database in 32 seconds...

Implement CI Pipeline

Define a simple CI workflow and implement it.

Test CHT Sync on CouchDB 3

Check if cht-sync works with CouchDB 3.

Remove host dependency on node

The docs currently refer to running the project with node on the host.
The node script is mostly issuing docker compose commands with different arguments based on environment
and some other things which are either not relevant anymore or only for dev (creating logstash file which is now in source control, zipping/unzip data directory for local couchdb)

This node script could be replaced with a bash script which sets the arguments for docker-compose. Similar to https://github.com/medic/cht-interoperability/blob/main/startup.sh.
And the bash script is only for convenience; if the dev and other stuff is removed, its possible to run by using the docker compose commands directly

this would remove the requirement that node be installed on the host. maybe not a big requirement but node is not very lightweight and fewer dependencies is always better. It undermines the portability benefits of docker-compose a little bit whenever new requirements for the host are added; although unlikely that this script would run differently on different version of node/npm, it is still possible.

Add a way to know how many documents are yet to be synced

There is currently no easy way to know how far along in syncing cht-sync is. Adding a way to get the amount of data yet to be synced would come in very handy especially for projects migrating from couch2pg.

Document server requirements & needed setup to run CHT Sync successfully

Document server requirements & needed setup to run CHT Sync successfully.

Improve environment variables documentation.

Describe the issue
It would be helpful if the configuration setup for cht-sync and cht-pipeline could be made more understandable.

Describe the improvement you'd like
We already have a template for the environment variables that can be used as an example of how to set up an instance.
I will explain the meaning and purpose of each group of environment variables.

Context: This came up because of the confusion when defining values for the environment variables.

Deploy on Gamma

This ticket aims to deploy CHT Sync using the existing DBT model on the Gamma environment.

Description:

Deploy CHT Sync on the Gamma environment.
Utilize the current DBT model for data transformation.
Ensure seamless integration and compatibility between CHT Sync and the DBT model.
Perform necessary configuration and testing to ensure successful deployment.
Document the deployment process and any relevant instructions or dependencies.

Help @medic/site-reliability-engineering

cht sync logstash downtime error

logstash seems to have an error that resolves itself but could have a downtime of over an hour meaning no data is being copied over from couchDB to postgres

BRAC - CHP Follow Up - Create instance to host BRAC cht-sync instance

The instance created here is no longer accessible so there is a need to create an instance where cht-sync will be hosted

Enable SSH tunneling to superset

SSH tunneling is required for the MoH UG project. To do that in that in this repo we have to update the base image to version 2.1.2 and then enable the SSH Tunneling feature flag.

Increase cht-sync throughput

Cht-sync is currently running aginst the https://brac-clone-for-phil.dev.medicmobile.org instance but syncing of documents is currently very slow. When I checked a few hours ago it had a backlog of 41,168,339 documents to be synced (see screenshot below).

Version CHT Sync

Objects getting dropped by CHTSync during refresh

When CHTSync is doing a refresh, it is dropping objects and recreating them. This causes issues because if the database is queried while a refresh is going on, the functions will not return any data.

Views
A full Brac dbt config has 42 views

Here is a screenshot in the midst of a CHTSync run showing only 18 views.

By SELECT * FROM pg_catalog.pg_tables where schemaname= 'dbt_clone';, it seems tables are not being dropped, but they are now taking very long to update.

cc @njogz @derickl

Document a recommended pattern of using cht-sync against multiple CouchDB instances

Document a recommended pattern of using cht-sync against multiple CouchDB instances and linked to one PostgreSQL server.

As an example, there are MoHs hosted across several instances (mostly by county) and all data is synced to 1 PostgreSQL database.

This ticket can be moved to cht-docs if there are not required code changes. TBD.

Related Slack conversation

Remove .env File from Git Commit and Add Environment Variable Template

I noticed that the .env file is currently being committed to our Git repository, which poses a security risk by exposing sensitive information. I think we should do the following:

Remove .env file from Git Commit:
Create an Environment Variable Template:
Add a template file (e.g., .env.template) to the repository. The template should include placeholders and instructions for developers to fill in their own values. This will ensure a consistent approach to managing environment variables.

@njogz your feedback and suggestions on this matter are welcome.

Implement E2E test

Describe the issue
Currently, the project doesn't count with a set of e2e tests.

Describe the improvement you'd like
Implement the e2e test for one first scenario

Describe alternatives you've considered
None.

Tasks:

Implement e2e LTFU flow
Add e2e to CI pipeline

Collect aggregate data and submit back to a central repo

There are two main classes of data that we want to submit to a central repo...

Impact data. Knowing the impact the CHT is having helps focus our efforts on those that have the best impact. Additionally it can be used to justify grant applications to continue developing the CHT. Currently this is collected manually, is labor intensive, and prone to delays and bugs.
Usage data. Being able to query user telemetry across projects and understand how the CHT is being used is critical for informing on potential improvements, and assessing the success of delivered improvements.

Data collection must be opt-in, and any data collected must be completely free of PII (aggregated and/or sanitized) so we don't risk exposing patient information.

Deploy to BRAC dev

Please deploy the containers from https://github.com/medic/cht-sync/blob/first_release/docker-compose.yml to point at BRAC dev couchdb and a new BRAC postgres database/schema

Config

There are env vars mapped in the linked docker-compose.yml that need to be set like,

POSTGRES_USER=<User with write access to the schema below>
POSTGRES_PASSWORD=<user password>
POSTGRES_DB=<Name of new database>
POSTGRES_TABLE=couchdb
POSTGRES_SCHEMA=v1

COUCHDB_USER=<read access user>
COUCHDB_PASSWORD=<password>
COUCHDB_DB=<main database with all records>
COUCHDB_HOST=<hostname for couchdb server>

New DB Schema

Should be configured by running the below using the above env vars,

cht-sync/init-db.sh

Lines 14 to 25 in 636e330

 psql -v ON_ERROR_STOP=1 --username "$POSTGRES_USER" --dbname "$POSTGRES_DB" <<-EOSQL 

  CREATE SCHEMA IF NOT EXISTS $POSTGRES_SCHEMA AUTHORIZATION $POSTGRES_USER; 

  CREATE TABLE $POSTGRES_SCHEMA.$POSTGRES_TABLE ( 

  "@version" TEXT, 

  "@timestamp" TIMESTAMP, 

  "_id" TEXT, 

  "_rev" TEXT, 

  doc jsonb, 

  doc_as_upsert BOOLEAN, 

  UNIQUE ("_id", "_rev") 

  ); 

 EOSQL

Patch Logstash Docker Image to fix SSL issue

The current image we use to build the Logstash container uses acouchdb plugin that was last released in 2019. There is a bug where the secure option does not use https protocol. Additionally it seems that Let's Encrypt certs do not work possibly due to the dated code not have updated root ca certs. The proposed solution is to patch the plugin while building the Docker container.

CHT DATA PIPELINE Issues with large data sets

When hosting this we will need to work through how we sync large data sets for both self-hosted and medic hosted projects. The current suggested options are

We run the sync during maintenance windows and run large batches to catch up.
When we query CouchDB, we run smaller batches so as not to impact performance but be ready to wait for long till all the data is synced.
Start from the sequence last reported by couch2pg ( we already have the past data so do we really need to sync it afresh?)
Clone the production CouchDB and pull data from that till the pipeline catches up and then point it to production.
^^ @henokgetachew , @Hareet , @aodhan-domhnaill

Publish 2 reusable default schemas

These reusable schemas will be used to build activity dashboards.

	psql -v ON_ERROR_STOP=1 --username "$POSTGRES_USER" --dbname "$POSTGRES_DB" <<-EOSQL
	CREATE SCHEMA IF NOT EXISTS $POSTGRES_SCHEMA AUTHORIZATION $POSTGRES_USER;
	CREATE TABLE $POSTGRES_SCHEMA.$POSTGRES_TABLE (
	"@version" TEXT,
	"@timestamp" TIMESTAMP,
	"_id" TEXT,
	"_rev" TEXT,
	doc jsonb,
	doc_as_upsert BOOLEAN,
	UNIQUE ("_id", "_rev")
	);
	EOSQL