CHT Sync is an integrated solution designed to enable data synchronization between CouchDB and PostgreSQL for the purpose of analytics. Read more detail on the CHT docs site.
See the CHT docs site!
See testing.
Data synchronization between CouchDB and PostgreSQL for the purpose of analytics.
License: GNU General Public License v3.0
CHT Sync is an integrated solution designed to enable data synchronization between CouchDB and PostgreSQL for the purpose of analytics. Read more detail on the CHT docs site.
See the CHT docs site!
See testing.
Installation instructions and how to run it in prodcution
How to define models
Superset instructions
Given the decision not to include superset in the cht-sync repo, we need to clean up the code and remove the Superset container and any tests regarding Superset.
From @Hareet : We requested cht-sync to be deployed / migrated to EKS a while back as we have less people to fix the automation, and maintenance required on isolated EC2 instances. We've solved that in EKS, and really want to push to utilizing that.
This seems like a perfect continuation of the effort to merge several components of the current cht-sync/cht-pipeline/dataemon setup.
Additionally, document the steps required for deploying a cht-sync
instance to k3s.
We need to create an RDBMS instance from where Klipfolio will get the formatted data. Right now the only way to access the database hosted on the instance is through an SSH tunnel. While Klipfolio has a way to connect to a db through SSH I am not how I would be able to generate a public key that I would to the server to ensure the Klipfolio app has access to the instance.
This ticket tracks the creation of the DB by the Infra team
This issue requires setting up a way to sync only the medic-users-meta database to Postgres. Add functionality to cht-sync to ensure that this is possible to do
Create a postgres instance for use with the BRAC work as suggested here. The current RDS instance is full and requires the support of the infra team to debug and get logs.
I'm trying to run CHT Sync against a "production" CHT instance which is really just a an instance of CHT Docker Helper. This has an external to CHT Sync, URL, port, login and password. Here's the entries from my .env
file:
COUCHDB_USER=medic
COUCHDB_PASSWORD=password
COUCHDB_DB=medic-users-meta
COUCHDB_HOST=192-168-0-37.local-ip.medicmobile.org
COUCHDB_PORT=10447
COUCHDB_SECURE=true
When I run make prod
I get the following output:
➜ cht-sync git:(main) make prod
rm -rf ./data/json_docs
docker-compose -f docker-compose.couchdb.yml -f docker-compose.postgres.yml -f docker-compose.yml \
down -v --remove-orphans
docker-compose build
[+] Building 0.0s (0/2)
[+] Building 0.2s (2/3)
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 212B 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
[+] Building 0.4s (7/8)
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 212B 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [internal] load metadata for docker.io/library/logstash:7.14.1 0.4s
=> [1/3] FROM docker.io/library/logstash:7.14.1@sha256:8f1969b7dfe7fffc13f55d72b6001804e45c1f655b8169bc84a6dfeda481c2b9 0.0s
=> [internal] load build context 0.0s
=> => transferring context: 106B 0.0s
[+] Building 0.4s (8/8) FINISHED
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 212B 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [internal] load metadata for docker.io/library/logstash:7.14.1 0.4s
=> [1/3] FROM docker.io/library/logstash:7.14.1@sha256:8f1969b7dfe7fffc13f55d72b6001804e45c1f655b8169bc84a6dfeda481c2b9 0.0s
=> [internal] load build context 0.0s
=> => transferring context: 106B 0.0s
=> CACHED [2/3] COPY ./pipeline /usr/share/logstash/pipeline 0.0s
=> CACHED [3/3] COPY ./entrypoint.sh /usr/local/bin/entrypoint.sh 0.0s
=> exporting to image 0.0s
=> => exporting layers 0.0s
=> => writing image sha256:999e357ca35ffbb05bd7838c566ba69ef47644ab92c029c9e4f20a9d4597e239 0.0s
=> => naming to docker.io/medicmobile/cht-logstash-couchdb:latest 0.0s
docker-compose up logstash postgrest dbt
no such service: postgrest
make: *** [Makefile:19: prod] Error 1
Steps to reproduce:
git clone https://github.com/medic/cht-sync
cd cht-sync
cp env.template .env
.env
to have couchdb credentials from zeroeth stepmake prod
expected: cht sync starts
actual: cht sync fails to start and shows error
During integration testing, verify data accuracy from CouchDB to Postgres to ensure the model reflects the source data. Additionally, rename existing End-to-End tests to Integration tests to reflect their purpose.
Initial Data Sync: Verify that data from CouchDB is correctly transferred to PostgreSQL during initial sync (Data in PostgreSQL should exactly match the data in CouchDB).
Continuous Data Sync: Ensure continuous syncing works correctly when data is updated or added in CouchDB. Incremental changes are synced properly without reprocessing the entire dataset.
Conflict Resolution: Ensure that conflicts (e.g., same document updated differently in multiple places) are handled properly during the sync process. (Not sure yet how this could happen)
Large Data Volume: Test the system’s ability to handle large volumes of data.
Error Handling: Validate the application's ability to handle and recover from errors during the sync process.
Pending tests:
Deletion Sync: Confirm that document deletions in CouchDB are correctly propagated to PostgreSQL.
When setting up the BRAC test instance logstash was getting 401 errors from posgrest maybe because of not using PGRST_DB_ANON_ROLE
. For this ticket we need to investigate the configuration required to have postgrest running well and in a secure manner and documenting that in the readme file.
Add a file containing the repo-level SonarCloud configuration for cht-sync
.
Similar to medic/cht-core#8466
Set up and document a monitoring solution for CHT Sync (CHT Watchdog), together with relevant metrics and alerts.
Describe the issue
We're discussing increasing pbkdf2 iterations for CouchDb.
This will mean that computing the hashed password every time basic auth is used will become a costly server-side operation.
Describe the improvement you'd like
Instead of using basic auth for every request, make one initial _session request, and use the provided AuthSession cookie to authenticate all following requests within one conf run.
Describe alternatives you've considered
None.
pipeline-postgrest-1 exited with code 137
Related issues:
supabase/cli#33
Probable solution: here
Document how to run cht-sync
e2e tests locally.
This dashboards will be activity dashboards based on user interactions such as sync rate.
TBD.
Hi there! When I follow the steps on the readme, I get the following error repeatedly in the dbt
container Database Error in model raw_contacts (models/contact_views/raw_contacts.sql)
.
Here's the complete logs from docker logs pipeline-dbt-1
22:52:47 Found 5 models, 0 tests, 0 snapshots, 0 analyses, 307 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics, 0 groups
22:52:47
22:52:47 Concurrency: 1 threads (target='dev')
22:52:47
22:52:47 1 of 5 START sql view model dbt.couchdb ........................................ [RUN]
22:52:47 1 of 5 OK created sql view model dbt.couchdb ................................... [CREATE VIEW in 0.13s]
22:52:47 2 of 5 START sql incremental model dbt.data_record ............................. [RUN]
22:52:47 2 of 5 OK created sql incremental model dbt.data_record ........................ [INSERT 0 1 in 0.13s]
22:52:47 3 of 5 START sql incremental model dbt.person .................................. [RUN]
22:52:48 3 of 5 OK created sql incremental model dbt.person ............................. [INSERT 0 8 in 0.08s]
22:52:48 4 of 5 START sql view model dbt.raw_contacts ................................... [RUN]
22:52:48 4 of 5 ERROR creating sql view model dbt.raw_contacts .......................... [ERROR in 0.03s]
22:52:48 5 of 5 SKIP relation dbt.contactview_metadata .................................. [SKIP]
22:52:48
22:52:48 Finished running 3 view models, 2 incremental models in 0 hours 0 minutes and 0.52 seconds (0.52s).
22:52:48
22:52:48 Completed with 1 error and 0 warnings:
22:52:48
22:52:48 Database Error in model raw_contacts (models/contact_views/raw_contacts.sql)
22:52:48 syntax error at or near ";"
22:52:48 LINE 10: ...ct_hospital'::text, 'health_center'::text, 'person'::text]);
22:52:48 ^
22:52:48 compiled Code at target/run/pipeline/models/contact_views/raw_contacts.sql
22:52:48
22:52:48 Done. PASS=3 WARN=0 ERROR=1 SKIP=1 TOTAL=5
22:52:58 Running with dbt=1.5.0
Steps to reproduce:
git clone https://github.com/medic/cht-sync
cd cht-sync
cp env.template .env
make local
expected: no errors
actual: errors
related: checking the SQL instance I do see rows are being added successfully on regular intervals
There's a broken link to resource on Readme.md for schema change.
Add a superset docker image to the docker compose file.
Add TLS to the Postgres service in cht-sync to add an extra layer of security
Add cht-pipeline to the docker-compose file
In order to move fast the current worker hard codes the postgrest URL which means we are only able to replicate the medic
database in CouchDB. Refactor the code so that the URL is built using values set by the COUCHDB_DBS
environment variable.
Create a default superset dashboard for the default data model.
TODO: Flesh out the scope of the dashboard.
Logstash sometimes sends requests where the body doesn't include the full doc.
This causes problems with the redis worker because either postgrest refuses these requests, or if we add default values, inserts them with null values, whihc is not useful.
What are these partial requests coming from logstash?
Why are they only affecting the redis worker, shouldn't postgrest have rejected them straight from logstash?
How best to handle them in the redis worker?
Replace postgres with apache druid in cht-sync and document the findings.
Set up a tool (CHT Watchdog? DOT?) to help understand the health of data.
Currently, Docker Compose does not provide a way to specify the platform as linux/arm64
in the compose file, which is necessary for proper functioning on M1 MacBooks. As a result, Docker Compose may encounter compatibility issues or fail to run on these devices. It is essential to add support for platform specification in the compose file to ensure seamless operation on M1 MacBooks.
This issue was first detected when doing acceptance testing on cht-pipeline default models issue.
We should be using the postgres container by default. Check and update default env variables for the postgres container.
Environment: MacOS 13.1 (22C65); Docker desktop 4.15.0 (93002)l; Docker engine: 20.10.21
CHT 4.1.0: Local using docker compose files cht-core.yml and cht-couchdb.yml
Obs:
For cht-sync environment variables I used:
export POSTGRES_USER=postgres
export POSTGRES_PASSWORD=postgres
export POSTGRES_DB=cht-sync
export POSTGRES_TABLE=couchdb
export POSTGRES_SCHEMA=v2
export POSTGRES_PORT=5432
export POSTGRES_HOST=localhost
export DBT_POSTGRES_USER=postgres
export DBT_POSTGRES_PASSWORD=postgres
export DBT_POSTGRES_SCHEMA=dbt
export COUCHDB_USER=medic
export COUCHDB_PASSWORD=password
export COUCHDB_DB=medic
export COUCHDB_HOST=****.my.local-ip.co
export COUCHDB_PORT=5984
export COUCHDB_SECURE=false
Postgres container wasn't able to run. The log:
2023-01-25 18:43:30 psql: error: connection to server at "localhost" (127.0.0.1), port 5432 failed: Connection refused
2023-01-25 18:43:30 Is the server running on that host and accepting TCP/IP connections?
2023-01-25 18:43:30 connection to server at "localhost" (::1), port 5432 failed: Cannot assign requested address
2023-01-25 18:43:30 Is the server running on that host and accepting TCP/IP connections?
PostREST log:
2023-01-25 18:44:14 25/Jan/2023:21:44:14 +0000: {"code":"PGRST000","details":"connection to server at \"localhost\" (127.0.0.1), port 5432 failed: Connection refused\n\tIs the server running on that host and accepting TCP/IP connections?\nconnection to server at \"localhost\" (::1), port 5432 failed: Address not available\n\tIs the server running on that host and accepting TCP/IP connections?\n","hint":null,"message":"Database connection error. Retrying the connection."}
2023-01-25 18:44:14 25/Jan/2023:21:44:14 +0000: Attempting to reconnect to the database in 32 seconds...
Define a simple CI workflow and implement it.
Check if cht-sync works with CouchDB 3.
The docs currently refer to running the project with node on the host.
The node script is mostly issuing docker compose commands with different arguments based on environment
and some other things which are either not relevant anymore or only for dev (creating logstash file which is now in source control, zipping/unzip data directory for local couchdb)
This node script could be replaced with a bash script which sets the arguments for docker-compose. Similar to https://github.com/medic/cht-interoperability/blob/main/startup.sh.
And the bash script is only for convenience; if the dev and other stuff is removed, its possible to run by using the docker compose commands directly
this would remove the requirement that node be installed on the host. maybe not a big requirement but node is not very lightweight and fewer dependencies is always better. It undermines the portability benefits of docker-compose a little bit whenever new requirements for the host are added; although unlikely that this script would run differently on different version of node/npm, it is still possible.
There is currently no easy way to know how far along in syncing cht-sync is. Adding a way to get the amount of data yet to be synced would come in very handy especially for projects migrating from couch2pg.
Document server requirements & needed setup to run CHT Sync successfully.
Describe the issue
It would be helpful if the configuration setup for cht-sync and cht-pipeline could be made more understandable.
Describe the improvement you'd like
We already have a template for the environment variables that can be used as an example of how to set up an instance.
I will explain the meaning and purpose of each group of environment variables.
Context: This came up because of the confusion when defining values for the environment variables.
This ticket aims to deploy CHT Sync using the existing DBT model on the Gamma environment.
Description:
Help @medic/site-reliability-engineering
logstash seems to have an error that resolves itself but could have a downtime of over an hour meaning no data is being copied over from couchDB to postgres
The instance created here is no longer accessible so there is a need to create an instance where cht-sync will be hosted
SSH tunneling is required for the MoH UG project. To do that in that in this repo we have to update the base image to version 2.1.2 and then enable the SSH Tunneling feature flag.
Cht-sync is currently running aginst the https://brac-clone-for-phil.dev.medicmobile.org instance but syncing of documents is currently very slow. When I checked a few hours ago it had a backlog of 41,168,339 documents to be synced (see screenshot below).
When CHTSync is doing a refresh, it is dropping objects and recreating them. This causes issues because if the database is queried while a refresh is going on, the functions will not return any data.
Views
A full Brac dbt config has 42 views
Here is a screenshot in the midst of a CHTSync run showing only 18 views.
By SELECT * FROM pg_catalog.pg_tables where schemaname= 'dbt_clone';
, it seems tables are not being dropped, but they are now taking very long to update.
Document a recommended pattern of using cht-sync against multiple CouchDB instances and linked to one PostgreSQL server.
As an example, there are MoHs hosted across several instances (mostly by county) and all data is synced to 1 PostgreSQL database.
This ticket can be moved to cht-docs if there are not required code changes. TBD.
I noticed that the .env file is currently being committed to our Git repository, which poses a security risk by exposing sensitive information. I think we should do the following:
Remove .env file from Git Commit:
Create an Environment Variable Template:
Add a template file (e.g., .env.template
) to the repository. The template should include placeholders and instructions for developers to fill in their own values. This will ensure a consistent approach to managing environment variables.
@njogz your feedback and suggestions on this matter are welcome.
Describe the issue
Currently, the project doesn't count with a set of e2e tests.
Describe the improvement you'd like
Implement the e2e test for one first scenario
Describe alternatives you've considered
None.
Tasks:
There are two main classes of data that we want to submit to a central repo...
Data collection must be opt-in, and any data collected must be completely free of PII (aggregated and/or sanitized) so we don't risk exposing patient information.
Please deploy the containers from https://github.com/medic/cht-sync/blob/first_release/docker-compose.yml to point at BRAC dev couchdb and a new BRAC postgres database/schema
There are env vars mapped in the linked docker-compose.yml
that need to be set like,
POSTGRES_USER=<User with write access to the schema below>
POSTGRES_PASSWORD=<user password>
POSTGRES_DB=<Name of new database>
POSTGRES_TABLE=couchdb
POSTGRES_SCHEMA=v1
COUCHDB_USER=<read access user>
COUCHDB_PASSWORD=<password>
COUCHDB_DB=<main database with all records>
COUCHDB_HOST=<hostname for couchdb server>
Should be configured by running the below using the above env vars,
Lines 14 to 25 in 636e330
The current image we use to build the Logstash container uses acouchdb plugin that was last released in 2019. There is a bug where the secure option does not use https protocol. Additionally it seems that Let's Encrypt certs do not work possibly due to the dated code not have updated root ca certs. The proposed solution is to patch the plugin while building the Docker container.
When hosting this we will need to work through how we sync large data sets for both self-hosted and medic hosted projects. The current suggested options are
These reusable schemas will be used to build activity dashboards.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.