greenelab / connectivity-search-backend Goto Github PK

View Code? Open in Web Editor NEW

6.0 7.0 2.0 1.4 MB

Django backend for hetnet connectivity search

Home Page: https://search-api.het.io

License: BSD 3-Clause "New" or "Revised" License

Python 94.32% Shell 5.68%

hetmech hetnets hetio django database hetnet-connectivity-search backend

connectivity-search-backend's Introduction

connectivity search backend

This django application powers the API available at https://search-api.het.io/.

Environment

This repository uses conda to manage its environment as specified in environment.yml. Install the environment with:

conda env create --file=environment.yml

Then use conda activate hetmech-backend and conda deactivate to activate or deactivate the environment.

Secrets

Users must supply dj_hetmech/secrets.yml with the database connection information and two optional parameters for Django settings. See dj_hetmech/secrets-template.yml for what fields should be defined. These secrets will determine whether django connects to a local database or a remote database and other security settings in Django.

Notebooks

Use the following command to launch Jupyter Notebook in your browser for interactive development:

python manage.py shell_plus --notebook

Server

A local development server can be started with the command:

python manage.py runserver

This exposes the API at http://localhost:8000/v1/.

Database

This project uses a PostgreSQL database. The deployed version of this application uses a remote database. Public read-only access is available with the following configuration:

name: connectivity_db
user: read_only_user
password: tm8ut9uzqx7628swwkb9
host: search-db.het.io
port: 5432

To erect a new database locally for development, run:

# https://docs.docker.com/samples/library/postgres/
docker run \
  --name connectivity_db \
  --env POSTGRES_DB=connectivity_db \
  --env POSTGRES_USER=dj_hetmech \
  --env POSTGRES_PASSWORD=not_secure \
  --volume "$(pwd)"/database:/var/lib/postgresql/data \
  --publish 5432:5432 \
  --detach \
  postgres:12.4

Populating the database

To populate the database from scratch, use the populate_database management command (source). Here is an example workflow:

# migrate database to the current Django models
python manage.py makemigrations
python manage.py migrate --run-syncdb
# view the populate_database usage docs
python manage.py populate_database --help
# wipe the existing database (populate_database assumes empty tables)
python manage.py flush --no-input
# populate the database (will take a long time)
python manage.py populate_database --max-metapath-length=3 --reduced-metapaths --batch-size=12000
# output database information and table summaries
python manage.py database_info

Another option to load the database is to import it from the connectivity-search-pg_dump.sql.gz database dump, which will save time if you are interested in loading the full database (i.e. without --reduced-metapaths). This 5 GB file is available on Zenodo (TODO: update latest database dump to Zenodo).

To load connectivity-search-pg_dump.sql.gz into a new database, modify the following command:

zcat hetmech-pg_dump.sql.gz | psql --user=dj_hetmech --dbname=connectivity_db --host=HOST

connectivity-search-pg_dump.sql.gz was exported from the development Docker database with the command:

docker exec connectivity_db \
  pg_dump \
  --host=localhost --username=dj_hetmech --dbname=connectivity_db \
  --create --clean \
  --compress=8 \
  > connectivity-search-pg_dump.sql.gz

connectivity-search-backend's People

Contributors

Stargazers

Watchers

Forkers

dhimmel luciawl001

connectivity-search-backend's Issues

count-metapaths-to not returning metapath_count

https://search-api.het.io/v1/nodes/?search=Alitretinoin&limit=100&count-metapaths-to=20115 returns

{
    "count": 1,
    "next": null,
    "previous": null,
    "results": [
        {
            "id": 21150,
            "identifier": "DB00523",
            "identifier_type": "str",
            "name": "Alitretinoin",
            "properties": {
                "url": "http://www.drugbank.ca/drugs/DB00523",
                "inchi": "InChI=1S/C20H28O2/c1-15(8-6-9-16(2)14-19(21)22)11-12-18-17(3)10-7-13-20(18,4)5/h6,8-9,11-12,14H,7,10,13H2,1-5H3,(H,21,22)/b9-6+,12-11+,15-8-,16-14+",
                "source": "DrugBank",
                "license": "CC BY-NC 4.0",
                "inchikey": "InChIKey=SHGAZHPCJJPHSC-ZVCIMWCZSA-N"
            },
            "metanode": "Compound"
        }
    ]
}

Note that metapath_count is missing, but there should be results:

Add metatype filter to all applicable queries

We have a handy collection of metatype filter buttons at the top of the page on the front end. Currently, the only query those have an effect on is when the user types in a search string. It doesn't apply when doing count-metapaths-to or anything else.

We can sit down and discuss where it would be appropriate to include this parameter. It might just be all the queries involved with the source/target search bars, as making the filter buttons have an effect on something much further down the page might be confusing to the user.

Automate Nginx Deployment

This Django app is called by Nginx, which behaves as a reverse proxy to client requests. Here is the nginx configuration file:

# HTTP configuration 
# Default HTTP server: always redirect to HTTPS
server {
    listen 80 default_server;
    server_name _;
    return 301 https://$host$request_uri;
}

# HTTPS server
server {
    listen 443;
    server_name search-api.het.io;

    if ( $http_host !~* ^(search-api\.het\.io)$ ) {
        return 444;
    }

    ssl on;
    ssl_certificate /etc/letsencrypt/live/search-api.het.io/fullchain.pem;   # managed by Certbot
    ssl_certificate_key /etc/letsencrypt/live/search-api.het.io/privkey.pem; # managed by Certbot

    charset utf-8;

    # max upload size (adjust to taste)
    client_max_body_size 10M;

    location / {
        return 301 $scheme://$host/v1;
    }

   location /static {
       alias /home/ubuntu/hetmech-backend/dj_hetmech/static;
    }

    location /v1 {
        proxy_pass http://127.0.0.1:8001/v1;
        proxy_set_header X-Forwarded-Host $server_name;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        add_header P3P 'CP="ALL DSP COR PSAa PSDa OUR NOR ONL UNI COM NAV"';
    }

}

The directory /home/ubuntu/hetmech-backend/dj_hetmech/static (which holds static files for API view) should be populated by the following management command:

python manage.py collectstatic

Gunicorn is started at boot time by supervisord, with the following configuration:

command=/home/ubuntu/miniconda3/envs/hetmech-backend/bin/gunicorn dj_hetmech.wsgi:application --bind 127.0.0.1:8001 --error-logfile /tmp/hetmech-g
unicorn.log -w 3
directory=/home/ubuntu/hetmech-backend/
user=nobody
group=nobody
autostart=true
autorestart=true
priority=991
stopsignal=KILL

Improve ordering of matches for nodes endpoint

Currently, searching ep will return nicotine dependence before epilepsy.

We are also thinking of including number of metapaths in database functionality.

Refactor endpoint names/urls

While coding the frontend, I often get mixed up about which query does which thing; mainly for the node search queries.

Here's how it is now:

// lookup specific node by id
https://search-api.het.io/v1/nodes/[NODE_ID]

// search node based on search string
https://search-api.het.io/v1/nodes/?search=[SEARCH]

// search node based on search string, with other node specified so we can sort by # of metapaths to other node
https://search-api.het.io/v1/nodes/?search=[SEARCH_STRING]&count-metapaths-to=[NODE_ID]

// return a list of all nodes connected to specified node, sorted by # of metapaths
https://search-api.het.io/v1/count-metapaths-to/[NODE_ID]

// get random node pair
https://search-api.het.io/v1/random-node-pair/

// search metapaths
https://search-api.het.io/v1/query-metapaths/

// search paths
https://search-api.het.io/v1/query-paths/

I propose we change it to this:

https://search-api.het.io/v1/nodes/[NODE_ID]
https://search-api.het.io/v1/nodes/?search=[SEARCH_STRING]
https://search-api.het.io/v1/nodes/?search=[SEARCH_STRING]&other-node=[OTHER_NODE_ID]
https://search-api.het.io/v1/nodes/?other-node=[OTHER_NODE_ID]
https://search-api.het.io/v1/nodes/random-pair
https://search-api.het.io/v1/metapaths/
https://search-api.het.io/v1/paths/

I think this would be a more clear organization to the queries. Couple of things:

putting all the node related queries under node
making the "count-metapaths-to" based on parameters rather than different query urls. i think this is closer to what is actually going on
the metatype filter parameter can then be added to everything under nodes/, consistently
removing the word "query" from metapaths and paths (I think this is redundant because all of our endpoints are some kind of query to the backend.

Let's talk about this in person tomorrow.

Harmonize database node ids between postgres and neo4j

Currently, we have two internal database ids for nodes: one used by the Hetionet neo4j database and one used by this repo's postgres db. I am thinking we should consolidate and use the neo4j ids in this database.

Zenodo urllib.request.urlretrieve downloads raise ContentTooShortError

When downloading https://zenodo.org/record/1435834/files/dwpcs_length-2_damping-0.0.zip from https://zenodo.org/record/1435834, I got:

python manage.py populate_database --max-metapath-length=3  --reduced-metapaths --batch-size=12000
_download_hetionet_hetmat(self=<dj_hetmech_app.management.commands.populate_database.Command object at 0x7fc5f9b3e670>) ran in 0:00:00
_populate_metanode_table() ran in 0:00:00
_populate_node_table() ran in 0:00:13
_populate_metapath_table() ran in 0:00:00
_download_path_counts(length=1) ran in 0:01:17
_populate_degree_grouped_permutation_table(length=1) ran in 0:00:00
Traceback (most recent call last):
  File "manage.py", line 15, in <module>
    execute_from_command_line(sys.argv)
  File "/home/dhimmel/miniconda3/envs/hetmech-backend/lib/python3.8/site-packages/django/core/management/__init__.py", line 401, in execute_from_command_line
    utility.execute()
  File "/home/dhimmel/miniconda3/envs/hetmech-backend/lib/python3.8/site-packages/django/core/management/__init__.py", line 395, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/home/dhimmel/miniconda3/envs/hetmech-backend/lib/python3.8/site-packages/django/core/management/base.py", line 328, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/home/dhimmel/miniconda3/envs/hetmech-backend/lib/python3.8/site-packages/django/core/management/base.py", line 369, in execute
    output = self.handle(*args, **options)
  File "/home/dhimmel/Documents/repos/connectivity-search-backend/dj_hetmech_app/management/commands/populate_database.py", line 350, in handle
    timed(self._download_path_counts)(length)
  File "/home/dhimmel/Documents/repos/connectivity-search-backend/dj_hetmech_app/utils/__init__.py", line 16, in wrapper
    result = func(*args, **kwargs)
  File "/home/dhimmel/Documents/repos/connectivity-search-backend/dj_hetmech_app/management/commands/populate_database.py", line 274, in _download_path_counts
    path = self.zenodo_download('1435834', archive)
  File "/home/dhimmel/Documents/repos/connectivity-search-backend/dj_hetmech_app/management/commands/populate_database.py", line 365, in zenodo_download
    urlretrieve(url, path)
  File "/home/dhimmel/miniconda3/envs/hetmech-backend/lib/python3.8/urllib/request.py", line 286, in urlretrieve
    raise ContentTooShortError(
urllib.error.ContentTooShortError: <urlopen error retrieval incomplete: got only 1320992799 out of 3186294789 bytes>
Exception ignored in: <function Driver.__del__ at 0x7fc572508160>
Traceback (most recent call last):
  File "/home/dhimmel/miniconda3/envs/hetmech-backend/lib/python3.8/site-packages/neo4j/__init__.py", line 277, in __del__
  File "/home/dhimmel/miniconda3/envs/hetmech-backend/lib/python3.8/site-packages/neo4j/__init__.py", line 307, in close
  File "/home/dhimmel/miniconda3/envs/hetmech-backend/lib/python3.8/site-packages/neo4j/io/__init__.py", line 488, in close
  File "/home/dhimmel/miniconda3/envs/hetmech-backend/lib/python3.8/site-packages/neo4j/io/__init__.py", line 477, in remove
  File "/home/dhimmel/miniconda3/envs/hetmech-backend/lib/python3.8/site-packages/neo4j/io/_bolt3.py", line 390, in close
AttributeError: 'NoneType' object has no attribute 'debug'

DegreeGroupedPermutation table contains NaN values

from dj_hetmech_app.models import DegreeGroupedPermutation
instance = DegreeGroupedPermutation.objects.get(pk=23633958)
instance.__dict__

returns:

{'_state': <django.db.models.base.ModelState at 0x7f39aeb78550>,
 'id': 23633958,
 'metapath_id': 'CrCuGaD',
 'source_degree': 0,
 'target_degree': 540,
 'n_dwpcs': 54200,
 'n_nonzero_dwpcs': 0,
 'nonzero_mean': nan,
 'nonzero_sd': nan}

Notice how nonzero_mean and nonzero_sd are nan. I was expecting missing values to be None here, which would help with JSON encoding.

We double checked this occurs in the database:

SELECT * FROM dj_hetmech_app_degreegroupedpermutation WHERE id=23633958;

    id    | source_degree | target_degree | n_dwpcs | n_nonzero_dwpcs | nonzero_mean | nonzero_sd | metapath_id 
----------+---------------+---------------+---------+-----------------+--------------+------------+-------------
 23633958 |             0 |           540 |   54200 |               0 |          NaN |        NaN | CrCuGaD
(1 row)

Related to the following API call http://localhost:8000/v1/query-paths/?target=17054&source=6602&metapath=DaGuCrC

Flush of AWS database not working

I am attempting to wipe the prototype database and re-populate it using the new hetmatpy version added in #15.

However, the following command seems to run indefinitely without returning or erroring:

python manage.py flush --no-input

When the database was a local postgres instance in a Docker, this command took at most a few seconds. @dongbohu any ideas?

search-api.het.io SSH access and main branch rename

I renamed the default branch to main and updated source in 3c4e558.

Looking at the CI Logs, I'm not sure the auto-deploy is working for the main branch:

Already on 'master'
Your branch is up to date with 'origin/master'.
Fetching origin
From https://github.com/greenelab/hetmech-backend
 * [new branch]      main       -> origin/main
Your configuration specifies to merge with the ref 'refs/heads/master'
from the remote, but no such ref was fetched.

Tried SSHing into the instance using ssh [email protected], but got "Permission denied (publickey)". @dongbohu can you add my public SSH keys from here.

Also I think you were planning to migrate to AWS, similar to hetio/hetionet#35. The migration will probably take care of the any issues with new default branch name I imagine? The issue that I think might be occurring is that instance is still on the master branch and not main?

Return path count for non-precomputed rows

Currently, when we compute DWPC and path information on the fly, we have not been calculating setting path_count:

https://github.com/greenelab/hetmech-backend/blob/6b00ffe58664db9941e3bf6ce89596e9bc7f9404/dj_hetmech_app/utils/paths.py#L97

The reason for this limitation was that hetnetpy.neo4j.construct_dwpc_query only returned percent_of_DWPC. This is due to a possible Cypher/neo4j limitation where we can't return intermediate values separately from the resulting table.

One solution would be to change the Cypher to repeat PC and DWPC for every path row like:

MATCH path = (n0:Compound)-[:BINDS_CbG]-(n1)-[:PARTICIPATES_GpPW]-(n2)-[:PARTICIPATES_GpPW]-(n3)-[:ASSOCIATES_DaG]-(n4:Disease)
USING JOIN ON n2
WHERE n0.identifier = 'DB01156'
AND n4.identifier = 'DOID:0050742'
AND n1 <> n3
WITH
[
size((n0)-[:BINDS_CbG]-()),
size(()-[:BINDS_CbG]-(n1)),
size((n1)-[:PARTICIPATES_GpPW]-()),
size(()-[:PARTICIPATES_GpPW]-(n2)),
size((n2)-[:PARTICIPATES_GpPW]-()),
size(()-[:PARTICIPATES_GpPW]-(n3)),
size((n3)-[:ASSOCIATES_DaG]-()),
size(()-[:ASSOCIATES_DaG]-(n4))
] AS degrees, path
WITH path, reduce(pdp = 1.0, d in degrees| pdp * d ^ -0.5) AS PDP
WITH collect({paths: path, PDPs: PDP}) AS data_maps, count(path) AS PC, sum(PDP) AS DWPC
UNWIND data_maps AS data_map
WITH data_map.paths AS path, data_map.PDPs AS PDP, PC, DWPC
RETURN
  substring(reduce(s = '', node IN nodes(path)| s + '–' + node.name), 1) AS path,
  PDP,
  100 * (PDP / DWPC) AS percent_of_DWPC,
  PC, DWPC
ORDER BY percent_of_DWPC DESC

Returning a table like:

https://search-api.het.io/ is currently down or inaccessible

ping https://search-api.het.io/
ping: https://search-api.het.io/: Name or service not known

And on https://het.io/search/, looking at the console for search API requests, we're getting ERR_CONNECTION_TIMED_OUT.

@dongbohu can you look into this?

Change `max-paths` to `limit`, and add `limit` for all queries that return lists

Make the parameter name consistent. I recommend limit or result-limit, because it makes no assumptions about what type of data or object the query will return (in case we ever change the data structures or names of things), and it pretty clearly indicates that it's just a # limit on whatever will be returned by the query.

Also, it would be good to have this parameter available for all queries that return a list of things. It seems we'll need to start having some kind of result limit available for every part of the app, now that the metapaths table will be getting substantially longer.

dj_hetmech_app_node does not include in the code

Hi, I followed the steps in the readme and try to connect the front end with the back end. However, though I could open page http://localhost:8000/v1/ I got error when trying to open http://localhost:8000/v1/nodes/
The error message is:

ProgrammingError at /v1/nodes/
relation "dj_hetmech_app_node" does not exist
LINE 1: SELECT COUNT(*) AS "__count" FROM "dj_hetmech_app_node"

I set up the docker and POSTGRES_DB locally. Is there anything I missed to implement?

Migrate hosting to GCP from AWS

From @dongbohu:

Daniel, I created a new database and a new virtual machine on Google Cloud Platform to host the connectivity-search API. Please take a look at it:
http://35.229.106.21/v1/
If it looks okay with you, please register the IP address "35.229.106.21" with "search-api.het.io". When the registration gets effective, I will install an HTTPS certificate and enable HTTPS. The migration will be done.

nodes and query-paths endpoints return node data in different formats

@vincerubinetti mentioned that this is an annoyance for the frontend.

From https://search-api.het.io/v1/nodes/

        {
            "id": 0,
            "identifier": "128239",
            "identifier_type": "int",
            "name": "IQGAP3",
            "data": {
                "url": "http://identifiers.org/ncbigene/128239",
                "source": "Entrez Gene",
                "license": "CC0 1.0",
                "chromosome": "1",
                "description": "IQ motif containing GTPase activating protein 3"
            },
            "metanode": "Gene"
        },

From https://search-api.het.io/v1/query-paths/?source=11545&target=33324&metapath=SEcCrCrC:

    "nodes": {
        "11545": {
            "neo4j_id": 11545,
            "node_label": "SideEffect",
            "data": {
                "name": "Nasal itching",
                "source": "UMLS via SIDER 4.1",
                "identifier": "C0850060",
                "url": "http://identifiers.org/umls/C0850060",
                "license": "CC BY-NC-SA 4.0"
            },
            "metanode": "Side Effect"
        },

Use a clean git repo in the production search-api.het.io instance

Following #78 & #80, I noticed the following unstaged change in the git repo on search-api.het.io:

diff --git a/dj_hetmech/settings.py b/dj_hetmech/settings.py
index 9b9bdd9..691968f 100644
--- a/dj_hetmech/settings.py
+++ b/dj_hetmech/settings.py
@@ -29,7 +29,7 @@ with open(path) as read_file:
 SECRET_KEY = 'secret_not_yet_set'
 
 # SECURITY WARNING: don't run with debug turned on in production!
-DEBUG = True
+DEBUG = False
 
 ALLOWED_HOSTS = ['localhost', 'search-api.het.io', ]

This seems problematic to me. If we change settings.py it could cause the git pull --ff-only to fail. There must be a more robust way of setting DEBUG false? For example, by way of an environment variable or other option passed to django? @dongbohu what do you think?

Improve performance/lag for benefit of frontend

Performance of the frontend can be a bit laggy, which makes it unpredictable and confusing to the user when there is a queue of long computations in sequence. Some of this lag can be improved on the frontend, which I am in the process of doing. Some of this lag can surely be improved with modifications to the backend.

I'm not sure what the improvements would look like. One could possibly be a timeout parameter available for each query. After taking too long, it would simply give up on the query and return an error or something, preventing too many queries from building up and "clogging" the backend.

This is something we should discuss and iterate on over time.

Postgres Database Optimization

@dhimmel and @vincerubinetti: I added a few indexes in identifier and name fields in Node table on a clone of current backend DB. These indexes are supposed to make prefix search on identifier and substring/trigram searches on name field up to ten times faster (from a few hundred milliseconds to ~10 milliseconds). I am running the optimized DB on test AWS EC2 instance:
http://35.175.113.38/v1/nodes/?search=xxx

Please replace xxx with whatever string you want to search and compare its performance with the production server:
https://search-api.het.io/v1/nodes/?search=xxx
and tell me whether you feel any difference. If you do, I will apply these indexes on the production DB. Thanks.

Continuous deployment fails to update conda environment

There are two issues. First, when ac47bc4 updated the environment, the CI deployment failed with error:

/home/ubuntu/hetmech-backend/.circleci/deploy.sh: line 13: conda: command not found

Second, the conda env update command (which is currently incorrect as conda update) does not update the pip dependencies (see conda/conda#7774 / conda/conda#8541 / conda/conda#8542). I am thinking we should move to wipe and reinstall the conda env.

Selecting a cloud provider for this project

@dongbohu @cgreene and I were discussing cloud providers to host this webapp. Generally the lab uses AWS. However, I personally find AWS too difficult to use, and at times expensive.

It looks like setting up a Postgres instance on Google Cloud is straightforward. @dongbohu I am inclined to suggest Google Cloud as the first option, and if we find ourselves struggling we can switch to AWS. Just I feel that Google Cloud will make the hosting more accessible to me and I can help with the infrastructure to a larger extent then.

populate_database.py: "data" should be "properties"

The line 216 in populate_database.py is wrong.
It shoud be changed from data=data to properties=data.
Please check and confirm.
Thank you~

Implement elastic search or similar

I believe we're currently using simple substring matching for our node search. Throughout my testing of the frontend, I've come across many cases where I've had to type much more of a gene/compound/disease/etc than I would expect (compared to something like Google) to get it to show up near the top of the list.

Here is one example I found just now, though it's certainly not the best example of the issue:

In this example, I'd expect epilepsy to come before nicotine dependence. There are other examples (that I can't remember right now) where the thing I'm actually searching for is far far down the list and not visible. The criteria of how to rank results might be difficult to decide on, but I think we could come up with something that performs better than substring matching.

I'd put this feature request near the top of the list of things that would enhance the user experience of using this app, along with #49

Provide query to frontend to retrieve path/graph data from neo4j

The front-end will need to query the neo4j database to get the data needed to draw a graph representation of metapaths/paths. Apparently this query is fairly complex and also would be more appropriately generated by the backend.

The backend could either give the frontend the query, or perhaps do the query and return the results to the frontend.

Info on the neo4j data format:
https://github.com/eisman/neo4jd3#neo4j-data-format

Improvements to the database accessibility

User feedback in #74 highlighted some issues with the accessibility of our database. I updated the README in a6d5588, but there are still several improvements I'd like to make.

publicly archiving hetmech-pg_dump.sql.gz. I will look into whether we can upload this to Zenodo.
adding instructions to the readme on how to load hetmech-pg_dump.sql.gz. @dongbohu can you do this?
creating a read-only database user and making the database URL and this user and password public knowledge. This will make it much easier for us to share queries since we can give anyone read access to the database. Also there is code in Hetmech that directly queries the db, and these notebooks are not reproducible without access.

The database doesn't contain any sensitive information, so I think the risk of abuse is low. We can always disable the read-only user if there is some unintended consequence like excessive cloud costs. @dongbohu, we might already have a read-only user, but can you look into this and provide the details?

query-paths fails due to duplicate rows for symmetric metapaths

I noticed a the frontend failed at https://search.het.io/?source=8224&target=32460 when I tried to select a metapath. Looks like a backend error. Specifically, https://search-api.het.io/v1/query-paths/?source=8224&target=32460&metapath=GpBPpG fails at:

https://github.com/greenelab/hetmech-backend/blob/df4cc76248c3e68cdcefee255b449f59fe9945d9/dj_hetmech_app/utils/paths.py#L29