Git Product home page Git Product logo

phenopacket-scraper-core's Introduction

Phenopacket-scraper-core

Extracts information from life-science websites and texts, generating phenopackets with the extracted information and correct external ontology references.

Running PhenopacketScraper Tool

Setup

To get the project's source code, clone the github repository:

$ git clone https://github.com/monarch-initiative/phenopacket-scraper-core.git

First, you need to create a virtual environment and activate it.

$ [sudo] pip install virtualenv
$ virtualenv -p python3 venv
$ source venv/bin/activate
(venv)$

Next, install all the dependencies in the environment.

(venv)$ venv/bin/pip install -r requirements.txt

Clone the phenopacket-python repository:

$ git clone https://github.com/phenopackets/phenopacket-python.git

Install it in your virtual environment:

(venv)$ cd phenopacket-python
(venv)$ python setup.py install

Add this to the end of ~/.profile or ~/.bash_profile file to add phenopacket-python directory to your python environment variables:

$ export PYTHONPATH=$PYTHONPATH:[path of phenopacket-python directory]

For Example:

$ export PYTHONPATH=$PYTHONPATH:/Users/Gauss/Home/phenopacket-python

Now, install the application into the virtual environment.

(venv)$ cd phenopacket-scraper-core
(venv)$ python setup.py install

Usage

(venv)$ pps --help
(venv)$ pps scrape -u (url)

Example:

(venv)$ pps scrape -u http://molecularcasestudies.cshlp.org/content/early/2016/02/09/mcs.a000786.abstract
(venv)$ pps -q scrape -u (Url)

Title: Mutations in the substrate ...

Abstract:
We describe a large Lebanese fa...

HPO Terms:
Diffuse cerebellar atrophy
Generalized hypotoni...

To use files with list of URLs as input:

(venv)$ pps scrape -f (Filename)

Example:

(venv)$ pps scrape -f testurls.txt

testurls.txt:

http://molecularcasestudies.cshlp.org/content/early/2016/02/09/mcs.a000786.abstract
http://molecularcasestudies.cshlp.org/content/2/2/a000703.abstract
http://molecularcasestudies.cshlp.org/content/2/2/a000620.abstract
http://molecularcasestudies.cshlp.org/content/2/1/a000661.abstract

To scrape required data from a HTML file:

(venv)$ pps scrape -d (Filename)

To store the output in a file:

(venv)$ pps scrape -u (Url) -o (Filename)
(venv)$ pps scrape -d (Filename) -o (Filename)
(venv)$ pps scrape -f (Input_filename) -o (Output_filename)

This will create two files for now, (Filename)_abstract.txt will contain the abstract and the (Filename)_hpo_terms.txt will contain the hpo terms.

Sci-graph Annotation:

(venv)$ pps annotate -u (url)

[{u'start': 4, u'token': {u'terms': [u'TORC1 complex'], u'id': u'GO:0031931', u'categories': [u'cellular component']}, u'end': 10}, {u'start': 11, u'token': {u'terms': [u'inhibitor'], u'id': u'CHEBI:35222', u'categories': [u'chemical role']}, u'end': 20}, {u'start': 72, u'token': {u'terms': [u'multiple'], u'id': u'PATO:0002118', u'categories': [u'qua......

HPO Terms:
Neoplasm
Breast carcinoma
Carcinoma
increased carcinoma incidence

Phenopacket Generation:

(venv)$ pps phenopacket -u (url)
(venv)$ pps phenopacket -d (html_filename)

{
"entities": [
  {
    "id": "http://molecularcasestudies.cshlp.org/content/2/1/a000661.abstract",
    "type": "paper"
  }
],
"id": "gauss-packet",
"phenotype_profile": [
  {.....

To store the output in a file:

(venv)$ pps annotate -u (Url) -o (Filename)
(venv)$ pps phenopacket -u (Url) -o (Filename)

Cleaning Up

Finally, when done, deactivate your virtual environment:

(venv)$ deactivate
$

phenopacket-scraper-core's People

Contributors

satwik77 avatar doctorbud avatar cmungall avatar

Watchers

 avatar Harry Hochheiser avatar James Cloos avatar  avatar Jeffrey S. Grethe, Ph.D. avatar Peter Robinson avatar Nicole Washington avatar  avatar

phenopacket-scraper-core's Issues

Add a .travis.yml file

We should have travis run on every commit, and for every pull request.

I suggest as a first pass, just a very simple test

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.