Git Product home page Git Product logo

mayoclinic-scrapper's Introduction

mayoclinic-scrapper

Does not work anymore. It also infringes Mayo's Clinic TOS

Scrapping diseases information from Mayo Clinic and saving it in Neo4j

SetupUsageData formatPossible improvements

screenshot

Setup

This project has been developed using Python 3 (Python 2 may work). You need to install Scrapy and the Neo4j Bolt Driver for Python. Execute the following command in the project's root directory to install all the required dependencies (using a virtual environment is recommended):

pip install -r requirements

A running Neo4j instance is needed. For development purposes, the easiest way of starting an instance is running the official Neo4j Docker image using this command:

docker run --publish=7474:7474 --publish=7687:7687 --env=NEO4J_AUTH=none neo4j

Usage

There are two scripts: scraper.py and neo4j_importer.py. The first one does not need any parameter and it will extract diseases data from the Mayo Clinic's diseases and conditions index, generating a JSON file. The second script receives this file as a parameter and will import the data into the Neo4j instance at http://localhost:7687. If you haven not used the command in the previous section to start Neo4j, make the necessary modifications in the second script.

Finally go to the Neo4j dashboard and start playing! For example, in the next gif you can see how the causes that are related with more than 3 diseases are retrieved:

demo

Data format

This file generated by the first script is a JSON Array containing the extracted diseases. An example of a disease is:

{
   "disease_id": 0,
   "disease_name": "Sweet's syndrome",
   "causes": [
      {
         "cause_id": 0,
         "cause_name": "Sex"
      },
      {
         "cause_id": 1,
         "cause_name": "Age"
      },
      {
         "cause_id": 2,
         "cause_name": "Cancer"
      }
   ],
   "risk_factors": [
      {
         "risk_id": 0,
         "risk_name": "Sex"
      },
      {
         "risk_id": 1,
         "risk_name": "Age"
      },
      {
         "risk_id": 2,
         "risk_name": "Cancer"
      }
    ]
}

The data from this file is inserted in Neo4j with the following schema:

(d:Disease { id, name })-[:CAUSED_BY]->(:Cause { id, name })
(d:Disease { id, name })-[:HAS_RISK]->(:RiskFactor { id, name })

Possible improvements

The scrapping is imperfect. There are some disease's causes that should be processed as the same one. For example, Smoking can also appear as Smokin or You smoke. It would be cool to extract entities from the text and perform some fuzzy matching. Maybe using NLTK.

The same idea could be applied to extract the symptoms, since in the webpage the symptoms are contained in a free-text box as opposed to causes and risk factors that are bullet points.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.