Git Product home page Git Product logo

project-sim-dishamisal's Introduction

The Problem: Classify Wikipedia Disease Articles

We provide a sample of articles taken from Wikipedia. There are lots of different kinds of articles, and one flavor is those that describe a disease. The data are html dumps of wikipedia articles. We give you a labelled set of disease articles (positives), and non-diseases articles (negatives).

  • Part 1: Use this data set to create a classifier that can accurately predict whether an article describes a disease.
  • Part 2: Extract the name of the disease. Optionally, include as much information as you can about the disease. E.g. information to consider are symptoms, causes, prognosis, prevention, treatment, relevant drugs, human/non-human susceptibility.

The data: The directory contains wikipedia article html dumps. There are direct wgets of the articles, e.g. malaria, autism, Parkinson's disease. They are organized into two directories: positive/ and negative/. The positive dataset is 3,693 articles about diseases, and the negative dataset is 10,000 articles.

Here is the data set: wikipedia_diseases.zip

Edge cases:

  1. Your classifier might misclassify drug articles as disease articles, e.g. Penicillin, Paracetamol, L-DOPA, Erythromycin. Check if that is the case, and optionally try to fix that mis-classification.
  2. Exclude broad articles that talk about generic classes of diseases, e.g. genetic disorders, infection, bacteria, virus, mutation. Fun: when you exclude these, what does your classifier do for cancer?

Solution

Components:

  • HTML Parsing: Parse through the wikipedia article for sentences.
  • Baseline model: Logistic Regression Classifier
  • DNN model: Sequential model from Keras is used for both Part-A and Part-B (separate models for the same)

Instructions to run

Runner module is created and DNN model is pre-trained with a small subset of training data (in absense of adequate compute).

$ python runner.py
$ (enter text to be classified)
$ (if text is classified as disease, it would also try to identify the disease)

Conclusions

Model performs satisfactorily well, but with caveats. Future scope includes:

  • Experimentation with Word embeddings and Glove bag-of-words
  • Convolutional Neural Networks and Deep-NLP

project-sim-dishamisal's People

Contributors

dishamisal avatar shkr avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.