The Problem: Classify Wikipedia Disease Articles

We provide a sample of articles taken from Wikipedia. There are lots of different kinds of articles, and one flavor is those that describe a disease. The data are html dumps of wikipedia articles. We give you a labelled set of disease articles (positives), and non-diseases articles (negatives).

Part 1: Use this data set to create a classifier that can accurately predict whether an article describes a disease.
Part 2: Extract the name of the disease. Optionally, include as much information as you can about the disease. E.g. information to consider are symptoms, causes, prognosis, prevention, treatment, relevant drugs, human/non-human susceptibility.

The data: The directory contains wikipedia article html dumps. There are direct wgets of the articles, e.g. malaria, autism, Parkinson's disease. They are organized into two directories: positive/ and negative/. The positive dataset is 3,693 articles about diseases, and the negative dataset is 10,000 articles.

Here is the data set: wikipedia_diseases.zip

Edge cases:

Your classifier might misclassify drug articles as disease articles, e.g. Penicillin, Paracetamol, L-DOPA, Erythromycin. Check if that is the case, and optionally try to fix that mis-classification.
Exclude broad articles that talk about generic classes of diseases, e.g. genetic disorders, infection, bacteria, virus, mutation. Fun: when you exclude these, what does your classifier do for cancer?

Solution

Components:

HTML Parsing: Parse through the wikipedia article for sentences.
Baseline model: Logistic Regression Classifier
DNN model: Sequential model from Keras is used for both Part-A and Part-B (separate models for the same)

Instructions to run

Runner module is created and DNN model is pre-trained with a small subset of training data (in absense of adequate compute).

$ python runner.py
$ (enter text to be classified)
$ (if text is classified as disease, it would also try to identify the disease)

Conclusions

Model performs satisfactorily well, but with caveats. Future scope includes:

Experimentation with Word embeddings and Glove bag-of-words
Convolutional Neural Networks and Deep-NLP

shkr / project-sim-dishamisal Goto Github PK

project-sim-dishamisal's Introduction

The Problem: Classify Wikipedia Disease Articles

Solution

Instructions to run

Conclusions

project-sim-dishamisal's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent