The Problem: Classify Wikipedia Disease Articles
We provide a sample of articles taken from Wikipedia. There are lots of different kinds of articles, and one flavor is those that describe a disease. The data are html dumps of wikipedia articles. We give you a labelled set of disease articles (positives), and non-diseases articles (negatives).
- Part 1: Use this data set to create a classifier that can accurately predict whether an article describes a disease.
- Part 2: Extract the name of the disease. Optionally, include as much information as you can about the disease. E.g. information to consider are symptoms, causes, prognosis, prevention, treatment, relevant drugs, human/non-human susceptibility.
The data: The directory contains wikipedia article html dumps. There are direct wgets of the articles, e.g. malaria, autism, Parkinson's disease. They are organized into two directories: positive/ and negative/. The positive dataset is 3,693 articles about diseases, and the negative dataset is 10,000 articles.
Here is the data set: wikipedia_diseases.zip
Edge cases:
- Your classifier might misclassify drug articles as disease articles, e.g. Penicillin, Paracetamol, L-DOPA, Erythromycin. Check if that is the case, and optionally try to fix that mis-classification.
- Exclude broad articles that talk about generic classes of diseases, e.g. genetic disorders, infection, bacteria, virus, mutation. Fun: when you exclude these, what does your classifier do for cancer?
Solution
Components:
- HTML Parsing: Parse through the wikipedia article for sentences.
- Baseline model: Logistic Regression Classifier
- DNN model: Sequential model from Keras is used for both Part-A and Part-B (separate models for the same)
Instructions to run
Runner module is created and DNN model is pre-trained with a small subset of training data (in absense of adequate compute).
$ python runner.py
$ (enter text to be classified)
$ (if text is classified as disease, it would also try to identify the disease)
Conclusions
Model performs satisfactorily well, but with caveats. Future scope includes:
- Experimentation with Word embeddings and Glove bag-of-words
- Convolutional Neural Networks and Deep-NLP