We provide a sample of articles taken from Wikipedia. There are lots of different kinds of articles, and one flavor is those that describe a disease. The data are html dumps of wikipedia articles. We give you a labelled set of disease articles (positives), and non-diseases articles (negatives).
- Part 1: Use this data set to create a classifier that can accurately predict whether an article describes a disease.
- Part 2: Extract the name of the disease. Optionally, include as much information as you can about the disease. E.g. information to consider are symptoms, causes, prognosis, prevention, treatment, relevant drugs, human/non-human susceptibility.
The data: The directory contains wikipedia article html dumps. There are direct wgets of the articles, e.g. malaria, autism, Parkinson's disease. They are organized into two directories: positive/ and negative/. The positive dataset is 3,693 articles about diseases, and the negative dataset is 10,000 articles.
Here is the data set: wikipedia_diseases.zip
Edge cases:
- Your classifier might misclassify drug articles as disease articles, e.g. Penicillin, Paracetamol, L-DOPA, Erythromycin. Check if that is the case, and optionally try to fix that mis-classification.
- Exclude broad articles that talk about generic classes of diseases, e.g. genetic disorders, infection, bacteria, virus, mutation. Fun: when you exclude these, what does your classifier do for cancer?
How long will this take?
That is yours to commit. Just get back within 7 days from the time this repository is shared with you.
What language should I use?
Whatever you want.
Can I use X?
Unless the tool directly answers the question "are these disease articles?". You can use whatever you want.
How should I solve it? There are many solutions to non-convex problems. We are seeking a deep learning engineer, so deep learning models are preferrable.
Through this we want to understand the core skills and methods you are developing as a problem solver.
What should I deliver?
Your code, results, and brief instructions on how to build and run your solution.
What is this again? This is a simulation of a project that you might execute. Our goal is that by collaborating on this repo we will both learn from experience.
Is there an example solution?
- Naive Bayes Classifier - https://github.com/shkr/wikiclassifier/blob/master/wikiclassifier.pdf