The Problem: Classify Wikipedia Disease Articles

We provide a sample of articles taken from Wikipedia. There are lots of different kinds of articles, and one flavor is those that describe a disease. The data are html dumps of wikipedia articles. We give you a labelled set of disease articles (positives), and non-diseases articles (negatives).

Part 1: Use this data set to create a classifier that can accurately predict whether an article describes a disease.
Part 2: Extract the name of the disease. Optionally, include as much information as you can about the disease. E.g. information to consider are symptoms, causes, prognosis, prevention, treatment, relevant drugs, human/non-human susceptibility.

The data: The directory contains wikipedia article html dumps. There are direct wgets of the articles, e.g. malaria, autism, Parkinson's disease. They are organized into two directories: positive/ and negative/. The positive dataset is 3,693 articles about diseases, and the negative dataset is 10,000 articles.

Here is the data set: wikipedia_diseases.zip

Edge cases:

Your classifier might misclassify drug articles as disease articles, e.g. Penicillin, Paracetamol, L-DOPA, Erythromycin. Check if that is the case, and optionally try to fix that mis-classification.
Exclude broad articles that talk about generic classes of diseases, e.g. genetic disorders, infection, bacteria, virus, mutation. Fun: when you exclude these, what does your classifier do for cancer?

FAQs

How long will this take?
That is yours to commit. Just get back within 7 days from the time this repository is shared with you.

What language should I use?
Whatever you want.

Can I use X?
Unless the tool directly answers the question "are these disease articles?". You can use whatever you want.

How should I solve it? There are many solutions to non-convex problems. We are seeking a deep learning engineer, so deep learning models are preferrable.

Through this we want to understand the core skills and methods you are developing as a problem solver.

What should I deliver?
Your code, results, and brief instructions on how to build and run your solution.

What is this again? This is a simulation of a project that you might execute. Our goal is that by collaborating on this repo we will both learn from experience.

Is there an example solution?

Naive Bayes Classifier - https://github.com/shkr/wikiclassifier/blob/master/wikiclassifier.pdf

shkr / project-sim-ram-kumar Goto Github PK

project-sim-ram-kumar's Introduction

The Problem: Classify Wikipedia Disease Articles

FAQs

project-sim-ram-kumar's People

Contributors

Watchers

project-sim-ram-kumar's Issues

Unable to download the dataset

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent