Git Product home page Git Product logo

project-sim-ram-kumar's Introduction

The Problem: Classify Wikipedia Disease Articles

We provide a sample of articles taken from Wikipedia. There are lots of different kinds of articles, and one flavor is those that describe a disease. The data are html dumps of wikipedia articles. We give you a labelled set of disease articles (positives), and non-diseases articles (negatives).

  • Part 1: Use this data set to create a classifier that can accurately predict whether an article describes a disease.
  • Part 2: Extract the name of the disease. Optionally, include as much information as you can about the disease. E.g. information to consider are symptoms, causes, prognosis, prevention, treatment, relevant drugs, human/non-human susceptibility.

The data: The directory contains wikipedia article html dumps. There are direct wgets of the articles, e.g. malaria, autism, Parkinson's disease. They are organized into two directories: positive/ and negative/. The positive dataset is 3,693 articles about diseases, and the negative dataset is 10,000 articles.

Here is the data set: wikipedia_diseases.zip

Edge cases:

  1. Your classifier might misclassify drug articles as disease articles, e.g. Penicillin, Paracetamol, L-DOPA, Erythromycin. Check if that is the case, and optionally try to fix that mis-classification.
  2. Exclude broad articles that talk about generic classes of diseases, e.g. genetic disorders, infection, bacteria, virus, mutation. Fun: when you exclude these, what does your classifier do for cancer?

FAQs

How long will this take?
That is yours to commit. Just get back within 7 days from the time this repository is shared with you.

What language should I use?
Whatever you want.

Can I use X?
Unless the tool directly answers the question "are these disease articles?". You can use whatever you want.

How should I solve it? There are many solutions to non-convex problems. We are seeking a deep learning engineer, so deep learning models are preferrable.

Through this we want to understand the core skills and methods you are developing as a problem solver.

What should I deliver?
Your code, results, and brief instructions on how to build and run your solution.

What is this again? This is a simulation of a project that you might execute. Our goal is that by collaborating on this repo we will both learn from experience.

Is there an example solution?

  1. Naive Bayes Classifier - https://github.com/shkr/wikiclassifier/blob/master/wikiclassifier.pdf

project-sim-ram-kumar's People

Contributors

billaram avatar shkr avatar

Watchers

James Cloos avatar  avatar  avatar  avatar

project-sim-ram-kumar's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.