Git Product home page Git Product logo

krowdd's Introduction

Expert estimates for feature relevance are imperfect

Our paper shows that expert estimates for feature relevance are imperfect. This is why we developed KrowDD: a crowdsourcing approach to estimate feature relevance before obtaining data.

Keywords

Feature Ranking, Feature Selection, Crowdsourcing, Wisdom of Crowds, Machine Learning.

Functional Prototype

A working prototype is available here. For more details, see Section "Web Application" bellow or download the paper on the UZH website.

Datasets (Preprocessed)

The preprocessed datasets are stored as .csv here: paper_plots-and-data/datasets

AUC for Each Condition

This data was used to create Tables 2, 3 and 4, as well as Figures 2, 3 and 4. The calculated AUC scores (Naive Bayes) for each conditions are stored as .json here: paper_plots-and-data/datasets The files are structured as follows: {CONDITION: {NUMBER_OF_FEATURES: [AUC1, AUC2, AUC3,...],...},...} Where Condition is one of the following:

  • 'Data Scientists': Data Scientists from Upwork. There is one AUC score per number of features for each expert.
  • 'Domain Experts': Manually recruited Domain Experts. There is one AUC score per number of features for each expert.
  • 'KrowDD': AUC scores for the approach proposed here.
  • 'Laypeople': AUC scores for rankings from Amazon Mechanical Turk
  • 'Random': AUC scores for a random selection of features.
  • 'Actual': AUC scores when using the actual Information Game Ranking from the dataset.
  • 'Best': Best possible* AUC score for chosen Classifier (Naive Bayes)
  • 'Worst': Worst possible* AUC score for chosen Classifier (Naive Bayes)

*Please note that the AUC score has been calculated as the mean of a 10-fold cross validation. Therefore, it is possible that this score is slightly higher/lower than the true best score.

NUMBER_OF_FEATURES denotes the number of features used to train the classifier. For example, the AUC scores for NUMBER_OF_FEATURES=3 have been calculated using the best three features according to the given condition.

Figure 5: Crowd Estimate Errors

Data for creating Figure 5 can be found as .json here. The File is structured as follow: {NUMBER_OF_ANSWERS:{INDEX_0: Delta_0, INDEX_1: Delta_1,...},...} NUMBER_OF_ANSWERS denotes the numbers of estimates sampled from all acquired crowd estimates (without replacement). DELTA_X denotes the average of the absolute difference between the means calculated from the actual dataset and the aggregated crowd estimates. INDEX_X is an internal index.

Web Application

Try KrowDD online! On http://mbuehler.ch/krowdd you can upload your own data and obtain a relevance estimation for each feature. You only need the following:

  • Job title: the title of the job
  • E-mail: your email
  • AMT access key ID and AMT secret access key: your Amazon Mechanical Turk access key ID and the corresponding secret access key. These credentials are required to collect the crowd estimates on Amazon Mechanical Turk.
  • CSV file: a CSV file with seven columns: Feature, Question P (X|Y = 0), Question P (X|Y = 1), Question P (X), P (X|Y = 0), P (X|Y = 1), P (X). Users need to fill at least one field for each sibling (e.g. Question P (X|Y = 1) and P (X|Y = 1)). For each feature, the user can either provide descriptions for the (conditional) means of this feature or directly enter a value. KrowDD only queries feature means for which no value has been provided.
  • Target mean (optional): you can decide between defining a target mean (e.g. if it is already known that the target variable is balanced, the user might use a target mean of 0.5) or querying the target mean from the crowd.

The estimates acquired from the crowd and the feature data (Information Gain and probability estimates) can be downloaded as CSV files.

Questions / Feedback

For questions / feedback, don't hesitate and contact me.

Screenshots

New Job View Job Status View Job Result View

krowdd's People

Watchers

Marcel Bühler avatar

Forkers

pdeboer

krowdd's Issues

Validation of Job

Validate Job (specifically input CSV) and return message to user. So far an invalid CSV results in a server error (500).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.