Git Product home page Git Product logo

smokh004 / tr-datachallenge1 Goto Github PK

View Code? Open in Web Editor NEW

This project forked from thomsonreuters/tr-datachallenge1

0.0 2.0 0.0 7.24 MB

Thomson Reuters is challenging you today to leverage machine learning and natural language processing to build an algorithm that can automatically classify news into different categories. If you are as obsessed as we are with deep learning, you are encouraged to create a headline summarizer which helps you earn extra points.

License: MIT License

tr-datachallenge1's Introduction

Data Challenge I- Reuters News Classification & Summarization

Introduction

alt text

As a multinational mass media and information firm, Thomson Reuters delivers quality news and latest stories to the world driven by intelligent technologies. With numerous news coming up each day, it’s resource-intensive to manually categorize them. As a leader in the information technology field, TR highly emphasizes Artificial Intelligence to harness the world’s content.

Thomson Reuters is challenging you today to leverage machine learning and natural language processing to build an algorithm that can automatically classify news into different categories. To earn more points, we encourage you to take the bonus problem to build a news headline summarization based on news body, which might require Deep Learning.

Problem 1: News Category Classification

Problem1-Description

Given news headlines, build a model to classify them into one of the three news categories.

Problem1-Data

For problem 1, please only use train1.csv and test1.csv under 1-Title_Classification folder for model building and result predicting. You are not allowed to use other training or testing dataset.

(I) Training dataset: 7916 rows, 3 columns

ID - unique identifier for each news
TITLE - news headline
TOPIC - one of the three topics (0/1/2) (our y label)

(II) Testing dataset: 3392 rows, 3 columns

ID - unique identifier for each news
TITLE - news headline
TOPIC - your predicted result (None)

Problem 2: News Headline Summarization(Optional-Bonus Points)

Problem2-Description

Given news bodies, build a model to generate their titles.

Problem2-Data

For problem 2, please only use train2.csv and test2.csv under 2-Title_Summarization folder for model building and result predicting. You are not allowed to use other training or testing dataset.

(I) Training dataset: 17142 rows, 3 columns

ID - unique identifier for each news
BODY - news content
TITLE - news headline (our y label)

(II) Testing dataset: 1904 rows, 3 columns

ID - unique identifier for each news
BODY - news content
TITLE - your predicted summary (None)

Submission

(I) You can use one of the five coding languages (Python, R, Java, C, C++) for this competition.

(II) Zip your code and predicted result file in the following format and send it to [email protected] with a title of firstname-lastname-challenge2 (such as 'bill-smith-challenge2')

  • For Problem 1, fill out the TOPIC column with your predicted result on test1.csv under folder 1-Title_Classification. Name this result CSV file firstname-lastname-result1.csv (such as bill-smith-result1.csv). Participants should assign the best topic for each headline, which means that only one predicted topic for each row is allowed. Example: bill-smith-result1.csv

  • ID TITLE TOPIC
    0 INDONESIAN COFFEE PRODUCTION MAY FALL THIS YEAR 2
    1 INTERNATIONAL BUSINESS MACHINE CORP 0
    ... ......... ...

  • For Problem 1, put all your code into a folder named firstname-lastname-code1
  • For Problem 2, fill out the TITLE column with your predicted result on test2.csv under folder 2-Title_Summarization. Name this result CSV file firstname-lastname-result2.csv (such as bill-smith-result2.csv). Example: bill-smith-result2.csv

  • ID BODY TITLE
    0 Jill Considine, New York State................. headline generated by my machine
    ... ......... ................................

  • For Problem 2, put all your code into a folder named firstname-lastname-code2
  • Evaluation and Score

    (I) Problem 1 (100 points)

    You will get up to 100 points totally based on the accuracy rate of the submission CSV (firstname-lastname-result1.csv). The formula is listed below:

    Accuracy rate =  (correctly predicted class / total testing class) × 100%

    We will review your code to check plagiarism. If we find a high similarity between your code and other participants' code or code published online such as on Github, Kaggle, etc, you won't earn points.

    (II) Problem 2 (50 points)

    This is a bonus problem which is not required. You will get up to 50 points based on the code review and results review. Review committee at Thomson Reuters will determine the score for Problem 2.

    If we find a high similarity between your code and other participants' code or code published online such as on Github, Kaggle, etc, you won't earn points.

    Rules

  • You agree not to transmit, duplicate, publish, redistribute or otherwise provide or make available the Competition Data to any party not participating in the Competition.
  • One person cannot participate with more than one user accounts. You can’t resubmit and resend your result to our email.
  • Hand-labelling is not allowed on the testing dataset
  • Competition Timeline. Start Date: Sep-10-2018. End Date: Sep-14-2018 11:59 PM CST.
  • You agree to only use our provided training dataset for model building. Using other sources is prohibited.
  • Thomson Reuters reserves the right of final decision on the interpretation of these Terms and Conditions.
  • tr-datachallenge1's People

    Contributors

    katherine-shiqi avatar

    Watchers

    James Cloos avatar Shokoufeh Mokhtari avatar

    Recommend Projects

    • React photo React

      A declarative, efficient, and flexible JavaScript library for building user interfaces.

    • Vue.js photo Vue.js

      🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

    • Typescript photo Typescript

      TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

    • TensorFlow photo TensorFlow

      An Open Source Machine Learning Framework for Everyone

    • Django photo Django

      The Web framework for perfectionists with deadlines.

    • D3 photo D3

      Bring data to life with SVG, Canvas and HTML. 📊📈🎉

    Recommend Topics

    • javascript

      JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

    • web

      Some thing interesting about web. New door for the world.

    • server

      A server is a program made to process requests and deliver data to clients.

    • Machine learning

      Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

    • Game

      Some thing interesting about game, make everyone happy.

    Recommend Org

    • Facebook photo Facebook

      We are working to build community through open source technology. NB: members must have two-factor auth.

    • Microsoft photo Microsoft

      Open source projects and samples from Microsoft.

    • Google photo Google

      Google ❤️ Open Source for everyone.

    • D3 photo D3

      Data-Driven Documents codes.