Git Product home page Git Product logo

news-title-classification's Introduction

News title classification

In this project I digg into the headlines of IsraelHayom Hebrew website and Haaretz Hebrew website. I will try to build machine learning model that when given a headline from one of the two, it will classify it to it's source with high probability. Hopefully, at list 80%.

Data probe

First I tried to search for resources, since I wanted to get as match samples as I can. I found out that both newspapers website offer RSS page that I can get into using http get request. Haaretz's RSS page is formatted as needed, but IsraelHayom's RSS page has a lot of encoding errors. It took me some time to parse their xml, but I found the pattern of their errors. Now, the website prob function can get a normal RSS page and get the data from there.

Secondly, I wanted to get all the data from beginning of my work till handout of the script. So I wrote a batch file that I set the Windows Task scheduler to run at each hour. Now I have more than 1,000 samples from both newspapers. I'm saving the sample as json file with the classification (0 for Haaretz and 1 for IsraelHayom).

Preprocessing

Since the problem is classifying text, after having some data, I have to convert it to a way that I will be able to apply Machine Learning algorithms. I used sklearn CountVectorizer that set each word in the training set a number and then returns a matrix where for each sample in the i'th column there is 1 iff the word that got the number i is in that sample. Then I used sklearn TfidfTransformer in order to normalize the cells. Since I wanted to save some samples that later on I will use to evaluate real prediction, I splited the sample to test and train sets.

Fetcher selection

I tried also doing some fetcher selection, but I found out that it looks like there is no fetchers to remove. I think that is because each feacher is a word, and if I take a work from the vocabulary it has more problem when seeing that word in when trying to predict the test set.

Note

that the CountVectorizer vocabulary is not fitted with the train set sample in order to get as match closer to real test cases, where there may be word that won't appear in the vocabulary.

Training the model

Since the problem is classification problem I choose a few of classifications models to work with. For each model I used cross validation method to choose its best hyperparameter over the training set. For each hyperparameter I run the cross validation method, and then I take the mean of scores, that is in order to evaluate the hyperparameter in the mean case. After choosing each model its best hyperparameter, I'm training the model again on all the training dataset and calculate the score of the model under the test dataset (that I saved earlier in the preprocessing stage). In order to avoid over fitting, I choose the model that has the highest score over the test dataset and not the train set. I'm saving the model using pickle library to allow fast predictions in the future without training the model all over again.

Using the script

The script can be run in four configurations:

  1. Update data set-Let the user update the dataset from current website RSS: Run the script with 'update' as second argument of the program
  2. Load test - Let the user download from website a separate dataset to test the model on data that it has never seen before: Run the script with 'load_test' as second argument of the program
  3. Fit - Let the user fit the model using the ml methods mentioned above Run the script with 'fit' as second argument of the program Note: There should be dataset with sample in the directory mentioned in the top of the script (The mode need to learn from something :))
  4. Predict - Let the user use a trained model to clsify newspaper titles to Haaretz or IsraelHayom Run the script with 'predict' as second argument of the program and then the path to the data to predict third argument

I had very nice time writing the script and I hope you will like it. Thanks

news-title-classification's People

Contributors

oezrachi avatar oferezr avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.