Git Product home page Git Product logo

n-gram's Introduction

n-gram based text categorization


Description

Implementation of the N-Gram-Based Text Categorization Research written in PHP.

This implementation allows you to categorize news articles by category:

  • business
  • entertainment
  • politics
  • sport
  • tech

Create a profile

<?php
use FilippoFinke\Categorizer;
require __DIR__ . '/vendor/autoload.php';

// Raw data used for training
$rawData = 'fileName.txt';
// N-Grams to keep
$grams = 500;
// Create a profile based on the training data
$category = Categorizer::createProfile($rawData, 'profile_name', $grams);
// Save the profile to disk
$category->save('destination/name.profile');

Categorize news

<?php
use FilippoFinke\Profile;
use FilippoFinke\Categorizer;
require __DIR__ . '/vendor/autoload.php';

$cateogories = array();
$cateogories[] = Profile::load('profiles/cat1.profile');
$cateogories[] = Profile::load('profiles/cat2.profile');
$cateogories[] = Profile::load('profiles/cat3.profile');

// File to categorize
$file = 'news.txt';
// Get the category
$result = Categorizer::categorize($file, $cateogories);
// Print the category name
echo $result->getName().PHP_EOL;

Example of accuracy

➡️ Loaded category business from file business.txt!
➡️ Loaded category entertainment from file entertainment.txt!
➡️ Loaded category politics from file politics.txt!
➡️ Loaded category sport from file sport.txt!
➡️ Loaded category tech from file tech.txt!
➡️ Loaded 5 models!
➡️ Testing files for category business: 60!
➡️ Testing files for category entertainment: 66!
➡️ Testing files for category politics: 47!
➡️ Testing files for category sport: 61!
➡️ Testing files for category tech: 51!
Total files: 285, Right guesses: 275, Wrong guesses: 10
Accuracy: 96%
⏱️ Took 12256 milliseconds

Dataset used

https://www.bbc.co.uk/blogs/bbcbackstage/dataset

n-gram's People

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.