Git Product home page Git Product logo

proglanguageclassifier's Introduction

ProgLanguageClasifier

A Naive Bayes Classifier for classifying documents by programming language.

COMPILING: Either run 'make' in the src directory, or 'javac NBCMain.java' in the src directory

RUNNING: The program takes four arguments: [train|test|one-vs-all-but-one|random-skip-loop|likelihood-details] [documentDir] [loadFile] [saveFile]

  • Not all arguments may be used in any given operation, but they must all be specified. Simply using a string will suffice (e.g. “none”).

  • The first argument is the type of operation to be done. Each operation is explained further below.

  • The second argument is the directory training data is to be read from. The format should look like:

documentDir |-> Class1 |-> file1.txt |-> file2.txt |-> Class2 |-> Class3 |-> etc.

Where each subfolder is named for the class it contains, and contains training files in *.txt format.

For our training data, use “ExampleFiles”

If testing, this should simply be a one-level directory of *.txt files. For example “ExampleFiles/Python”.

  • The third argument is the location of a serialized classifier to be loaded. Again, use a non-file string to initialize a new classifier (“none”).

  • The fourth and last argument is the location of where the trained classifier should be serialized to. (e.g. “classifier.ser”).

When running the program from command line, it seems like it needs to be run while the user is sitting in the directory containing the binary files. I was having trouble running "elnux3> java ./src/NBCMain" -- Error: Could not find or load main class ..src.NBCMain so I had to navigate into ./src/ and running from there is fine.

*** Either run *** 'make runProg'

* This will run with the following defaults:
	- it will run 'one-vs-all-but-one'
	- save/load to src's parent directory as: ProgLanguageClassifier/serial.files
	- document dir is ../ExampleFiles

* Default document dir is our "ExampleFiles" directory, which should be kept as the default for all running methods, EXCEPT for test. For test, use a class directory within ExampleFiles, such as 'ExampleFiles/JavaScript' . The other running methods train and test on all class directories anyhow, so you might not even end up using test

* To change arguments, run:
	make runProg method="random-skip-loop" load="../serialfile.file" save="../serialfile.file" directory="../ExampleFiles" 

*** Or run *** 'java NBCMain train ../ExampleFiles ../serialfile.file ../serialfile.file' with the arg choices listed above

OPERATIONS:

  • train: train a new or existing classifier over a two-level directory, each sub-directory named for it’s class and containing training documents. Classifier is serialized to disk.

  • test: test an existing classifier (unserialized from disk) over a one-level directory containing test documents, and displays results.

  • one-vs-all-but-one: for each document d, trains a new classifier over each document excluding d, and tests on d, displays results.

  • random-skip-loop: for probability 5%,10%,...,95%, trains on random portion of the documents using the probability, and tests on the remainining documents. Performs this 100 times per probability. Each probability is saved to a csv file, the mean and variance for each probability is displayed after running all 1900 tests.

  • likelihood-details: for each class in documentDir, trains the classifier, and prints the top 100 features for that class, according to their relative likelihood and in descending order.

proglanguageclassifier's People

Contributors

jellyyfish avatar jmhummel avatar jonsaj avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.