Git Product home page Git Product logo

k-means-text-classifier's Introduction

This is a text classifier program implemented in Java that uses unsupervised learning to classify unstructured text data. This project was completed in the Big Data Sciences course at NYU with Professor Anasse Bari. 

This program takes in text files, uses NLP techniques and the Stanford NLP simple library to preprocess the files, find top keywords, create a tf-idf word document matrix and cluster each article based on its similarity to the others using K-means. F-measure, precision, recall, and a confusion matrix was used as performance metrics. Unknown documents are also assigned their most likely cluster through an implementation of the KNN algorithm.

*** Bds.java is my main class ***
To run program:

1. unzip file 
**** My program requires Stanford NLP that contains Sentence and Lemmatization ****
First try running with the dependencies you already have, if that doesnt work...
2. download stanford corenlp from here https://stanfordnlp.github.io/CoreNLP/download.html
	choose English	download version 3.9.2
3. the specific referenced libraries used in my program are 
	xom.jar
	protobuff.jar	
	stanfor_corenlp-3.9.2-models.jar
 	stanfor_corenlp-3.9.2.jar
These jars must all be part of the class path, a screenshot of my eclipse environment is included
4. If using eclipse, it can run as is, otherwise the jars should be exported as a runnable jar file saved as HW4/BDSHW/myjars.jar
5. From the command line make sure you are in HW4/BDSHW/ and run: 
javac -cp myjars.jar Bds.java
6. Next run: java -cp myjars.jar Bds
		
This will show output from my program, however the k cannot be updated by editing the program in sublime, it would have to be updated through eclipse 


Settings/Good to know:

1. Default KNN k is 3. To change KNN k value, manually change in Bds.java line 15
line15	public static final int UNK_K = 8;    ///update this for KNN

2. All paths to files can be found in Bds.java, none should need to be updated
line10	public static String datafilename = "data.txt";
line11	public static String stopwordsfilename = "stopwords.txt"; 
line12	public static String unknownsfilename = "datahw/";

3.  To change method of similarity comparison: Bds.java line 37
-True for cosine, False for euclidean
-default is cosine
line37        Similarity.kmeans(myMatrix, K, true);
	
4.  K++ means is implemented in Similarity as default. To change to normal kmeans, comment out line 205 in Similarity.java and uncomment line 206
line205		means = init_kpp_clusters(myMatrix);	//k++ means
line206		//means = init_clusters(myMatrix);	//k-means



Notes:
1. Word output is ordered by frequency, with most frequent at the top
	To see the wordcount too, update commented out string in Word.java line 64


k-means-text-classifier's People

Contributors

vangul01 avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.