kdd-lab-5's Introduction
AUTHORS: Nicholas Hansen - [email protected] Kaanan Kharwa - [email protected] PYTHON VERSION: Python 3.8.8 - Should be run on this version or newer. May face issues if run on older version of python REQUIRED PACKAGES: itertools numpy pandas concurrent json nltk FILE NAMES OF BEST RUNS: knnAuthorship.py: - RAW: "classified_cos_10.csv" - EVALUATION: - STATS: "knn_stats_cos_10.txt" - MATRIX: "knn_matrix_cos_10.csv" RFAuthorship.py: - RAW: "classified_995_20_1750.txt" - EVALUATION: - STATS: "rf_stats_995_20_1750.txt" - MATRIX: "rf_matrix_995_20_1750.csv" USAGE: Text Vectorizer: Usage: python3 textVectorizer.py <dataset_path> <output_name> Parameters: - <dataset_path> must be a directory containing the C50test and C50train directories - <output_name> is path of csv file to be created as ground truth KNN Authorship Attribution: Usage: python3 knnAuthorship.py <doc_vector_path> <word_counts_path> <sim_metric> <k> Parameters: - <doc_vector_path> path to file generated from textVectorizer.py - /vectorized/doc_vectors.txt - <word_counts_path> path to file generated from textVectorizer.py - /vectorized/word_counts.txt - <sim_metric> either 'cos' or 'okapi' - <k> integer Notes: Will generate an output file: /KNNOutput/classified_<sim_metric>_<k>.csv MAKE SURE /KNNOutput/ EXISTS AS A DIRECTORY RF Authorship Attribution: Usage: python3 RFAuthorship.py <doc_vector_path> <word_counts_path> <num_trees> <num_attr> <num_data_points> Parameters: - <doc_vector_path> path to file generated from textVectorizer.py - /vectorized/doc_vectors.txt - <word_counts_path> path to file generated from textVectorizer.py - /vectorized/word_counts.txt - <num_trees> integer - <num_attr> integer - <num_data_points> integer Notes: - Will generate an output file: /RFOutput/classified_<num_trees>_<num_attr>_<num_data_points>.csv - MAKE SURE RFOutput EXISTS AS DIRECTORY Classifier Evaluation: Usage: python3 classifierEvaluation.py <input_file_from_classifier> <ground_truth_path> Parameters: - <input_file_from_classifier> path to classified_ .csv generated by either RFAuthorship.py or KNNAuthorship.py Note: - Files can be from either /KNNOutput/ or /RFOutput/ - <ground_truth_path> path to ground truth .csv generated by textVectorizer.py OUTPUTS: textVectorizer.py: /vectorized/ Will only contain doc_vectors.txt and word_counts.txt knnAuthorship.py: /KNNOutput/ RFAuthorship.py: /RFOutput/ classifierEvaluation: /eval_outputs/
kdd-lab-5's People
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.