Git Product home page Git Product logo

kdd-lab-5's Introduction

AUTHORS:
    Nicholas Hansen - [email protected]
    Kaanan Kharwa - [email protected]

PYTHON VERSION:
    Python 3.8.8 - Should be run on this version or newer. May face issues if run on older version of python

REQUIRED PACKAGES:
    itertools
    numpy
    pandas
    concurrent
    json
    nltk

FILE NAMES OF BEST RUNS:
    knnAuthorship.py:
        - RAW: "classified_cos_10.csv"
        - EVALUATION: 
            - STATS: "knn_stats_cos_10.txt"
            - MATRIX: "knn_matrix_cos_10.csv"
    RFAuthorship.py: 
        - RAW: "classified_995_20_1750.txt"
        - EVALUATION: 
            - STATS: "rf_stats_995_20_1750.txt"
            - MATRIX: "rf_matrix_995_20_1750.csv"

USAGE:
    Text Vectorizer: 
        Usage: 
            python3 textVectorizer.py <dataset_path> <output_name>
        Parameters:
            - <dataset_path> must be a directory containing the C50test and C50train directories
            - <output_name> is path of csv file to be created as ground truth

    KNN Authorship Attribution: 
        Usage: 
            python3 knnAuthorship.py <doc_vector_path> <word_counts_path> <sim_metric> <k>
        Parameters:
            - <doc_vector_path> path to file generated from textVectorizer.py - /vectorized/doc_vectors.txt
            - <word_counts_path> path to file generated from textVectorizer.py - /vectorized/word_counts.txt
            - <sim_metric> either 'cos' or 'okapi'
            - <k> integer
        Notes:
            Will generate an output file: /KNNOutput/classified_<sim_metric>_<k>.csv
            MAKE SURE /KNNOutput/ EXISTS AS A DIRECTORY

    RF Authorship Attribution:
        Usage: 
            python3 RFAuthorship.py <doc_vector_path> <word_counts_path> <num_trees> <num_attr> <num_data_points>
        Parameters:
            - <doc_vector_path> path to file generated from textVectorizer.py - /vectorized/doc_vectors.txt
            - <word_counts_path> path to file generated from textVectorizer.py - /vectorized/word_counts.txt
            - <num_trees> integer
            - <num_attr> integer
            - <num_data_points> integer
        Notes:
            - Will generate an output file: /RFOutput/classified_<num_trees>_<num_attr>_<num_data_points>.csv
            - MAKE SURE RFOutput EXISTS AS DIRECTORY

    Classifier Evaluation:
        Usage: 
            python3 classifierEvaluation.py <input_file_from_classifier> <ground_truth_path>
        Parameters:
            - <input_file_from_classifier> path to classified_ .csv generated by either RFAuthorship.py or KNNAuthorship.py
                Note:
                    - Files can be from either /KNNOutput/ or /RFOutput/
            - <ground_truth_path> path to ground truth .csv generated by textVectorizer.py

OUTPUTS:
    textVectorizer.py: /vectorized/
        Will only contain doc_vectors.txt and word_counts.txt
    knnAuthorship.py: /KNNOutput/
    RFAuthorship.py: /RFOutput/
    classifierEvaluation: /eval_outputs/

kdd-lab-5's People

Contributors

nhans32 avatar kaanan99 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.