Git Product home page Git Product logo

textcomplexity's Introduction

Requirements

Clone the repository

git clone https://github.com/michaelcapizzi/TextComplexity.git

How to Run

The system expects the text to be classified to be in .txt form with paragraphs delimited by a blank line:

She moved away from the door, stepping as softly as if she were afraid of awakening some one.  She was glad that there 
was grass under her feet and that her steps made no sounds.  She walked under one of the fairy-like gray arches between 
the trees and looked up at the sprays and tendrils which formed them.  "I wonder if they are all quite dead," she said.  
"Is it all a quite dead garden? I wish it wasn't."

If she had been Ben Weatherstaff she could have told whether the wood was alive by looking at it, but she could only 
see that there were only gray or brown sprays and branches and none showed any signs of even a tiny leaf-bud anywhere.

**Note:**You can import a .txt file where the paragraphs are not delimited by a blank line, but computational times may explode as the feature extractor depends upon a discourse parse which handles the entire document as one instance to parse.

The system can handle two different classification structures:

Classify into 6 distinct classes:

K-1 2-3 4-5 6-8 9-10 11-12
"0001" "0203" "0405" "0606" "0910" "1112"

or

Classify into 3 distinct classes:

K-5 6-8 9-12
"0005 "0608" "0912"

Best Performing

If you'd like to simply run the best-performing model for each label structure*, you can run this command: run-main Complexity.Demo [file to analyze] [number of classes]. You will see the feature values generated, the predicted grade level band, and the confidence of the other classes for comparison.

run-main Complexity.Demo "document.txt" "3"

or

run-main Complexity.Demo "document.txt" "6"

*current, best-performing configurations:

  • 6-class structure: random forest classifier using lexical features only
  • 3-class structure: linear SVM classifier using lexical and paragraph features

Other Configurations

If you'd like to further investigate the predicted output generated by different configurations, you can run the Predict main class. This requires more arguments: run-main Complexity.Predict [file to analyze] [number of classes] [model to use] [full path to dataset to use] [feature types to include+].

The model choices are: randomForest, perceptron, logisticRegression, or svm.

The datasets can all be found in the /resources/savedFeatureMatrices folder of the repository. They are saved with an .svmLight file type, but they are just plain text. The number of classes and feature sets used to generate the matrix should be easily identiable from the file name.

Note: The choice of the dataset must match both the number of classes and the feature sets to use.

run-main Complexity.Predict "document.txt" "6" "randomForest" "/path/to/resources/savedFeatureMatrices/lex_par-6.svmLight" "lexical" "paragraph"

Each feature set to include should be its own argument separated by a space. The choices are: lexical, syntactic, paragraph, or all. For example lexical paragraph would use both lexical and paragraph features. Using all will utilize all three feature sets.

textcomplexity's People

Contributors

michaelcapizzi avatar

Stargazers

 avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.