- Java 8
- sbt
git clone https://github.com/michaelcapizzi/TextComplexity.git
The system expects the text to be classified to be in .txt
form with paragraphs delimited by a blank line:
She moved away from the door, stepping as softly as if she were afraid of awakening some one. She was glad that there
was grass under her feet and that her steps made no sounds. She walked under one of the fairy-like gray arches between
the trees and looked up at the sprays and tendrils which formed them. "I wonder if they are all quite dead," she said.
"Is it all a quite dead garden? I wish it wasn't."
If she had been Ben Weatherstaff she could have told whether the wood was alive by looking at it, but she could only
see that there were only gray or brown sprays and branches and none showed any signs of even a tiny leaf-bud anywhere.
**Note:**You can import a .txt
file where the paragraphs are not delimited by a blank line, but computational times may
explode as the feature extractor depends upon a discourse parse which handles the entire document as one instance
to parse.
The system can handle two different classification structures:
Classify into 6 distinct classes:
K-1 | 2-3 | 4-5 | 6-8 | 9-10 | 11-12 |
---|---|---|---|---|---|
"0001" | "0203" | "0405" | "0606" | "0910" | "1112" |
or
Classify into 3 distinct classes:
K-5 | 6-8 | 9-12 |
---|---|---|
"0005 | "0608" | "0912" |
If you'd like to simply run the best-performing model for each label structure*, you can run this command: run-main Complexity.Demo [file to analyze] [number of classes]
. You will see the feature values generated, the predicted grade level band, and the confidence of the other classes for comparison.
run-main Complexity.Demo "document.txt" "3"
or
run-main Complexity.Demo "document.txt" "6"
*current, best-performing configurations:
- 6-class structure:
random forest
classifier usinglexical
features only - 3-class structure:
linear SVM
classifier usinglexical
andparagraph
features
If you'd like to further investigate the predicted output generated by different configurations, you can run the Predict
main class. This requires more arguments: run-main Complexity.Predict [file to analyze] [number of classes] [model to use] [full path to dataset to use] [feature types to include+]
.
The model choices are: randomForest
, perceptron
, logisticRegression
, or svm
.
The datasets can all be found in the /resources/savedFeatureMatrices
folder of the repository. They are saved with an .svmLight
file type, but they are just plain text. The number of classes and feature sets used to generate the matrix should be easily identiable from the file name.
Note: The choice of the dataset must match both the number of classes and the feature sets to use.
run-main Complexity.Predict "document.txt" "6" "randomForest" "/path/to/resources/savedFeatureMatrices/lex_par-6.svmLight" "lexical" "paragraph"
Each feature set to include should be its own argument separated by a space. The choices are: lexical
, syntactic
, paragraph
, or all
. For example lexical paragraph
would use both lexical and paragraph features. Using all
will utilize all three feature sets.