Git Product home page Git Product logo

limco's Introduction

Linguistic measure collection

A collection of the stylometric measures for authorship detection that are suggested to have relationships to the attitudes and psychological tendencies of authors. This library can calculate 12 types of stylometrics, based on Japanese text metrics organised in Asaishi (2017).

Compatible natural languages:

  • Japanese
  • English

Installation

NOTE: pip-installable packaging is work-in-progress; sorry for the inconvenience

Requirements

  • Python 3.6+
    • f-string (PEP 536):
    • Type Hints (PEP 484):
  • (Your kind patience)

Dependencies

If you are using pipenv, pipenv install will create a virtual environment for this library. Otherwise, pip install -r requirements.txt will suffice.

This library also depends on the following external commands for Japanese processing:

  • MeCab (for Japanese text processing):
  • CaboCha (for analyse_parseddoc.py):

Resources

Furthermore, for Japanese metrics, the following linguistic resources are required.

  • AWD-J EX as the file name AWD-J_EX.txt
  • JIWC as the file name 2017-11-JIWC.csv
  • Japanese stopwords as the file name stopwords_jp.txt

Put the above in the data/ folder.

I will make them optional in future. Also, the location of resources will be made configurable.

Usage

IMPORTANT: the user is responsible for text normalisation

measure_lang.py

You can generate table data from a text file by executing:

python measure_lang.py [csv/excel file path] [target column name]

When you input your_data.csv, your_data.measured.csv will be created in the same location of the input file.

Recommend that the input text is formatted like one sentence per line to calculate the proper sentence count.

You can also use each measure by importing like:

import measure_lang as ml

def your_fancy_func(string):
    # ...
    res = ml.calc_ttrs(string)
    # ...

All functions read a string (a long passages with new lines are OK.). If you want to apply them to a list of strings, you need to iterate them over the list.

The detailed usage instruction for each measure will be available in its docstring (work-in-progress).

analyse_parseddoc.py

(the documentation here is work-in-progress)

Prepare:

  • one sentence per line without blank lines
  • documents are split by one blank line

E.g.

import re
import pandas as pd

import split_sentence  # find a great library to split sentences

col = "column_name"

with open('formatted_text.txt', 'w') as f:
    f.writelines(
        [
            re.sub("^$", "", re.sub("\n+", "\n", re.sub("\s", " ", t))) + "\n\n"
            for t in df[col].to_list()
        ]
    )

then, execute sentence-splitter formatted_text.txt > formatted_text-sentsplit.txt or something like that, where sentence-splitter splits input text into the sentence-per-line format while preserving blank lines (you must find a good real tool; in English, nltk provides one).

Run:

cabocha -f1 your_text.txt > your_text.cabocha.txt
python analyse_parseddoc.py your_text.cabocha.txt [your_csv.measured.csv]

If you add measured csv file produced by measure_lang.py as the second argument, analyse_parseddoc.py concatenate the result column-wise (like GNU paste(1), but aware of escaped multi-line text in the csv).

Measures

import measure_lang as ml
import analyse_parseddoc as ap
  • Percentages of character types (ml.count_charcat): The ratio of hiragana, katakana, and kanji (Chinese characters) to the characters in text, respectively.

  • Type Token Ratio (TTR) (ml.calc_ttrs): The ratio of the different words to the total number of words in text.

  • Percentage of content words (ml.measure_pos): The ratio of content words (i.e., nouns, verbs, adjectives, and adverbs) to the total number of words in text.

  • Modifying words and Verb Ratio (MVR) (ml.measure_pos): The ratio of verbs to adjectives, adverbs, and conjunctions for the words in text. It has been used as one of the indicators of author estimation.

  • Percentage of proper nouns (ml.measure_pos): The ratio of proper nouns (named entities) to all words in text.

  • Word abstraction (ml.measure_abst): The abstraction degrees of the words in text. We specifically used the maximum value of the most abstract word, and the average of the top five abstract words. The abstraction degrees were obtained from the Japanese word-abstraction dictionary AWD-J EX.

  • Ratios of emotional words (ml.calc_jiwc): The ratios, to all the words in text, of the words that are associated with each of the seven kinds of emotions: sadness, anxiety, anger, disgust, trust, surprise, and happiness. The seven values are transformed to meet the property of probability (each value spans between 0 and 1; the sum of all values is to be 1). The degree of association with emotion was determined according to the Japanese emotional-word dictionary JIWC.

  • Number of sentences (ml.measure_sents): The total number of sentences that make up text.

  • Length of sentences (ml.measure_sents): Descriptive statistics (mean, standard deviation, interquartile, minimum, and maximum) for the number of characters in each sentence that constitutes text. In particular, the average sentence length has been suggested to be linked to the writer’s creative attitude and personality .

  • Percentage of conversational sentences (ml.count_conversations): Percentage of the total number of conversational sentences contained in text.

  • Depth of syntax tree (ap.analyse_dep): Descriptive statistics calculated for the depth of the dependency tree for each sentence in text.

  • Mean of the number of chunks per sentence (ap.analyse_dep): Descriptive statistics calculated for the average values of the number of chunks for each sentence in text.

  • Mean of the words per chunk (ap.analyse_dep): Descriptive statistics calculated for the average values of the number of words per chunk in text.

Summary table

Stylometric Sub-measures (value format)
Percentages of character types Hiragana, katakana, and kanji (Chinese characters) (%)
Type Token Ration (TTR) (%)
Percentages of content words (%)
Modifying words and Verb Ratio (MVR) (%)
Percentage of proper nouns (%)
Word abstraction The maximum, and the average of the top five abstract words (real number)
Ratios of emotional words sadness, anxiety, anger, disgust, trust, surprise, and happiness (%)
Number of sentences (integer)
Length of sentences mean, standard deviation, interquartile, minimum, and maximum (real number)
Percentage of conversational sentences (%)
Depth of syntax tree mean, standard deviation, interquartile, minimum, and maximum (real number)
Mean of the number of chunks per sentence mean, standard deviation, interquartile, minimum, and maximum (real number)
Mean of the words per chunk mean, standard deviation, interquartile, minimum, and maximum (real number)

References

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.