Git Product home page Git Product logo

c-atf-feature-extractor's Introduction

c-atf-feature-extractor

A Preprocessor for C-ATF files

The extractor is pure python as in only_uses_four_standard_modules:_itertools_re_json_unittest_aka_as_light_as_it_gets. It takes a c-atf text and spits out a huge python heavily nested dictionary. An api would be available in the future for facilitating the usage of it by humans.

The Preprocessor is in its alpha stage, don't use it for scholarly work yet. When the milestone Version 0.1 Beta is achieved, you can use it for scholarly work. Major releases, like 0.1b or 0.1, will also be contributed to the cltk repository.

The Resulting dictionary contains the following information with regard to text:

On text level:

  • text_textPartCount: Number of parts in a texts: like @obverse, @reverse, etc.

  • text_totalSignOccuranceCount: Total count of sign occurrences, this is to be differentiated from the number of signs used in a document. For example a text might have 30 signs with each of them occurring 3 times, thus total occurrence count would be 90. I will also provide number of signs, later on.

  • text_RelSignPositions: Relative Sign Positions with respect to text level. Meaning that the signs are counted throughout the text without considering changes in line or part. If the line 1 has 20 sign occurrences, for line 2, the counter continues from where the line 1 has left off.

  • text_RelWordPositions: Relative Word Positions with respect to text level. Same as signs but done for words.

  • text_language: Gives the specified language in the protocol section of the c-atf document, ex.. #atf: lang akk.

  • text_totalLineCount: Gives the total number of lines in the text.

  • text_objectType: Gives the object type stated with @, ex. @tablet, etc.

  • text_id: Gives the text id, ex.. P480793, etc.

  • text_RelLinePositions: Relative Line Positions with respect to text level. Same as signs but done for lines.

  • text_textParts: A list containing the parts of the text. Parts are stored as dictionaries.


On Part Level:

  • part_partTitle: Gives the title of the part indicated with @, ex. @observe, @reverse, etc.

  • part_partString: Gives the text of the part as string.

  • part_AL_occurances: Gives the list of Another Language Occurrences, these are indicated with _ in c-atf they can contain multiple lines, and a language switch. We'll explain some of the features contained by this in a second.

  • part_RelSignPositions: Relative Sign Positions with respect to part level. Same thing as text level, but this time the counter starts from zero for each part.

  • part_RelWordPositions: Relative Word Positions with respect to part level. Same thing as signs but done for words.

  • part_partLines: A list of dictionaries. Dictionaries relate information about the states of individual lines.

AL Occurances:

Another language occurances can be dispersed to multiple lines thus they are handled at a part level rather then in line level.

The key part_AL_occurances gives a list of list of dicts. Each list of dicts represent an AL occurrence. Each dict represents a word. The structure of the dict is as follows:

  • alWord_AlOc: is a string representation of the entire Another Language occurrence. Ex. For an AL occurrence of _an kur_u2, the dictionary might belong to the word kur_u2, but this key would contain the _an kur_u2 anyway.

  • alWord_AlOc_Position: contains a dictionary with following keys: alWord_Position, and totalWords_AlOc: First one indicates the position of the another language word inside the AL occurrence, the second one gives the total number of AL words inside the AL occurrence.

  • alWord_LineNumber: Stores the number of the line which contains the AL word.

  • alWord_LinePosition: contains a dictionary with the following keys: alWord_Position and totalWords_Line: First one indicates the position of the AL word inside the line, the second one gives the total number of words in the line.

  • alWord_language: Stores the language of the another language occurrence if stated, ex: if there is something like %hit right after the underscore it would give the hit as the value of this key.

  • alWord_textLanguage: Stores the language of the text, indicated in the protocol

  • alWord_word: Stores the AL word, so for an AL occurrence of _an kur_u2, this key would contain only kur_u2.


On Line Level:

  • lineText: Contains the string representation of the line.
  • lineNumber: Contains the line number.
  • isLineStructure: Stores a boolean value. Indicates whether the line is a comment about the structure indicated with $ in C-ATF.
  • isLineContent: Stores a boolean value. Indicates whether the line is a comment about the content of another line indicated with # in C-ATF.
  • lineWordCount: Stores the the number total word count in the line.
  • lineWordPos: Line Word Positions: Stores the words with their positions in the line. Different from the lineWords
  • lineWords: Stores a list of dictionaries. Each dictionary represent a unique word in the line.
  • line_RelSignPositions: Relative Sign Positions with respect to line level. Same thing as in part level, this time the counter results to 0 at the beginning of each line.

On Word Level:

Word dictionary includes some features from sign dictionary for facilitating processing afterwards.

  • word_hasComplement: Boolean value. True if the word has a compliment indicated with +.
  • word_Signs: Stores a list of dictionaries. Each dictionary represent a unique sign.
  • word_hasUnknownReading: Boolean value. True if the word has an uppercase reading outside of a compound sign.
  • word_hasNutillu: Boolean value. True if the word has a sign with nutillu modifier.
  • word_hasRotated: Boolean value. True if the word has a sign with rotation modifier.
  • word_hasAllograph: Boolean value. True if the word has a sign with allograph indicator ~.
  • word_hasDamage: Boolean value. True if the word has a sign with #.
  • word_punctuationDict: Stores a dictionary which includes information about the punctuation. Keys are punctuation_punctElement, and punctuation_punctGrapheme: First one contains the value of the punctuation without the qualifying grapheme, ex. *, /; the second one contains the grapheme qualifying the punctuation, ex. the disz in *(disz).
  • word_hasSpecification: Boolean value. True if the word has parentheses in it.
  • word_wordSignCount: Stores the sign occurrence count for a word.
  • word_hasGunu: Boolean value. True if the word has a sign with gunu modifier.
  • word_isSpecifiedWordDivider: boolean value. True if the word is composed of the following structure: /(GRAPHEME).
  • word_signRelations: Stores a list of dictionaries. Each dictionary represents a relation indicated with an operator. Contained relation types are, sign-sign, sign-group, group-sign, group-group.
  • word_hasKabatenu: Boolean value. True if the word has a sign with kabatenu modifier.
  • word_isDColon: Boolean value. True if the word is punctuation of Double Colon '::'.
  • word_hasJoining: Boolean value. True if the word has the joining operator in it.
  • word_hasFlat: Boolean value. True if the word has a sign with flat modifier.
  • word_hasCrossing: Boolean value. True if the word has the crossing operator.
  • word_hasCollation: Boolean value. True if the word has a collation indicated with '*'.
  • word_hasComposite: Boolean value. True if the word has compound sign in it.
  • word_hasVertReflected: Boolean value. True if the word has a sign with vertically Reflected modifier.
  • word_hasQuery: Boolean value. True if the word has a sign with ?.
  • word_hasFormVariant: Boolean. True if the word has .
  • word_hasSpecialAllograph: Boolean value. True if the word has a special allograph.
  • word_determinatives: Stores a list of tuples of dicts. Each dict represents a sign in a determinative. And each tuple represents a determinative. Keys of this determinative sign dictionary will be explained below.
  • word_numberDict: Stores a dict with following keys. number_repetitionCount, number_grapheme: First one indicates the repetition count of a number, ex. n+1, n, 4, etc. The second one indicates the grapheme of the number, ex. asz in 4(asz), etc.
  • word_hasCurved: Boolean value. True if the word has a sign with curved modifier.
  • word_hasContaining: Boolean value. True if the word has the containing operator.
  • word_hasCorrection: Boolean value. True if the word has a sign with !.
  • word_word: String representation of the white-space delimited word.
  • word_hasSheshig: Boolean value. True if the word has a sign with sheshig modifier.
  • word_isColon: Boolean value. True if the word is a punctuation of the type, :.
  • word_isBulletSpecified: Boolean value. True if the word is a punctuation of the type *(GRAPHEME).
  • word_wordSignsPos: List of tuples. Stores the signs with their relative positions inside the word.
  • word_isNumber: Boolean value. True if the word belongs to one of the three number types specified by Grapheme Description Language of ORACC.
  • word_wordLang: Stores the information of the language of the word. This can be different from the text language if the word is inside an AL occurrence.
  • word_hasVariant: Boolean value. True if the word has a sign with variant modifier.
  • word_hasAbove: Boolean value. True if the word has the above operator.
  • word_hasTenu: Boolean value. True if the word has a sign with tenu modifier.
  • word_isColonDQ: Boolean value. True if the word is a punctuation of the type, :".
  • word_hasHorReflected: Boolean value. True if the word has a sign with Horizontally Reflected modifier.
  • word_isColonRQ: Boolean value. True if the word is a punctuation of the type, :' or MZL592~b.
  • word_hasBeside: Boolean value. True if the word has the beside operator. word_isBullet: Boolean value. True if the word is a punctuation of the type, *. word_hasContainingGroup: Boolean value. True if the word has containing operator with parentheses. word_isWordDivider: Boolean value. True if the word is an unspecified word divider. word_hasZidatenu: Boolean value. True if the word has a sign with zidatenu modifier.

Determinatives:

Each determinative of the word is a tuple, which contains dictionaries representing signs.

The structure of the dictionary is of the following:

  • detSign_DetPosition: stores a dictionary with following keys. detSign_position, totalSigns_determinative. First one contains the position of the sign inside the determinative. Second one contains total number of signs inside the determinative.
  • detSign_WordPosition: stores a dictionary with following keys. detSign_position, totalSigns_word. First one stores the position of the sign inside the word. Second one contains total number of signs inside the word.
  • detSign_det: Stores the string representation of the determinative.
  • detSign_detMark: Stores a string representation. It can have three values: Inpos, postpos, prepos. Prepos, for determinatives at the beginning of a word. Postpos for determinatives at the end of a word. Inpos for determinatives that are neither at the beginning nor at the end of the word. They maybe used for example for determinatives that follow other determinatives inside a word.
  • detSign_detSign: Stores the string representation of the sign of the determinative to which the dictionary is consecrated.
  • detSign_det_WordPos: Stores a tuple which contains the beginning and the end of the sign range of the determinative which contains the above mentioned sign. Ex. for a made up word like {gesz}{an-il-hal}sza-pa-ra-ku2-me-{mesz}, the dictionary concerning il of {an-il-hal} would contain (1,3), since gesz is in the 0 position.

On Sign Level:

Apart from the 'is' versions of the sign features indicated in word level, ex. isDamaged instead of hasDamage. Sign dictionary has the following information keys:

  • sign_isPartOfCompound: Boolean value. True if the sign is part of a compound sign, ex. KA in KAxIR2.
  • sign_nestLevel: Stores the nest level of the sign if the sign is contained in a compound sign involving groups, ex. 1 for KA in IR3x(AN.KA). The complete compound sign is considered as the 0 and each balanced parentheses is counted as a nest indicator.
  • sign_relatedSigns: Stores a dictionary. Its keys will be explained below.
  • sign_sign: Stores the string representation of the sign.
  • sign_compoundSign: Stores the string representation of the compound sign if the sign is a part of a compound sign.

Sign Relations:

These are indicated at two levels: at word level and at sign level. Word level representation contains group-group relations which could not be conceived at a sign level. A more elegant solution would be to implement a compoundSignHandler class for this occasion, this will most probably be done in the future. However sign relation dictionaries contained at both of the levels have the same keys.

The structure of the sign relation dictionary is of the following:

  • SR_operator: String representation of the operator, ex: +, x, %, etc.
  • SR_operator_antec: String representation of the characters before the operator.
  • SR_operator_subsq: String representation of the characters after the operator.
  • SR_nest_level: Indicates the nest level of sign relation occurrence.
  • SR_nest_content: Stores the string representation of the text content in which a sign relation occurs.
  • SR_compoundSign: Stores the string representation of the compound sign in which the nested sign relation occurrence is observed.
  • SR_nest_range: Stores the character range of the nest in which the sign relation is observed.
  • SR_relation_type: Stores a dictionary with the following keys. operator_antecedent, operator_subsequent. First one indicates whether a group or a sign comes before the operator, the second one indicates whether a group or sign comes after the operator.
  • SR_operator_position: Stores the character position of the operator. Position is with regard to the compound sign's character range.
  • SR_operator_type: Stores the operator type, ex. crossing, above, joining, etc.
  • SR_operator_antec_range: Stores the character range of the elements that come before the operator.
  • SR_operator_subseq_range: Stores the character range of the elements that come after the operator.

Usage Example:

For now the parser is conceived for documents containing individual texts like in here

For now I am more concerned with fine tuning the parser rather than supporting multiple documents at once, because supporting multiple documents at once is quite easy. I would just need to add couple of lines to initial section getter.

To use the feature extractor on a brute text like this one use the following:

After importing the cAtfFeatExtractor.py from the extractor module.

with open("Archival view of P462811.txt","r",encoding="utf-8", newline="\n") as cAtfFile:
    test_file = cAtfFile.read()

test_textClass = cAtfTextBuilder(test_file)

test_text = test_textClass.buildTextDict_SP()

If you use FP, First Pass, instead of SP, Second Pass, at the end of buildTextDict_ your dictionary would not have the relative positions of the signs with regard to different levels and total counts with regard to different levels. Use that if you don't need those.

c-atf-feature-extractor's People

Contributors

d-k-e avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

c-atf-feature-extractor's Issues

Outlier Text Checker

Before closing the 0.1 beta milestone include an outlier text checker, to make sure that the text that would be treated falls within the perimeters of a reasonable text.

No support for lexical texts

If someone sees any documentation on how lexical texts are implemented, feel free to modify the code base as you see fit. If you don't want to deal with the implementation, at least post it here so that i can take a look at it myself.

Sign Divisions

  • Very rarely in achaemenid royal inscriptions some complex signs, like ANSZE are dispersed into multiple lines. If they are presented as a specified compound sign, something like ANSZE(|AN.\newline SZE|). This is most likely an outlier for other languages. Thus doesn't fall into the scope of 0.1b

Mark Sign and Words

Signs and words that belong to Another Language Occurrence or determinative is not marked in the sign and word dictionaries.

Nested parts

@column 2
1.
2.

Are not handled well. They are going to be probably flattened to something like:
@surface a - @column 1
1.
2.
@surface a - @column 2
etc.
Flattening was a bad idea, because it risks of loosing potentially useful information for later on. I'll add "isPartOf", and "relationType" keys to dictionaries and use a controlled vocabulary for values.
Some of the values are:
text_part
part_part
part_AL_Oc

Known Issues

The Extractor is in its alpha stage DO NOT attempt to use it for scholarly work. The closing of this issue would mark the beta stage. It is by then that you can attempt to use it for scholarly work.

Determinative problem

Some determinatives have quirks like {d}en, this should have been conceived as {d}-en, currently i have no reason to why this occurs, because bracket separator methods should have been handling it.

Word division

Words and determinatives are handled at a line level. Whereas some words are most probably dispersed to multiple lines. Use the algorithm that i've described for oracc.

Write tests

Tests will cover the following classes:
cAtfLineGetter
cAtfLineDictBuilder
cAtfALHandler
cAtfWordDictBuilder
cAtfSignDictBuilder
cAtfTextBuilder

I probably won't write tests for tester classes, like cAtfLineTester, etc.

Currently, at the cAtfTextBuilder.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.