The iip-word-lists from ashleychampagne

Introduction

This code in this repository is intended for use in the Inscriptions of Israel / Palestine project. It uses Python and LXML to generate word lists from epidoc files and includes a simple web interface.

Setup

Clone or download the repository.
Enter the project directory with cd iip-word-lists
Create a virtual environment with the appropriate dependencies by running virtualenv -p python3 environment. If you do not have virtualenv installed, install it using your system's package manager, or with pip by running pip install virtualenv.
Activate the virtual environment by running source environment/bin/activate (The virtual environment must be active whenever you run the python code or rebuild the site. If it is active, you should see (environment) before the prompt in your terminal.)
Install the necessary dependencies by running pip install -r requirements.txt

To run the site locally

Enter the docs directory with cd docs
Start an http by running python -m SimpleHTTPServer 8000
Open localhost:8000 in your web browser

(You can view the files without running the server, but some links will not work.)

To build the site

from the root project directory, run ./build_site.sh. Add -nu if you are updating the site and do not wish to download the xml files.

Project structure

docs contains the files for the github pages site.
- texts contains the files representing individual inscription.
  - xml contains these files in their original XML form.
  - plain contains plain text representations of the inscriptions
  - plain_lemma contains the same as above, but using lemmas of each word instead of the actual text as it appeared in the inscription.
- each language has its own directory containing its data in html format.
- doubletreejs contains code for a the DoubleTreeJS visualization library.
src contains the list creation script and the html and css templates for the site.
- python contains the python scripts for processing the data
  - wordlist.py is the python script that generates word lists. The basic usage is ./wordlist.py <epidoc files to process>. By default, the list will be printed to the terminal, other output formats can be specified with flags. Run ./wordlist.py --help for information on usage.
- web contains the css, and javascript and html templates used to build the site.
.gitignore lists files that should not be included in the repository, such as lock files, etc.
README.md lists information about the project.
build_site.sh is a bash script that rebuilds the site, outputting to the docs directory. It can be run by typing ./build_site in the terminal from the root project directory. To rebuild the site without re-downloading the epidoc files, run ./build_site --use-existing. To rebuild the site without updating the word-lists (for example, when working on the frontend), run ./build_site --no-update. For help, run ./build_site --help.

Functionality

Lemmatization

A word's lemma is its "basic" form as it might appear in a dictionary. For instance, the lemma of "rethinking" is "think". The process of getting a lemma from a word is called "lemmatization." Lemmatization allows this project to recognize different strings as instances of the same word, which is very useful for learning about the usage and distributions of specific words.

Lemmatization is currently done only for Latin and Greek, as provided by CLTK.

Problems Encountered

Line breaks following certain tags indicate the start of a new word. These are currently listed in the global variable include_trailing_linebreak. However, this is not comprehensive. A complete list based on the epidoc spec should be added.
How should gaps be handled?
Graffiti: some transcriptions, such as masa09390.xml, are of graffiti and do not contain complete words but just jumbles of characters. Currently these are added to the word list as if they were words, leading to some strange results. However, if we ignored all files marked as containing graffiti, we could potentially lose some words.
Should <num> elements always indicate the start of a new word?

Probable Mistakes found in iip-texts

caes0004.xml, line 187: <lb> should have attribute break="no"
jeru0003.xml, line 127: <lb> should have attribute break="no"
zoor0013.xml, line 136: 'expan="ἔτους">' appears outside of tag
zoor0136.xml, line 156: 'expan="ἡμέρᾳ">' appears outside of tag

Todo

Misc

Thank you to the Unicode Consortium for keeping us on our toes by including all these as seperate characters: · ‧ ⋅ • ∙.

ashleychampagne / iip-word-lists Goto Github PK

iip-word-lists's Introduction

Introduction

Setup

To run the site locally

To build the site

Project structure

Functionality

Lemmatization

Problems Encountered

Probable Mistakes found in iip-texts

Todo

Misc

iip-word-lists's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent