Git Product home page Git Product logo

iip-word-lists's Introduction

Introduction

This code in this repository is intended for use in the Inscriptions of Israel / Palestine project. It uses Python and LXML to generate word lists from epidoc files and includes a simple web interface.

Setup

  1. Clone or download the repository.
  2. Enter the project directory with cd iip-word-lists
  3. Create a virtual environment with the appropriate dependencies by running virtualenv -p python3 environment. If you do not have virtualenv installed, install it using your system's package manager, or with pip by running pip install virtualenv.
  4. Activate the virtual environment by running source environment/bin/activate (The virtual environment must be active whenever you run the python code or rebuild the site. If it is active, you should see (environment) before the prompt in your terminal.)
  5. Install the necessary dependencies by running pip install -r requirements.txt

To run the site locally

  1. Enter the docs directory with cd docs
  2. Start an http by running python -m SimpleHTTPServer 8000
  3. Open localhost:8000 in your web browser

(You can view the files without running the server, but some links will not work.)

To build the site

  1. from the root project directory, run ./build_site.sh. Add -nu if you are updating the site and do not wish to download the xml files.

Project structure

  • docs contains the files for the github pages site.
    • texts contains the files representing individual inscription.
      • xml contains these files in their original XML form.
      • plain contains plain text representations of the inscriptions
      • plain_lemma contains the same as above, but using lemmas of each word instead of the actual text as it appeared in the inscription.
    • each language has its own directory containing its data in html format.
    • doubletreejs contains code for a the DoubleTreeJS visualization library.
  • src contains the list creation script and the html and css templates for the site.
    • python contains the python scripts for processing the data
      • wordlist.py is the python script that generates word lists. The basic usage is ./wordlist.py <epidoc files to process>. By default, the list will be printed to the terminal, other output formats can be specified with flags. Run ./wordlist.py --help for information on usage.
    • web contains the css, and javascript and html templates used to build the site.
  • .gitignore lists files that should not be included in the repository, such as lock files, etc.
  • README.md lists information about the project.
  • build_site.sh is a bash script that rebuilds the site, outputting to the docs directory. It can be run by typing ./build_site in the terminal from the root project directory. To rebuild the site without re-downloading the epidoc files, run ./build_site --use-existing. To rebuild the site without updating the word-lists (for example, when working on the frontend), run ./build_site --no-update. For help, run ./build_site --help.

Functionality

Lemmatization

A word's lemma is its "basic" form as it might appear in a dictionary. For instance, the lemma of "rethinking" is "think". The process of getting a lemma from a word is called "lemmatization." Lemmatization allows this project to recognize different strings as instances of the same word, which is very useful for learning about the usage and distributions of specific words.

Lemmatization is currently done only for Latin and Greek, as provided by CLTK.

Problems Encountered

  • Line breaks following certain tags indicate the start of a new word. These are currently listed in the global variable include_trailing_linebreak. However, this is not comprehensive. A complete list based on the epidoc spec should be added.
  • How should gaps be handled?
  • Graffiti: some transcriptions, such as masa09390.xml, are of graffiti and do not contain complete words but just jumbles of characters. Currently these are added to the word list as if they were words, leading to some strange results. However, if we ignored all files marked as containing graffiti, we could potentially lose some words.
  • Should <num> elements always indicate the start of a new word?

Probable Mistakes found in iip-texts

  • caes0004.xml, line 187: <lb> should have attribute break="no"
  • jeru0003.xml, line 127: <lb> should have attribute break="no"
  • zoor0013.xml, line 136: 'expan="ἔτους">' appears outside of tag
  • zoor0136.xml, line 156: 'expan="ἡμέρᾳ">' appears outside of tag

Todo

  • Geographic Word Counts
  • Filter by Region
  • Word in Context
  • Default to sort by occurence
  • Word Frequency by Geography
  • Part of Speech Tagging
  • Display for <corr> <reg>
  • Stems
  • Fuzzy Search
  • Wildcard Search
  • Named Entity Recognition?
  • Frontend Redesign
  • Add Alphabet links to frontend

Misc

Thank you to the Unicode Consortium for keeping us on our toes by including all these as seperate characters: · ‧ ⋅ • ∙.

iip-word-lists's People

Contributors

lmcnulty avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.