Git Product home page Git Product logo

loanpydatahub / loanpy Goto Github PK

View Code? Open in Web Editor NEW
15.0 4.0 1.0 32.67 MB

LoanPy is a linguistic toolkit for rule-based prediction and evaluation of loanword adaptation and historical reconstructions and can be used to search for old loanwords

Home Page: https://twitter.com/martino_vik

License: MIT License

Python 99.62% Makefile 0.38%
language-contact linguistics loanwords sound-change-applier

loanpy's People

Contributors

martino-vic avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

antipodite

loanpy's Issues

integrate workflow into CLDF's CLI

Currently loanpy does the file reading on its own with etym_obj = Etym(forms_csv="forms.csv", source_language="English", target_language="Bislama") but this could be handled with sth like:

from cldfbench_tryonbislama import Dataset as BSLM
bslm = BSLM()
ds = Dataset.from_metadata(bslm.cldf_dir.joinpath("cldf-metadata.json"))
for val in ds.objects("FormTable"):
    # execute loanpy workflow

add try-except for KeyError in eval_sca.eval_one

If crossvalidation causes that a sound is not in scdict AND it can't be caught through the heuristics (e.g. 'ʊ͡ə'), adapt throws a keyerror. This causes eval_one to stop when eval_one calls adapt. Instead append "False" to the results and continue the loop.

LingPy's align method can be used more efficiently

https://github.com/martino-vic/loanpy/blob/b073e367c31ef87c3b3e694b6f2c2adfa7abd2fa/qfysc.py#L405

What you do is to convert the alignment of two strings to a string, then pull out the info. But Pairwise works like this:

pw = Pairwise(["a", "b", "c"], ["b", "c", "a"]) # if you have segmented data, no need to take care of merging
pw.align(distance=True)
almA, almB, dist = pw.alignments[0]

You see? The alignment is accessible via the alignments attribute. Since Pairwise handles more than just one pair, this is in position 0, if only one pair has been passed.

add AI-module

use script from google colab notebook.
Consider doing this only in colab, since training on own machine is wasteful.

rearrange order within modules

Put class on top, then main methods, then helping methods/functions for init.

Rearrange order of tests too, so it reflects order in modules - ensure this way that every function and method has its own test

Write documentation

including examples in an ipynb, inline comments, youtube tutorial, method chapter.
also the tests.

Update tests

Check if:

a) Every function has a unit & integration test
b) Redundant tests are deleted
c) Every function of every unit test has a patch

Is IPAtok really needed?

You use IPA tok for a segmentation task, but ipatok is only needed if you have unsegmented text. We have with CLDF now the convention to only work whith pre-segmented data, and it is beneficial, as you can then concentrate on the coding aspect, not on data wrangling, in your library. So I suggest to drop the dependency and instead maybe go for an internal representation of the form similar to lingpy, e.g., using a lingpy.basictype.strings or lingpy.basictypes.lists (which is more complete):

Test this out:

from lingpy import basictypes

tst = basictypes.lists("t h i s + i s + a + t e s t")
tst.n[1]

Refactor qfysc.Etym.__init__

Couple of simplifications already made in the commits from today (Jul 13, 2022). Still open:

  • remove function + tests + param, and maybe even attribute for read_dst/distance_measure
  • remove read_scdictbase
  • remove cldf2pd, i.e. ditch the long table format
  • remove align()
  • rearrange order of functions in adrc.py - class on top.

bugfix (?) lingpy's prosodic_string func

Vowels with tie bars, i.e. diphthongs, are transcribed as one vowel, i.e. prosodic_string("a͡e", _output="cv") == "V". Might be that this is generally the preferred way to transcribe this. For my own use case, this kinda seems to have fixed the issue for now:

        lexseg = []  # replace tie bars in vowels with space
        for i in ipa_in:
            if chr(865) in i and prosodic_string(  # tie bar
            i.split(" "), _output='cv') == "V":
                i = i.split(chr(865))
                lexseg += i
            else:
                lexseg.append(i)

Where is the `setup.py`

I was curious: how do you deploy data on pypi? I could not find the setup.py script, is it in another repository?

fix vowel harmony

image

at https://en.wikipedia.org/wiki/Hungarian_phonology

In the current version front and back vowels as defined by asjp can't occur in one word. That's wrong. It's the phonemes from the illustration above that can't co-occur within a word (adapt and reconstruct anyways first reconstruct one of those vowels, so no need to fix ANY vowel, and possible to fix only on those specific ones)

front-back vowel quality

Just realised there's another part that seems like I should move to preprocessing as well: turning front vowels to an "F" and back-vowels to a "B" (it's for later optionally repairing and filtering words that are violating the constraint "front-back vowel harmony", assuming words can have either only front vowels or only back vowels - this was just an idea I had in the beginning, ever since I'm evaluating my results with sanity.py it turned out that results are always worse when I use this feature, so I'm not actively using it anymore, but I think it's still cool to have this option built in at least).

My question before I start fiddling around with this: Is there maybe a function in lingpy that can do this (similar to prosodic_string)? And if not, would it make sense to build it in there? Then I could ditch it from loanpy as well - it's kind of bloating the whole thing unnecessarily I feel

Fix tests

Earlier I was unconsciously always installing the latest stable version from pypi and running the tests against that, locally, instead of installing the latest commit from github. This issue was fixed in #21. However, now I can see that the first minor change that I had announced in #18 (comment), i.e. c44ecfc in fact broke quite a few tests. These need to be fixed now.

uncomment functions in loanfinder.py

functions likeliestphonmatch and postprocessing, including tests. Has to be done in combination with the lexibank-script, in which the results have to be re-tokenised. Will have to be part of issue #15.

circleCI: install loanpy from latest git commit

The current yaml-file tells circleCI to pip install loanpy - i.e. the latest stable version on pypi. I can't make a new stable release for every minor change I make. But I want to run all the tests for every minor change I make. Therefore, I must change the yaml file, so that it pip installs from the latest commit. A quick web search for "pip install from github" led me to: https://pythoninoffice.com/python-pip-install-from-github/

but if I run

python3.9 -m pip install git+https://github.com/martino-vic/[email protected]

I get:

Collecting git+https://github.com/martino-vic/[email protected]
  Cloning https://github.com/martino-vic/loanpy.git (to revision 2.0-beta) to /tmp/pip-req-build-euxg2hai
  Running command git clone --filter=blob:none --quiet https://github.com/martino-vic/loanpy.git /tmp/pip-req-build-euxg2hai

  Resolved https://github.com/martino-vic/loanpy.git to commit c345c9912aec7c742d5c7778e1e6534a0cb2478b
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [10 lines of output]
      /home/viktor/Documents/cldfvenv3.9/lib/python3.9/site-packages/setuptools/dist.py:717: UserWarning: Usage of dash-separated 'description-file' will not be supported in future versions. Please use the underscore name 'description_file' instead
        warnings.warn(
      running egg_info
      creating /tmp/pip-pip-egg-info-snob2_1c/loanpy.egg-info
      writing /tmp/pip-pip-egg-info-snob2_1c/loanpy.egg-info/PKG-INFO
      writing dependency_links to /tmp/pip-pip-egg-info-snob2_1c/loanpy.egg-info/dependency_links.txt
      writing requirements to /tmp/pip-pip-egg-info-snob2_1c/loanpy.egg-info/requires.txt
      writing top-level names to /tmp/pip-pip-egg-info-snob2_1c/loanpy.egg-info/top_level.txt
      writing manifest file '/tmp/pip-pip-egg-info-snob2_1c/loanpy.egg-info/SOURCES.txt'
      error: package directory 'loanpy' does not exist
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

ditch gensim, switch to spark-nlp

seems to be more state-of-the-art, more out of the box, and good vectors for German available (and many more languages) so no unnecessary error-prone translations to English needed.

move alignment into preprocessing from qfysc

so it can be improved with edictor. The evaluation module can read alignments from the alignment column, instead of from the tokenised IPA-transcriptions

Workflow sketch:

  1. improve alignments with edictor
  2. download them to etc.
  3. include through lexibank
  4. make folder loanpy
  5. rewrite qfysc.py so it works with aligned inputs only
  6. Write sound correspondence files to folder loanpy
  7. Think about best format how to store them
  8. create etymology profiles for each word-pair. 1 profile = 1 table, linked by ID to big table? In separate folder?
  9. run sanity.py and write results to new folder sanity?
  10. include entropy in forms.csv via lexibank script?

phonotactic profiles

At the moment I'm calculating the phonotactic profiles of words from within loanpy and was wondering if that is not also a data wrangling task that should be moved to the preprocessing? If yes then I could fix this step simultaneously with issue #18 @LinguList

calculate AUC

from sklearn.metrics import roc_auc_score
roc_auc_score(nparray_x, nparray_y)

misc

rename private functions to start with "_"
switch from json to toml for sc-files to make them human-usable (allows even comments)
add pyproject.toml and move some functionality from setup.py in there

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.