loanpydatahub / loanpy Goto Github PK

LoanPy is a linguistic toolkit for rule-based prediction and evaluation of loanword adaptation and historical reconstructions and can be used to search for old loanwords

Home Page: https://twitter.com/martino_vik

License: MIT License

Python 99.62% Makefile 0.38%

language-contact linguistics loanwords sound-change-applier

loanpy's People

Contributors

Stargazers

Watchers

Forkers

antipodite

loanpy's Issues

move class Etym from helpers.py to qfysc.py

along with all functions that go with it. Also fix the tests.

add gpl-3.0.txt to repo (license)

integrate workflow into CLDF's CLI

Currently loanpy does the file reading on its own with etym_obj = Etym(forms_csv="forms.csv", source_language="English", target_language="Bislama") but this could be handled with sth like:

from cldfbench_tryonbislama import Dataset as BSLM
bslm = BSLM()
ds = Dataset.from_metadata(bslm.cldf_dir.joinpath("cldf-metadata.json"))
for val in ds.objects("FormTable"):
    # execute loanpy workflow

update e-mail address in setup.py

add try-except for KeyError in eval_sca.eval_one

If crossvalidation causes that a sound is not in scdict AND it can't be caught through the heuristics (e.g. 'ʊ͡ə'), adapt throws a keyerror. This causes eval_one to stop when eval_one calls adapt. Instead append "False" to the results and continue the loop.

LingPy's align method can be used more efficiently

https://github.com/martino-vic/loanpy/blob/b073e367c31ef87c3b3e694b6f2c2adfa7abd2fa/qfysc.py#L405

What you do is to convert the alignment of two strings to a string, then pull out the info. But Pairwise works like this:

pw = Pairwise(["a", "b", "c"], ["b", "c", "a"]) # if you have segmented data, no need to take care of merging
pw.align(distance=True)
almA, almB, dist = pw.alignments[0]

You see? The alignment is accessible via the alignments attribute. Since Pairwise handles more than just one pair, this is in position 0, if only one pair has been passed.

add tokenised (and other?) columns to Etym.dfety

This will serve as the input information that will eventually allow to ditch ipatok.tokenise, ipatok.clusterise and presumably others

rename default col name PROSODY to Prosodic_Structure

https://cldf.clld.org/v1.0/terms.rdf#prosodicStructure

add AI-module

use script from google colab notebook.
Consider doing this only in colab, since training on own machine is wasteful.

rearrange order within modules

Put class on top, then main methods, then helping methods/functions for init.

Rearrange order of tests too, so it reflects order in modules - ensure this way that every function and method has its own test

switch from vow2fb to token2class

and fix all tests

provide a full workflow to explain how to work with a file from beginning to end

This would help to run through loanpy.

Add .gitignore for python

In tests you have shared you pycache

Write documentation

including examples in an ipynb, inline comments, youtube tutorial, method chapter.
also the tests.

Update tests

Check if:

a) Every function has a unit & integration test
b) Redundant tests are deleted
c) Every function of every unit test has a patch

Is IPAtok really needed?

You use IPA tok for a segmentation task, but ipatok is only needed if you have unsegmented text. We have with CLDF now the convention to only work whith pre-segmented data, and it is beneficial, as you can then concentrate on the coding aspect, not on data wrangling, in your library. So I suggest to drop the dependency and instead maybe go for an internal representation of the form similar to lingpy, e.g., using a lingpy.basictype.strings or lingpy.basictypes.lists (which is more complete):

Test this out:

from lingpy import basictypes

tst = basictypes.lists("t h i s + i s + a + t e s t")
tst.n[1]

Refactor qfysc.Etym.init

Couple of simplifications already made in the commits from today (Jul 13, 2022). Still open:

remove function + tests + param, and maybe even attribute for read_dst/distance_measure
remove read_scdictbase
remove cldf2pd, i.e. ditch the long table format
remove align()
rearrange order of functions in adrc.py - class on top.

bugfix (?) lingpy's prosodic_string func

Vowels with tie bars, i.e. diphthongs, are transcribed as one vowel, i.e. prosodic_string("a͡e", _output="cv") == "V". Might be that this is generally the preferred way to transcribe this. For my own use case, this kinda seems to have fixed the issue for now:

        lexseg = []  # replace tie bars in vowels with space
        for i in ipa_in:
            if chr(865) in i and prosodic_string(  # tie bar
            i.split(" "), _output='cv') == "V":
                i = i.split(chr(865))
                lexseg += i
            else:
                lexseg.append(i)

Where is the `setup.py`

I was curious: how do you deploy data on pypi? I could not find the setup.py script, is it in another repository?

use arrows when storing sound substituions, not <

Write unittests with pytest

support python 3.7

tox software

replace {} | {} by {**{}, **{}} in two places

fix vowel harmony

at https://en.wikipedia.org/wiki/Hungarian_phonology

In the current version front and back vowels as defined by asjp can't occur in one word. That's wrong. It's the phonemes from the illustration above that can't co-occur within a word (adapt and reconstruct anyways first reconstruct one of those vowels, so no need to fix ANY vowel, and possible to fix only on those specific ones)

front-back vowel quality

Just realised there's another part that seems like I should move to preprocessing as well: turning front vowels to an "F" and back-vowels to a "B" (it's for later optionally repairing and filtering words that are violating the constraint "front-back vowel harmony", assuming words can have either only front vowels or only back vowels - this was just an idea I had in the beginning, ever since I'm evaluating my results with sanity.py it turned out that results are always worse when I use this feature, so I'm not actively using it anymore, but I think it's still cool to have this option built in at least).

My question before I start fiddling around with this: Is there maybe a function in lingpy that can do this (similar to prosodic_string)? And if not, would it make sense to build it in there? Then I could ditch it from loanpy as well - it's kind of bloating the whole thing unnecessarily I feel

calculate entropy of etymologies

look up details. Division by length of input is wrong. Log2 needed.

add "how to cite" to readme with bibtex badge

Fix tests

Earlier I was unconsciously always installing the latest stable version from pypi and running the tests against that, locally, instead of installing the latest commit from github. This issue was fixed in #21. However, now I can see that the first minor change that I had announced in #18 (comment), i.e. c44ecfc in fact broke quite a few tests. These need to be fixed now.

uncomment functions in loanfinder.py

functions likeliestphonmatch and postprocessing, including tests. Has to be done in combination with the lexibank-script, in which the results have to be re-tokenised. Will have to be part of issue #15.

circleCI: install loanpy from latest git commit

The current yaml-file tells circleCI to pip install loanpy - i.e. the latest stable version on pypi. I can't make a new stable release for every minor change I make. But I want to run all the tests for every minor change I make. Therefore, I must change the yaml file, so that it pip installs from the latest commit. A quick web search for "pip install from github" led me to: https://pythoninoffice.com/python-pip-install-from-github/

but if I run

python3.9 -m pip install git+https://github.com/martino-vic/[email protected]

I get:

Collecting git+https://github.com/martino-vic/[email protected]
  Cloning https://github.com/martino-vic/loanpy.git (to revision 2.0-beta) to /tmp/pip-req-build-euxg2hai
  Running command git clone --filter=blob:none --quiet https://github.com/martino-vic/loanpy.git /tmp/pip-req-build-euxg2hai

  Resolved https://github.com/martino-vic/loanpy.git to commit c345c9912aec7c742d5c7778e1e6534a0cb2478b
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [10 lines of output]
      /home/viktor/Documents/cldfvenv3.9/lib/python3.9/site-packages/setuptools/dist.py:717: UserWarning: Usage of dash-separated 'description-file' will not be supported in future versions. Please use the underscore name 'description_file' instead
        warnings.warn(
      running egg_info
      creating /tmp/pip-pip-egg-info-snob2_1c/loanpy.egg-info
      writing /tmp/pip-pip-egg-info-snob2_1c/loanpy.egg-info/PKG-INFO
      writing dependency_links to /tmp/pip-pip-egg-info-snob2_1c/loanpy.egg-info/dependency_links.txt
      writing requirements to /tmp/pip-pip-egg-info-snob2_1c/loanpy.egg-info/requires.txt
      writing top-level names to /tmp/pip-pip-egg-info-snob2_1c/loanpy.egg-info/top_level.txt
      writing manifest file '/tmp/pip-pip-egg-info-snob2_1c/loanpy.egg-info/SOURCES.txt'
      error: package directory 'loanpy' does not exist
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

replace helpers.pick_minmax with inbuilt heapq.nlargest, .nsmallest

see https://en.wikipedia.org/wiki/Partial_sorting
and https://docs.python.org/3/library/heapq.html#heapq.nlargest
and https://docs.python.org/3/library/heapq.html#heapq.nsmallest
also https://stackoverflow.com/questions/4555820/how-can-i-partially-sort-a-python-list

ditch gensim, switch to spark-nlp

seems to be more state-of-the-art, more out of the box, and good vectors for German available (and many more languages) so no unnecessary error-prone translations to English needed.

update documentation after vs 3.0.4

loanpy.utils.is_valid_language_sequence has an additional parameter which specifies the index of the column containing the language ID

write test for new try/except clause, see coverall

store files as json, not txt

for security reasons

get CircleCI-badge

continuous integration.
see repo copius_api for how it's done: https://github.com/martino-vic/copius_api

move alignment into preprocessing from qfysc

so it can be improved with edictor. The evaluation module can read alignments from the alignment column, instead of from the tokenised IPA-transcriptions

Workflow sketch:

improve alignments with edictor
download them to etc.
include through lexibank
make folder loanpy
rewrite qfysc.py so it works with aligned inputs only
Write sound correspondence files to folder loanpy
Think about best format how to store them
create etymology profiles for each word-pair. 1 profile = 1 table, linked by ID to big table? In separate folder?
run sanity.py and write results to new folder sanity?
include entropy in forms.csv via lexibank script?

phonotactic profiles

At the moment I'm calculating the phonotactic profiles of words from within loanpy and was wondering if that is not also a data wrangling task that should be moved to the preprocessing? If yes then I could fix this step simultaneously with issue #18 @LinguList

from sklearn.metrics import roc_auc_score
roc_auc_score(nparray_x, nparray_y)

misc

rename private functions to start with "_"
switch from json to toml for sc-files to make them human-usable (allows even comments)
add pyproject.toml and move some functionality from setup.py in there