Git Product home page Git Product logo

pyconcepticon's Introduction

pyconcepticon's People

Contributors

chrzyki avatar lingulist avatar martino-vic avatar simongreenhill avatar xrotwang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pyconcepticon's Issues

Validation in Conceptlist with `partial` and `valid_int` yields `None` values

>>> from pyconcepticon.api import Concepticon
>>> from pyconcepticon.util import lowercase
>>> from pyconcepticon.models import Conceptlist
>>> con = Concepticon()
>>> Conceptlist(api=con, **lowercase(con.conceptlists_dicts[0]))
Conceptlist(_api=<pyconcepticon.api.Concepticon object at 0x7f6465f35f10>, id='Savelyev-2020-254', author='Savelyev, Alexander and Robbeets, Martine', year=None, list_suffix='', items=None, tags=['basic'], source_language=['english'], target_language='Turkic languages', url='https://zenodo.org/record/3555174/files/turkic_alignment.tsv?download=1', refs=['Savelyev2020'], pdf=[], note='This concept list, which extends the classical [Swadesh list of 200 items](:ref:Swadesh-1952-200) by the [Leipzig-Jakarta list](:ref:Tadmor-2009-100) was used for a phylogenetic study on Turkic languages.', pages='', alias=[], local=False)

So we can see that year and items both give None.

I realized that by modifying the attr.ib converter in line 194 from

    year = attr.ib(converter=partial(valid_int, 'YEAR'))

to

    year = attr.ib(converter=int)

fixes this bug. The question is: do we need this valid_int? We would prefer it to raise an error if items or year are not integers, right?

add wheel to requirements?

When running pip install pyconcepticon in a fresh virtual environment I get an error (see below). I fixed this by first running pip install wheel but was wondering if it wouldn't be more practical to have wheel as a dependency in requirements.txt

  Building wheel for pylatexenc (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: /home/viktor/Documents/GitHub/streitberggothic/venv/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-xf3nahcf/pylatexenc/setup.py'"'"'; __file__='"'"'/tmp/pip-install-xf3nahcf/pylatexenc/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-0doo41n8
       cwd: /tmp/pip-install-xf3nahcf/pylatexenc/
  Complete output (6 lines):
  usage: setup.py [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
     or: setup.py --help [cmd1 cmd2 ...]
     or: setup.py --help-commands
     or: setup.py cmd --help
  
  error: invalid command 'bdist_wheel'
  ----------------------------------------
  ERROR: Failed building wheel for pylatexenc
  Running setup.py clean for pylatexenc
  Building wheel for docopt (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: /home/viktor/Documents/GitHub/streitberggothic/venv/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-xf3nahcf/docopt/setup.py'"'"'; __file__='"'"'/tmp/pip-install-xf3nahcf/docopt/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-4kotq992
       cwd: /tmp/pip-install-xf3nahcf/docopt/
  Complete output (6 lines):
  usage: setup.py [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
     or: setup.py --help [cmd1 cmd2 ...]
     or: setup.py --help-commands
     or: setup.py cmd --help
  
  error: invalid command 'bdist_wheel'
  ----------------------------------------
  ERROR: Failed building wheel for docopt

Open issue per editor

As discussed with @xrotwang, it would be great to have an overview of open issues per editor so that editors can check open tasks or follow up on discussions.

Compute intersection of local conceptlist

Hello everybody,

is there an implemented way of running the intersection command against a local conceptlist that is not part of CONCEPTICON, but succesfully mapped?

Currently, I try to run the following command:
concepticon intersection Swadesh-1955-100 Pano-Takana/blumpanotacana/etc/Conceptlists/Alviano-1957-448.tsv

Alviano-1957-448 is a conceptlist with the following columns:

ID NUMBER WORD CONCEPTICON_ID CONCEPTICON_GLOSS

However, I receive the following error message:

(concepticon) blum@lingn18 Projects % concepticon intersection Swadesh-1955-100 Pano-Takana/blumpanotacana/etc/Conceptlists/Alviano-1957-448.tsv
INFO    concepticon/concepticon-data at /Users/blum/Library/Application Support/cldf/concepticon
Traceback (most recent call last):
  File "/usr/local/bin/concepticon", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.9/site-packages/pyconcepticon/__main__.py", line 62, in main
    return args.main(args) or 0
  File "/usr/local/lib/python3.9/site-packages/pyconcepticon/commands/intersection.py", line 22, in run
    format_set_operation(args.repos.intersection(*get_conceptlist(args)))
  File "/usr/local/lib/python3.9/site-packages/pyconcepticon/api.py", line 543, in intersection
    return list(self._set_operation('intersection', *clids, **kw))
  File "/usr/local/lib/python3.9/site-packages/pyconcepticon/api.py", line 524, in _set_operation
    for c, lists in compare_conceptlists(self, *clids, **kw):
  File "/usr/local/lib/python3.9/site-packages/pyconcepticon/models.py", line 315, in compare_conceptlists
    for c in clist.concepts.values():
  File "/usr/local/lib/python3.9/site-packages/clldutils/misc.py", line 197, in __get__
    result = instance.__dict__[self.__name__] = self.fget(instance)
  File "/usr/local/lib/python3.9/site-packages/pyconcepticon/models.py", line 255, in concepts
    for item in self.metadata:
  File "/usr/local/lib/python3.9/site-packages/clldutils/misc.py", line 197, in __get__
    result = instance.__dict__[self.__name__] = self.fget(instance)
  File "/usr/local/lib/python3.9/site-packages/pyconcepticon/models.py", line 234, in metadata
    return self.tg.tables[0]
  File "/usr/local/lib/python3.9/site-packages/clldutils/misc.py", line 197, in __get__
    result = instance.__dict__[self.__name__] = self.fget(instance)
  File "/usr/local/lib/python3.9/site-packages/pyconcepticon/models.py", line 226, in tg
    tg = TableGroup.from_file(md)
  File "/usr/local/lib/python3.9/site-packages/csvw/metadata.py", line 1039, in from_file
    res = cls.fromvalue(data or get_json(fname))
  File "/usr/local/lib/python3.9/site-packages/csvw/metadata.py", line 430, in fromvalue
    return cls(**cls.partition_properties(d))
  File "<attrs generated init csvw.metadata.TableGroup>", line 39, in __init__
  File "/usr/local/lib/python3.9/site-packages/csvw/metadata.py", line 1402, in converter_tables
    res.append(Table.fromvalue(vv) if isinstance(vv, dict) else vv)
  File "/usr/local/lib/python3.9/site-packages/csvw/metadata.py", line 430, in fromvalue
    return cls(**cls.partition_properties(d))
  File "<attrs generated init csvw.metadata.Table>", line 46, in __init__
  File "/usr/local/lib/python3.9/site-packages/csvw/metadata.py", line 1166, in __attrs_post_init__
    raise ValueError('url property is required for Tables')
ValueError: url property is required for Tables

What works:
concepticon intersection Swadesh-1955-100 Swadesh-1952-200

How can I fix my command or the conceptlist so I can compute the intersection?

handling multiple relations for the same concept sets

As came up during the integrations of new conceptrelations, the current version of concepticon test cannot handle concepts having more than one relation.

The dictionary there currently as the structure:

{
"concept1": {
"concept2": "relation-unique"
}
}

We'll probably need a set/list there:

{
"concept1": {
"concept2": ["relation-A", "relation-B"]
}
}

path-issues

In my current installation, I have path issues: mapping only works when I reside inside the concepticon folder. Can or do we have the possibility to assign where the conceepticon-repos resides?

Validator for Conceptlist.author

To prevent issues like the one fixed here, Conceptlist.author should have an associated validator, making sure that

  • if the string contains more than one comma, it must contain an and
  • no substring - if split on and - exceeds a reasonable max length (200 chars might be suitable, taking into account clld/clld#214 and the fact that contributor names are turned into IDs and thus become part of URLs).

fix tests for concept sets which are aliased

The aliases have changed a bit during the past time. They should ideally only be touched in exceptional situations, involving the retiring of concept sets.

The test should:

  1. check if an aliased concept set ID is higher than its alias (this is prohibited and should raise an error)
  2. check if aliased concept sets are used in any of the concept lists (this should raise an error)

Better mapping

The way automatic concept mapping works can be improved, especially from an end-user perspective. @LinguList suggests a number of ways in #398. Amongst those:

  • an interactive way of working with concept mapping suggestions (e.g. in the form of a small web app)
  • POS information and definitions
  • better (more approachable? understandable?) similarity levels
  • better handling of malformed input in the Gloss() class

map_concepts fails on Windows machines, cp1251 issue

Printing the results of automatic mapping

if out is None:
print(writer.read().decode('utf-8'))

for results with UTF-8 elements fails with a UnicodeDecodeError for the respective character (note, this is happening even with > redirecting, internally the characters are still trying to be printed). Can be remedied temporarily with either

set PYTHONIOENCODING=utf-8

in the Windows command prompt (before running the mapping), by using a Unicode-aware terminal or by changing the way map() prints or by giving an alternative way for writing (not 'printing') directly to a file.

make tests for authors

Authors are bibtex-styled, so we expect: multiple authors split by " and ", and then we have familyname, firstname or firstname familyname.

I assume, given that pybtex is also not strong in this regard, that we best add an Author class doing the parsing here? It should raise an error if there are multiple commas and no and or AND, for example.

I am also fine with a regex in the attr.ib evaluation.

add new functionalities into the CLI

add new functionalities into the CLI. The basic idea is:

  1. a command concepticon mergers conceptlist which spits out the concepts which are merged.
  2. a command concepticon check conceptlist which tests for general conformity and will spit out problems, like non-unqiue IDs etc. Current functionality is lacking, as it only gives us the first non-unique key, which is tedious if you have 100 non-unique ones, etc.
  3. the possibility to have user-defined potentially new concept sets for the time while working on a project. We currently define them by leaving the concepticon_id empty and adding a "!" in front of the concepticon gloss. All this applies to local concept-lists, not to our data in Concepticon. But a command Concepticon newconcepts conceptlist or similar would ideally provide new lines with free concepticon IDs where users can fill in the missing definitions etc. I used my custom workarounds when working on larger datasets mysefl, but I think we should, with more people working on this, try to make it more comfortable for our users.

@chrzyki will first start looking into this. I added some hints with @todo marks to the testing-branch, and I hope we can have have a first PR with some of these new ideas by the end of the week.

Remove obsolete code

The union/intersection commands are largely superseded by cldfbench concepticon.intersection, once Concepticon 3.9 (and the CLDF data) is released.
So remove

def compare_conceptlists(api, *conceptlists, **kw):
"""
Function compares multiple conceptlists and extracts common concepts.
Note
----
The method takes concept relations into account.
"""
search_depth = kw.pop('search_depth', 3)
commons = collections.defaultdict(set)
# store all concepts along with their broader concepts
for arg in conceptlists:
if isinstance(arg, Conceptlist):
clist = arg
elif arg not in api.conceptlists:
clist = Conceptlist.from_file(arg)
else:
clist = api.conceptlists[arg]
clid = getattr(arg, 'id', arg)
for c in clist.concepts.values():
if c.concepticon_id:
commons[c.concepticon_id].add((
clid, 0, c.concepticon_id, c.concepticon_gloss))
for rel, depth in [
('broader', functools.partial(operator.add, 0)),
('narrower', functools.partial(operator.sub, 0))
]:
for cn, d in api.relations.iter_related(
c.concepticon_id, rel, max_degree_of_separation=search_depth):
commons[cn].add((
clid, depth(d), c.concepticon_id, c.concepticon_gloss))
# store proper concepts (the ones purely underived), as we need to check in
# a second run, whether a narrower concept occurs (don't find another
# solution for this)
proper_concepts = set()
for c, lists in commons.items():
if len(set([x[0] for x in lists])) > 1 and [d for _, d, i, g in lists if d == 0]:
proper_concepts.add(c)
# get a list of concepts that should be split into subsets (so they should
# not be retained, such as arm/hand if arm and hand occur in certain lists
# the blacklist is needed to make sure that narrower concepts which are
# combined by adding a broader concept are not added additionally
split_concepts = set([])
blacklist = set([])
for cid, lists in commons.items():
if len(lists) > 1:
# if one list makes MORE distinctions than the other, yield the
# more refined list
listcheck = collections.defaultdict(list)
for a, b, c, d in lists:
if b >= 0:
listcheck[a] += [(a, b, c, d)]
for _, concepts in listcheck.items():
if len([x for x in concepts if x[1] > 0]) > 1:
split_concepts.add(cid)
break
if cid not in split_concepts:
if len([l_ for l_ in lists if l_[1] > 0]) == len(lists):
if len(set([l_[2] for l_ in lists])) > 1:
for l_ in lists:
blacklist.add(l_[2])
for cid, lists in sorted(
commons.items(), key=lambda x: api.conceptsets[x[0]].gloss):
sorted_lists = sorted(lists, key=lambda x: str(x))
depths = [x[1] for x in sorted_lists]
reflexes = [x[2] for x in sorted_lists]
if cid not in split_concepts:
# yield unique concepts directly
if len(lists) == 1:
if next(iter(lists))[1] == 0 and cid not in blacklist:
yield (cid, lists)
# check broader or narrower concept collections
elif 0 not in depths:
concepts = dict([(c, (a, b)) for a, b, c, d in sorted_lists])
# if all concepts are narrower, dont' retain them
retain = bool([x for x in depths if x > 0])
for concept in concepts:
if concept in proper_concepts:
retain = False
break
if retain:
yield (cid, lists)
else:
# if one list makes MORE distinctions than the other, yield the
# more refined list
if [x for x in depths if x < 0]:
dont_yield = False
for d, c in zip(depths, reflexes):
if d < 0 and c not in split_concepts:
dont_yield = True
if not dont_yield:
yield (cid, lists)
else:
yield (cid, lists)

and code using this.

consider moving the mapping code to `pysen`

I started drafting the mapping code, by establishing a new evaluation of similarities with respect to concepticon mappings in pysen. They are not perfect, but automated mapping has already been ported and 100% test coverage.

The principle is similar to what we decided to do in linse: in order to have access to part of the data in clts, we make a dump in zipped for, that can be included in the libraries and it is stated in RELEASING.md that the most recent dump needs to be made then.

The dump command is now available from pyconcepticon as well, and pysen also uses the dumped data for mapping.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.