concepticon / pyconcepticon Goto Github PK

View Code? Open in Web Editor NEW

6.0 6.0 3.0 1.27 MB

Programmatic access to Concepticon data.

Home Page: https://concepticon.clld.org

License: Apache License 2.0

Python 99.80% TeX 0.20%

pyconcepticon's Introduction

concepticon app

This repository provides the source code for the clld app serving https://concepticon.clld.org

pyconcepticon's People

Contributors

Stargazers

Watchers

Forkers

anaphory armendk javiervz

pyconcepticon's Issues

Remove reference to clldutils.dsv

With the latest clldutils release, dsv has been finally removed from clldutils, so

pyconcepticon/src/pyconcepticon/util.py

Line 8 in af3f732

from clldutils import dsv

should use csvw.dsv instead.

Validation in Conceptlist with `partial` and `valid_int` yields `None` values

>>> from pyconcepticon.api import Concepticon
>>> from pyconcepticon.util import lowercase
>>> from pyconcepticon.models import Conceptlist
>>> con = Concepticon()
>>> Conceptlist(api=con, **lowercase(con.conceptlists_dicts[0]))
Conceptlist(_api=<pyconcepticon.api.Concepticon object at 0x7f6465f35f10>, id='Savelyev-2020-254', author='Savelyev, Alexander and Robbeets, Martine', year=None, list_suffix='', items=None, tags=['basic'], source_language=['english'], target_language='Turkic languages', url='https://zenodo.org/record/3555174/files/turkic_alignment.tsv?download=1', refs=['Savelyev2020'], pdf=[], note='This concept list, which extends the classical [Swadesh list of 200 items](:ref:Swadesh-1952-200) by the [Leipzig-Jakarta list](:ref:Tadmor-2009-100) was used for a phylogenetic study on Turkic languages.', pages='', alias=[], local=False)

So we can see that year and items both give None.

I realized that by modifying the attr.ib converter in line 194 from

    year = attr.ib(converter=partial(valid_int, 'YEAR'))

    year = attr.ib(converter=int)

fixes this bug. The question is: do we need this valid_int? We would prefer it to raise an error if items or year are not integers, right?

add wheel to requirements?

When running pip install pyconcepticon in a fresh virtual environment I get an error (see below). I fixed this by first running pip install wheel but was wondering if it wouldn't be more practical to have wheel as a dependency in requirements.txt

  Building wheel for pylatexenc (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: /home/viktor/Documents/GitHub/streitberggothic/venv/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-xf3nahcf/pylatexenc/setup.py'"'"'; __file__='"'"'/tmp/pip-install-xf3nahcf/pylatexenc/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-0doo41n8
       cwd: /tmp/pip-install-xf3nahcf/pylatexenc/
  Complete output (6 lines):
  usage: setup.py [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
     or: setup.py --help [cmd1 cmd2 ...]
     or: setup.py --help-commands
     or: setup.py cmd --help
  
  error: invalid command 'bdist_wheel'
  ----------------------------------------
  ERROR: Failed building wheel for pylatexenc
  Running setup.py clean for pylatexenc
  Building wheel for docopt (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: /home/viktor/Documents/GitHub/streitberggothic/venv/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-xf3nahcf/docopt/setup.py'"'"'; __file__='"'"'/tmp/pip-install-xf3nahcf/docopt/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-4kotq992
       cwd: /tmp/pip-install-xf3nahcf/docopt/
  Complete output (6 lines):
  usage: setup.py [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
     or: setup.py --help [cmd1 cmd2 ...]
     or: setup.py --help-commands
     or: setup.py cmd --help
  
  error: invalid command 'bdist_wheel'
  ----------------------------------------
  ERROR: Failed building wheel for docopt

Open issue per editor

As discussed with @xrotwang, it would be great to have an overview of open issues per editor so that editors can check open tasks or follow up on discussions.

Compute intersection of local conceptlist

Hello everybody,

is there an implemented way of running the intersection command against a local conceptlist that is not part of CONCEPTICON, but succesfully mapped?

Currently, I try to run the following command:
concepticon intersection Swadesh-1955-100 Pano-Takana/blumpanotacana/etc/Conceptlists/Alviano-1957-448.tsv

Alviano-1957-448 is a conceptlist with the following columns:

ID NUMBER WORD CONCEPTICON_ID CONCEPTICON_GLOSS

However, I receive the following error message:

(concepticon) blum@lingn18 Projects % concepticon intersection Swadesh-1955-100 Pano-Takana/blumpanotacana/etc/Conceptlists/Alviano-1957-448.tsv
INFO    concepticon/concepticon-data at /Users/blum/Library/Application Support/cldf/concepticon
Traceback (most recent call last):
  File "/usr/local/bin/concepticon", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.9/site-packages/pyconcepticon/__main__.py", line 62, in main
    return args.main(args) or 0
  File "/usr/local/lib/python3.9/site-packages/pyconcepticon/commands/intersection.py", line 22, in run
    format_set_operation(args.repos.intersection(*get_conceptlist(args)))
  File "/usr/local/lib/python3.9/site-packages/pyconcepticon/api.py", line 543, in intersection
    return list(self._set_operation('intersection', *clids, **kw))
  File "/usr/local/lib/python3.9/site-packages/pyconcepticon/api.py", line 524, in _set_operation
    for c, lists in compare_conceptlists(self, *clids, **kw):
  File "/usr/local/lib/python3.9/site-packages/pyconcepticon/models.py", line 315, in compare_conceptlists
    for c in clist.concepts.values():
  File "/usr/local/lib/python3.9/site-packages/clldutils/misc.py", line 197, in __get__
    result = instance.__dict__[self.__name__] = self.fget(instance)
  File "/usr/local/lib/python3.9/site-packages/pyconcepticon/models.py", line 255, in concepts
    for item in self.metadata:
  File "/usr/local/lib/python3.9/site-packages/clldutils/misc.py", line 197, in __get__
    result = instance.__dict__[self.__name__] = self.fget(instance)
  File "/usr/local/lib/python3.9/site-packages/pyconcepticon/models.py", line 234, in metadata
    return self.tg.tables[0]
  File "/usr/local/lib/python3.9/site-packages/clldutils/misc.py", line 197, in __get__
    result = instance.__dict__[self.__name__] = self.fget(instance)
  File "/usr/local/lib/python3.9/site-packages/pyconcepticon/models.py", line 226, in tg
    tg = TableGroup.from_file(md)
  File "/usr/local/lib/python3.9/site-packages/csvw/metadata.py", line 1039, in from_file
    res = cls.fromvalue(data or get_json(fname))
  File "/usr/local/lib/python3.9/site-packages/csvw/metadata.py", line 430, in fromvalue
    return cls(**cls.partition_properties(d))
  File "<attrs generated init csvw.metadata.TableGroup>", line 39, in __init__
  File "/usr/local/lib/python3.9/site-packages/csvw/metadata.py", line 1402, in converter_tables
    res.append(Table.fromvalue(vv) if isinstance(vv, dict) else vv)
  File "/usr/local/lib/python3.9/site-packages/csvw/metadata.py", line 430, in fromvalue
    return cls(**cls.partition_properties(d))
  File "<attrs generated init csvw.metadata.Table>", line 46, in __init__
  File "/usr/local/lib/python3.9/site-packages/csvw/metadata.py", line 1166, in __attrs_post_init__
    raise ValueError('url property is required for Tables')
ValueError: url property is required for Tables

What works:
concepticon intersection Swadesh-1955-100 Swadesh-1952-200

How can I fix my command or the conceptlist so I can compute the intersection?

handling multiple relations for the same concept sets

As came up during the integrations of new conceptrelations, the current version of concepticon test cannot handle concepts having more than one relation.

The dictionary there currently as the structure:

{
"concept1": {
"concept2": "relation-unique"
}
}

We'll probably need a set/list there:

{
"concept1": {
"concept2": ["relation-A", "relation-B"]
}
}

path-issues

In my current installation, I have path issues: mapping only works when I reside inside the concepticon folder. Can or do we have the possibility to assign where the conceepticon-repos resides?

make_app doesn't have args.args from Concepticon.app_wrapper

Calling concepticon make_app ...

pyconcepticon/src/pyconcepticon/commands/make_app.py

Line 14 in 9d1c1c4

@Concepticon.app_wrapper

makes us of the app_wrapper ...

https://github.com/clld/clldutils/blob/65ac0344526cdc9b12bd6910113331515d4bca61/src/clldutils/apilib.py#L97

which doesn't have args.args populated, as far as I can tell. Or am I misremembering how to use/call make_app?

concepticon rename generates wrong url for concept list

Using concepticon rename provides the wrong URL for the list itself in the metadata, i.e. Buck-1949-1110.tsv-metadata.json instead of Buck-1949-1110.tsv. Observed here: concepticon/concepticon-data#781 & thanks for noticing @Schweikhard.

Validator for Conceptlist.author

To prevent issues like the one fixed here, Conceptlist.author should have an associated validator, making sure that

if the string contains more than one comma, it must contain an and
no substring - if split on and - exceeds a reasonable max length (200 chars might be suitable, taking into account clld/clld#214 and the fact that contributor names are turned into IDs and thus become part of URLs).

complete concepticon release command

Currently, this command does not specify the upload type, thus it defaults to "software".

Better error reporting

See concepticon/concepticon-data#1240 (comment)

fix tests for concept sets which are aliased

The aliases have changed a bit during the past time. They should ideally only be touched in exceptional situations, involving the retiring of concept sets.

The test should:

check if an aliased concept set ID is higher than its alias (this is prohibited and should raise an error)
check if aliased concept sets are used in any of the concept lists (this should raise an error)

`test` command with a single list runs check on everything

Despite what it says in the help, concepticon test xxx does not just test list xxx but all the lists.

Better mapping

The way automatic concept mapping works can be improved, especially from an end-user perspective. @LinguList suggests a number of ways in #398. Amongst those:

an interactive way of working with concept mapping suggestions (e.g. in the form of a small web app)
POS information and definitions
better (more approachable? understandable?) similarity levels
better handling of malformed input in the Gloss() class

map_concepts fails on Windows machines, cp1251 issue

Printing the results of automatic mapping

pyconcepticon/src/pyconcepticon/api.py

Lines 240 to 241 in 70120b1

 if out is None: 

 print(writer.read().decode('utf-8'))

for results with UTF-8 elements fails with a UnicodeDecodeError for the respective character (note, this is happening even with > redirecting, internally the characters are still trying to be printed). Can be remedied temporarily with either

set PYTHONIOENCODING=utf-8

in the Windows command prompt (before running the mapping), by using a Unicode-aware terminal or by changing the way map() prints or by giving an alternative way for writing (not 'printing') directly to a file.

Check language columns for mappings are well specified

In particular, the ISO 639-2 code used as part of the filenames mappings/map-<ISO-639-2>.tsv must be unique and correct. This code is specified in concepticondata/concepticon.json.

Integrate cldfcatalog

Based on cldfcatalog, discovering the location of concepticon-data should be simplified.
Here's how this could look like in pyclts: https://github.com/cldf-clts/pyclts#install

replace pyconcepticon.metadata with corresponding functionality in clldutils 3.4

Command to move concept list from dev repos to production

see concepticon/concepticon-data#408

make tests for authors

Authors are bibtex-styled, so we expect: multiple authors split by " and ", and then we have familyname, firstname or firstname familyname.

I assume, given that pybtex is also not strong in this regard, that we best add an Author class doing the parsing here? It should raise an error if there are multiple commas and no and or AND, for example.

I am also fine with a regex in the attr.ib evaluation.

add new functionalities into the CLI

add new functionalities into the CLI. The basic idea is:

a command concepticon mergers conceptlist which spits out the concepts which are merged.
a command concepticon check conceptlist which tests for general conformity and will spit out problems, like non-unqiue IDs etc. Current functionality is lacking, as it only gives us the first non-unique key, which is tedious if you have 100 non-unique ones, etc.
the possibility to have user-defined potentially new concept sets for the time while working on a project. We currently define them by leaving the concepticon_id empty and adding a "!" in front of the concepticon gloss. All this applies to local concept-lists, not to our data in Concepticon. But a command Concepticon newconcepts conceptlist or similar would ideally provide new lines with free concepticon IDs where users can fill in the missing definitions etc. I used my custom workarounds when working on larger datasets mysefl, but I think we should, with more people working on this, try to make it more comfortable for our users.

@chrzyki will first start looking into this. I added some hints with @todo marks to the testing-branch, and I hope we can have have a first PR with some of these new ideas by the end of the week.

Remove obsolete code

The union/intersection commands are largely superseded by cldfbench concepticon.intersection, once Concepticon 3.9 (and the CLDF data) is released.
So remove

pyconcepticon/src/pyconcepticon/models.py

Lines 306 to 404 in 15d9a64

 def compare_conceptlists(api, *conceptlists, **kw): 

 """ 

  Function compares multiple conceptlists and extracts common concepts. 

  Note 

  ---- 

  The method takes concept relations into account. 

  """ 

 search_depth = kw.pop('search_depth', 3) 

 commons = collections.defaultdict(set) 

 # store all concepts along with their broader concepts 

 for arg in conceptlists: 

 if isinstance(arg, Conceptlist): 

 clist = arg 

 elif arg not in api.conceptlists: 

 clist = Conceptlist.from_file(arg) 

 else: 

 clist = api.conceptlists[arg] 

 clid = getattr(arg, 'id', arg) 

 for c in clist.concepts.values(): 

 if c.concepticon_id: 

 commons[c.concepticon_id].add(( 

 clid, 0, c.concepticon_id, c.concepticon_gloss)) 

 for rel, depth in [ 

 ('broader', functools.partial(operator.add, 0)), 

 ('narrower', functools.partial(operator.sub, 0)) 

 ]: 

 for cn, d in api.relations.iter_related( 

 c.concepticon_id, rel, max_degree_of_separation=search_depth): 

 commons[cn].add(( 

 clid, depth(d), c.concepticon_id, c.concepticon_gloss)) 

 # store proper concepts (the ones purely underived), as we need to check in 

 # a second run, whether a narrower concept occurs (don't find another 

 # solution for this) 

 proper_concepts = set() 

 for c, lists in commons.items(): 

 if len(set([x[0] for x in lists])) > 1 and [d for _, d, i, g in lists if d == 0]: 

 proper_concepts.add(c) 

 # get a list of concepts that should be split into subsets (so they should 

 # not be retained, such as arm/hand if arm and hand occur in certain lists 

 # the blacklist is needed to make sure that narrower concepts which are 

 # combined by adding a broader concept are not added additionally 

 split_concepts = set([]) 

 blacklist = set([]) 

 for cid, lists in commons.items(): 

 if len(lists) > 1: 

 # if one list makes MORE distinctions than the other, yield the 

 # more refined list 

 listcheck = collections.defaultdict(list) 

 for a, b, c, d in lists: 

 if b >= 0: 

 listcheck[a] += [(a, b, c, d)] 

 for _, concepts in listcheck.items(): 

 if len([x for x in concepts if x[1] > 0]) > 1: 

 split_concepts.add(cid) 

 break 

 if cid not in split_concepts: 

 if len([l_ for l_ in lists if l_[1] > 0]) == len(lists): 

 if len(set([l_[2] for l_ in lists])) > 1: 

 for l_ in lists: 

 blacklist.add(l_[2]) 

 for cid, lists in sorted( 

 commons.items(), key=lambda x: api.conceptsets[x[0]].gloss): 

 sorted_lists = sorted(lists, key=lambda x: str(x)) 

 depths = [x[1] for x in sorted_lists] 

 reflexes = [x[2] for x in sorted_lists] 

 if cid not in split_concepts: 

 # yield unique concepts directly 

 if len(lists) == 1: 

 if next(iter(lists))[1] == 0 and cid not in blacklist: 

 yield (cid, lists) 

 # check broader or narrower concept collections 

 elif 0 not in depths: 

 concepts = dict([(c, (a, b)) for a, b, c, d in sorted_lists]) 

 # if all concepts are narrower, dont' retain them 

 retain = bool([x for x in depths if x > 0]) 

 for concept in concepts: 

 if concept in proper_concepts: 

 retain = False 

 break 

 if retain: 

 yield (cid, lists) 

 else: 

 # if one list makes MORE distinctions than the other, yield the 

 # more refined list 

 if [x for x in depths if x < 0]: 

 dont_yield = False 

 for d, c in zip(depths, reflexes): 

 if d < 0 and c not in split_concepts: 

 dont_yield = True 

 if not dont_yield: 

 yield (cid, lists) 

 else: 

 yield (cid, lists)

and code using this.

consider moving the mapping code to `pysen`

I started drafting the mapping code, by establishing a new evaluation of similarities with respect to concepticon mappings in pysen. They are not perfect, but automated mapping has already been ported and 100% test coverage.

The principle is similar to what we decided to do in linse: in order to have access to part of the data in clts, we make a dump in zipped for, that can be included in the libraries and it is stated in RELEASING.md that the most recent dump needs to be made then.

The dump command is now available from pyconcepticon as well, and pysen also uses the dumped data for mapping.

Backwards compatible support of repo layout Concepticon 3.0

pyconcepticon should support concepticon data as of Concepticon 3.0 in a backwards-compatible way.

Check consistency of concept set metadata

See e.g. concepticon/concepticon-data#600

	def compare_conceptlists(api, conceptlists, *kw):
	"""
	Function compares multiple conceptlists and extracts common concepts.

	Note
	----
	The method takes concept relations into account.
	"""
	search_depth = kw.pop('search_depth', 3)
	commons = collections.defaultdict(set)

	# store all concepts along with their broader concepts
	for arg in conceptlists:
	if isinstance(arg, Conceptlist):
	clist = arg
	elif arg not in api.conceptlists:
	clist = Conceptlist.from_file(arg)
	else:
	clist = api.conceptlists[arg]
	clid = getattr(arg, 'id', arg)
	for c in clist.concepts.values():
	if c.concepticon_id:
	commons[c.concepticon_id].add((
	clid, 0, c.concepticon_id, c.concepticon_gloss))
	for rel, depth in [
	('broader', functools.partial(operator.add, 0)),
	('narrower', functools.partial(operator.sub, 0))
	]:
	for cn, d in api.relations.iter_related(
	c.concepticon_id, rel, max_degree_of_separation=search_depth):
	commons[cn].add((
	clid, depth(d), c.concepticon_id, c.concepticon_gloss))

	# store proper concepts (the ones purely underived), as we need to check in
	# a second run, whether a narrower concept occurs (don't find another
	# solution for this)
	proper_concepts = set()
	for c, lists in commons.items():
	if len(set([x[0] for x in lists])) > 1 and [d for _, d, i, g in lists if d == 0]:
	proper_concepts.add(c)

	# get a list of concepts that should be split into subsets (so they should
	# not be retained, such as arm/hand if arm and hand occur in certain lists
	# the blacklist is needed to make sure that narrower concepts which are
	# combined by adding a broader concept are not added additionally
	split_concepts = set([])
	blacklist = set([])
	for cid, lists in commons.items():
	if len(lists) > 1:
	# if one list makes MORE distinctions than the other, yield the
	# more refined list
	listcheck = collections.defaultdict(list)
	for a, b, c, d in lists:
	if b >= 0:
	listcheck[a] += [(a, b, c, d)]
	for _, concepts in listcheck.items():
	if len([x for x in concepts if x[1] > 0]) > 1:
	split_concepts.add(cid)
	break
	if cid not in split_concepts:
	if len([l_ for l_ in lists if l_[1] > 0]) == len(lists):
	if len(set([l_[2] for l_ in lists])) > 1:
	for l_ in lists:
	blacklist.add(l_[2])

	for cid, lists in sorted(
	commons.items(), key=lambda x: api.conceptsets[x[0]].gloss):
	sorted_lists = sorted(lists, key=lambda x: str(x))
	depths = [x[1] for x in sorted_lists]
	reflexes = [x[2] for x in sorted_lists]

	if cid not in split_concepts:
	# yield unique concepts directly
	if len(lists) == 1:
	if next(iter(lists))[1] == 0 and cid not in blacklist:
	yield (cid, lists)
	# check broader or narrower concept collections
	elif 0 not in depths:
	concepts = dict([(c, (a, b)) for a, b, c, d in sorted_lists])
	# if all concepts are narrower, dont' retain them
	retain = bool([x for x in depths if x > 0])
	for concept in concepts:
	if concept in proper_concepts:
	retain = False
	break
	if retain:
	yield (cid, lists)
	else:
	# if one list makes MORE distinctions than the other, yield the
	# more refined list
	if [x for x in depths if x < 0]:
	dont_yield = False
	for d, c in zip(depths, reflexes):
	if d < 0 and c not in split_concepts:
	dont_yield = True
	if not dont_yield:
	yield (cid, lists)
	else:
	yield (cid, lists)