lexibank / abvd Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 2.0 67.19 MB

CLDF dataset derived from Greenhill et al.'s "Austronesian Basic Vocabulary Database" from 2020.

Home Page: https://abvd.eva.mpg.de

License: Creative Commons Attribution 4.0 International

TeX 99.14% Python 0.86%

austronesian cldf lexibank1 vocabulary-database

abvd's People

Contributors

Stargazers

Watchers

Forkers

bibiko hansonmenghan

abvd's Issues

How to handle inconsistent cognateset IDs

The word for "Twenty" in language Palembang Malay is assigned to cognate set 3.6. This may mean 36, or 3 , 6.
It certainly is a good example why we shouldn't sluggify identifiers, but how to correct? Fix in the source?

11 -> 1 for 1193-92_toopenuncover-1 ?

I suspect that the form below should be in cognacy class 1 rather than 11. The same is perhaps also true of 983-92_toopenuncover-1.

ID	Local_ID	Language_ID	Parameter_ID	Value	Form	Segments	Comment	Source	Cognacy
1193-92_toopenuncover-1	287978	1193	92_toopenuncover	mukaʔ	mukaʔ			67307	11

For brief context

Other members of 11:

duuki
suuki
suugiɪ
suughi
sukej
bukaʔ
sugeg
shoeegu

Other members of 1:

buka
puke
bika
munukə
mo-muʔo
muka
mbukak
wuʔa
bukaʔ
buŋgăhu
buka
bukáʔ
buka
fuka
buka
bukuəʔ
mukaʔ
uk
bukkaʔ
céŋka
wuka
ma-móha
puk
vuke
boka
mbukaʔ
buge
məmukaʔ
boke
hoka
buka
fukac
puke
-puki
wuka
buka
mukā
mamuka
buka
mambuka
bo'ahaè
buka
bukaq
mabbukka
buka
ŋəbukaʔ
nguka
buka
wu
membuka
bukaʔ
bukaʔ
mambuka
*buka
*buka
*buka
mbikak
mukɘ
buka
bunggero
bhugera
buwa
bugera
buke
bungga
bungga
bowa
bukaʔ
pʷak
úka
káuka
membuka
bukaan
bukaʔ
movoɣ
m+buka'+en
buka
puk
buka
muhka
buke
buka
buka'
bukis
mawuka
wuka'an
mbuka
buke
bukɑʔ
bukɑʔ
bukɑʔ
bukɑʔ
bukɑʔ
bukɜʔ
bukɑ
bukɑ
ŋɑbukɑʔ
bukɑʔ
ŋɑbukɑʔ
ŋɑmbukɑ
ŋubukɑ
ŋəbukɑʔ
ŋɑbukɑʔ
bukɑʔ
bukɑ
buke
bukɑʔ
bukɑʔ
dibukɑʔ
momuko
mukǝ
muka
dibukɨ
mukǝ
mukǝ
muka
buka
buka
buka
buka
bukak
baək
búka
vu31 ha31
*bukaʔ
pŏk
pŏk
mabuká
*bukaʔ
buká
buka
*buka
*bukas
buka
ipki
bukaʔ
bukaʔ
bukaʔ
bukaʔ
bukaʔ
bukaʔ
boeka
boeka
mi-vòha
bukaʔ
bukaʔ
vuka
bukaʔ
məmbukaʔ
bukáʔ
boksán
boksí
boka
buka

gita class 1 for Kubokota and Tabar

I'd like to suggest that Kubokota 508-185_we-2 and Tabar 129-185_we-1, which both have the form gita, should probably be in cognacy class 1 rather than 2 and 2? resp.

Some other members of 1: hita, gita, gida, 'gita, kita.
Some other members of 2: gami, gia, yami

Kubokota and Tabar are both Oceanic and the reconstructed form in ABVD for proton-oceanic is *kita which is tagged as cognacy 1.

I'd be more convinced if there was a comment saying that the forms are specifically inclusive for Kubokota and Tabar, but I still think it's a good idea to review if they should both be classed as 1 rather than 2/2?.

Typo in 853-201_five-1 (Wetan 'five'): Should be <wolima>, not <wolina>

The form given for 'five' in Wetan [luan1263] should be "wolima"; there appears to be a typo (with medial *n instead of "m"). See Josselin de Jong (1987: 179, 272, 294).

de Josselin de Jong, Jan Petrus Benjamin. 1987. Wetan fieldnotes: Some eastern Indonesian texts with linguistic notes and a vocabulary (Verhandelingen van het Koninklijk Instituut voor Taal-, Land- en Volkenkunde 130). Dordrecht: Foris Publications.

Upgrade to pylexibank 2.x

Consider split or new Glottocode for 'Angkola / Mandailing' [lang id 863]

https://github.com/lexibank/abvd/blob/master/cldf/languages.csv#L1986
current Glottocode bata1290 refers to Batak Angkola only

either new:

Angkola / Mandailing angk1248 - https://glottolog.org/resource/languoid/id/angk1248

or split into

Batak Angkola bata1290 - https://glottolog.org/resource/languoid/id/bata1290
Batak Mandailing bata1291 - https://glottolog.org/resource/languoid/id/bata1291

if the data are the same for both varieties

update

Typo in 186-1_hand-1 (Sekar 'hand'): Should be <nima-n>, not <nina-n>

The form given for 'hand' in Sekar [seka1247] should be "nima-n"; there appears to be a typo (with medial *n instead of "m"). See George Grace's fieldnotes:

https://digital.library.manoa.hawaii.edu/static/grace/media/50.pdf

The top left of page 22 (of the pdf) has forms for "tangan" (Indonesian for 'hand') shown with prenominal possessive marking.

The righthand column of page 27 (of the pdf) has compound forms like "niman ˈbukin" for "sikut" (= 'elbow'), "niman ˈtagan" for "djari tangan" (= 'finger'), and "(niman) ˈkisin" for "kuku" (= 'nail').

Normalize contributors

In order to be able to create a clld abvd app from the CLDF data, we would normalize contributor names at some point. I think ideally, the "good" data should already go into the CLDF dataset, so we'd need to normalize here.

forms.csv Cognacy column has floats/strings, lingpy expects ints

lingpy/basictypes.py in <lambda>(x)
     29         list.__setitem__(self, index, self._type(item))
     30 
---> 31 integer = lambda x: int(x) if x else 0
     32 strings = partial(_strings, str)
     33 ints = partial(_strings, int)

ValueError: invalid literal for int() with base 10: '1,64'

ABVD can't be imported with lingpy because the cognacy column has invalid values, amongst them: 29?, 1,83, etc. Thanks to @KonstantinHoffmann for pointing this out.

subcognacy question

I have a question about the interpretation of the data in cases where more than one cogancy class is listed.

I thought that if a word had more than one cogancy class listed, then one of them (probably the second) represents a subcognacy class. Like "wahine" in Hawai'ian being "1, 116, 106" and that that means that all forms that get 116 also get 1 (but not all that have 1 get 116).

For water, I found that there were some words that had cognacy "1,2" and some that have "1" and some that have only "2".

Does my original assumption hold and these are a type of error, or is my assumption wrong?

change glottocode for Dadu'a to dadu1237

https://github.com/lexibank/abvd/blob/master/cldf/languages.csv#L2861

There are maybe still discussions about the classification of Dadu'a but since Glottolog has a code for it, it would be better to use it. So far Dadu'a is probably very likely Wetarese.

see:

https://iso639-3.sil.org/sites/iso639-3/files/change_requests/2019/2019-053.pdf
Taylor-Leech 2009 'The language situation in Timor-Leste'
Taylor-Leech 2007 'The Ecology of Language Planning in Timor-Leste: A Study of Language Policy, Planning and Practices in Identity Construction'
and others

https://glottolog.org/resource/languoid/id/dadu1237

makecldf fails

... on this entry: word 57 = "?" here.

...which means that the following is passed to add_form:

{'Language_ID': '661', 'Parameter_ID': '57', 'Value': '?', 'Source': ['15258'], 'Cognacy': None, 'Comment': "er-jai 'be married (of woman)' p.78", 'Loan': False, 'Local_ID': '173141', 'Form': None}

... and then we fail with

Traceback (most recent call last):
  File "/Users/simon/projects/lexibank2018/env/bin/lexibank", line 11, in <module>
    load_entry_point('pylexibank', 'console_scripts', 'lexibank')()
  File "/Users/simon/projects/lexibank2018/pylexibank/src/pylexibank/__main__.py", line 139, in main
    sys.exit(parser.main())
  File "/Users/simon/projects/lexibank2018/env/lib/python3.7/site-packages/clldutils-2.8.0-py3.7.egg/clldutils/clilib.py", line 110, in main
    catch_all=catch_all, parsed_args=args)
  File "/Users/simon/projects/lexibank2018/env/lib/python3.7/site-packages/clldutils-2.8.0-py3.7.egg/clldutils/clilib.py", line 82, in main
    self.commands[args.command](args)
  File "/Users/simon/projects/lexibank2018/env/lib/python3.7/site-packages/clldutils-2.8.0-py3.7.egg/clldutils/clilib.py", line 35, in __call__
    return self.func(args)
  File "/Users/simon/projects/lexibank2018/pylexibank/src/pylexibank/commands/misc.py", line 149, in makecldf
    with_dataset(args, Dataset._install)
  File "/Users/simon/projects/lexibank2018/pylexibank/src/pylexibank/commands/util.py", line 28, in with_dataset
    func(get_dataset(args, dataset.id), **vars(args))
  File "/Users/simon/projects/lexibank2018/pylexibank/src/pylexibank/dataset.py", line 437, in _install
    if self.cmd_install(**kw) == NOOP:
  File "/Users/simon/projects/lexibank2018/abvd/lexibank_abvd.py", line 39, in cmd_install
    source=[b for b in bibs if b.id in refs.get(wl.id, [])]
  File "/Users/simon/projects/lexibank2018/pylexibank/src/pylexibank/providers/abvd.py", line 231, in to_cldf
    Local_ID=entry.id,
  File "/Users/simon/projects/lexibank2018/pylexibank/src/pylexibank/cldf.py", line 222, in add_lexemes
    lexemes = self.add_forms_from_value(split_value=split_value, **kw)
  File "/Users/simon/projects/lexibank2018/pylexibank/src/pylexibank/cldf.py", line 208, in add_forms_from_value
    kw_ = self.add_form(with_morphemes=with_morphemes, **kw_)
  File "/Users/simon/projects/lexibank2018/pylexibank/src/pylexibank/cldf.py", line 157, in add_form
    raise ValueError('language, concept, value, and form '
ValueError: language, concept, value, and form must be supplied

What's the best way to fix this? Should add_form catch this? or should this be caught before getting to add_form?

Dayak Ngaju /salawi/ is 'Twenty-Five', not 'Twenty'

For Dayak Ngaju [ngaj1237], the form /salawi/ is glossed as 'Twenty', but it should be glossed as 'Twenty-Five' (Suryanyahu 2013: 130). (A loan from Javanese?) By all accounts, '20' in Dayak Ngaju is a reflex of *duha *puluq.

Wrong glottocode for 399 Megiar

Should be megi1245, not mele1255, I think

Typo in 1387-201_five-1 (Pak 'five'): Should be <nuron>, not <muron>

The form given for 'five' in Pak [pakt1239] should be "nuron"; there appears to be a typo (with initial *m instead of "n"). See Smythe & Z'graggen (1975: 185).

Also, speaking of this source, <Z'graggen’> should be spelled with a lowercase in the Source/Author and Notes sections.

Oh, and Simon, you mentioned fixing typos upstream ... is there a more convenient place than here for me to register these typos when I see them?

Missing glottocodes

These should definitely be checked by people who know more about these languages than I do, but--in case this is might be helpful--here are my best guesses of what the glottocodes should be for these Austronesian languages that currently seem to lack them (or, in some cases where Glottolog might not have an entry for the lect, what the closest glottocode might be):

Alavas 1 > [mpot1241] / [mvt]
Alavas 2 > [mpot1241] / [mvt]
Alavas-Wowo (Wowo 1) > [mpot1241] / [mvt]
Alavas-Wowo (Wowo 2) > [mpot1241] / [mvt]
Badeng > [main1275] / [xkl]
Baliledo > [anak1240] / [akg]
Kayan > [bara1370] / [kys]
Mandri (Faru) 162-100 > [axam1237] / [ahb]
Mandri (Farun) 162-91 > [nasv1234]
Najit > [malu1245] / [mll]
Siviti (Beterbu, Jericho) > [malu1245] / [mll]
Siviti (Womol) > [malu1245] / [mll]
ßatarxobu (Benut)> [malu1245] / [mll]
ßatarxobu (Gunwar)> [malu1245] / [mll]
ßatarxobu (Limsak)> [malu1245] / [mll]
ßatarxobu (Lipitav) > [malu1245] / [mll]
Novol (Bangir) > [lete1241] / [nms]
Riwo > [geda1237] / [gdd]
Tesmbol (Melaklak) > (?) [aulu1238] / [aul] (or something related)
Tesmbol (Usus) > (?) [aulu1238] / [aul] (or something related)

Wrong glottocode for 1686 (Betawi Malay (Tengahan dialect))

Hello, language ID 1686 (Betawi Malay (Tengahan dialect)) currently has the glottocode lame1259 (for Lamenu-Lewo), which doesn't seem right. It should be something more along the lines of beta1252 (for Betawi) ... although the terminology and classifications surrounding "Betawi" and various other Malayic varieties spoken in and around Jakarta is a mess.

is who the cognacy experts are in a list somewhere?

is there anywhere one can see who the expert is that did the cognacy judgement?

Lengo glottocode

I think the languoid with the ABVD ID 520 should have the glottocode "pari1257", not "pari1237".

reality check cognacy - possible extra cognates filled in relatively easily!

I checked the FormTable for two kinds of issues that could be found easily and presented to a human reviewer for improvement of ABVD cognacy. Some forms may be identical but shouldn't belong to the same cognacy class, and vice versa, so human reviewing is necessary. I'm presenting these instances for the ABVD-team to consider.

forms without cognacy that are identical to other forms for the same concept which have assigned cognacy

example:

Concept	Form_1	Cognacy_1	Form_2
hand	tangan	18	tangan
left	karuk	83	karuk
legfoot	au	80	au
legfoot	kuku	46	kuku

The amount that can be filled in like this are 8% of the entire dataset, 24903 forms .

EDIT: I didn't cut down duplicate matches appropriately, the number is smaller (2-9,000) and @xrotwang and I get different numbers. If the ABVD-team wants to investigate this, I can spend more time fine-tuning the pattern-finding.

abvd_possible_matches.csv

same form, same concept but different cognacy classes.

example:

Language_ID	Form	Cognacy
134	sunu	1?
163	sunu	106
1353	sunu	4
1368	sunu	1

There are 661 concept-form matchings ('144_toburn' - 'sunu') where there are identical forms assigned to different cognacy classes. If you also include cases of multiple cognacy (e.g. '1, 50') there are 4910 of this kind.

abvd_different_cognate_same_form_excl_multiple.csv

Rscript for finding these instances:

library(tidyverse)
library(cluster)
library(reshape2)
library(stringdist)

forms <- read_csv("https://github.com/lexibank/abvd/raw/ccff2bc86c30b102cd5b95174fafb378ddc0d3eb/cldf/forms.csv", show_col_types = F)

unknown_cognacy <-  forms %>% 
  filter(is.na(Cognacy)) %>% 
  dplyr::select(ID_Var1= ID, Var1 = Form)  

known_cognacy <- forms %>% 
  filter(!is.na(Cognacy)) %>% 
  dplyr::select(Var2 = Form, Var2_Cognacy = Cognacy)  

percentage_unknown <- round(nrow(unknown_cognacy)/ (nrow(known_cognacy) + nrow(unknown_cognacy) ), digits = 2)

known_cognacy <- known_cognacy %>% distinct()
unknown_cognacy <- unknown_cognacy %>% distinct()

#make df to join to in loop
dist_full <- matrix(nrow = 0, ncol = 7) %>% 
  as.data.frame() %>% 
  rename("Var1" = V1, "Var2" = V2, "lv_dist"= V3, "Form_1" = V4, "Cognacy_1" = V5, "Form_2" = V6, "Cognacy_2" = V7) %>% 
  mutate_if(.predicate = is.logical, as.character) %>% 
  mutate(lv_dist = as.numeric(lv_dist))

##
#df to join info on the side of dists df
left <- forms %>%
dplyr::select(Var1 = ID, Form_1 = Form, Cognacy_1 = Cognacy)
right <- forms %>%
dplyr::select(Var2 = ID,Form_2 = Form, Cognacy_2 = Cognacy)

#vector of unique concepts to loop over
Parameters_ID_unique_vector <- forms$Parameter_ID %>% unique()

#index to start loop at
index <- 0

#for loop, calcuating the lv dist each time for all words within each concept
for(Parameter in Parameters_ID_unique_vector){

index <- index + 1

cat(paste0("I'm on ", Parameters_ID_unique_vector[index], ". Which is index ", index, " out of ", length(Parameters_ID_unique_vector), ".\n"))

forms_spec <- forms %>%
filter(Parameter_ID == Parameters_ID_unique_vector[index])
#filter(Parameter_ID == "122_water")

form_vec <- as.vector(forms_spec$Form)

names(form_vec) <- forms_spec$ID

dists <- stringdistmatrix(a = form_vec, b = form_vec, method = "lv",  useNames = "names")

dists[upper.tri(dists, diag = T)] <- NA

dists_long <- dists %>%
reshape2::melt() %>%
filter(!is.na(value)) %>%
filter(value <= 2)  %>%
distinct() %>%
mutate(Var1 = as.character(Var1)) %>%
mutate(Var2 = as.character(Var2)) %>%
rename(lv_dist = value) %>%
left_join(left, by = "Var1") %>%
left_join(right, by = "Var2") %>%
distinct()

dist_full <- full_join(dist_full, dists_long, by = c("Var1", "Var2", "lv_dist", "Form_1", "Cognacy_1", "Form_2", "Cognacy_2"))  %>%
  distinct()

}

#different cognate same form
different_cognate_same_form_incl_multiple <- forms %>% 
  filter(!is.na(Cognacy)) %>% 
  distinct(Form, Parameter_ID, Cognacy) %>% 
  group_by(Form, Parameter_ID) %>%
  summarise(n = n(), .groups = "drop") %>% 
  filter(n > 1) 

different_cognate_same_form_excl_multiple <- forms %>% 
  filter(!is.na(Cognacy)) %>% 
  filter(!str_detect(Cognacy, ",")) %>% 
  distinct(Form, Parameter_ID, Cognacy) %>% 
  group_by(Form, Parameter_ID) %>%
  summarise(n = n(), .groups = "drop") %>% 
  filter(n > 1) %>% 
  arrange(desc(n))

different_cognate_same_form_excl_multiple %>% 
  write_csv("output/abvd_different_cognate_same_form_excl_multiple.csv", na = "")
  
cat("There are ", nrow(different_cognate_same_form_excl_multiple), " concept-form matchings ('144_toburn' - 'sunu') where there are identical forms assigned to different cognacy classes. If you also include cases of multiple cognacy (e.g. '1, 50', there are ", nrow(different_cognate_same_form_incl_multiple), " of this kind.\n", sep = "")

#forms with missing cognacy that could probably be filled in easily
possible_matches <- dist_full %>% 
  filter(is.na(Cognacy_2)) %>% 
  filter(!is.na(Cognacy_1)) %>% 
  filter(lv_dist <= 0)

possible_matches %>% 
write_csv("output/abvd_possible_matches.csv", na = "")

cat("There are ", nrow(possible_matches 
), " words where you could easily fill in the cognacy because they are identical to other words which are already filled in for cognacy. For example, 'tangan' for the concept hand is assigned cognacy class 18 in some languages but no cognacy in others. The amount that can be filled in like this are ",round(nrow(possible_matches 
) / nrow(forms), 2) *100, "% of the entire dataset.\n", sep = "")

Include classification in LanguageTable?

As far as I can see, the ABVD classification isn't included in the CLDF dataset yet. If we want to re-implement the php app in clld and load data from the CLDF, this might be necessary.

Some glottocodes / ISO codes to check

Hello! I think the following languages might have the wrong glottocodes / ISO codes (with my suggestions for what I think they should be):

Houaïlou --> [ajie1238] / [aji]
Axamb (Avok) -- > [avok1244] / []
Proto-Tsouic --> [tsou1250] / []
Bontok, Eastern --> [fina1242] / [bkb]

And here are some with just ISO codes that look off:

Saipan Carolinian Tanapag --> [tpv]
Lamenu (Filakara) --> [lww]
Dadu'a --> [] (or [ilu], for language-level)

Best,
Russell

. -> , in cognacy field

I think that for the three forms below, the cognacy should contain a comma rather than a period to signify subcognacy/compound.

276-26_hair-5	164076	276	26_hair	but-sek	but-sek		(Chulo-hsien)	Ferrell69	1.65	false
1371-12_skin-1	318779	1371	12_skin	huij	huij		skin	317042	1.96	false
1425-188_what-1	327001	1425	188_what	somo	somo			schlossberg2012	1.75	false

Here are the other members of 1, 65 for 26_hair

ID	Local_ID	Language_ID	Parameter_ID	Value	Form	Segments	Comment	Source	Cognacy	Loan
405-26_hair-1	108575	405	26_hair	buek	buek			Daroya-405-2006	1,65	false
416-26_hair-1	110798	416	26_hair	bu'ʔuk	bu'ʔuk			136553	1,65	false
494-26_hair-1	130688	494	26_hair	buek	buek		(final) (no glottal between in most subdialects)	Davis-494-2007	1,65	false
718-26_hair-1	184291	718	26_hair	*buSék	*buSék			109780	1,65	false

1,96 for 12_skin

ID	Local_ID	Language_ID	Parameter_ID	Value	Form	Segments	Comment	Source	Cognacy	Loan
57-12_skin-1	3984	57	12_skin	wiriji	wiriji			GraceNB	1,96	false
1382-12_skin-1	319681	1382	12_skin	geldu-n	geldu-n		skin	317042	1,96	false
1386-12_skin-1	320109	1386	12_skin	galaʔatjuː-m	galaʔatjuː-m		skin	317042	1,96	false
1388-12_skin-1	320317	1388	12_skin	gulidjoˈi-	gulidjoˈi-		skin	317042	1,96	false

1,75 for 188_what

ID	Local_ID	Language_ID	Parameter_ID	Value	Form	Segments	Comment	Source	Cognacy	Loan
1346-188_what-1	315887	1346	188_what	som	som		what	LincolnND	1,75	false
1355-188_what-1	315590	1355	188_what	som	som		what	LincolnND	1,75	false

Inconsistent use of slash / solidus

It seems like the forward slash is mainly used to indicate alternate forms for a given concept, but it also creeps up in other places, perhaps to indicate morpheme boundaries (?), as in 'twenty' and 'fifty' in Malagasy (Sakalava) [1184] and Malagasy (Tandroy) [1186]: <roa/pòlo> and <lima/m/pòlo>. A few words for 'vomit' also seem to have slashes, e.g., Rarotongan <rua/ki>. This of course results in the problematic interpretation that and are both forms for 'vomit', whereas there's really just one form /ruaki/.