acqdiv / acqdiv Goto Github PK

Pipeline for the ACQDIV Corpus Database

License: Other

Python 99.89% R 0.11%

language-acquisition linguistics-databases corpora databases linguistics child-language typology corpus-linguistics cross-linguistic-data

acqdiv's Introduction

ACQDIV

This repository contains the code and configuration files for transforming the child language acquisition corpora into the ACQDIV database.

Publication

If you use the database in your reasearch, please cite as follows:

Jancso, Anna, Steven Moran, and Sabine Stoll.
"The ACQDIV Corpus Database and Aggregation Pipeline."
Proceedings of The 12th Language Resources and Evaluation Conference. 2020.

Link to Paper

Resources

Download the ACQDIV database (only public corpora):

To request access to the full database including the private corpora (for research purposes only!), please refer to Sabine Stoll. In case of technical questions, please open an issue on this repository.

Corpora

Our full database consists of the following corpora:

Corpus	ISO	Public	# Words
Chintang Language Corpus	ctn	no	987'673
Cree Child Language Acquisition Study (CCLAS) Corpus	cre	yes	44'751
English Manchester Corpus	eng	yes	2'016'043
MPI-EVA Jakarta Child Language Database	ind	yes	2'489'329
Allen Inuktitut Child Language Corpus	ike	no	71'191
MiiPro Japanese Corpus	jpn	yes	1'011'670
Miyata Japanese Corpus	jpn	yes	373'021
Ku Waru Child Language Socialization Study	mux	yes	65'723
Sarvasy Nungon Corpus	yuw	yes	19'659
Qaqet Child Language Documentation	byx	no	56'239
Stoll Russian Corpus	rus	no	2'029'704
Demuth Sesotho Corpus	sot	yes	177'963
Tuatschin Corpus	roh	no	118'310
Koç University Longitudinal Language Development Database	tur	no	1'120'077
Pfeiler Yucatec Child Language Corpus	yua	no	262'382
Total			10'843'735

Running the pipeline

For Windows users, follow the installation/run instructions here: https://github.com/acqdiv/acqdiv/wiki/Installation-Run-instructions-for-Windows

For Mac and Linux user, continue here to run the pipeline yourself:

Install the package

Create a virtual environment [optional]:

python3 -m venv venv
source venv/bin/activate

You can install the package from PyPI or directly from source:

PyPI

pip install acqdiv

From source

# Clone Repository
git clone [email protected]:acqdiv/acqdiv.git
cd acqdiv

# Install package (for users!)
pip install .

# Developer mode (for developers!)
pip install -r requirements.txt

Get the corpora

Run the following script to download the public corpora:

python util/download_public_corpora.py

The corpora are in the folder corpora.

For the private corpora, either place the session files in corpora/<corpus_name>/{cha|toolbox}/ and the metadata files (only Toolbox corpora) in corpora/<corpus_name>/imdi/ or edit the paths to those files in the config.ini (also see below).

Generate the database

Get the configuration file src/acqdiv/config.ini and specify the absolute paths (without trailing slashes) for the corpora directory (corpora_dir) and the directory where the database should be written to (db_dir):

[.global]
# directory containing corpora
corpora_dir = /absolute/path/to/corpora/dir
# directory where the database is written to
db_dir = /absolute/path/to/database/dir
...

Optionally adapt the paths for the individual corpora (sessions and metadata_dir).

Run the pipeline specifying the absolute path to the configuration file:
acqdiv load -c /absolute/path/to/config.ini

Generate the R object

Install dependencies

$ R
> install.packages("RSQLite")
> install.packages("rlang")

Navigate to src/acqdiv/database and run:

Rscript sqlite_to_r.R /absolute/path/to/sqlite-DB

Run tests

Run the unittests:
pytest tests/unittests

Run the integrity tests on the database:
pytest tests/systemtests

acqdiv's People

Contributors

Stargazers

Watchers

Forkers

bambooforest jancso sumitram

acqdiv's Issues

Release on PyPI

Fix CI

CircleCi is still connected to the old repository.

Automatic Deployment to PyPI

https://circleci.com/blog/continuously-deploying-python-packages-to-pypi-with-circleci/

Sound-to-noise integration

Integrate the sound-to-noise ratio computed by Omnia for Chintang.

Cleanup ini files

have a global config file
pass global config file to CLI
remove unnecessary sessions_dir variable for toolbox corpora
update READMEs

Add syntactic parses

Add syntactic dependencies for:

Manchester
Russian (Nick has parsed the dependency structures of Russian utterances)

Example (in Manchester):
%gra: 1|2|PRED 2|0|ROOT 3|2|PRED 4|2|PUNCT

Update `session_durations.csv`

Due to some corpora updates, file names have changed and therefore the file names in session_durations.csv are outdated. This in turn leads to missing durations

We might want to fetch the most recent media files from CHILDES and re-generate session_durations.csv.

Both Japanese, Nungon, and Sesotho are affected. Basically, we need to:

download the updated media
rerun the get session durations script
create the csv
reload the database with the session durations

perhaps useful:

https://github.com/uzling/acqdiv-misc/blob/master/scripts/web/get_files_from_web.sh

https://media.talkbank.org/childes/Other/Sesotho/Demuth/Hlobohang/
https://media.talkbank.org/childes/Other/Sesotho/Demuth/Litlhare/
https://media.talkbank.org/childes/Other/Sesotho/Demuth/TseboNeuoe/
https://media.talkbank.org/childes/Japanese/Miyata/Tai/
https://media.talkbank.org/childes/Japanese/MiiPro/
https://media.talkbank.org/childes/Other/Nungon/Sarvasy/

Create script for downloading corpora from CHILDES TalkBank

Handle morphemes coding glosses

Most of our corpora with CHAT-style glosses sometimes code glosses as part of morphemes or the other way round.

Right now we can deal with the most frequent cases, i.e. glosses coding morphemes. For instance, the Turkish corpus has a lot of "glosses" such as GER:INCA. This is supposed to mean 'gerund ending in -IncA', so we replace the gloss by CVB and update the morpheme from ??? to IncA.

But there are also cases where morphemes code glosses. For instance, Japanese MiiPro has a "morpheme" da&PRES with "gloss" ???. This is the present tense form of the copula da, so the morpheme should be da and the gloss should be PRES (or ideally COP.PRES). We cannot deal with this case yet.

Add citation source

@Jancso please add the citation and the link to the paper to this repo -- http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.20.pdf

Database Schema changes

Email protocol:

corpora.name and acronym should be UNIQUE, right?
yes

missing language table (corpora, words, morphemes) with name, iso,
glottocode?
yes, would be nice

make all (or most columns) in corpora NOT NULL?
yes

maybe make corpora.owner a join table?
yes, would be nice

sessions.corpus NOT NULL?
Yes

do you want an ord column on utterance (unique(session_id_fk, ord))?
Yes, I already suggested this as well but you said it's fine if we infer it from the ID

utterance.session_id_fk NOT NULL (also: what about speaker_id_fk)?
session_id_fk yes, speaker_id_fk probably no

same for other foreign keys to parent elements, they are NOT NULL, right?
not necessarily

are there no fields mandatory in utterance (e.g. utterance)?
yes, source ID

same for words, morphemes, and uniquespeakers
words.word maybe

turn speakers.languages_spoken into a join table?
yes, would be nice

missing ord (or pos, word_index etc.) on word table (w/ uniqueness
constraint)
ditto

same for missing ord on morpheme table (position inside word)
ditto

can we drop the fk from morpheme to utterance (every morpheme being
linked via word)?
nope

utterance.*_raw fields are denormalized, right? do we need them or is
there a consistency check?
Not sure how a normalized variant would look like? However, I see that having cleaned and raw fields violates the third normal form (3NF).

morpheme.type make this an enum (check constraint)?
yes

should morpheme.lemma_id be a foreign key (join table)?
probably overkill

better naming would probably be participants (of a session) and
speaker instead of speakers and uniquespeakers
I agree

enum (check constraint) for speakers.role and macrorole?
yes, would be nice

missing unique(speakers.session_id_fk, speakers.unique_speaker_id_fk),
NOT NULLABLE as well
yes

any mandatory fields in speakers?
no

uniquespeakers speaker_label NOT NULL and UNIQUE?
unfortunately no

gender enum (check constraint), NOT NULL anything? :)
enum yes, not null no

does uniquespeakers.corpus add any information or is it just
denormalized (also does not comply with the _fk naming convention used
otherwise), would tend towards dropping it
already answered in previous mail

maybe use a numeric primary key for corpora as well (e.g. rename id to
name)
I used string primary key for convenience reasons

Add license

PiPI sticker

Here:

https://github.com/acqdiv/acqdiv

Doesn't actually go to the PyPi page for ACQDIV.

Look into morphemes and their stems and affixes

In Turkish, Japanese, Yucatec, morphemes mostly contain stems and not affixes.

Add metadata categories to inis

@Jancso - I think I'll add a couple metadata categories to the corpora table:

License (is the corpus private or CC)
InputFormat (CHAT, Toolbox)

And maybe some information on the language's genealogical and geographic location. That way we can easily dump the information into tables for publications.

What do you think?

Revisit word_language mapping

We may have at some point decided not to infer from morpheme to word the language because of cases of multiple morpheme language values. We should perhaps revisit this since it's interesting linguistically (and typically thought to be improbable or rare, but see: https://www.frontiersin.org/articles/10.3389/fpsyg.2015.00082/full ). We could do something like infer the language of the morpheme as word when there is only one language, drop a term like "mixed" into word_language when there are multiple different languages per morphemes in a word. This would be interesting then to see what percentage, if at all, are mixed word forms.