Git Product home page Git Product logo

acqdiv's Introduction

ACQDIV

DOI PyPI version

CircleCI

This repository contains the code and configuration files for transforming the child language acquisition corpora into the ACQDIV database.

Publication

If you use the database in your reasearch, please cite as follows:

Jancso, Anna, Steven Moran, and Sabine Stoll.
"The ACQDIV Corpus Database and Aggregation Pipeline."
Proceedings of The 12th Language Resources and Evaluation Conference. 2020.

Link to Paper

Resources

Download the ACQDIV database (only public corpora):

DOI

To request access to the full database including the private corpora (for research purposes only!), please refer to Sabine Stoll. In case of technical questions, please open an issue on this repository.


Corpora

Our full database consists of the following corpora:

Corpus ISO Public # Words
Chintang Language Corpus ctn no 987'673
Cree Child Language Acquisition Study (CCLAS) Corpus cre yes 44'751
English Manchester Corpus eng yes 2'016'043
MPI-EVA Jakarta Child Language Database ind yes 2'489'329
Allen Inuktitut Child Language Corpus ike no 71'191
MiiPro Japanese Corpus jpn yes 1'011'670
Miyata Japanese Corpus jpn yes 373'021
Ku Waru Child Language Socialization Study mux yes 65'723
Sarvasy Nungon Corpus yuw yes 19'659
Qaqet Child Language Documentation byx no 56'239
Stoll Russian Corpus rus no 2'029'704
Demuth Sesotho Corpus sot yes 177'963
Tuatschin Corpus roh no 118'310
Koç University Longitudinal Language Development Database tur no 1'120'077
Pfeiler Yucatec Child Language Corpus yua no 262'382
Total 10'843'735

Running the pipeline

For Windows users, follow the installation/run instructions here: https://github.com/acqdiv/acqdiv/wiki/Installation-Run-instructions-for-Windows

For Mac and Linux user, continue here to run the pipeline yourself:

Install the package

Create a virtual environment [optional]:

python3 -m venv venv
source venv/bin/activate

You can install the package from PyPI or directly from source:

PyPI

pip install acqdiv

From source

# Clone Repository
git clone [email protected]:acqdiv/acqdiv.git
cd acqdiv

# Install package (for users!)
pip install .

# Developer mode (for developers!)
pip install -r requirements.txt

Get the corpora

Run the following script to download the public corpora:

python util/download_public_corpora.py

The corpora are in the folder corpora.

For the private corpora, either place the session files in corpora/<corpus_name>/{cha|toolbox}/ and the metadata files (only Toolbox corpora) in corpora/<corpus_name>/imdi/ or edit the paths to those files in the config.ini (also see below).

Generate the database

Get the configuration file src/acqdiv/config.ini and specify the absolute paths (without trailing slashes) for the corpora directory (corpora_dir) and the directory where the database should be written to (db_dir):

[.global]
# directory containing corpora
corpora_dir = /absolute/path/to/corpora/dir
# directory where the database is written to
db_dir = /absolute/path/to/database/dir
...

Optionally adapt the paths for the individual corpora (sessions and metadata_dir).

Run the pipeline specifying the absolute path to the configuration file:
acqdiv load -c /absolute/path/to/config.ini

Generate the R object

Install dependencies

$ R
> install.packages("RSQLite")
> install.packages("rlang")

Navigate to src/acqdiv/database and run:

Rscript sqlite_to_r.R /absolute/path/to/sqlite-DB

Run tests

Run the unittests:
pytest tests/unittests

Run the integrity tests on the database:
pytest tests/systemtests

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.