Git Product home page Git Product logo

unimorph's Introduction

These are the raw data that were developed as part of the project reported in the paper:

Richard Sproat, Bruno Cartoni, HyunJeong Choe, David Huynh, Linne Ha, Ravindran Rajakumar, Evelyn Wenzel-Grondie. 2014. A Database for Measuring Linguistic Information Content. Language Resources and Evaluation Conference, Reykjavík, Iceland.

http://www.lrec-conf.org/proceedings/lrec2014/pdf/47_Paper.pdf


Abstract

Which languages convey the most information in a given amount of space? This is a question often asked of linguists, especially by engineers who often have some information theoretic measure of "information" in mind, but rarely define exactly how they would measure that information. The question is, in fact remarkably hard to answer, and many linguists consider it unanswerable. But it is a question that seems as if it ought to have an answer. If one had a database of close translations between a set of typologically diverse languages, with detailed marking of morphosyntactic and morphosemantic features, one could hope to quantify the differences between how these different languages convey information. Since no appropriate database exists we decided to construct one. The purpose of this paper is to present our work on the database, along with some preliminary results. We plan to release the dataset once complete.


Per the paper

Our data are taken from a few domains of interest to Google including driving directions and answers generated from structured data for Google Now™. (Note that no Google user data is included in our data collection.) Obviously such examples are but a subset of the ways in which language is used to communicate: The reason for picking data from this circumscribed set of domains is that for part of the data at least, the text corresponds to, and in a real application would be automatically generated from, data in a defined format (see below for an example). Therefore the basic intended meaning of a message is to a large extent given, thus obviating the need to do semantic annotation. By producing parallel target sentences in various languages, and making sure that the translations are as close as possible, while still being stylistically and socially acceptable, we can be minimize differences in information content that might arise for irrelevant reasons, such as liberal choices of wording taken by the translators. We are therefore focusing as much as possible on what the languages must convey, rather than one what they may convey.

Our initial dataset consists of 85 sentences from a mix of domains for the following languages: English, French, Italian, German, Russian, Arabic, Korean and Mandarin Chinese. These languages were chosen from among languages for which we have very good resources, to be somewhat typologically balanced, representing languages of the “isolating” or quasi-isolating type (English, Mandarin), “inflectional” (French, Italian, German, Russian, Arabic) and “agglutinative”’ (Korean). We are also interested in languages with rich case systems (German, Russian, Korean), gender systems (French, Italian, German, Russian, Arabic),and a variety of language families — four in this case.2 For the current dataset, translators were given the English original in a spreadsheet, and were given the following instructions:

This is a request for natural sounding and socially appropriate translations which should be inserted directly into the provided spreadsheet in the column for your language. Important: There is no character restriction for these translations. However, we want translations that are succinct as possible, natural sounding, and socially appropriate.

unimorph's People

Contributors

rwsproat avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.