Git Product home page Git Product logo

finsemevl's Introduction

FinSemEvl

(Copyright) Viljami Venekoski 2017

These are the resources produced for the NoDaLiDa 2017 conference short paper "Finnish resources for evaluating language model semantics" by V. Venekoski and J. Vankka.
The full text can be found here: http://www.ep.liu.se/ecp/article.asp?issue=131&article=028.

This repository contains three sets of data, one for each evaluation task:

(1) Similarity judgments The directory has two files:

  • 'FinnSim_judgment_scores' containing mean and standard deviation of every 300 word pairs as answered by the 55 respondents
  • 'SimLex-TranslationsScores' containing the original SimLex-999 scores for the 300 words which were translated into Finnish. The file includes the additional information of the original data set, but do note that the concreteness scores should probably not be applied cross-lingually.

(2) Word intrusion
The directory has multiple files each containing a list of words. The lists were retrieved from the Finnish language Wikipedia in December 2016. The word intrusion task randomly samples 5 words from one list and 1 word from another, forming a 6 word intrusion set. This should be repeated for each ordered pair of lists for a specified number of times. The lists used in the paper and available here are the following:

  • animals
  • chemical_elements
  • cities
  • colors
  • countries
  • fruit
  • illnesses
  • mathematical_symbols
  • minerals
  • mythical_creatures
  • philosophers
  • pieces_of_clothing
  • poker_terms
  • professions
  • religions
  • sports

(3) Analogies
The analogy test set contains one file per type of semantic relation. For instance, the file 'ana_gender' gendered analogies, analogizing female-male and male-female pairs. Each file contains all permutations of analogizable word pairs. Included relations are the following:

  • antonymic adjectives, e.g. 'WET is to DRY what RICH is to POOR' (in Finnish)
  • capital-country, 'BANGKOK is to THAILAND what TALLINN is to ESTONIA"
  • country-currency, 'RUSSIA is to RUBLE what JAPAN is to YEN'
  • female-male, 'MOTHER is to FATHER what GIRL is to BOY'
  • orthogonal directions, 'UP is to DOWN what LEFT is to RIGHT'
  • hockey team-city, 'HIFK is to HELSINKI what TAPPARA is to TAMPERE'
  • cardinal-ordinal number, 'ONE is to FIRST what FIVE is to FIFTH'

We are unfortunately unable to provide the code used to calculate a language model's score in each of the above tasks due to the policy of the Finnish Defence Forces where this research was conducted. However, one should be able to replicate the procedure using methods described in the paper. If you need assistance in using these resources in the described manner, do not hesitate to contact the first author at venekoski[ät]gmail.com or social media.

If you wish to use the data in your research, please cite the following:

@inproceedings{venekoski2017finnish,
author = {Venekoski, Viljami and Vankka, Jouko},
institution = {National Defence University, Helsinki, Finland},
booktitle = {Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden},
number = {131},
pages = {231-236},
publisher = {Linköping University Electronic Press, Linköpings universitet},
title = {Finnish resources for evaluating language model semantics},
year = {2017},
issn = {1650-3740}
}

finsemevl's People

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.