See https://github.com/p

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Add language-code-extensions.csv about language-codes HOT 7 CLOSED

datasets commented on July 26, 2024

Add language-code-extensions.csv

from language-codes.

Comments (7)

rufuspollock commented on July 26, 2024

@ppKrauss - first welcome and thanks for proposing an enhancement. Could you briefly summarize what change you'd like to see

/cc @Yannael - our managing core datasets curator!

from language-codes.

Yannael commented on July 26, 2024

Nice work!

At first sight, it does not seem straightforward to merge these data with the existing language-codes-full.csv. So probably the best would be to include the file language-code-extensions.csv in the package, and update the description.

A few more thoughts:

Rename language-codes-full.csv to iso-639.csv
Rename language-code-extensions to ietf-language-tags.csv
Files language-codes-3b2.csvand language-codes.csv are redundant with language-codes-full.csv and could be removed.

What do you think?

@ppKrauss @ewheeler @rgrp

from language-codes.

ppKrauss commented on July 26, 2024

@rgrp and @Yannael thanks (!), I would be proud with this chance to discuss and collaborate.

As user of locale (standards for i18n and l10n) and metadata like xml:lang, I believe that "only root language-codes" are not enough... We need the combination with country-codes to obtain meaningful codes... In day-by-day I use pt-BR, and I perceive some impact when change it to pt-PT or when "only pt" is pt-PT, in both contexts of use (locale or metadata).

As programmer I see that there are many possible combinations of language×country,
~190 × ~250 ≈ 47500, so make sense to show at datasets/language-codes the list of ~700 valid combinations (~2%)... It is not very clear (to me) the goal of Datasets Project, but as datasets user and enthusiast, I prefer to list "the officially valid combinations" here than to use http://i18ndata.appspot.com/cldr/main

Synopsis: the objective of this (CSV) proposal is to generate a summary of unicode.org/Public/cldr core for datasets/language-codes.

PS: the name "language-tag-extensions" for this official list was something arbitrary, I got from the URL http://www.iana.org/assignments/language-tag-extensions-registry/language-tag-extensions-registry

from language-codes.

ppKrauss commented on July 26, 2024

@Yannael , sorry, you was fast :-)

I think your forwarding is perfect!

PS: about "Preparation" section of this project, I am not a Python expert, but I can help to translate PHP to Python.

from language-codes.

Yannael commented on July 26, 2024

@ppKrauss Thanks :)

I agree with you, language regional codes should also be included in the core packages.

The dataset you provided seems very nice for this (i.e. using version 27 of http://www.unicode.org/Public/cldr/)

Since we already have the language-codes package, the best is to merge it with this package. Do you agree?

If you do, the best to do is:

Fork datasets/language-codes
Rename language-code-extensions.csv to ietf-language-tags.csv, and add it to the fork
Update the Readme section so that information you wrote in your repository https://github.com/ppKrauss/language-tag-extensions/blob/master/README.md are also included there (see the guidelines at http://data.okfn.org/doc/publish-faq#readme)
And then send back the link here

If you think we should do differently let me know.

Since the main goal of the datasets project is to ensure easy sharing of datasets, we set a few guidelines there http://data.okfn.org/doc/publish-faq, can you have a look?

Looking forward to your feedback!

cc/ @rgrp

from language-codes.

ppKrauss commented on July 26, 2024

Thanks! I followed your recipe (!)... lets see if it works ;-)

About fields: I described it at datapackage.json; I not harmonized/compatibilized names, perhaps renaming langType to iso639-1-alpha2 and territory to ISO3166-1-alpha2...

About "merge these data with the existing language-codes-full.csv", I can do, but perhaps users prefer the normalized form -- well... normalization not help here, I not know a join mechanism for CSV (neither see at tabular-data-model).

from language-codes.

Yannael commented on July 26, 2024

@ppKrauss @rgrp
Great!
Merged.
Note: I made a few edits to add this additional resource to the title and data sources.
Regarding the renaming langType and territory, I think it is not necessary since the ISO info is in the description.

Thanks!

from language-codes.

Add language-code-extensions.csv about language-codes HOT 7 CLOSED

Comments (7)

Related Issues (3)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent