Git Product home page Git Product logo

Comments (8)

thadguidry avatar thadguidry commented on April 26, 2024 1

@vrandezo Thanks Denny. I also agree with you, since Pinyin is an input format, not an output, so its not reversible. (I just wanted to make sure there wasn't something I was missing conceptually from the AbstractText effort regarding transliteration handling. Thanks for explaining!)

Regarding Chinese example... the talk, sleep, and tax are all pronounced the same in Chinese, and relies on sentence context. water and who have different pronunciation. All 5 use "shui" to type into input systems where a user is usually given a popup choice for which Chinese lexeme they are meaning from the English Pinyin input.

Closing this issue now, since we have the use case and our agreed probable handling for it as a reference.

from abstracttext.

vrandezo avatar vrandezo commented on April 26, 2024

Thanks for opening the issue (and yay, Issue #1!!)

And I have to admit, I am not sure I understand the issue. This is because I really don't understand how Chinese works, so my answer might be entirely besides the point.

So I will rephrase how I understand the question and then answer that question. Please don't let me get away with it if I entirely missed the point.

Serbian Wikipedia, for example, uses two scripts (so do a few others, such as Uzbek, Tatar, etc.). And the question is how would Abstract Wikipedia support both of those scripts?

In Serbian, the situation is particularly simple: the latin transliteration can be generated from a cyrillic input easily. So it is possible to simply generate a cyrillic output, and then, at the very end, just run a transliteration function over the resulting string that translates the string to latin.

This does not always work: for example, the reverse wouldn't be as trivial, because Њ transliterates to nj, but the two letters n and j transliterate to н and ј respectively. In that case we would need to retain the information whether these are the two letters n and j which happen to be next to each other or whether it is the digraph nj.

This can be done by either creating a slightly abstract output that retains this information with a special token, and then use a final pass over the result that removes these tokens and replaces them with the concrete letters, or by rewriting the functions so that they take the script as a parameter and push this knowledge deeper into the function stack.

Either of the solutions would be possible, and the respective language community can decide which one makes more sense for their particular language (in fact, this could get a far way to solve the differences between standard Croatian and Serbian).

So, I hope that this answer somehow applies to your question. If it doesn't please let me know and give me a bit more background. Thank you!

from abstracttext.

thadguidry avatar thadguidry commented on April 26, 2024

Yes it answers it partially.
I completely understand that functions could read information from lots of places.
The question is WHERE is the information stored (best).

So my only question is about Wikidata Lexeme's themselves storing that information of transliteration maps and how best to store it, so that Abstract Text functions can read it properly.

  1. Where in the Lexeme ecosystem would the transliteration mapping be applied that functions could read from? Would it be on the ZH entities? or the EN entities? or both? or somewhere else?

  2. Would P1721 "pinyin transliteration" be used always as a qualifier within the translation statement? Ex: https://www.wikidata.org/wiki/Lexeme:L3302
    image

Or use P1721 "pinyin transleteration" as a direct statement on the ZH entity (which mimics how input systems work)? Ex: https://www.wikidata.org/wiki/Lexeme:L8219
image

from abstracttext.

vrandezo avatar vrandezo commented on April 26, 2024

As I said, I really am not sufficiently knowledgeable about Chinese.

If I understand it correctly, and the transliteration is always the same for a given Chinese lexeme, and does not differ based on Sense or Form, then I would think that it makes more sense as a statement on the Chinese lexeme (as in your last screenshot).

If it is on the translation of the English lexeme for water, it looks like it is a denormalization - that date should not be a qualifier on that translation, as in your first screenshot, that doesn't look right to me. This would lead to a lot of duplication.

from abstracttext.

thadguidry avatar thadguidry commented on April 26, 2024

date?

from abstracttext.

thadguidry avatar thadguidry commented on April 26, 2024

Here's a transliteration map from OpenVanilla.org
Hopefully this clarifies the question for you to offer good advice... and then we can close this issue out after you respond.

shui 水  - water
shui 说  - talk
shui 谁  - who
shui 睡  - sleep
shui 税  - tax

from abstracttext.

vrandezo avatar vrandezo commented on April 26, 2024

"date" - mistyped, I meant datum, snak, or piece of information.

from abstracttext.

vrandezo avatar vrandezo commented on April 26, 2024

Regarding the example you showed:

It looks there as if every Chinese character only has a single Pinyin transliteration into, but that the result of that is not reversible, i.e. the same string in latin script is ambiguous when translated back to Chinese characters.

That would indicate that it would make sense to have the render function for Chinese create Chinese characters, and if a transliteration into pinyin is desired, a function can run on top of that.

So I still think that my last comment holds: it looks like the pinyin form should be on the lexeme representing the Chinese character, not on the statement offering a translation coming from the English (or any other) noun.

Also, I think that this discussion probably would make more sense on Wikidata itself. I wouldn't want the modelling of Wikidata be affected by a possible future implementation of a project proposal. That seems premature :)

Feel free to close this if this satisfies your question.

from abstracttext.

Related Issues (15)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.