Git Product home page Git Product logo

taibun's Introduction

Logo

Andrei Harbachov

LinkedIn YouTube Mail Website Visitors

πŸ‘‹ Hello, everyone! I'm Andrei, a passionate Computer Science graduate with a specialisation in Artificial Intelligence. My daily routine revolves around experimentation, learning, and coding, fuelled by copious amounts of tea. Let's connect, collaborate, and embark on this thrilling AI adventure together! πŸš€


Coding

  • πŸŽ“ Graduated with a Bachelor's degree in Computer Science, specialising in Artificial Intelligence and Visual and Interactive Computing at Simon Fraser University.

  • πŸ‘¨β€πŸ’» Currently enhancing the representation of Taiwanese Hokkien by leveraging a Neural Machine Translator and developing Programming Toolboxes that simplify the creation of applications for the language.

  • πŸ“š Learning Natural Language Processing, Computer Vision, and Machine Learning.

  • ❀️ Passionate about exploring Computer Science, studying foreign languages, and coding.

  • πŸ’¬ Feel free to reach out on LinkedIn or by Mail.

GitHub Stats GitHub Langs

Languages and Tools

Python Java C C++ Rust Haskell

PyTorch NumPy TensorFlow Keras MATLAB Pandas

HTML CSS JavaScript TypeScript Angular React

Git Unity C# Android Firebase MySQL

taibun's People

Contributors

andreihar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

zhengwayne

taibun's Issues

Test cases

Describe the feature
Current test cases are sufficient enough to test for the correctness of Tailo and Zhuyin systems but lack in testing other systems. Similar test cases should be added for other transliteration systems.

Expected behavior
Test cases are added for POJ, TLPA, Pingyim, and Tongiong transliteration systems.

Relevant files

  • tests

Style Improvement

Is your feature request related to a problem? Please describe.
The code is cluttered and can be organised better

What is the ' in the ipa conversion output?

This is not a bug report but just a question.
I noticed in the output of the IPA conversion from traditional Chinese,
some IPA symbols have a ' following them.
I tried to dig into the conversion script but couldn't figure out what it was.
Could you tell me what it represents?

Better zh_TW and zh_CN conversion

Thank you for making this! As a native Hokkien speaker I find it very professionally done.

However, when doing conversion between zh_TW and zh_* (to_traditional & to_simplified), the context (the word and the sentence a char is in) should be considered, simple char-to-char mapping can be problematic in some cases.

https://github.com/BYVoid/OpenCC This library seem to be better at handling the subtlety of conversion.

Transliteration conversion

Describe the feature
Currently, it is possible to convert to different transliteration systems only by converting from Chinese characters. Should add the possbility from converting from Tailo to other transliteration systems, bypassing Chinese characters.

Expected behavior
There exists a method/class that will convert from Tailo to other transliteration systems

Dataset improvement

Describe the feature
The current dataset doesn't include ~10k words that were not processed from the Taiwanese-Chinese Online Dictionary and doesn't include a big chunk of the iTaigi Chinese-Taiwanese Comparison Dictionary. Additionally, many proper nouns in the dataset can be derived using other entries and can yield unsatisfactory tokenisation results (e.g. δΈ­εœ‹ζ–‡εŒ–ε€§ε­Έ should be tokenised as δΈ­εœ‹ ζ–‡εŒ– 倧學, not δΈ­εœ‹ζ–‡εŒ–ε€§ε­Έ).

Expected behavior
More words are added and unnecessary proper nouns whose romanisation can be derived using other entries in the data are removed.

Relevant files
*data/words.json

Proper Sandhi rules

Describe the feature
Currently, the sandhi flag applies sandhi rules locally within a word, i.e. changes tone within the word of every single syllable except for the first one. This is different from real sandhi rules, where changes are applied to every single syllable of the sentence, not just single words. Sandhi rules of the library should be modified to reflect the actual sandhi rules of the Taiwanese language.

Expected behavior
Properly applies sandhi rules to words depending on their position in the sentence.

Relevant methods

  • __tone_sandhi

Additional context
Taiwanese Sandhi rules

Incorrect Pingyim

Describe the bug
The converter throws an error when converting words with "m" as the nucleus (e.g. ζ―‹). The transliteration of initial "n-" in incorrect (e.g. 耐 as nΓ’i instead of lnΓ’i). The conversion of final "-t", "-k", "-p" is incorrect (converts to "-d", "-g", "-b").

To Reproduce
Steps to reproduce the behavior:

  1. Initialise Converter with 'Pingyim' as the system
  2. Prompt 'ζ―‹'
  3. See error
  4. Prompt '耐'
  5. See incorrect transliteration of initial 'n-'
  6. Prompt '法', '色', '硦'
  7. See incorrect transliteration of final '-t', '-k', '-p'

Expected behavior
Initial "n-" should convert to "ln-", final "-t", "-k", "-p" should convert to same values instead of "-d", "-g", "-b". Conversion of characters with "m" as the nucleus shouldn't throw an error.

Efficient Zhuyin conversion

Describe the feature
Current conversion to Zhuyin requires a 800 word long csv file with all possible syllables. It is preferred to have a conversion table to be much smaller, enabling to move the table to a dictionary within the code.

Expected behavior
Converter with Zhuyin system produces the same results as currently, but the chart within zhuyin.json file is decreased in size and moved moved to the in-code Python dictionary.

Relevant methods

  • __tailo_to_zhuyin

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.