Git Product home page Git Product logo

anyascii / anyascii Goto Github PK

View Code? Open in Web Editor NEW
242.0 5.0 24.0 78.11 MB

Unicode to ASCII transliteration - C Elixir Go Java JS Julia PHP Python Ruby Rust Shell .NET

Home Page: https://anyascii.com

License: ISC License

Kotlin 38.44% Perl 0.31% Java 11.12% Python 4.52% JavaScript 3.14% Rust 6.17% Go 3.23% Ruby 4.20% Shell 3.44% C# 4.32% PHP 3.36% Julia 3.34% C 8.16% CMake 0.44% Elixir 5.83%
unicode ascii transliteration slug romanization emoji utf8 unidecode normalization

anyascii's Introduction

AnyAscii

Unicode to ASCII transliteration

Web Demo

Table of Contents

Description

Converts Unicode characters to their best ASCII representation

AnyAscii provides ASCII-only replacement strings for practically all Unicode characters. Text is converted character-by-character without considering the context. The mappings for each script are based on popular existing romanization systems. Symbolic characters are converted based on their meaning or appearance. All ASCII characters in the input are left unchanged, every other character is replaced with printable ASCII characters. Unknown characters and some known characters are replaced with an empty string and removed.

Examples

Representative examples for different languages comparing the AnyAscii output to the conventional romanization:

Language (Script) Input Output Conventional
French (Latin) René François Lacôte Rene Francois Lacote Rene Francois Lacote
German (Latin) Blöße Blosse Bloesse
Vietnamese (Latin) Trần Hưng Đạo Tran Hung Dao Tran Hung Dao
Norwegian (Latin) Nærøy Naeroy Naroy
Ancient Greek (Greek) Φειδιππίδης Feidippidis Pheidippides
Modern Greek (Greek) Δημήτρης Φωτόπουλος Dimitris Fotopoylos Dimitris Fotopoulos
Russian (Cyrillic) Борис Николаевич Ельцин Boris Nikolaevich El'tsin Boris Nikolayevich Yeltsin
Ukrainian (Cyrillic) Володимир Горбулін Volodimir Gorbulin Volodymyr Horbulin
Bulgarian (Cyrillic) Търговище T'rgovishche Targovishte
Mandarin Chinese (Han) 深圳 ShenZhen Shenzhen
Cantonese Chinese (Han) 深水埗 ShenShuiBu Sham Shui Po
Korean (Hangul) 화성시 HwaSeongSi Hwaseong-si
Korean (Han) 華城市 HuaChengShi Hwaseong-si
Japanese (Hiragana) さいたま saitama Saitama
Japanese (Han) 埼玉県 QiYuXian Saitama-ken
Amharic (Ethiopic) ደብረ ዘይት debre zeyt Debre Zeyit
Tigrinya (Ethiopic) ደቀምሓረ dek'emhare Dekemhare
Arabic دمنهور dmnhwr Damanhur
Armenian Աբովյան Abovyan Abovyan
Georgian სამტრედია samt'redia Samtredia
Hebrew אברהם הלוי פרנקל 'vrhm hlvy frnkl Abraham Halevi Fraenkel
Unified English Braille (Braille) ⠠⠎⠁⠽⠀⠭⠀⠁⠛ +say x ag Say it again
Bengali ময়মনসিংহ mymnsimh Mymensingh
Burmese (Myanmar) ထန်တလန် thntln Thantlang
Gujarati પોરબંદર porbmdr Porbandar
Hindi (Devanagari) महासमुंद mhasmumd Mahasamund
Kannada ಬೆಂಗಳೂರು bemgluru Bengaluru
Khmer សៀមរាប siemrab Siem Reap
Lao ສະຫວັນນະເຂດ sahvannaekhd Savannakhet
Malayalam കളമശ്ശേരി klmsseri Kalamassery
Odia ଗଜପତି gjpti Gajapati
Punjabi (Gurmukhi) ਜਲੰਧਰ jlmdhr Jalandhar
Sinhala රත්නපුර rtnpur Ratnapura
Tamil கன்னியாகுமரி knniyakumri Kanniyakumari
Telugu శ్రీకాకుళం srikakulm Srikakulam
Thai สงขลา sngkhla Songkhla
Symbols Input Output
Emojis 👑 🌴 :crown: :palm_tree:
Misc. ☆ ♯ ♰ ⚄ ⛌ * # + 5 X
Letterlike № ℳ ⅋ ⅍ No M & A/S

Implementations

AnyAscii is implemented across multiple programming languages with the same behavior and versioning

C

https://raw.githubusercontent.com/anyascii/anyascii/master/impl/c/anyascii.h https://raw.githubusercontent.com/anyascii/anyascii/master/impl/c/anyascii.c

Elixir

https://hex.pm/packages/any_ascii

iex> AnyAscii.transliterate("άνθρωποι") |> IO.iodata_to_binary()
"anthropoi"

Go

https://pkg.go.dev/github.com/anyascii/go

import "github.com/anyascii/go"

s := anyascii.Transliterate("άνθρωποι")
// anthropoi

Go 1.10+ compatible

Java

https://mvnrepository.com/artifact/com.anyascii/anyascii

String s = AnyAscii.transliterate("άνθρωποι");
assert s.equals("anthropoi");

Java 6+ compatible

<dependency>
    <groupId>com.anyascii</groupId>
    <artifactId>anyascii</artifactId>
    <version>LATEST</version>
</dependency>

JavaScript

https://npmjs.com/package/any-ascii

import anyAscii from 'any-ascii';

const s = anyAscii('άνθρωποι');
// anthropoi

npm install any-ascii

Julia

https://juliahub.com/ui/Packages/AnyAscii/wYZIV

julia> using AnyAscii
julia> anyascii("άνθρωποι")
"anthropoi"

Julia 1.0+ compatible

pkg> add AnyAscii

PHP

https://packagist.org/packages/anyascii/anyascii

$s = AnyAscii::transliterate('άνθρωποι');
// anthropoi

PHP 5.3+ compatible

composer require anyascii/anyascii

Python

https://pypi.org/project/anyascii

from anyascii import anyascii

s = anyascii('άνθρωποι')
assert s == 'anthropoi'

Python 3.3+ compatible

pip install anyascii

Ruby

https://rubygems.org/gems/any_ascii

require 'any_ascii'

s = AnyAscii.transliterate('άνθρωποι')
# anthropoi

Ruby 2.0+ compatible

gem install any_ascii

Rust

https://crates.io/crates/any_ascii

use any_ascii::any_ascii;

let s = any_ascii("άνθρωποι");
// anthropoi

Rust 1.42+ compatible

cargo add any_ascii

Install executable: cargo install any_ascii

$ anyascii άνθρωποι
anthropoi

$ echo άνθρωποι | anyascii
anthropoi

Shell

https://raw.githubusercontent.com/anyascii/anyascii/master/impl/sh/anyascii

$ anyascii άνθρωποι
anthropoi

$ echo άνθρωποι | anyascii
anthropoi

POSIX-compliant

.NET

https://nuget.org/packages/AnyAscii

// C#
using AnyAscii;

string s = "άνθρωποι".Transliterate();
// anthropoi

.NET Core 3.0+ and .NET 5.0+ compatible

Background

Unicode is the universal character encoding. This encoding standard provides the basis for processing, storage and interchange of text data in any language in all modern software and information technology protocols. [Unicode's scope] covers all the characters for all the writing systems of the world, modern and ancient. It also includes technical symbols, punctuations, and many other characters used in writing text. *

Unicode provides a unique numeric value for each character and uses UTF-8 to encode sequences of characters into bytes. UTF-8 uses a variable number of bytes for each character and is backwards compatible with ASCII. UTF-16 and UTF-32 are also specified but not common. There is a name and various properties for each character along with algorithms for casing, collation, equivalence, line breaking, segmentation, text direction, and more.

ASCII is the lowest common denominator character encoding, established in 1967 and using 7 bits for 128 characters. The printable characters are English letters, digits, and punctuation, with the remaining being control characters. The characters found on a standard US keyboard are from ASCII. Most legacy 8-bit encodings were backwards compatible with ASCII.

... expressed only in the original non-control ASCII range so as to be as widely compatible with as many existing tools, languages, and serialization formats as possible and avoid display issues in text editors and source control *

A language is written using characters from a script. Some languages use multiple scripts and some scripts are used by multiple languages. English uses the Latin script which is based on the alphabet the Romans used for writing Latin. Other languages using the Latin script may require additional letters and diacritics.

The Unicode Standard encodes scripts rather than languages. When writing systems for more than one language share sets of graphical symbols that have historically related derivations, the union of all of those graphical symbols ... is identified as a single script. *

When converting text between languages there are multiple properties that can be preserved:

Original Transliteration (Spelling) Transcription (Sound) Translation (Meaning)
ευαγγέλιο euaggelio evangelio gospel

Romanization is the conversion into the Latin script using transliteration and transcription, it is most commonly used when representing the names of people and places. Some nations have an official romanization standard for their language. Several organizations publish romanization standards for multiple languages.

Geographical names are Romanized to help foreigners find the place they intend to go to and help them remember cities, villages and mountains they visited and climbed. But it is Koreans who make up the Roman transcription of their proper names to print on their business cards and draw up maps for international tourists. Sometimes, they write the lyrics of a Korean song in Roman letters to help foreigners join in a singing session or write part of a public address (in Korean) in Roman letters for a visiting foreign VIP. In this sense, it is for both foreigners and the local public. The Romanization system must not be a code only for the native English-speaking community here but an important tool for international communication between Korean society, foreign residents in the country and the entire external world. *

Stats

Supports Unicode 15.0 (2022). Covers 100k of the 149k total Unicode characters, missing 47k very rare CJK characters and 2k other rare characters.

Bundled data files total 200-500 KB depending on the implementation

Unidecode

AnyAscii is an alternative to (and inspired by) Unidecode and its many ports. Unidecode only supports a subset of the basic mulitlingual plane. AnyAscii gives better results, supports more than twice as many characters, and often has a smaller file size. To compare the mappings see table.tsv and unidecode/unidecode.tsv.

Sources

ALA-LC, BGN/PCGN, Discord, ISO, KNAB, NFKD, UNGEGN, Unihan, national standards, and more

anyascii's People

Contributors

awhetter avatar casept avatar hunterwb avatar psi29a avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

anyascii's Issues

Use CommonJS instead of ES Modules?

Thank you for this awesome library.

A question: ES modules are great but still not exactly widely supported. They cause all kinds of issues with node and related tooling, such as Jest and TypeScript. Would it be possible to make the block.js import in the JS version use a require() instead?

Exception logic?

Would it be possible to make a way to tell anyascii to transliterate all characters except certain specified ones? I'm writing output to a display that needs most Unicode characters transliterated, but can handle certain ones, like the ° symbol.

.NET - Potentially large memory allocation at each call

In the code there is such a thing:

private static ReadOnlySpan<byte> Bank => new byte[64436]
{
    83, 104, 99, 104, 39, 101, 117, 101, 117, 101,
    // .. 62 Kb more data
};

Probably the expectation here was that the array would be created once. But in fact it will be created every time the property is accessed, because the expression-bodied property is just syntactic sugar for get { return new ... }

To avoid potential memory problems you should replace this with:

private static ReadOnlyMemory<byte> Bank { get; } = new byte[]
{
...

German replacement is weird

As a native German with an Umlaut in my name, I'm really surprised by the choice to, as shown in the README, replace an Umlaut with the same base character. The "conventional" replacement is indeed a convention dating back to typewriters which did not originally contain Umlaut characters, so all Germans will recognize it as having the same meaning. Simply dropping the dots, however, changes pronunciation and potentially meaning.

Take the German word for bear, "Bär" as an example. I'm not saying it's the best example, but it's the first that comes to mind. Capitalization makes it somewhat distinguishable from "bar" (free from/of), except at the beginning of a sentence: here, the words are indistinguishable in your replacement scheme, and would be read as the latter word, but then conventional "Baer" would still retain its original sense.

I don't know what reasoning led to adopting this behaviour, but from a German point of view, it seems quite weird.

I also understand that from a non-German point of view, "Baer" is very difficult to know how to pronounce - but the spelling of "Bar" wouldn't result in an understandable pronunciation, either. My own name trips people up because it ends up with three vowels in a row. I do get the issue here, I just don't see how the adopted solution helps.

Hope that makes sense, and happy to discuss.

Python package missing py.typed marker

Currently when running mypy against a library that uses anyascii will result in the following error:

$ mypy mypackage
mypackage/mymodule.py:6: error: Skipping analyzing "anyascii": module is installed, but missing library stubs or py.typed marker  [import]
mypackage/mymodule.py:6: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports

As noted in PEP-561 (https://peps.python.org/pep-0561/#packaging-type-information) and the mypy documentation linked above, the presence of a py.typed file is necessary for mypy to use the type annotation that's already present in anyascii.

Targeting Framework not Supported

Hello,

We are trying to add the Nuget Package AnyAscii to our C# .Net Standard 2.1 class library.

image
We noticed that the package wasn't compatible even though we are able to clone your repo and take a dependency on the any-ascii csproj inside the repository.

Option for replacement char

Instead of removing characters that can't be translated, it'd be nice to have an option to replace them with a character.

For some languages (like Python) this could be added as a new argument with a default value, like replace="". For others (like Go) this would have to be a new function.

Deprecation warning in Python 3.11


  /Users/xxx/Workspace/Backend/venv/lib/python3.11/site-packages/anyascii/__init__.py:29: DeprecationWarning: read_binary is deprecated. Use files() instead. Refer to https://importlib-resources.readthedocs.io/en/latest/using.html#migrating-from-legacy for migration advice.
    b = read_binary('anyascii._data', '%03x' % blocknum)

read_binary will eventually be removed in later version of Python.

They recommend using .open('rb')

https://docs.python.org/3/library/importlib.resources.html#importlib.resources.read_binary

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.