Git Product home page Git Product logo

diacritics.net's Introduction

Diacritics.NET

Version Downloads

Diacritics are used across many languages in order to change the sound-values of the letters to which they are added. In software development, diacritics often have to be replaced with non-diacritics, e.g. to improve usability of user input. Diacritics.NET is a basic mapper between diacritic characters an non-diacritic characters.

Download and Install Diacritics

This library is available on NuGet: https://www.nuget.org/packages/Diacritics/ Use the following command to install Diacritics using NuGet package manager console:

PM> Install-Package Diacritics

You can use this library in any .Net project which is compatible to PCL (e.g. Xamarin Android, iOS, Windows Phone, Windows Store, Universal Apps, etc.)

API Usage

Replace diacritic characters

The most common use case of this library is to find and replace diacritic characters in a given string. RemoveDiacritics is a string extension method which returns a diacritics-free string.

// Arrange
const string InputString = "Je veux aller à Saint-Étienne";

// Act
string removeDiacritics = InputString.RemoveDiacritics();

// Assert
removeDiacritics.Should().Be("Je veux aller a Saint-Etienne");

Find diacritic characters

The most common use case of this library is to detect and remove diacritic characters from a given string. If you just want to check whether a string contains diacritics, use the string extensions method HasDiacritics.

// Arrange
const string InputString = "Je veux aller à Saint-Étienne";

// Act
bool hasDiacritics = InputString.HasDiacritics();

// Assert
hasDiacritics.Should().BeTrue();

Using Diacritics with IoC

The example shown above uses extension methods which use a default implementation of IDiacriticsMapper, namely type DefaultDiacriticsMapper. If you're using an IoC container, you can register IDiacriticsMapper either with the provided DefaultDiacriticsMapper or with your own implementation of IDiacriticsMapper.

Add custom diactrics mappings

Diacritics is extensible. You can write your own language accent by implementing IAccentMapping (or AccentMapping base class). DiacriticsMapper accepts any IAccentMapping type at construction time. You are highly welcome to contribute to this library. Just create a fork, commit your changes and create a pull request.

TODO: Add/Remove methods for adding/removing accents at runtime.

Benchmark Tests

Tested Version
https://www.nuget.org/packages/Diacritics/2.1.19291.8-pre

Benchmark Environment
BenchmarkDotNet=v0.11.5, OS=Windows 10.0.17134.885 (1803/April2018Update/Redstone4) Intel Core i7-7600U CPU 2.80GHz (Kaby Lake), 1 CPU, 4 logical and 2 physical cores Frequency=2835933 Hz, Resolution=352.6176 ns, Timer=TSC .NET Core SDK=3.0.100 [Host] : .NET Core 2.2.4 (CoreCLR 4.6.27521.02, CoreFX 4.6.27521.01), 64bit RyuJIT ShortRun : .NET Core 2.2.4 (CoreCLR 4.6.27521.02, CoreFX 4.6.27521.01), 64bit RyuJIT

Job=ShortRun IterationCount=3 LaunchCount=1 WarmupCount=3

Benchmark Results

Method Mean Error StdDev
RemoveDiacritics (9 latin chars) 230.5 ns 476.2 ns 26.10 ns
RemoveDiacritics (23 diacritic chars) 651.5 ns 843.4 ns 46.23 ns
RemoveDiacritics (408 latin chars) 8,697.1 ns 9,938.1 ns 544.74 ns
RemoveDiacritics (729 diacritic chars) 15,045.0 ns 12,893.0 ns 706.71 ns

Legend
Mean : Arithmetic mean of all measurements
Error : Half of 99.9% confidence interval
StdDev : Standard deviation of all measurements
Rank : Relative position of current benchmark mean among all benchmarks (Arabic style)
1 ns : 1 Nanosecond (0.000000001 sec)

License

This project is Copyright © 2019 Thomas Galliker. Free for non-commercial use. For commercial use please contact the author.

diacritics.net's People

Contributors

darthramone avatar jerry2007 avatar julien-vandenbussche avatar ltduy avatar monoman avatar much4cho avatar ryanoneill1970 avatar steevequadra avatar thomasgalliker avatar zolrath avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

diacritics.net's Issues

FrenchAccentsMapping œ should transform on oe not o

Hi,

In French, we have some special word like "œuf" (Egg). For them, after RemoveDiacritics, we should have "oeuf" and not "ouf".
I tracked down the mapping the the file FrenchAccentsMapping. Line 22, the { 'œ', "o" }, should be replaced by { 'œ', "oe" },

Is that possible to do it ?
To make it simple, I submitted a pull request.
Regards
Steeve

Add a LICENSE.MD file

NuGet license info says Apache 2.0, while README.MD doesn't specify a concrete license (except a non commercial clause - even though Apache 2.0 allows commercial usage).

A LICENSE.MD or LICENSE.TXT file with a proper license would make this project useful for more people.

Lower-case Latin variants hide upper diacritics

When characters have a lower-case Latin equivalent the diacritic is not correctly removed.
Take for example the Turkish word "İngiltere" (England), when invoking RemoveDiacritics the input is converted to lowercase before IndexOfAny is called. At this point the input is transformed to "ingiltere" meaning the İ diacritic is not replaced & the original string is returned.

License

What means commercial use?
"For commercial use please contact the author."
Would an information system development fall into commercial use?

Add overloads to RemoveDiacritics and IsDiacritics extension methods to pass a IDiacriticsMapper

If you want to use the extension methods of the library you have to register a global default diacritics mapper.
This is not very pure and it does not allow to have different mappings for different strings without switching the global mapper.

StaticDiacritics.SetDefaultMapper(() =>
    new DiacriticsMapper(
        new MyGermanAccentMapping(),
        new GermanAccentsMapping(),
        new ItalianAccentsMapping(),
        new ArabicAccentsMapping()
    )
);

"Thöni".RemoveDiacritics() // "Thoeni"

The current "pure" approach is to instantiate a DiacriticsMapper with accent mappings and use the methods form this instance.

var myMapper = new DiacriticsMapper(
    new MyGermanAccentMapping(),
    new GermanAccentsMapping(),
    new ItalianAccentsMapping(),
    new ArabicAccentsMapping());

myMapper.RemoveDiacritics() // "Thoeni"

This is fine.
But it would be convenient to have an overload for the extensions methods where the mapper (or single accent mappings) could be passed:

"Thöni".RemoveDiacritics(myMapper) // "Thoeni"

or simply as a params array:

"Thöni".RemoveDiacritics(new MyGermanAccentMapping()) // "Thoeni"

Add .NET Standard support

When installing the nuget in a .NET Core project I see warnings at compile time: "Package 'Diacritics 1.0.6' was restored using '.NETFramework,Version=v4.6.1' instead of the project target framework '.NETStandard,Version=v2.0'. This package may not be fully compatible with your project."

It would be nice if it could be compiled under .NET Standard.

ñ => n

spanish letter n with tilde

ß-> ss

Please can you change the code that an ß will be translateted to ss

ơ => o

Hi.

Could you please add this mapping for Vietnamese "ơ" letter?

Thanks.

German umlaut mapping not correct

I came across this project while searching for an umlaut replacement library. Find the approach to solve the general problem. Unfortunately I found that there is something wrong with the German language.

Correct is that the character ß is replaced by ss. But this is not true for the other umlauts. Actually, Ä is replaced by Ae, Ö is Oe and Ü is Ue. The same applies to lower case letters, of course. In this library, however, the translation is only done by one letter instead of two. Therefore this is actually not correct.

Umlaut replacement character
Ä Ae
Ü Ue
Ö Oe
ä ae
ü ue
ö oe

https://github.com/thomasgalliker/Diacritics.NET/blob/develop/Diacritics/AccentMappings/GermanAccentsMapping.cs#L10.L12

What about œ (e.g. in "cœur")

Hi Thomas, while looking for a solution to normalize diacritics and other digrams, I came across your implementation. I like the way you separated every language into its own set of rules.

However, I don't feel comfortable using it, since it produces unexpected conversions. Say you feed it with "cœur" in French. You'd expect to get "coeur" as an output, but since you map "œ" to "o" you finally get "cour" instead.

Same thing for German words, where "Grüße" might be more appropriately mapped to "Gruesse" (i.e. map ü → ue and ß → ss).

Can you explain why you chose your approach of a one-to-one mapping?

Release Note Availability

Thank you for your work on this project! My team is working a project that utilizes your library and is looking at updating from 2.0.19240.3 to 3.3.18. Do you have release notes available anywhere that we could review for possible breaking changes? I didn't see any release information available in GitHub that I could read through.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.