Light

libogonek / ogonek Goto Github PK

View Code? Open in Web Editor NEW

80.0 12.0 7.0 3.93 MB

A C++11 library for Unicode

License: Creative Commons Zero v1.0 Universal

Python 0.10% C++ 99.43% C# 0.48%

unicode c-plus-plus c-plus-plus-14

ogonek's Introduction

Ogonek

Ogonek is mostly the result of me playing around with Unicode. Currently the library is still in alpha stages, so I don't recommend using it for anything serious, mainly because not all APIs are stabilised. You are welcome to play around with it for any non-serious purposes, though.

Status

The latest version, 0.5.0, implements most of the important stuff. The next version, 0.6.0, will be a thorough refactoring of the code in order to enable more type-safety and faster development.

Setup

Ogonek is mostly header-only. The only part that needs compilation is the data in the Unicode Character Database. Currently I am translating the database to C++ source as static initializers, but I may change this in the future.

You can compile that data by running scons dist from the command line (requires SCons to be installed, for obvious reasons). This will create a zip file in the dist/ directory with both the headers and the library files necessary to use ogonek. By default, a static library is built. To build a shared library (DLL), add lib=shared to the command-line when building.

ogonek's People

Contributors

Stargazers

Watchers

Forkers

respu borgleader leeter rmartinho brinkqiang2cpp ezhangle externalrepositories

ogonek's Issues

Text insertion

Implement functionality to insert text into the middle of a text instance.

Normalization queries

Implement functionality to test if a string is in some given normalization form.

is_nfd
is_nfc
is_nfkd
is_nfkc

Text erasure

Implement functionality to remove text from the middle of a text instance.

Line break opportunities

Implement the line breaking algorithm as per UAX #14.

Normalization Form KD

Implement Normalization Form KD as per UAX #15.

Collation algorithm

Implement the Unicode collation algorithm as per UAX #10.

ICU interoperation

icu::UnicodeString should be supported as an UnicodeSequence, allowing its use with all ogonek functionality.

End-user property queries

Implement end-user useful queries that are not present in the UCD, like "is functionally lowercase?". Additionally bring to scope all end-user useful direct queries from namespace ogonek::ucd into namespace ogonek.

Name
Numeric/digit/decimal value
Casing properties
Classification properties

Hashing

Implement hash objects to provide hashing functionality that match the equality relations given by canonical_equivalence and compatibility_equivalence.

std::hash<text<C, D>> which defaults to canonical_hash
canonical_hash: polymorphic hash function; canonically equivalent ranges yield the same hash
compatibility_hash: polymorphic hash function; compatibility equivalent ranges yield the same hash

DUCET

Implement a tool for importing the data from the Default Unicode Collation Element Table.

GB18030

Implement the 国家标准 18030 (aka GB18030) encoding form.

Default Case Detection

Implement Default Case Detection as specified in 3.13 in the Unicode Standard.

Normalization Form KC

Implement Normalization Form KC as per UAX #15.

SCSU

Implement the Standard Compression Scheme for Unicode as per UTS #6.

Text equivalence

Implement text::operator==, and text::operator!= with canonical equivalence semantics.

BOCU-1

Implement the Binary Ordered Compression for Unicode as per UTN #6.

Canonical equivalence

Implement canonical equivalence as per UAX #15.

Unicode sequence traits

Ogonek algorithms and classes should work with any object that can be seen as Unicode data. This includes text instances, char32_t and char16_t literals (because the encoding can be assumed), and any range of char32_t (this includes results from algorithms). Additionally, icu::UnicodeString is also obviously a range of Unicode data.

In order to support this paraphernalia of data sources, tools for uniform manipulation of these need to be created and applied to the existing algorithms. These include is_unicode_sequence and UnicodeSequenceIterator traits, and an as_code_point_range function that provides a common way to view different Unicode sequences.

Generic encoding form mapper

Implement a generic template for generating encoding forms from a mapping table.

Windows-1252

Implement the Windows codepage 1252 encoding form.

ASCII

Implement the ASCII encoding form.

UTF-32

Implement the UTF-32 encoding form.

Generic encoding scheme

Implement a generic template for creating encoding schemes from encoding forms and byte orders. Provide appropriate aliases for UTF-16LE, UTF-16BE, UTF-32LE, and UTF-32BE.

Grapheme cluster boundaries

Implement the grapheme cluster boundary algorithm as per UAX #29.

Abstract away encodings

Implement ogonek::text as a template that abstracts away encoding and makes conversions automatic and compile-time checked.

Enable carrying validation and normalization info on types

Right now an instance of text carries information about the encoding in the type, in order to enforce conversions at the appropriate places.

It would be neat if text could optionally also enforce some normalization form. This would allow for optimisations in the equivalence and hashing functions, which would reflect in improved performance when using text as a key in maps, for example.

To allow similar optimisation, make all ranges based on normalizing iterators carry information about their normal form in the type.

text types also carry information about validity of the sequence. This property is already used in various places to optimise away redundant validation. There are however more ranges that can have the same characteristics and enable the same optimisations. Make all ranges based on decoding or normalizing iterators be considered as validated.

Normalization Form D

Implement Normalization Form D as per UAX #15.

The code does not compile

I attempted to call the compiler directly through javac.exe, and it rejected this code. Fix immediately or I'll sue you for everything you own.

Text ordering

Implement text::operator< and the other relational operators with default collation semantics.

Normalization Form Fast C Contiguous

Implement the Normalization Form FCC as per UTN #5.

normalize<fcc>
is_fcc

UTF-16

Implement the UTF-16 encoding form.

Type-erased text variant

Implement ogonek::any_text as a variant of ogonek::text that erases the type of the encoding and of the underlying storage.

Current design does not allow flushing the last code units out of a state for stateful encodings

When using encode_one and decode_one, stateful encodings can end up with unfinished code units/code points that won't be in the result, but instead will be stored in the state for mixing in the next call to encode_one/decode_one.

Currently there is no way to complete those code units/code points when at the end of a stream.

UCD

Implement direct UCD queries.

Normalization Form C

Implement Normalization Form C as per UAX #15.

Text replacement

Implement functionality to replace a subrange from an instance of text.

GSM 03.38

Implement the GSM 03.38 encoding form.

Normalization Form Fast C or D

Implement the Normalization Form FCD as per UTN #5.

normalize<fcd>
is_fcd

UTF-8

Implement the UTF-8 encoding form.

Text concatenation

Implement ogonek::concat to and ogonek::text::append allow concatenation of code point ranges.

Default Caseless Matching

Implement Default Caseless Matching as specified in 3.13 in the Unicode Standard.

Normalization-related iterators should use quick checks

Specially for NFC, quick checks can make for high efficiency. Normalizing iterators should use that.

ISO/IEC 8859

Implement the ISO/IEC 8859 encoding forms.

Compatibility equivalence

Implement compatibility equivalence as per UAX #15.

Sentence boundaries

Implement the sentence boundary algorithm as per UAX #29.

Default Case Conversion

Implement Default Case Conversion as specified in 3.13 in the Unicode Standard.

Word boundaries

Implement the word boundary algorithm as per UAX #29.

Bidirectional algorithm

Implement the Unicode bidirectional algorithm as per UAX #9.

Empty string testing

Implement functionality to test whether an instance of text is empty.

Replacement validation does not follow best practice

The replacement validation policy does not follow the best practice delineated in 5.22 in the Unicode Standard. For example, the sequence <F3 85> is replaced by two replacement characters, but it should only use one, since it's a maximal subpart.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.