Git Product home page Git Product logo

ogonek's Introduction

Ogonek

Ogonek is mostly the result of me playing around with Unicode. Currently the library is still in alpha stages, so I don't recommend using it for anything serious, mainly because not all APIs are stabilised. You are welcome to play around with it for any non-serious purposes, though.

Status

The latest version, 0.5.0, implements most of the important stuff. The next version, 0.6.0, will be a thorough refactoring of the code in order to enable more type-safety and faster development.

Setup

Ogonek is mostly header-only. The only part that needs compilation is the data in the Unicode Character Database. Currently I am translating the database to C++ source as static initializers, but I may change this in the future.

You can compile that data by running scons dist from the command line (requires SCons to be installed, for obvious reasons). This will create a zip file in the dist/ directory with both the headers and the library files necessary to use ogonek. By default, a static library is built. To build a shared library (DLL), add lib=shared to the command-line when building.

ogonek's People

Contributors

rmartinho avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ogonek's Issues

Text insertion

Implement functionality to insert text into the middle of a text instance.

Normalization queries

Implement functionality to test if a string is in some given normalization form.

  • is_nfd
  • is_nfc
  • is_nfkd
  • is_nfkc

Text erasure

Implement functionality to remove text from the middle of a text instance.

ICU interoperation

icu::UnicodeString should be supported as an UnicodeSequence, allowing its use with all ogonek functionality.

End-user property queries

Implement end-user useful queries that are not present in the UCD, like "is functionally lowercase?". Additionally bring to scope all end-user useful direct queries from namespace ogonek::ucd into namespace ogonek.

  • Name
  • Numeric/digit/decimal value
  • Casing properties
  • Classification properties

Hashing

Implement hash objects to provide hashing functionality that match the equality relations given by canonical_equivalence and compatibility_equivalence.

  • std::hash<text<C, D>> which defaults to canonical_hash
  • canonical_hash: polymorphic hash function; canonically equivalent ranges yield the same hash
  • compatibility_hash: polymorphic hash function; compatibility equivalent ranges yield the same hash

SCSU

Implement the Standard Compression Scheme for Unicode as per UTS #6.

Text equivalence

Implement text::operator==, and text::operator!= with canonical equivalence semantics.

BOCU-1

Implement the Binary Ordered Compression for Unicode as per UTN #6.

Unicode sequence traits

Ogonek algorithms and classes should work with any object that can be seen as Unicode data. This includes text instances, char32_t and char16_t literals (because the encoding can be assumed), and any range of char32_t (this includes results from algorithms). Additionally, icu::UnicodeString is also obviously a range of Unicode data.

In order to support this paraphernalia of data sources, tools for uniform manipulation of these need to be created and applied to the existing algorithms. These include is_unicode_sequence and UnicodeSequenceIterator traits, and an as_code_point_range function that provides a common way to view different Unicode sequences.

UTF-32

Implement the UTF-32 encoding form.

Generic encoding scheme

Implement a generic template for creating encoding schemes from encoding forms and byte orders. Provide appropriate aliases for UTF-16LE, UTF-16BE, UTF-32LE, and UTF-32BE.

Abstract away encodings

Implement ogonek::text as a template that abstracts away encoding and makes conversions automatic and compile-time checked.

Enable carrying validation and normalization info on types

Right now an instance of text carries information about the encoding in the type, in order to enforce conversions at the appropriate places.

It would be neat if text could optionally also enforce some normalization form. This would allow for optimisations in the equivalence and hashing functions, which would reflect in improved performance when using text as a key in maps, for example.

To allow similar optimisation, make all ranges based on normalizing iterators carry information about their normal form in the type.

text types also carry information about validity of the sequence. This property is already used in various places to optimise away redundant validation. There are however more ranges that can have the same characteristics and enable the same optimisations. Make all ranges based on decoding or normalizing iterators be considered as validated.

The code does not compile

I attempted to call the compiler directly through javac.exe, and it rejected this code. Fix immediately or I'll sue you for everything you own.

Text ordering

Implement text::operator< and the other relational operators with default collation semantics.

UTF-16

Implement the UTF-16 encoding form.

Type-erased text variant

Implement ogonek::any_text as a variant of ogonek::text that erases the type of the encoding and of the underlying storage.

UCD

Implement direct UCD queries.

Text replacement

Implement functionality to replace a subrange from an instance of text.

UTF-8

Implement the UTF-8 encoding form.

Text concatenation

Implement ogonek::concat to and ogonek::text::append allow concatenation of code point ranges.

Replacement validation does not follow best practice

The replacement validation policy does not follow the best practice delineated in 5.22 in the Unicode Standard. For example, the sequence <F3 85> is replaced by two replacement characters, but it should only use one, since it's a maximal subpart.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.