Git Product home page Git Product logo

utf's Introduction

Build status Build Status GitHub License platform platform platform

UTF Iterators for modern C++

Introduction

UTF is a header only library providing simple unicode iterator adapters to for converting between various unicode formats and differing endianness. Similar to those available with boost, but without the boost dependency.

Utility functions that build on top of these iterators are available to provide standard conversions and byte order mark detection. Additional support is provided for replacing invalid byte sequences with a user provide replacement character, or the unicode default (U+FFFD �) or throwing an exception.

Iterators

Each iterator adapter models the catagory of the underlying base_iterator. To ensure iterators do not iterate past the end when consuming multiple elements from the base_iterator, different constructors are used for certain iterators depending on the catagory of base_iterator.

|utf8_to_utf32_iterator<base_iterator,policies...>

constructors:

base_iterator::iterator_category is bidrectional_iterator or random_access_iterator
utf8_to_utf32_iterator(start_iterator,range_begin_iterator,range_end_iterator)

base_iterator::iterator_category is forward_iterator or input_iterator
utf8_to_utf32_iterator(start_iterator,range_end_iterator)

policies:

to<big_endian|little_endian>
onerror<replace_with<char32_t>|throw_exception>


|utf8_to_utf16_iterator<base_iterator,policies...>

constructors:

base_iterator::iterator_category is bidrectional_iterator
utf8_to_utf16_iterator(start_iterator,range_begin_iterator,range_end_iterator)

base_iterator::iterator_category is forward_iterator or input_iterator
utf8_to_utf16_iterator(start_iterator,range_end_iterator)

policies:

to<big_endian|little_endian>
onerror<replace_with<char32_t>|throw_exception>


|utf32_to_utf8_iterator<base_iterator,policies...>

constructors:

utf32_to_utf8_iterator(base_iterator)

policies:

from<big_endian|little_endian>
onerror<replace_with<char32_t>|throw_exception>


|ut32_to_utf16_iterator<base_iterator,policies...>

constructors:

utf32_to_utf16_iterator(base_iterator)

policies:

from<big_endian|little_endian>
to<big_endian|little_endian>
onerror<replace_with<char32_t>|throw_exception>

Utility Functions

The utfX_to_utfY() utility functions provide an easy way to convert between different unicode types. the base_iterator dereference operator should yeild a type convertible to uintX_t.

Each of utfX_to_utfY() use a default return value of uYstring, which can be changed by providing any type that has begin() end() and value_type as one of the tempalte parameters. for example,

std::string u8text = u8"text";
uint8_to_uint16<std::vector<char16_t>>(text);

Two overloads exist for each utfX_to_utfY() conversion function, one whos arguments are a start and end iterator range to convert, the other an iterable object. The same set of policies that would be used with the respective iterator can be passed as template parameters to these functions. The order is unimportant. However polices are not checked and unused policies will be silently ignored (for example a to<big_endian> in a to_utf8 conversion has no meaning).

uint8_to_uint16<onerror<replace_with_fffd>,std::vector<char16_t>>(text);
// the same as
uint8_to_uint16<std::vector<char16_t>,onerror<replace_with_fffd>>(text);

|u16string utf8_to_utf16<result_type,policies...>(start_iterator,end_iterator)

|u16string utf8_to_utf16<result_type,policies...>(iteratble)

policies:

to<big_endian|little_endian>
onerror<replace_with<char32_t>|throw_exception>


|u16string utf8_to_utf32<result_type,policies...>(start_iterator,end_iterator)

|u16string utf8_to_utf32<result_type,policies...>(iteratble)

policies:

to<big_endian|little_endian>
onerror<replace_with<char32_t>|throw_exception>


|u8string utf16_to_utf8<result_type,policies...>(start_iterator,end_iterator)

|u8string utf16_to_utf8<result_type,policies...>(iteratble)

policies:

from<big_endian|little_endian>
onerror<replace_with<char32_t>|throw_exception>


|u16string utf16_to_utf32<result_type,policies...>(start_iterator,end_iterator)

|u16string utf16_to_utf32<result_type,policies...>(iteratble)

policies:

from<big_endian|little_endian>
to<big_endian|little_endian>
onerror<replace_with<char32_t>|throw_exception>


|u8string utf32_to_utf16<result_type,policies...>(start_iterator,end_iterator)

|u8string utf32_to_utf16<result_type,policies...>(iteratble)

policies:

from<big_endian|little_endian>
to<big_endian|little_endian>
onerror<replace_with<char32_t>|throw_exception>


|u8string utf32_to_utf8<result_type,policies...>(start_iterator,end_iterator)

|u8string utf32_to_utf8<result_type,policies...>(iteratble)

policies:

from<big_endian|little_endian>
onerror<replace_with<char32_t>|throw_exception>


|bom_type detect_bom(start_iterator,end_iterator)

Detects a bom from a byte sequence. reads upto 4 bytes to detect the bom.

This is nesseccary to distinguish between utf32_little_endian and utf16_little_endian where the first two bytes are identical.

Note: there is potential for ambiguity when the first two bytes after the bom in a utf16_little_endian byte sequence are 0x0.

returns one of,

none,
utf8,
utf16_little_endian,
utf16_big_endian,
utf32_little_endian,
utf32_big_endian,

Constants

constexpr uint8_t utf8_bom[]
constexpr uint8_t utf16_little_endian_bom[]
constexpr uint8_t utf16_big_endian_bom[]
constexpr uint8_t utf32_little_endian_bom[]
constexpr uint8_t utf32_little_endian_bom[]
constexpr uint8_t utf32_big_endian_bom[]

Cmake

This project can be used as an external project cmake's find_package(),

# CMakeLists.txt
find_package(utf REQUIRED)
...
add_library(somelib ...)
...
target_link_libraries(somelib PRIVATE utf::utf)

or placed in a thirdparty folder and used through add_subdirectory,

# Disable building tests
set(UTF_BUILD_TESTS OFF CACHE INTERNAL "")
..
add_subdirectory(thirdparty/utf)
...
add_library(somelib ...)
...
target_link_libraries(somelib PRIVATE utf::utf)

Examples

  • Convert existing utf16 string to utf8 with utility functions.
#include <ifstream>
#include <iterator>
#include <utf/utf.h>

int main()
{
	using namespace utf;
    
    u16string u16_text = U"ɦΈ˪˪ʘ";
    //use default platform endianness, default onerror policy.
    u8string u8_text = utf16_to_utf8(u16_text.begin(),u16_text.end());
}
  • Convert from an existing utf32 string to utf8 with iterators.
#include <utf/utf.h>

int main()
{
    using namespace utf;

    u32string u32_text = U"ɦΈ˪˪ʘ";

    //use the bidirectional_iterator constructor, default platform endianness, default error policy.
    utf32_to_utf8_iterator<u32string::iterator> pos(u32_text.begin());
    utf32_to_utf8_iterator<u32string::iterator> end(u32_text.end());

    u8string u8_text(pos, end);
}
  • Reading from a UTF-32 little endian file, checking for a byte order mark, and converting to UTF-8.
#include <fstream>
#include <iterator>
#include <cassert>
#include <utf/utf.h>

int main()
{
    using namespace utf;

    // open the file
    std::ifstream uc_file("unicode_file.txt", std::ios::binary);

    //The iterators to use for the stream.
    using base_iterator = std::istreambuf_iterator<char>;

    base_iterator base_pos(uc_file);
    base_iterator base_end{};

    //detect the if the file contains a byte order mark and that it is UTF-32 little endian encoded file.
    bom_type uc_type = detect_bom(base_pos, base_end);
    assert( uc_type == bom_type::utf32_little_endian );

    // detect_bom will read upto 4 bytes to detect the bom, even if the bom is two bytes (utf16) or three bytes (utf8). For input_iterators,
    // bytes may already have been read that are part of the unicode text and cannot be re-read. In this case the bom is 4 bytes. However when
    // detecting utf16 or utf8 boms, the stream should be reset to the begining and the iterator advanced by exactly the size of the bom.
    // Although this is not necessary here, as sizeof(utf32_little_endian_bom) == 4

    uc_file.clear();
    uc_file.seekg(0, std::ios::beg);
    std::advance(base_pos, sizeof(utf32_little_endian_bom));

    //read from the stream 4 bytes at a time for the utf32_to_utf8_iterator
    using stridel_iterator = stride_long_iterator<base_iterator>;

    // using the input iterator constructor
    stridel_iterator b2l_pos(base_pos, base_end);
    stridel_iterator b2l_end(base_end, base_end);

    //specify the source endianness. The default will be the platform endianness. using the default onerror<throw_exception> policy.
    using utf_iterator = utf32_to_utf8_iterator<stridel_iterator, from<little_endian>>;


    utf_iterator utf_pos(b2l_pos);
    utf_iterator utf_end(b2l_end);

    // construct a new std::basic_string<char> from the iterators.
    u8string result_u8(utf_pos, utf_end);

    return 0;
}

License

The files in this repository are licensed under the Boost Software License 1.0. A copy of the license is available in the root of the repository.

utf's People

Contributors

rmawatson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

codacy-badger

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.