Git Product home page Git Product logo

compact_enc_det's Introduction

Introduction

Compact Encoding Detection(CED for short) is a library written in C++ that scans given raw bytes and detect the most likely text encoding.

Basic usage:

#include "compact_enc_det/compact_enc_det.h"

const char* text = "Input text";
bool is_reliable;
int bytes_consumed;

Encoding encoding = CompactEncDet::DetectEncoding(
        text, strlen(text),
        nullptr, nullptr, nullptr,
        UNKNOWN_ENCODING,
        UNKNOWN_LANGUAGE,
        CompactEncDet::WEB_CORPUS,
        false,
        &bytes_consumed,
        &is_reliable);

How to build

You need CMake to build the package. After unzipping the source code , run autogen.sh to build everything automatically. The script also downloads Google Test framework needed to build the unittest.

$ cd compact_enc_det
$ ./autogen.sh
...
$ bin/ced_unittest

On Windows, run cmake . to download the test framework, and generate project files for Visual Studio.

D:\packages\compact_enc_det> cmake .

3rd party bindings for other languages

Have you created bindings for another language? Open a PR and add it to the list!

compact_enc_det's People

Contributors

baklap4 avatar bamtor avatar j4ns-r avatar jinsukkim avatar krystof-k avatar mirabilos avatar nico avatar randomascii avatar sgraham avatar sonicdoe avatar tedlyngmo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

compact_enc_det's Issues

Add possibility to detect encoding from input streams

Right now it seems the detector is only able to work with input arrays, it would be very useful to allow input streams as well. This makes sense when the input files are very large and it's not memory efficient to load them entirely into memory.

update CMakeLists.txt to accomodate the needs of the community

I am interested in using this project inside zxing-cpp (see zxing-cpp/zxing-cpp#336). I'd very much like to incorporate it via cmake's FetchContent mechanism. This is currently not possible due to the way gtest is setup. A quick look at the network graph revealed that quite a few people felt the need to disable that as well.

Is there any chance that the CMakeLists.txt could be updated with

  • the option of disabling the gtest infrastructure (preferably by default)
  • making sure it builds with cmake without the platform specific shell script autogen.sh
  • an update of the required minimum cmake version to something more adequately representing the current landscape (fixing deprecation warnings)?

can't use it to build a shared library

I use this cmd:

g++ -fPIC -std=c++11 -I /path/to/compact_enc_det -I ${JAVA_HOME}/include -I ${JAVA_HOME}/include/linux -shared -o libmycode.so mycode.c -Wl,-whole-archive libced.a  -Wl,-no-whole-archive

and the output is:

/usr/bin/ld: libced.a(compact_enc_det.cc.o): relocation R_X86_64_32S against `.rodata' can not be used when making a shared object; recompile with -fPIC
libced.a(compact_enc_det.cc.o): error adding symbols: Bad value
collect2: error: ld returned 1 exit status

Introduce HTML5 mode

(originally reported by [email protected])

When used in conjunction with Chromium, Blink only supports ISO-2022-JP. The detection of other 7-bit encodings and other non-HTML5 encodings should be disabled in CED. We can handle it by introducing HTML5 mode.

Standardize the encoding name

(originally reported by [email protected])

Encoding names are different from HTML5/WHATWG encoding specs. Blink has encoding aliases to make up for those differences, but we can standardize the encoding name in CED (at least when it's built with 'HTML5' defined, which we need to introduce to CED)

Failing to detect ISO-8859 encodings

I have good results detecting Unicode encodings and Asian codepages, but really poor results with common European languages files saved in the ISO-8859 family, which are really common and this problem makes compact_enc_det unusable for me.
Encoding is always detected as ASCII (and reliable is set to true) for these encodings.
ISO-8859-6 for Arabic is OK.
Am I the only one?
Thanks for letting me know, so I can check if there is a problem or just look for an alternative.

logic

Hello there, I wonder if there is a document somewhere explaining the logic of this encoding detection?
Thank you!

Provide a bridge for other languages (Java, node.js, etc)

I don't know if this feature request belongs to here, but it would nice to be able to use CED from other programming languages (e.g. Java or node.js).
That's what @mzsanford has actually done for (a fork of) the Compact Language Detector library: https://github.com/mzsanford/cld.

My personal use case would be Java — I suppose technologies like SWIG or a JNI-based wrapper could be used.

What's your opinion on this feature request? Thanks.

Encoding issue

Unsure if it can help but the character was sometimes interpreted as › on my website.

Context

The charset was not specified properly in the html document (typo). No charset in the css file either where the char was used.

Data

›   -> (UTF8)     E2   80   BA
› -> (UTF16)  00E2 20AC 00BA 

›   -> 1110 0010                     1000 0000 1011 1010
› -> 1110 0010 0010 0000 1010 1100 0000 0000 1011 1010

Again, not sure if is causing the wrong charset detection / lures the algorithm since the binary isn't matching 1:1 but since 0x80 seems to be a special value in the algorithm I thought it could be an edge-case that could be fixed.

Cheers,

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.