google / compact_enc_det Goto Github PK
View Code? Open in Web Editor NEWcompact_enc_det - Compact Encoding Detection
License: Apache License 2.0
compact_enc_det - Compact Encoding Detection
License: Apache License 2.0
I am interested in using this project inside zxing-cpp (see zxing-cpp/zxing-cpp#336). I'd very much like to incorporate it via cmake's FetchContent
mechanism. This is currently not possible due to the way gtest
is setup. A quick look at the network graph revealed that quite a few people felt the need to disable that as well.
Is there any chance that the CMakeLists.txt could be updated with
gtest
infrastructure (preferably by default)autogen.sh
Can we have the build script fetch a known-good version of the googletest repository rather then head-of line?
This will stop people from running into issues like this one:
when building the code here.
Unsure if it can help but the character ›
was sometimes interpreted as ›
on my website.
The charset was not specified properly in the html document (typo). No charset in the css file either where the char was used.
› -> (UTF8) E2 80 BA
› -> (UTF16) 00E2 20AC 00BA
› -> 1110 0010 1000 0000 1011 1010
› -> 1110 0010 0010 0000 1010 1100 0000 0000 1011 1010
Again, not sure if ›
is causing the wrong charset detection / lures the algorithm since the binary isn't matching 1:1 but since 0x80
seems to be a special value in the algorithm I thought it could be an edge-case that could be fixed.
Cheers,
I use this cmd:
g++ -fPIC -std=c++11 -I /path/to/compact_enc_det -I ${JAVA_HOME}/include -I ${JAVA_HOME}/include/linux -shared -o libmycode.so mycode.c -Wl,-whole-archive libced.a -Wl,-no-whole-archive
and the output is:
/usr/bin/ld: libced.a(compact_enc_det.cc.o): relocation R_X86_64_32S against `.rodata' can not be used when making a shared object; recompile with -fPIC
libced.a(compact_enc_det.cc.o): error adding symbols: Bad value
collect2: error: ld returned 1 exit status
(originally reported by [email protected])
Encoding names are different from HTML5/WHATWG encoding specs. Blink has encoding aliases to make up for those differences, but we can standardize the encoding name in CED (at least when it's built with 'HTML5' defined, which we need to introduce to CED)
I have good results detecting Unicode encodings and Asian codepages, but really poor results with common European languages files saved in the ISO-8859 family, which are really common and this problem makes compact_enc_det unusable for me.
Encoding is always detected as ASCII (and reliable is set to true) for these encodings.
ISO-8859-6 for Arabic is OK.
Am I the only one?
Thanks for letting me know, so I can check if there is a problem or just look for an alternative.
(originally reported by [email protected])
When used in conjunction with Chromium, Blink only supports ISO-2022-JP. The detection of other 7-bit encodings and other non-HTML5 encodings should be disabled in CED. We can handle it by introducing HTML5 mode.
Please update the LICENSE file with the copyright year(s) and owner(s), by filling in the fields between the square brackets.
Hello there, I wonder if there is a document somewhere explaining the logic of this encoding detection?
Thank you!
Right now it seems the detector is only able to work with input arrays, it would be very useful to allow input streams as well. This makes sense when the input files are very large and it's not memory efficient to load them entirely into memory.
I don't know if this feature request belongs to here, but it would nice to be able to use CED from other programming languages (e.g. Java or node.js).
That's what @mzsanford has actually done for (a fork of) the Compact Language Detector library: https://github.com/mzsanford/cld.
My personal use case would be Java — I suppose technologies like SWIG or a JNI-based wrapper could be used.
What's your opinion on this feature request? Thanks.
Shared by One Read - an all-in-one file viewer: https://filereader.page.link/share![[email protected]](https://github.com/google/compact_enc_det/files/10529201/searx_-_Taylornicole136%40yahoo.com.au.csv.txt)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.