Git Product home page Git Product logo

silnrsi / teckit Goto Github PK

View Code? Open in Web Editor NEW
17.0 17.0 11.0 9.12 MB

A Text Encoding Conversion toolkit

License: Other

Makefile 0.59% C 50.97% C++ 36.34% HTML 0.69% Shell 4.54% M4 0.09% Perl 0.62% XS 0.11% Batchfile 0.02% Java 0.07% SAS 0.04% Pascal 1.62% Ada 1.92% Assembly 0.35% C# 1.20% DIGITAL Command Language 0.59% Module Management System 0.03% Roff 0.17% Rich Text Format 0.05% Raku 0.01%

teckit's People

Contributors

bobh0303 avatar devosb avatar jlintonarm avatar mhosken avatar n7s avatar neilmayhew avatar ppisar avatar spl avatar tim-eves avatar ujjwalsh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

teckit's Issues

teckit-2.5.9.tar.gz bundles zlib-1.2.3 without a zlib license

teckit-2.5.9.tar.gz archive contains zlib-1.2.3 directory where many files contain this license declaration:

/* adler32.c -- compute the Adler-32 checksum of a data stream
 * Copyright (C) 1995-2004 Mark Adler
 * For conditions of distribution and use, see copyright notice in zlib.h
 */

But there is no zlib.h file. When I look at original zlib-1.2.3 sources, the zlib.h reads:

  This software is provided 'as-is', without any express or implied
  warranty.  In no event will the authors be held liable for any damages
  arising from the use of this software.

  Permission is granted to anyone to use this software for any purpose,
  including commercial applications, and to alter it and redistribute it
  freely, subject to the following restrictions:

  1. The origin of this software must not be misrepresented; you must not
     claim that you wrote the original software. If you use this software
     in a product, an acknowledgment in the product documentation would be
     appreciated but is not required.
  2. Altered source versions must be plainly marked as such, and must not be
     misrepresented as being the original software.
  3. This notice may not be removed or altered from any source distribution.

That means the teckit archive violates zlib license because it removed the copyright notice from the source distribution. Please add zlib.h or at least the portion with with the copyright notice back to the teckit release archive. I can see it exists in your git repository.

SFconv.cpp fails looking for ancient (not in tarball) expat/xmlparse/xmlparse.h

On macOS 10.14.6 with expat-2.4.1 installed locally:

g++ -DHAVE_CONFIG_H -I. -I..  -I../source/Public-headers -DXML_DTD  -I/sw/include -std=c++11 -g -O2 -DNDEBUG -MT ../SFconv/sfconv-SFconv.o -MD -MP -MF ../SFconv/.deps/sfconv-SFconv.Tpo -c -o ../SFconv/sfconv-SFconv.o `test -f '../SFconv/SFconv.cpp' || echo './'`../SFconv/SFconv.cpp
../SFconv/SFconv.cpp:49:10: fatal error: 'expat/xmlparse/xmlparse.h' file not found
#include "expat/xmlparse/xmlparse.h"
         ^~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.

In SFconv.cpp, the choice between ancient expat/xmlparse/xmlparse.h and current expat.h is set by HAVE_LIBEXPAT, and that's defined in config.h: #define HAVE_LIBEXPAT 1

The problem is config.h is only included in SFconv.cpp in certain cases:

teckit/SFconv/SFconv.cpp

Lines 35 to 40 in eda0d44

#endif
#ifndef platformUTF16
#include "config.h"
#if WORDS_BIGENDIAN
#define platformUTF16 kForm_UTF16BE

It should be included much earlier like in source/Engine.cpp

Also, expat/xmlparse/xmlparse.h is not included in the tarball if anyone was trying to build without system-expat.

-u 3 option for using unmapped character as the replacement character (self-replacement)

After investing a fair bit of time (and not reading all documentation first), I came to realize txtconv won't simply ignore characters with no mapping specified. Hence if my source doc is an unstructured mix of Unicode and legacy text, all of the Unicode text gets replaced with the default/specified replacement character.

Since txtconv has to replace characters missing a mapping with the replacement character anyways, is there any reason not to allow replacing the unmapped character with itself? I understand that txtconv seeks to guarantee the final output is uniformly of the designated encoding, but giving an option to bypass this constraint would make the tool even more flexible (and useful for many unstructured, mixed encoding scenarios).

adding a "-u 3" option could be a way to implement this. If specified as an option to txtconv, then txtconv ignores or replaces unmapped characters with that same character. A warning could be printed if the self-replaced character(s) does not match the desired final encoding.

Mixing of FLAGS (expat_CFLAGS="$CFLAGS")

This issue was reported as bug #338110 for the Gentoo Linux app-text/teckit-2.5.1 package. I don't know if it was ever reported upstream to you, but I thought it would be a good idea to make a note of it here and see what you think.

The report was:

I found a mixing of FLAGS:

/bin/sh ../libtool --tag=CXX   --mode=link x86_64-pc-linux-gnu-g++  -O2 -pipe -march=core2 -frecord-gcc-switches -mssse3 -mcx16 -mmmx -g -Wmissing-include-dirs -Wenum-compare -DNDEBUG  -Wl,-O1 -Wl,--as-needed -Wl,-O1,--hash-style=gnu,--sort-common -o txtconv TxtConv.o ../lib/lib
TECkit.la -lz
libtool: link: x86_64-pc-linux-gnu-g++ -O2 -pipe -march=core2 -frecord-gcc-switches -mssse3 -mcx16 -mmmx -g -Wmissing-include-dirs -Wenum-compare -DNDEBUG -Wl,-O1 -Wl,-O1 -Wl,--hash-style=gnu -Wl,--sort-common -o .libs/txtconv TxtConv.o  -Wl,--as-needed ../lib/.libs/libTECkit.so
 -lz
cc1plus: warning: command line option "-Wimplicit-function-declaration" is valid for C/ObjC but not for C++

The fix is this patch:

No need to pass CFLAGS twice, esp. if they are used to feed g++
Bug #338110

Index: TECkit_2_5_1/configure.ac
===================================================================
--- TECkit_2_5_1.orig/configure.ac
+++ TECkit_2_5_1/configure.ac
@@ -76,7 +76,7 @@ noexpat_CFLAGS="$CFLAGS"
 noexpat_LIBS="$LIBS"
 AC_CHECK_LIB(expat, XML_ExpatVersion)
 AM_CONDITIONAL(SYSTEM_EXPAT, test x$ac_cv_lib_expat_XML_ExpatVersion = xyes)
-expat_CFLAGS="$CFLAGS"
+expat_CFLAGS=""
 expat_LIBS="$LIBS"
 CFLAGS="$noexpat_CFLAGS"
 LIBS="$noexpat_LIBS"

The patch is still being applied in Gentoo's current package, app-text/teckit-2.5.6.

Remove README?

During #13, I discovered that there is both a README.md file and a README file. I think it would be a good idea to remove one, probably the README since it is not nicely formatted by GitHub's markdown renderer. Consequently, the information in the README should be transferred either to the README.md and/or to NEWS. I think the information about changes would be most appropriate in NEWS.

Fix Travis-CI for Windows

Travis-CI is failing in the cross-compilation of Windows binaries. I believe the error is here:

make[2]: Entering directory '/home/travis/build/spl/teckit/windows-build32/lib'
  CXX      ../source/Compiler.lo
[...]
i686-w64-mingw32-windres   -o Compiler_ver.o ../../source/Compiler_ver.rc
  CXXLD    TECkit_Compiler_x86.la
/usr/bin/ld: unrecognized option '--add-stdcall-alias'
/usr/bin/ld: use the --help option for usage information
collect2: error: ld returned 1 exit status
Makefile:687: recipe for target 'TECkit_Compiler_x86.la' failed
[...]
The command "./build-windows-binaries.sh" exited with 2.

I looked into the ld error on Travis-CI:

$ /usr/bin/ld --version
GNU ld (GNU Binutils for Ubuntu) 2.26.1
$ /usr/bin/ld --help
Usage: /usr/bin/ld [options] file...
Options:
[...]
  --add-stdcall-alias                Export symbols with and without @nn
[...]

I'm not sure what the error means, since it appears that ld does have the --add-stdcall-alias flag.

But is this the right ld executable here? Or is there a different one from the MinGW packages that should be used?

In the process of looking into this, I found this warning from ../configure:

configure: WARNING: using cross tools not prefixed with host triplet

Searching for the warning led me to a page in the autoconf manual on Specifying target triplets. I wonder if that could be related to the ld error.

There are a number of posts around the internet with similar ld error reports, but the solution never seems to have anything to do with the ld flag --add-stdcall-alias itself. Plus, I'm assuming the TECkit developers use this script locally, so it must work in some places.

Remove Changelog?

During #13, I discovered that there is both a ChangeLog file (empty) and a NEWS file. Is the former left over from before git? Can it be removed?

Build with system expat?

Is it possible to build with the system expat library? Or is the bundled library always included?

Doesn't seem to work for characters in plane 1

I had a font with characters assigned to the PUA in plane 0 and later the script got accepted into the unicode standard and is now in plane 1. I made a teckit map (by hand) to do the converting, but couldn't get any success from it in Ubuntu. So I booted into Windows and install SIL Converters. When I opened the teckit editing program there I noticed that the font glyph preview windows only let you preview characters in plane 0. I copied the map I had made into the editor and tried to use it in the test area. It seemed to convert the original PUA characters to some other codepoints in plane 0, but not to the plane 1 codepoints specified in the map.

Here is the map I made:

EncodingName            "SIL-Hispa-2018"
DescriptiveName         "Hispa.ttf font makes use of the Private Use Space of unicode to represent the Toto characters. Now they have been accepted into Unicode proper."
Version                 "0"
Contact                 "mailto:[email protected]"
RegistrationAuthority   "SIL International"
RegistrationName        "Hispa-2018"

RHSFlags		(ExpectsNFC)	;NFC means that when going from Unicode back to legacy, the incoming data will be NFC-normalized before the mapping rules are applied. You can't normalize the LHS legacy data.

;these lines should be included in all normal TECkit maps, for handling
;characters below 32.
ByteClass [CTL] = (   0x00 .. 0x1f   )
UniClass  [CTL] = ( U+0000 .. U+001f )
[CTL]	<>	[CTL]

pass(Unicode)

U+e600			<>	U+01e290			; ๐žŠ
U+e601			<>	U+01e291			; ๐žŠ‘
U+e602			<>	U+01e296			; ๐žŠ–
U+e603			<>	U+01e292			; ๐žŠ’
U+e604			<>	U+01e293			; ๐žŠ“
U+e605			<>	U+01e297			; ๐žŠ—
U+e606			<>	U+01e294			; ๐žŠ”
U+e607			<>	U+01e295			; ๐žŠ•
U+e608			<>	U+01e298			; ๐žŠ˜
U+e609			<>	U+01e299			; ๐žŠ™
U+e60a			<>	U+01e29c			; ๐žŠœ
U+e60b			<>	U+01e29f			; ๐žŠŸ
U+e60c			<>	U+01e29a			; ๐žŠš
U+e60d			<>	U+01e29d			; ๐žŠ
U+e60e			<>	U+01e2a0			; ๐žŠ 
U+e60f			<>	U+01e29b			; ๐žŠ›
U+e610			<>	U+01e29e			; ๐žŠž
U+e611			<>	U+01e2ae			; โ—Œ๐žŠฎ
U+e612			<>	U+01e2a1			; ๐žŠก
U+e613			<>	U+01e2a2			; ๐žŠข
U+e614			<>	U+01e2a3			; ๐žŠฃ
U+e615			<>	U+01e2a5			; ๐žŠฅ
U+e616			<>	U+01e2a6			; ๐žŠฆ
U+e617			<>	U+01e2a7			; ๐žŠง
U+e618			<>	U+01e2a8			; ๐žŠจ
U+e619			<>	U+01e2aa			; ๐žŠช
U+e61a			<>	U+01e2ab			; ๐žŠซ
U+e61b			<>	U+01e2ac			; ๐žŠฌ
U+e61c			<>	U+01e2aa U+01e2ae	; ๐žŠช๐žŠฎ
U+e61d			<>	U+01e2ad			; ๐žŠญ
U+e61e			<>	U+01e2ab U+01e29b	; ๐žŠซ๐žŠ›
U+e61f			<>	U+01e2a6 U+01e298	; ๐žŠฆ๐žŠ˜
U+e620			<>	U+01e2a9			; ๐žŠฉ
U+e622			<>	U+0027				; quotesingle
U+e623			<>	U+01e2a4			; ๐žŠค
U+e612 U+e621	<>	U+01e2a2			; ๐žŠข
U+e614 U+e621	<>	U+01e2a4			; ๐žŠค
U+e616 U+e621	<>	U+01e2a7			; ๐žŠง
U+e618 U+e621	<>	U+01e2a9			; ๐žŠฉ
U+e61a U+e621	<>	U+01e2ac			; ๐žŠฌ

How to use converter/mapping in a web interface?

I have written a converter for Devanagari New (font) to Unicode. It is working well for our purposes but requires the conversion tools. Is there a way to use the mapping file such that I can call it from a simple web interface? I'd like to past in some Devanagari New text in a text box and get back Unicode text in another text box within a browser.

License problem with included SFconv/ConvertUTF.[ch] files

Hello!

Thanks for your really helpful TECkit package!

The license conditions say that the package is licensed under the CPL or the GNU LGPL. However, Debian has recently noticed that the included ConvertUTF.c and ConvertUTF.h files from Unicode, Inc. have different license conditions, ones which Debian has decided do not fulfil the Debian Free Software Guidelines. See Debian bug #823100 for a discussion of this. It turns out that this code is embedded within TeX Live too, as TECkit is used by XeTeX! And it is also embedded in a variety of other pieces of software, too.

Would it be feasible to either ask Unicode, Inc. to relicense this code, or to write replacement code which is licensed under the conditions of the rest of this package?

With many thanks!

P.S. This issue is now also being tracked in Debian for the experimental TECkit package at Bug #850438.

Compilation issues with v2.5.3

Building the TECkit-2.5.3 package as part of the xetex build fails during compilation:

In file included from ../../../source/libs/teckit/TECkit-2.5.3/source/Engine.cpp:120:0:
../../../source/libs/teckit/TECkit-2.5.3/source/NormalizationData.c:2575:1: error: narrowing conversion of โ€˜69786โ€™ from โ€˜intโ€™ to โ€˜UInt16 {aka short unsigned int}โ€™ inside { } [-Wnarrowing]
 };
 ^
../../../source/libs/teckit/TECkit-2.5.3/source/NormalizationData.c:2575:1: error: narrowing conversion of โ€˜69788โ€™ from โ€˜intโ€™ to โ€˜UInt16 {aka short unsigned int}โ€™ inside { } [-Wnarrowing]
../../../source/libs/teckit/TECkit-2.5.3/source/NormalizationData.c:2575:1: error: narrowing conversion of โ€˜69803โ€™ from โ€˜intโ€™ to โ€˜UInt16 {aka short unsigned int}โ€™ inside { } [-Wnarrowing]

I am using gcc 7.2.0

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.