silnrsi / teckit Goto Github PK

A Text Encoding Conversion toolkit

License: Other

Makefile 0.59% C 50.97% C++ 36.34% HTML 0.69% Shell 4.54% M4 0.09% Perl 0.62% XS 0.11% Batchfile 0.02% Java 0.07% SAS 0.04% Pascal 1.62% Ada 1.92% Assembly 0.35% C# 1.20% DIGITAL Command Language 0.59% Module Management System 0.03% Roff 0.17% Rich Text Format 0.05% Raku 0.01%

teckit's People

Contributors

Stargazers

Watchers

Forkers

neilmayhew hughp ygemici jlinton spl dalavancloud ppisar ujjwalsh tim-eves devosb

teckit's Issues

teckit-2.5.9.tar.gz bundles zlib-1.2.3 without a zlib license

teckit-2.5.9.tar.gz archive contains zlib-1.2.3 directory where many files contain this license declaration:

/* adler32.c -- compute the Adler-32 checksum of a data stream
 * Copyright (C) 1995-2004 Mark Adler
 * For conditions of distribution and use, see copyright notice in zlib.h
 */

But there is no zlib.h file. When I look at original zlib-1.2.3 sources, the zlib.h reads:

  This software is provided 'as-is', without any express or implied
  warranty.  In no event will the authors be held liable for any damages
  arising from the use of this software.

  Permission is granted to anyone to use this software for any purpose,
  including commercial applications, and to alter it and redistribute it
  freely, subject to the following restrictions:

  1. The origin of this software must not be misrepresented; you must not
     claim that you wrote the original software. If you use this software
     in a product, an acknowledgment in the product documentation would be
     appreciated but is not required.
  2. Altered source versions must be plainly marked as such, and must not be
     misrepresented as being the original software.
  3. This notice may not be removed or altered from any source distribution.

That means the teckit archive violates zlib license because it removed the copyright notice from the source distribution. Please add zlib.h or at least the portion with with the copyright notice back to the teckit release archive. I can see it exists in your git repository.

SFconv.cpp fails looking for ancient (not in tarball) expat/xmlparse/xmlparse.h

On macOS 10.14.6 with expat-2.4.1 installed locally:

g++ -DHAVE_CONFIG_H -I. -I..  -I../source/Public-headers -DXML_DTD  -I/sw/include -std=c++11 -g -O2 -DNDEBUG -MT ../SFconv/sfconv-SFconv.o -MD -MP -MF ../SFconv/.deps/sfconv-SFconv.Tpo -c -o ../SFconv/sfconv-SFconv.o `test -f '../SFconv/SFconv.cpp' || echo './'`../SFconv/SFconv.cpp
../SFconv/SFconv.cpp:49:10: fatal error: 'expat/xmlparse/xmlparse.h' file not found
#include "expat/xmlparse/xmlparse.h"
         ^~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.

In SFconv.cpp, the choice between ancient expat/xmlparse/xmlparse.h and current expat.h is set by HAVE_LIBEXPAT, and that's defined in config.h: #define HAVE_LIBEXPAT 1

The problem is config.h is only included in SFconv.cpp in certain cases:

teckit/SFconv/SFconv.cpp

Lines 35 to 40 in eda0d44

 #endif 

 #ifndef platformUTF16 

 #include "config.h" 

 #if WORDS_BIGENDIAN 

 #define platformUTF16 kForm_UTF16BE

It should be included much earlier like in source/Engine.cpp

Also, expat/xmlparse/xmlparse.h is not included in the tarball if anyone was trying to build without system-expat.

-u 3 option for using unmapped character as the replacement character (self-replacement)

After investing a fair bit of time (and not reading all documentation first), I came to realize txtconv won't simply ignore characters with no mapping specified. Hence if my source doc is an unstructured mix of Unicode and legacy text, all of the Unicode text gets replaced with the default/specified replacement character.

Since txtconv has to replace characters missing a mapping with the replacement character anyways, is there any reason not to allow replacing the unmapped character with itself? I understand that txtconv seeks to guarantee the final output is uniformly of the designated encoding, but giving an option to bypass this constraint would make the tool even more flexible (and useful for many unstructured, mixed encoding scenarios).

adding a "-u 3" option could be a way to implement this. If specified as an option to txtconv, then txtconv ignores or replaces unmapped characters with that same character. A warning could be printed if the self-replaced character(s) does not match the desired final encoding.

Mixing of FLAGS (expat_CFLAGS="$CFLAGS")

This issue was reported as bug #338110 for the Gentoo Linux app-text/teckit-2.5.1 package. I don't know if it was ever reported upstream to you, but I thought it would be a good idea to make a note of it here and see what you think.

The report was:

I found a mixing of FLAGS:

/bin/sh ../libtool --tag=CXX   --mode=link x86_64-pc-linux-gnu-g++  -O2 -pipe -march=core2 -frecord-gcc-switches -mssse3 -mcx16 -mmmx -g -Wmissing-include-dirs -Wenum-compare -DNDEBUG  -Wl,-O1 -Wl,--as-needed -Wl,-O1,--hash-style=gnu,--sort-common -o txtconv TxtConv.o ../lib/lib
TECkit.la -lz
libtool: link: x86_64-pc-linux-gnu-g++ -O2 -pipe -march=core2 -frecord-gcc-switches -mssse3 -mcx16 -mmmx -g -Wmissing-include-dirs -Wenum-compare -DNDEBUG -Wl,-O1 -Wl,-O1 -Wl,--hash-style=gnu -Wl,--sort-common -o .libs/txtconv TxtConv.o  -Wl,--as-needed ../lib/.libs/libTECkit.so
 -lz
cc1plus: warning: command line option "-Wimplicit-function-declaration" is valid for C/ObjC but not for C++

The fix is this patch:

No need to pass CFLAGS twice, esp. if they are used to feed g++
Bug #338110

Index: TECkit_2_5_1/configure.ac
===================================================================
--- TECkit_2_5_1.orig/configure.ac
+++ TECkit_2_5_1/configure.ac
@@ -76,7 +76,7 @@ noexpat_CFLAGS="$CFLAGS"
 noexpat_LIBS="$LIBS"
 AC_CHECK_LIB(expat, XML_ExpatVersion)
 AM_CONDITIONAL(SYSTEM_EXPAT, test x$ac_cv_lib_expat_XML_ExpatVersion = xyes)
-expat_CFLAGS="$CFLAGS"
+expat_CFLAGS=""
 expat_LIBS="$LIBS"
 CFLAGS="$noexpat_CFLAGS"
 LIBS="$noexpat_LIBS"

The patch is still being applied in Gentoo's current package, app-text/teckit-2.5.6.

Remove README?

During #13, I discovered that there is both a README.md file and a README file. I think it would be a good idea to remove one, probably the README since it is not nicely formatted by GitHub's markdown renderer. Consequently, the information in the README should be transferred either to the README.md and/or to NEWS. I think the information about changes would be most appropriate in NEWS.

Fix Travis-CI for Windows

Travis-CI is failing in the cross-compilation of Windows binaries. I believe the error is here:

make[2]: Entering directory '/home/travis/build/spl/teckit/windows-build32/lib'
  CXX      ../source/Compiler.lo
[...]
i686-w64-mingw32-windres   -o Compiler_ver.o ../../source/Compiler_ver.rc
  CXXLD    TECkit_Compiler_x86.la
/usr/bin/ld: unrecognized option '--add-stdcall-alias'
/usr/bin/ld: use the --help option for usage information
collect2: error: ld returned 1 exit status
Makefile:687: recipe for target 'TECkit_Compiler_x86.la' failed
[...]
The command "./build-windows-binaries.sh" exited with 2.

I looked into the ld error on Travis-CI:

$ /usr/bin/ld --version
GNU ld (GNU Binutils for Ubuntu) 2.26.1

$ /usr/bin/ld --help
Usage: /usr/bin/ld [options] file...
Options:
[...]
  --add-stdcall-alias                Export symbols with and without @nn
[...]

I'm not sure what the error means, since it appears that ld does have the --add-stdcall-alias flag.

But is this the right ld executable here? Or is there a different one from the MinGW packages that should be used?

In the process of looking into this, I found this warning from ../configure:

configure: WARNING: using cross tools not prefixed with host triplet

Searching for the warning led me to a page in the autoconf manual on Specifying target triplets. I wonder if that could be related to the ld error.

There are a number of posts around the internet with similar ld error reports, but the solution never seems to have anything to do with the ld flag --add-stdcall-alias itself. Plus, I'm assuming the TECkit developers use this script locally, so it must work in some places.

syntax error near unexpected token `config.h'

I get an error when attempting to run ./configure from a new download of TECkit 2.5.10 (from https://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=TECkitDownloads) on MacOS Mojave version 10.14.5

./configure: line 2118: syntax error near unexpected token config.h' ./configure: line 2118: AM_CONFIG_HEADER(config.h)'

Any help would be appreciated, please let me know if I should provide any additional information.

Remove Changelog?

During #13, I discovered that there is both a ChangeLog file (empty) and a NEWS file. Is the former left over from before git? Can it be removed?

Build with system expat?

Is it possible to build with the system expat library? Or is the bundled library always included?

Compiler warnings from TeX Live

TeX Live uses part of TECkit and has patches at
https://www.tug.org/svn/texlive/trunk/Build/source/libs/teckit/TLpatches/
to eliminate compiler warnings.

A low priority issue would be to update the TECkit source so most of these patches are not needed.
For now, TeX Live is OK with the patches.

Doesn't seem to work for characters in plane 1

I had a font with characters assigned to the PUA in plane 0 and later the script got accepted into the unicode standard and is now in plane 1. I made a teckit map (by hand) to do the converting, but couldn't get any success from it in Ubuntu. So I booted into Windows and install SIL Converters. When I opened the teckit editing program there I noticed that the font glyph preview windows only let you preview characters in plane 0. I copied the map I had made into the editor and tried to use it in the test area. It seemed to convert the original PUA characters to some other codepoints in plane 0, but not to the plane 1 codepoints specified in the map.

Here is the map I made:

EncodingName            "SIL-Hispa-2018"
DescriptiveName         "Hispa.ttf font makes use of the Private Use Space of unicode to represent the Toto characters. Now they have been accepted into Unicode proper."
Version                 "0"
Contact                 "mailto:[email protected]"
RegistrationAuthority   "SIL International"
RegistrationName        "Hispa-2018"

RHSFlags		(ExpectsNFC)	;NFC means that when going from Unicode back to legacy, the incoming data will be NFC-normalized before the mapping rules are applied. You can't normalize the LHS legacy data.

;these lines should be included in all normal TECkit maps, for handling
;characters below 32.
ByteClass [CTL] = (   0x00 .. 0x1f   )
UniClass  [CTL] = ( U+0000 .. U+001f )
[CTL]	<>	[CTL]

pass(Unicode)

U+e600			<>	U+01e290			; 𞊐
U+e601			<>	U+01e291			; 𞊑
U+e602			<>	U+01e296			; 𞊖
U+e603			<>	U+01e292			; 𞊒
U+e604			<>	U+01e293			; 𞊓
U+e605			<>	U+01e297			; 𞊗
U+e606			<>	U+01e294			; 𞊔
U+e607			<>	U+01e295			; 𞊕
U+e608			<>	U+01e298			; 𞊘
U+e609			<>	U+01e299			; 𞊙
U+e60a			<>	U+01e29c			; 𞊜
U+e60b			<>	U+01e29f			; 𞊟
U+e60c			<>	U+01e29a			; 𞊚
U+e60d			<>	U+01e29d			; 𞊝
U+e60e			<>	U+01e2a0			; 𞊠
U+e60f			<>	U+01e29b			; 𞊛
U+e610			<>	U+01e29e			; 𞊞
U+e611			<>	U+01e2ae			; ◌𞊮
U+e612			<>	U+01e2a1			; 𞊡
U+e613			<>	U+01e2a2			; 𞊢
U+e614			<>	U+01e2a3			; 𞊣
U+e615			<>	U+01e2a5			; 𞊥
U+e616			<>	U+01e2a6			; 𞊦
U+e617			<>	U+01e2a7			; 𞊧
U+e618			<>	U+01e2a8			; 𞊨
U+e619			<>	U+01e2aa			; 𞊪
U+e61a			<>	U+01e2ab			; 𞊫
U+e61b			<>	U+01e2ac			; 𞊬
U+e61c			<>	U+01e2aa U+01e2ae	; 𞊪𞊮
U+e61d			<>	U+01e2ad			; 𞊭
U+e61e			<>	U+01e2ab U+01e29b	; 𞊫𞊛
U+e61f			<>	U+01e2a6 U+01e298	; 𞊦𞊘
U+e620			<>	U+01e2a9			; 𞊩
U+e622			<>	U+0027				; quotesingle
U+e623			<>	U+01e2a4			; 𞊤
U+e612 U+e621	<>	U+01e2a2			; 𞊢
U+e614 U+e621	<>	U+01e2a4			; 𞊤
U+e616 U+e621	<>	U+01e2a7			; 𞊧
U+e618 U+e621	<>	U+01e2a9			; 𞊩
U+e61a U+e621	<>	U+01e2ac			; 𞊬

installed teckit from 18.04 crashes

The attached text file and .tec seg faults txtconv.
Big_mouth_frog_story-kali.txt
kali-mymr.zip

Old FSF address

license/License_LGPLv21.txt file quotes Free Software Foundation's postal address that is not valid anymore. Current one can be found at https://www.gnu.org/licenses/old-licenses/lgpl-2.1.txt. Please update the license wording to provide an up-to-date address to your users.

How to use converter/mapping in a web interface?

I have written a converter for Devanagari New (font) to Unicode. It is working well for our purposes but requires the conversion tools. Is there a way to use the mapping file such that I can call it from a simple web interface? I'd like to past in some Devanagari New text in a text box and get back Unicode text in another text box within a browser.

License problem with included SFconv/ConvertUTF.[ch] files

Hello!

Thanks for your really helpful TECkit package!

The license conditions say that the package is licensed under the CPL or the GNU LGPL. However, Debian has recently noticed that the included ConvertUTF.c and ConvertUTF.h files from Unicode, Inc. have different license conditions, ones which Debian has decided do not fulfil the Debian Free Software Guidelines. See Debian bug #823100 for a discussion of this. It turns out that this code is embedded within TeX Live too, as TECkit is used by XeTeX! And it is also embedded in a variety of other pieces of software, too.

Would it be feasible to either ask Unicode, Inc. to relicense this code, or to write replacement code which is licensed under the conditions of the rest of this package?

With many thanks!

P.S. This issue is now also being tracked in Debian for the experimental TECkit package at Bug #850438.

Compilation issues with v2.5.3

Building the TECkit-2.5.3 package as part of the xetex build fails during compilation:

In file included from ../../../source/libs/teckit/TECkit-2.5.3/source/Engine.cpp:120:0:
../../../source/libs/teckit/TECkit-2.5.3/source/NormalizationData.c:2575:1: error: narrowing conversion of ‘69786’ from ‘int’ to ‘UInt16 {aka short unsigned int}’ inside { } [-Wnarrowing]
 };
 ^
../../../source/libs/teckit/TECkit-2.5.3/source/NormalizationData.c:2575:1: error: narrowing conversion of ‘69788’ from ‘int’ to ‘UInt16 {aka short unsigned int}’ inside { } [-Wnarrowing]
../../../source/libs/teckit/TECkit-2.5.3/source/NormalizationData.c:2575:1: error: narrowing conversion of ‘69803’ from ‘int’ to ‘UInt16 {aka short unsigned int}’ inside { } [-Wnarrowing]

I am using gcc 7.2.0

	#endif

	#ifndef platformUTF16
	#include "config.h"
	#if WORDS_BIGENDIAN
	#define platformUTF16 kForm_UTF16BE