Git Product home page Git Product logo

html2xhtml's Introduction

Html2xhtml

Html2xhtml is a command-line tool that converts HTML files to XHTML files. The path of the HTML input file can be provided as a command- line argument. If not, it is read from stdin.

Xhtml2xhtml tries always to generate valid XHTML files. It is able to correct many common errors in input HTML files without loose of infor‐ mation. However, for some errors, html2xhtml may decide to loose some information in order to generate a valid XHTML output. This can be avoided with the -e option, which allows html2xhtml to generate non- valid output in these cases.

Html2xhtml can generate the XHTML output compliant to one of the fol‐ lowing document types: XHTML 1.0 (Transitional, Strict and Frameset), XHTML 1.1, XHTML Basic and XHTML Mobile Profile.

HOW TO RUN THE PROGRAM

For full information about how to run the program see doc/html2xhtml.txt in the source code distribution, the html2xhtml.txt file in the Windows binaries ZIP file or the html2xhtml manpage. Some examples are shown below.

  • By default, the program reads the input file from its standard input and dumps the output file to its standard output:
cat input.html | html2xhtml
  • The input can also be specified as a command line argument:
html2xhtml input.html
  • In order to save the output to a file, redirect the standard output:
html2xhtml input.html > output.html
  • Alternatively, you can specify the output file name with the -o option:
html2xhtml input.html -o output.html
  • Select the document type of the output with -t:
html2xhtml input.html -t 1.1 -o output.html

The available values are:

  • transitional: XHTML 1.0 Transitional
  • frameset: XHTML 1.0 Frameset
  • strict: XHTML 1.0 Strict
  • 1.1: XHTML 1.1
  • basic-1.0: XHTML Basic 1.0
  • basic-1.1: XHTML Basic 1.1
  • mp: XHTML Mobile Profile
  • print-1.0: XHTML Print 1.0

Use "transitional" if you just want to tidy up the markup.

Choose an output character encoding (by default, the program uses the character encoding detected in the input):

html2xhtml input.html --ocs utf-8 -o output.html

Get the list of available character sets:

./src/html2xhtml --lcs

HOW TO COMPILE AND INSTALL THE PROGRAM FROM THE SOURCE TARBALL

Enter the main directory of the source distribution and type:

$ ./configure
$ make

You can run the test battery in order to check that the program is working as expected:

$ cd tests
$ ./test.sh
$ cd ..

If you want to install the program in your system, type then (it may require root priviledges):

$ make install 

See ./INSTALL for more information.

The program has been tested to compile on GNU/Linux and MinGW in Windows. In MinGW the actual EXE file to use is the one the compiler creates inside src\.libs instead of the one in src. It depends on the libiconv-2.dll file, which is distributed with MinGW (inside the bin\ subdirectory of the main MinGW installation directory).

HOW TO COMPILE AND INSTALL THE PROGRAM FROM THE GIT SOURCES

The source code in the Git repository does not include the files generated by the autotools. In order to build the ./configure script, run the following commands from the main directory of the sources:

$ aclocal
$ libtoolize
$ touch config.rpath
$ autoheader
$ automake --add-missing
$ autoconf

In OS X you need to use the glibtoolize command instead of libtoolize.

After that, you should get the ./configure script and proceed as explained above:

$ ./configure
$ make

html2xhtml's People

Contributors

abhijitkr avatar crygin avatar jfisteus avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

html2xhtml's Issues

Unable to compile on mac

make
/Applications/Xcode.app/Contents/Developer/usr/bin/make all-recursive
Making all in src
gcc -DHAVE_CONFIG_H -I. -I.. -std=c99 -g -O2 -MT html.o -MD -MP -MF .deps/html.Tpo -c -o html.o html.c
html.l:137:19: error: expected '}'
#line 138 "html.l"
^
html.c:1170:2: note: to match this '{'
{ /* beginning of action switch */
^
html.l:137:19: error: expected '}'
#line 138 "html.l"
^
html.c:1124:3: note: to match this '{'
{
^
html.l:137:19: error: expected '}'
#line 138 "html.l"
^
html.c:1086:1: note: to match this '{'
{
^
3 errors generated.
make[2]: *** [html.o] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all] Error 2

Compiled in AIX - Issue found

Hello,
I was able to compile html2xhtml in AIX 7.1. Took a few tries.
I can comment on how I did it if needed.
Issue I found:
I kept getting this error during execution:
iconv_open: invalid argument aix
After a ton of Google searches, I finally started looking at the code.
It turns out the case of the values of the charset_aliases struc was the issue. They do not match the case returned by "iconv -l" and therefore, I believe, causes the error.
To correct, I changed iso-5589-1 to ISO5589-1 and the conversion worked!!
So please update the values of charset_aliases.
Thank you,

issue : Too small preload buffer

Hi,
I have been using this tool for about a year and currently somehow now I seem to get this error "Too small preload buffer". I am calling it from powershell script
with html2xhtml $file.FullName -o $tempFileName . But I see this problem even when I use a different machine or calling it separately from a commandline instead of calling from powershell script. I noticed that it is making the input file 0kb.

Thank you!
-swetha

node.js example of using HTTP API

function html2xhtml(data,callback){
    var options = {
        host: 'www.it.uc3m.es',
        port: 80,
        path: '/jaf/cgi-bin/html2xhtml.cgi',
        method: 'POST'
    };

    options.headers = {
        'Content-Type': 'text/html',
        'Content-Length': data.length
    };

    var req = http.request(options, function(res) {
        res.setEncoding('utf8');
        var body = '';
        res.on('data', function (chunk) {
            body = body + chunk;
        });
        res.on('error',function(err){
            console.log(err);
        });
        res.on('end',function(){
            if(callback){
                callback(body);
            }
        });
    });
    req.on('error',function(err){
        console.log(err);
    });
    req.write(data);
    req.end();
}

Usage:

var data = require('fs').readFileSync('index.html');
html2xhtml(data,function(fixed){
    require('fs').writeFileSync('index2.html',fixed);
});

May be you want to create separate node.js module in npm registry with this code? Ask me, i can and want to help

Build fails on Fedora 28

At first the build failed because I didn't have flex installed, so I did dnf install flex. Now I get this:

[robin@laptop html2xhtml-html2xhtml-1.3]$ make
make  all-recursive
make[1]: Entering directory '/home/robin/opened/html2xhtml-html2xhtml-1.3'
Making all in src
make[2]: Entering directory '/home/robin/opened/html2xhtml-html2xhtml-1.3/src'
gcc -DHAVE_CONFIG_H -I. -I..    -std=c99 -g -O2 -MT dtd.o -MD -MP -MF .deps/dtd.Tpo -c -o dtd.o dtd.c
mv -f .deps/dtd.Tpo .deps/dtd.Po
gcc -DHAVE_CONFIG_H -I. -I..    -std=c99 -g -O2 -MT dtd_names.o -MD -MP -MF .deps/dtd_names.Tpo -c -o dtd_names.o dtd_names.c
mv -f .deps/dtd_names.Tpo .deps/dtd_names.Po
gcc -DHAVE_CONFIG_H -I. -I..    -std=c99 -g -O2 -MT dtd_util.o -MD -MP -MF .deps/dtd_util.Tpo -c -o dtd_util.o dtd_util.c
mv -f .deps/dtd_util.Tpo .deps/dtd_util.Po
gcc -DHAVE_CONFIG_H -I. -I..    -std=c99 -g -O2 -MT htmlgr.o -MD -MP -MF .deps/htmlgr.Tpo -c -o htmlgr.o htmlgr.c
mv -f .deps/htmlgr.Tpo .deps/htmlgr.Po
/bin/sh ../ylwrap html.l .c html.c -- /bin/sh /home/robin/opened/html2xhtml-html2xhtml-1.3/missing flex  
lex.yyhtml.c is unchanged
gcc -DHAVE_CONFIG_H -I. -I..    -std=c99 -g -O2 -MT html.o -MD -MP -MF .deps/html.Tpo -c -o html.o html.c
gcc: error: html.c: No such file or directory
gcc: fatal error: no input files
compilation terminated.
make[2]: *** [Makefile:472: html.o] Error 1
make[2]: Leaving directory '/home/robin/opened/html2xhtml-html2xhtml-1.3/src'
make[1]: *** [Makefile:407: all-recursive] Error 1
make[1]: Leaving directory '/home/robin/opened/html2xhtml-html2xhtml-1.3'
make: *** [Makefile:339: all] Error 2

Numerous warnings and an error when compiling with gcc under Mac OS X

Using ./configure, then make, I get numerous warnings that look like they might be unintentional usages of '=' (assignment) instead of '==' (logical equals). There is also a fatal error. Diagnostic output shown below. gcc --version yields the following

Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 5.1 (clang-503.0.40) (based on LLVM 3.4svn)
Target: x86_64-apple-darwin12.5.0
Thread model: posix

make all-recursive
Making all in src
gcc -DHAVE_CONFIG_H -I. -I.. -g -O2 -MT charset.o -MD -MP -MF .deps/charset.Tpo -c -o charset.o charset.c
charset.c:143:40: warning: incompatible pointer types passing 'int ' to parameter of type 'size_t *' (aka 'unsigned long *') [-Wincompatible-pointer-types]
iconv (cd, NULL, NULL, &bufferpos, &avail);
^~~~~~
/usr/include/iconv.h:74:69: note: passing argument to parameter here
char *
__restrict /outbuf/, size_t * __restrict /outbytesleft/);
^
charset.c:147:17: warning: comparison of unsigned expression < 0 is always false [-Wtautological-compare]
if (wrote < 0) {
~~~~~ ^ ~
charset.c:194:37: warning: incompatible pointer types passing 'int ' to parameter of type 'size_t *' (aka 'unsigned long *') [-Wincompatible-pointer-types]
nconv = iconv(cd, &bufferpos, &avail, &outbuf, &outbuf_max);
^~~~~~
/usr/include/iconv.h:73:68: note: passing argument to parameter here
char *
__restrict /inbuf/, size_t * __restrict /inbytesleft/,
^
charset.c:267:5: error: non-void function 'charset_write' should return a value [-Wreturn-type]
return;
^
charset.c:274:48: warning: incompatible pointer types passing 'int ' to parameter of type 'size_t *' (aka 'unsigned long *') [-Wincompatible-pointer-types]
nconv = iconv(cd, &bufpos, &n, &bufferpos, &avail);
^~~~~~
/usr/include/iconv.h:74:69: note: passing argument to parameter here
char *
__restrict /outbuf/, size_t * __restrict /outbytesleft/);
^
charset.c:446:23: warning: using the result of an assignment as a condition without parentheses [-Wparentheses]
} else if (buf[0] = 0x3c) {
~~~~~~~^~~~~~
charset.c:446:23: note: place parentheses around the assignment to silence this warning
} else if (buf[0] = 0x3c) {
^
( )
charset.c:446:23: note: use '==' to turn this assignment into an equality comparison
} else if (buf[0] = 0x3c) {
^
==
charset.c:456:23: warning: using the result of an assignment as a condition without parentheses [-Wparentheses]
} else if (buf[0] = 0x4c && buf[1] && buf[2] && buf[3]) {
~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
charset.c:456:23: note: place parentheses around the assignment to silence this warning
} else if (buf[0] = 0x4c && buf[1] && buf[2] && buf[3]) {
^
( )
charset.c:456:23: note: use '==' to turn this assignment into an equality comparison
} else if (buf[0] = 0x4c && buf[1] && buf[2] && buf[3]) {
^
==
charset.c:544:16: warning: implicit declaration of function 'tolower' is invalid in C99 [-Wimplicit-function-declaration]
buf[len] = tolower(buffer[i]);
^
7 warnings and 1 error generated.
make[2]: *** [charset.o] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all] Error 2

Hello

Thanks for this program. It is one the rare tools that I have seen working well. I was wondering if it is possible to convert multiple documents at the same time?

Thanks again.

How to perform syntax correction only and not deletion?

Hello,

I have the following HTML file:

<html xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <title>
   Cover
  </title>
  <link href="springer_epub.css" rel="styleSheet" type="text/css">
 </head>
 <body>
  <mjx-container class="MathJax" jax="CHTML" style="font-size: 113.1%; position: relative;"><mjx-math class=" MJX-TEX" aria-hidden="true"><mjx-mo class="mjx-n"><mjx-c class="mjx-c2225"></mjx-c></mjx-mo><mjx-TeXAtom texclass="ORD"><mjx-mi class="mjx-cal mjx-i"><mjx-c class="mjx-c50 TEX-C"></mjx-c></mjx-mi></mjx-TeXAtom><mjx-mo class="mjx-n"><mjx-c class="mjx-c2225"></mjx-c></mjx-mo></mjx-math><mjx-assistive-mml unselectable="on" display="inline"><math xmlns="http://www.w3.org/1998/Math/MathML"><mo data-mjx-texclass="ORD"></mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">P</mi></mrow><mo data-mjx-texclass="ORD"></mo></math></mjx-assistive-mml></mjx-container>
       of a partition (or of a tagged partition)
       <mjx-container class="MathJax" jax="CHTML" style="font-size: 113.1%; position: relative;"><mjx-math class=" MJX-TEX" aria-hidden="true"><mjx-TeXAtom texclass="ORD"><mjx-mi class="mjx-cal mjx-i"><mjx-c class="mjx-c50 TEX-C"></mjx-c></mjx-mi></mjx-TeXAtom></mjx-math><mjx-assistive-mml unselectable="on" display="inline"><math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">P</mi></mrow></math></mjx-assistive-mml></mjx-container>
       is defined by
       <mjx-container class="MathJax" jax="CHTML" style="font-size: 113.1%; position: relative;"><mjx-math class=" MJX-TEX" aria-hidden="true"><mjx-mo class="mjx-n"><mjx-c class="mjx-c2225"></mjx-c></mjx-mo><mjx-TeXAtom texclass="ORD"><mjx-mi class="mjx-cal mjx-i"><mjx-c class="mjx-c50 TEX-C"></mjx-c></mjx-mi></mjx-TeXAtom><mjx-mo class="mjx-n"><mjx-c class="mjx-c2225"></mjx-c></mjx-mo><mjx-mo class="mjx-n" space="4"><mjx-c class="mjx-c3D"></mjx-c></mjx-mo><mjx-munder space="4" limits="false"><mjx-mo class="mjx-n"><mjx-c class="mjx-c6D"></mjx-c><mjx-c class="mjx-c61"></mjx-c><mjx-c class="mjx-c78"></mjx-c></mjx-mo><mjx-script style="vertical-align: -0.15em;"><mjx-TeXAtom size="s" texclass="ORD"><mjx-mi class="mjx-i"><mjx-c class="mjx-c1D456 TEX-I"></mjx-c></mjx-mi></mjx-TeXAtom></mjx-script></mjx-munder><mjx-mo class="mjx-n"><mjx-c class="mjx-c28"></mjx-c></mjx-mo><mjx-msub><mjx-mi class="mjx-i"><mjx-c class="mjx-c1D44E TEX-I"></mjx-c></mjx-mi><mjx-script style="vertical-align: -0.15em;"><mjx-TeXAtom size="s" texclass="ORD"><mjx-mi class="mjx-i"><mjx-c class="mjx-c1D456 TEX-I"></mjx-c></mjx-mi></mjx-TeXAtom></mjx-script></mjx-msub><mjx-mo class="mjx-n" space="3"><mjx-c class="mjx-c2212"></mjx-c></mjx-mo><mjx-msub space="3"><mjx-mi class="mjx-i"><mjx-c class="mjx-c1D44E TEX-I"></mjx-c></mjx-mi><mjx-script style="vertical-align: -0.15em;"><mjx-TeXAtom size="s" texclass="ORD"><mjx-mi class="mjx-i"><mjx-c class="mjx-c1D456 TEX-I"></mjx-c></mjx-mi><mjx-mo class="mjx-n"><mjx-c class="mjx-c2212"></mjx-c></mjx-mo><mjx-mn class="mjx-n"><mjx-c class="mjx-c31"></mjx-c></mjx-mn></mjx-TeXAtom></mjx-script></mjx-msub><mjx-mo class="mjx-n"><mjx-c class="mjx-c29"></mjx-c></mjx-mo></mjx-math><mjx-assistive-mml unselectable="on" display="inline"><math xmlns="http://www.w3.org/1998/Math/MathML"><mo data-mjx-texclass="ORD"></mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">P</mi></mrow><mo data-mjx-texclass="ORD"></mo><mo>=</mo><munder><mo data-mjx-texclass="OP" movablelimits="true">max</mo><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></munder><mo stretchy="false">(</mo><msub><mi>a</mi><mrow data-mjx-texclass="ORD"><mi>i</mi></mrow></msub><mo></mo><msub><mi>a</mi><mrow data-mjx-texclass="ORD"><mi>i</mi><mo></mo><mn>1</mn></mrow></msub><mo stretchy="false">)</mo></math></mjx-assistive-mml></mjx-container>
 </body>
</html>

Applying html2xhtml on this document gives the following:

<?xml version="1.0" encoding="iso-8859-1"?>

<!DOCTYPE html
   PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" >

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>
      Cover
    </title>
    <link href="springer_epub.css" rel="styleSheet" type="text/css" />
  </head>
  <body>
    ∥P∥ of a partition (or of a tagged partition) P is defined by ∥P∥=maxi(ai−ai−1)
  </body>
</html>

We can see that html2xhtml has added a slash (/) to the closing link tag, which is a good thing. However, it also removed all the content in <mjx-container>...</mjx-container> (in addition to converting the HTML code to its displayed characters.

I would like to keep <mjx-container>...</mjx-container> unchanged. Could you please tell me if that's possible?

Thank you in advance for your help!

Out of Bounds Read In static void elm_close(tree_node_t *nodo)

Hi there!

Great work on html2xhtml, I find myself using it quite often. While I was using the tool I created some fuzz tests to run in the background. A couple of test cases led to a segfault when using the '-t frameset' option, which led me to further investigate the crash.

Valgrind

I started with Valgrind, which reported an invalid read of size 4 in each of the test cases:

==1040381== Memcheck, a memory error detector
==1040381== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==1040381== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==1040381== Command: ./src/html2xhtml -t frameset report/vuln/id:000000,sig:11,src:001386+001369,time:12081510,execs:2336913,op:splice,rep:16
==1040381== 
==1040381== Invalid read of size 4
==1040381==    at 0x40E911: elm_close (procesador.c:944)
==1040381==    by 0x410617: err_html_struct (procesador.c:1889)
==1040381==    by 0x40F20A: err_content_invalid (procesador.c:0)
==1040381==    by 0x40F20A: elm_close (procesador.c:959)
==1040381==    by 0x40E7C4: saxEndDocument (procesador.c:233)
==1040381==    by 0x40DF7A: main (html2xhtml.c:117)
==1040381==  Address 0x6f20d4 is not stack'd, malloc'd or (recently) free'd
==1040381== 
==1040381== 
==1040381== Process terminating with default action of signal 11 (SIGSEGV)
==1040381==  Access not within mapped region at address 0x6F20D4
==1040381==    at 0x40E911: elm_close (procesador.c:944)
==1040381==    by 0x410617: err_html_struct (procesador.c:1889)
==1040381==    by 0x40F20A: err_content_invalid (procesador.c:0)
==1040381==    by 0x40F20A: elm_close (procesador.c:959)
==1040381==    by 0x40E7C4: saxEndDocument (procesador.c:233)
==1040381==    by 0x40DF7A: main (html2xhtml.c:117)
==1040381==  If you believe this happened as a result of a stack
==1040381==  overflow in your program's main thread (unlikely but
==1040381==  possible), you can try to increase the size of the
==1040381==  main thread stack using the --main-stacksize= flag.
==1040381==  The main thread stack size used in this run was 8388608.
==1040381== 
==1040381== HEAP SUMMARY:
==1040381==     in use at exit: 88,190 bytes in 13 blocks
==1040381==   total heap usage: 22 allocs, 9 frees, 2,218,413 bytes allocated
==1040381== 
==1040381== LEAK SUMMARY:
==1040381==    definitely lost: 0 bytes in 0 blocks
==1040381==    indirectly lost: 0 bytes in 0 blocks
==1040381==      possibly lost: 0 bytes in 0 blocks
==1040381==    still reachable: 88,190 bytes in 13 blocks
==1040381==         suppressed: 0 bytes in 0 blocks
==1040381== Rerun with --leak-check=full to see details of leaked memory
==1040381== 
==1040381== For lists of detected and suppressed errors, rerun with: -s
==1040381== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
==1040419== Memcheck, a memory error detector
==1040419== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==1040419== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==1040419== Command: ./src/html2xhtml -t frameset report/vuln/id:000001,sig:11,src:001386+001369,time:12316330,execs:2651995,op:splice,rep:16
==1040419== 
==1040419== Invalid read of size 4
==1040419==    at 0x40E911: elm_close (procesador.c:944)
==1040419==    by 0x410617: err_html_struct (procesador.c:1889)
==1040419==    by 0x40F20A: err_content_invalid (procesador.c:0)
==1040419==    by 0x40F20A: elm_close (procesador.c:959)
==1040419==    by 0x40E7C4: saxEndDocument (procesador.c:233)
==1040419==    by 0x40DF7A: main (html2xhtml.c:117)
==1040419==  Address 0x6efc84 is not stack'd, malloc'd or (recently) free'd
==1040419== 
==1040419== 
==1040419== Process terminating with default action of signal 11 (SIGSEGV)
==1040419==  Access not within mapped region at address 0x6EFC84
==1040419==    at 0x40E911: elm_close (procesador.c:944)
==1040419==    by 0x410617: err_html_struct (procesador.c:1889)
==1040419==    by 0x40F20A: err_content_invalid (procesador.c:0)
==1040419==    by 0x40F20A: elm_close (procesador.c:959)
==1040419==    by 0x40E7C4: saxEndDocument (procesador.c:233)
==1040419==    by 0x40DF7A: main (html2xhtml.c:117)
==1040419==  If you believe this happened as a result of a stack
==1040419==  overflow in your program's main thread (unlikely but
==1040419==  possible), you can try to increase the size of the
==1040419==  main thread stack using the --main-stacksize= flag.
==1040419==  The main thread stack size used in this run was 8388608.
==1040419== 
==1040419== HEAP SUMMARY:
==1040419==     in use at exit: 88,190 bytes in 13 blocks
==1040419==   total heap usage: 22 allocs, 9 frees, 2,218,413 bytes allocated
==1040419== 
==1040419== LEAK SUMMARY:
==1040419==    definitely lost: 0 bytes in 0 blocks
==1040419==    indirectly lost: 0 bytes in 0 blocks
==1040419==      possibly lost: 0 bytes in 0 blocks
==1040419==    still reachable: 88,190 bytes in 13 blocks
==1040419==         suppressed: 0 bytes in 0 blocks
==1040419== Rerun with --leak-check=full to see details of leaked memory
==1040419== 
==1040419== For lists of detected and suppressed errors, rerun with: -s
==1040419== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
==1040433== Memcheck, a memory error detector
==1040433== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==1040433== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==1040433== Command: ./src/html2xhtml -t frameset report/vuln/id:000002,sig:11,src:001386+001369,time:43142960,execs:4309058,op:splice,rep:8
==1040433== 
==1040433== Invalid read of size 4
==1040433==    at 0x40E911: elm_close (procesador.c:944)
==1040433==    by 0x410617: err_html_struct (procesador.c:1889)
==1040433==    by 0x40F20A: err_content_invalid (procesador.c:0)
==1040433==    by 0x40F20A: elm_close (procesador.c:959)
==1040433==    by 0x40E7C4: saxEndDocument (procesador.c:233)
==1040433==    by 0x40DF7A: main (html2xhtml.c:117)
==1040433==  Address 0x6efc84 is not stack'd, malloc'd or (recently) free'd
==1040433== 
==1040433== 
==1040433== Process terminating with default action of signal 11 (SIGSEGV)
==1040433==  Access not within mapped region at address 0x6EFC84
==1040433==    at 0x40E911: elm_close (procesador.c:944)
==1040433==    by 0x410617: err_html_struct (procesador.c:1889)
==1040433==    by 0x40F20A: err_content_invalid (procesador.c:0)
==1040433==    by 0x40F20A: elm_close (procesador.c:959)
==1040433==    by 0x40E7C4: saxEndDocument (procesador.c:233)
==1040433==    by 0x40DF7A: main (html2xhtml.c:117)
==1040433==  If you believe this happened as a result of a stack
==1040433==  overflow in your program's main thread (unlikely but
==1040433==  possible), you can try to increase the size of the
==1040433==  main thread stack using the --main-stacksize= flag.
==1040433==  The main thread stack size used in this run was 8388608.
==1040433== 
==1040433== HEAP SUMMARY:
==1040433==     in use at exit: 92,286 bytes in 14 blocks
==1040433==   total heap usage: 23 allocs, 9 frees, 2,222,509 bytes allocated
==1040433== 
==1040433== LEAK SUMMARY:
==1040433==    definitely lost: 0 bytes in 0 blocks
==1040433==    indirectly lost: 0 bytes in 0 blocks
==1040433==      possibly lost: 0 bytes in 0 blocks
==1040433==    still reachable: 92,286 bytes in 14 blocks
==1040433==         suppressed: 0 bytes in 0 blocks
==1040433== Rerun with --leak-check=full to see details of leaked memory
==1040433== 
==1040433== For lists of detected and suppressed errors, rerun with: -s
==1040433== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
==1040439== Memcheck, a memory error detector
==1040439== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==1040439== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==1040439== Command: ./src/html2xhtml -t frameset report/vuln/id:000003,sig:11,src:001386+001369,time:43143048,execs:4309129,op:splice,rep:8
==1040439== 
==1040439== Invalid read of size 4
==1040439==    at 0x40E911: elm_close (procesador.c:944)
==1040439==    by 0x410617: err_html_struct (procesador.c:1889)
==1040439==    by 0x40F20A: err_content_invalid (procesador.c:0)
==1040439==    by 0x40F20A: elm_close (procesador.c:959)
==1040439==    by 0x40E7C4: saxEndDocument (procesador.c:233)
==1040439==    by 0x40DF7A: main (html2xhtml.c:117)
==1040439==  Address 0x6e7074 is not stack'd, malloc'd or (recently) free'd
==1040439== 
==1040439== 
==1040439== Process terminating with default action of signal 11 (SIGSEGV)
==1040439==  Access not within mapped region at address 0x6E7074
==1040439==    at 0x40E911: elm_close (procesador.c:944)
==1040439==    by 0x410617: err_html_struct (procesador.c:1889)
==1040439==    by 0x40F20A: err_content_invalid (procesador.c:0)
==1040439==    by 0x40F20A: elm_close (procesador.c:959)
==1040439==    by 0x40E7C4: saxEndDocument (procesador.c:233)
==1040439==    by 0x40DF7A: main (html2xhtml.c:117)
==1040439==  If you believe this happened as a result of a stack
==1040439==  overflow in your program's main thread (unlikely but
==1040439==  possible), you can try to increase the size of the
==1040439==  main thread stack using the --main-stacksize= flag.
==1040439==  The main thread stack size used in this run was 8388608.
==1040439== 
==1040439== HEAP SUMMARY:
==1040439==     in use at exit: 92,286 bytes in 14 blocks
==1040439==   total heap usage: 23 allocs, 9 frees, 2,222,509 bytes allocated
==1040439== 
==1040439== LEAK SUMMARY:
==1040439==    definitely lost: 0 bytes in 0 blocks
==1040439==    indirectly lost: 0 bytes in 0 blocks
==1040439==      possibly lost: 0 bytes in 0 blocks
==1040439==    still reachable: 92,286 bytes in 14 blocks
==1040439==         suppressed: 0 bytes in 0 blocks
==1040439== Rerun with --leak-check=full to see details of leaked memory
==1040439== 
==1040439== For lists of detected and suppressed errors, rerun with: -s
==1040439== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)

GDB Backtrace and Source Code

I attached gdb to html2xhtml in an attempt to find where the Out of Bounds Read was taking place:

backtrace_1

backtrace_2

Taking a look at the segfault in GDB led me to the following function:

static void elm_close(tree_node_t *nodo)

A user could provide a malformed document with an invalid 'ELM_PTR(nodo).contenttype[doctype]', resulting in the following comparison in assembly:

cmp    dword ptr [rbp + rax*4 + 0xc], 4

This could be leveraged to read locations that they should not have access to. I have attached multiple crash files to help reproduce the issue.

Thanks again!

crashes.zip

Showstopper: HTTP response code was not 200 OK - Ubuntu Server 11.10

davidparks21@ubuntuserver:~/apc/tmp$ html2xhtml --version
HTTP response code was not 200 OK. (Set $opts{ignore_http_response_code} to ignore this error.) at /usr/bin/html2xhtml line 12

I tried using the ubuntu package and building from source on ubuntu 11.10 server, in both cases I'm just getting this error on any execution of html2xhtml, as you can see above, even with just a --version parameter.

stack-buffer-overflow in static int doctype_scan(const xchar *data)

Hi there!

I fuzzed the tool and found some problems

Crash Out Phase

Run:

CFLAGS="-g" LDFLAGS="-g" CC=afl-clang-fast ./configure --disable-shared
AFL_USE_ASAN=1 make -j12
make install
./html2xhtml poc.html

Receive the output:

=================================================================
==209704==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7fd4c5900760 at pc 0x55e5ebaa1d80 bp 0x7ffee77f5d00 sp 0x7ffee77f54c0
READ of size 513 at 0x7fd4c5900760 thread T0
    #0 0x55e5ebaa1d7f in StrstrCheck(void*, char*, char const*, char const*) asan_interceptors.cpp.o
    #1 0x55e5ebaa1a1c in strstr (/root/Desktop/workdir/html2xhtml-1.3/fuzz/bin/html2xhtml+0x55a1c) (BuildId: aae9894f09b69ea8bef28f68d73e1991760dd031)
    #2 0x55e5ebb89242 in doctype_scan /root/Desktop/workdir/html2xhtml-1.3/src/procesador.c:718:9
    #3 0x55e5ebb89242 in saxDoctype /root/Desktop/workdir/html2xhtml-1.3/src/procesador.c:564:14
    #4 0x55e5ebb67a40 in yyparse /root/Desktop/workdir/html2xhtml-1.3/src/htmlgr.y:72:3
    #5 0x55e5ebb7f915 in main /root/Desktop/workdir/html2xhtml-1.3/src/html2xhtml.c:110:20
    #6 0x7fd4c7d00082 in __libc_start_main /build/glibc-LcI20x/glibc-2.31/csu/../csu/libc-start.c:308:16
    #7 0x55e5eba8848d in _start (/root/Desktop/workdir/html2xhtml-1.3/fuzz/bin/html2xhtml+0x3c48d) (BuildId: aae9894f09b69ea8bef28f68d73e1991760dd031)

Address 0x7fd4c5900760 is located in stack of thread T0 at offset 864 in frame
    #0 0x55e5ebb8908f in saxDoctype /root/Desktop/workdir/html2xhtml-1.3/src/procesador.c:550

  This frame has 2 object(s):
    [32, 288) 'msg.i' (line 745)
    [352, 864) 'buffer.i' (line 710) <== Memory access at offset 864 overflows this variable
HINT: this may be a false positive if your program uses some custom stack unwind mechanism, swapcontext or vfork
      (longjmp and C++ exceptions *are* supported)
SUMMARY: AddressSanitizer: stack-buffer-overflow asan_interceptors.cpp.o in StrstrCheck(void*, char*, char const*, char const*)
Shadow bytes around the buggy address:
  0x7fd4c5900480: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
  0x7fd4c5900500: f8 f8 f8 f8 f2 f2 f2 f2 f2 f2 f2 f2 00 00 00 00
  0x7fd4c5900580: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x7fd4c5900600: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x7fd4c5900680: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x7fd4c5900700: 00 00 00 00 00 00 00 00 00 00 00 00[f3]f3 f3 f3
  0x7fd4c5900780: f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00 00 00
  0x7fd4c5900800: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x7fd4c5900880: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x7fd4c5900900: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x7fd4c5900980: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==209704==ABORTING

Root Cause Analysis

When the strstr function is used, the argument string does not end with a null byte, resulting in a buffer overflow. The buffer variable should be checked for null bytes.

I have attached crash files to help reproduce the issue.
poc.zip

Unable to read HTML Document

Environment: Windows 7
When i try the conversion using html2xhtml on command prompt, the tool reads HTML file perfectly and does the conversion too, but when i give HTML Document it replies:
fopen: No such file or directory
Error [line 1]: Could not open the input file for reading

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.