Git Product home page Git Product logo

Comments (13)

cfis avatar cfis commented on August 30, 2024

Hi nkriege,

Thanks for the report and test case - will take a look.

Charlie

from libxml-ruby.

nkriege avatar nkriege commented on August 30, 2024

I have not found the root cause, but here is a hint. It appears that I can work around the segfault by adding a call to Encoding.find('ISO-8859-1') prior to parsing the document.

Nathan

from libxml-ruby.

nkriege avatar nkriege commented on August 30, 2024

Here is the stack trace as reported by gdb:

#0  0x0000000000000000 in ?? ()
#1  0x00007ffff79bf1c0 in rb_enc_precise_mbclen (p=0xd3c9b8 "Result", e=0xd3c9be "", enc=0x723fc0) at encoding.c:834
#2  0x00007ffff7abab3b in coderange_scan (p=0xd3c9b8 "Result", len=6, enc=0x723fc0) at string.c:213
#3  0x00007ffff7abb217 in rb_enc_str_coderange (str=13879720) at string.c:327
#4  0x00007ffff7ac2042 in rb_str_hash (str=13879720) at string.c:2051
#5  0x00007ffff7af925a in cdhash_hash (a=13879720) at compile.c:1245
#6  0x00007ffff7ab4604 in st_lookup (table=0x7665e0, key=13879720, value=0x7fffffffd5e8) at st.c:341
#7  0x00007ffff7b18870 in vm_exec_core (th=0x604da0, initial=0) at insns.def:1265
#8  0x00007ffff7b25412 in vm_exec (th=0x604da0) at vm.c:1147
#9  0x00007ffff7b25e49 in rb_iseq_eval_main (iseqval=6526880) at vm.c:1388
#10 0x00007ffff79f383b in ruby_exec_internal (n=0x6397a0) at eval.c:214
#11 0x00007ffff79f394e in ruby_exec_node (n=0x6397a0) at eval.c:261
#12 0x00007ffff79f3921 in ruby_run_node (n=0x6397a0) at eval.c:254
#13 0x00000000004009f1 in main (argc=2, argv=0x7fffffffe188) at main.c:35

Here are the contents of the rb_encoding object at the time of the crash:

(gdb) p *enc
$1 = {
  precise_mbc_enc_len = 0,
  name = 0x13e8fb0 "ISO-8859-1",
  max_enc_len = 0,
  min_enc_len = 0,
  is_mbc_newline = 0,
  mbc_to_code = 0,
  code_to_mbclen = 0,
  code_to_mbc = 0,
  mbc_case_fold = 0,
  apply_all_case_fold = 0,
  get_case_fold_codes_by_str = 0,
  property_name_to_ctype = 0,
  is_code_ctype = 0,
  get_ctype_code_range = 0,
  left_adjust_char_head = 0,
  is_allowed_reverse_match = 0,
  ruby_encoding_index = 13
}

from libxml-ruby.

nkriege avatar nkriege commented on August 30, 2024

Here is another error that I believe is related. You can trigger a floating point exception just by running the following line.

puts LibXML::XML::Encoding::to_s(LibXML::XML::Encoding::ISO_8859_1)

Here is the stacktrace:

#1  0x00007ffff7a1d2c1 in rb_io_puts (argc=1, argv=0x7ffff7ea8048, out=6658640) at io.c:6151
#2  0x00007ffff7b114fa in call_cfunc (func=0x7ffff7a1d194 <rb_io_puts>, recv=6658640, len=-1, argc=1, argv=0x7ffff7ea8048) at vm_insnhelper.c:315
#3  0x00007ffff7b1ee84 in vm_call0 (th=0x604da0, recv=6658640, id=6096, argc=1, argv=0x7ffff7ea8048, me=0x6d3680) at vm_eval.c:79
#4  0x00007ffff7b1f508 in rb_call0 (recv=6658640, mid=6096, argc=1, argv=0x7ffff7ea8048, scope=CALL_FCALL, self=6) at vm_eval.c:235
#5  0x00007ffff7b1fc3a in rb_call (recv=6658640, mid=6096, argc=1, argv=0x7ffff7ea8048, scope=CALL_FCALL) at vm_eval.c:438
#6  0x00007ffff7b20486 in rb_funcall2 (recv=6658640, mid=6096, argc=1, argv=0x7ffff7ea8048) at vm_eval.c:651
#7  0x00007ffff7a1d377 in rb_f_puts (argc=1, argv=0x7ffff7ea8048, recv=6708200) at io.c:6174
#8  0x00007ffff7b114fa in call_cfunc (func=0x7ffff7a1d2f5 <rb_f_puts>, recv=6708200, len=-1, argc=1, argv=0x7ffff7ea8048) at vm_insnhelper.c:315
#9  0x00007ffff7b11e39 in vm_call_cfunc (th=0x604da0, reg_cfp=0x7ffff7fa7f08, num=1, recv=6708200, blockptr=0x0, me=0x6d0d60) at vm_insnhelper.c:402
#10 0x00007ffff7b124d4 in vm_call_method (th=0x604da0, cfp=0x7ffff7fa7f08, num=1, blockptr=0x0, flag=8, id=6096, me=0x6d0d60, recv=6708200) at vm_insnhelper.c:524
#11 0x00007ffff7b17cd2 in vm_exec_core (th=0x604da0, initial=0) at insns.def:1006
#12 0x00007ffff7b25412 in vm_exec (th=0x604da0) at vm.c:1147
#13 0x00007ffff7b25e49 in rb_iseq_eval_main (iseqval=6524480) at vm.c:1388
#14 0x00007ffff79f383b in ruby_exec_internal (n=0x638e40) at eval.c:214
#15 0x00007ffff79f394e in ruby_exec_node (n=0x638e40) at eval.c:261
#16 0x00007ffff79f3921 in ruby_run_node (n=0x638e40) at eval.c:254
#17 0x00000000004009f1 in main (argc=2, argv=0x7fffffffe198) at main.c:35

from libxml-ruby.

nkriege avatar nkriege commented on August 30, 2024

I have one proposed fix for this in my fork:
nkriege@fbdd29a

I have not tested exhaustively and am not familiar enough with the code to know for sure if this is correct. Can you please review and let me know if you think this is the right direction?

from libxml-ruby.

cfis avatar cfis commented on August 30, 2024

Thanks for the additional info and patch. However, I can't reproduce either error. The reader tests already have an example parsing a "ISO-8859-1" string which passes. I also added in both of your tests with not issue.

So what version of libxml are you using? Are you doing any of this in threads? Anything unusual? What platform? Etc.

The reason I didn't do the encoding lookup the way you have in the patch is that Ruby and LibXML spell encodings differently. Since libxml only supports a small subset, seemed easier to just have a switch statement. And if all else fails, return ASCII_8BIT.

I'm suspicious there is something else going on here, and the patch is just changing the code enough to hide it but not actually solve it.

Thanks -Charlie

from libxml-ruby.

nkriege avatar nkriege commented on August 30, 2024

I am running Ubuntu 64bit Lucid with the following libxml packages installed:

ii  libxml2                                    2.7.6.dfsg-1ubuntu1.1             GNOME XML library
ii  libxml2-dev                                2.7.6.dfsg-1ubuntu1.1             Development files for the GNOME XML library

I am running ruby 1.9.2p180 with the libxml-ruby version 2.0.5 gem.

I can reproduce the error simply by running the script exactly as pasted in the bug report, without threads or anything out of the ordinary.

Here is the simplest repro case I have that demonstrates a problem:

$ ruby1.9.2 -e "require 'libxml'; puts LibXML::XML::Encoding::to_s(LibXML::XML::Encoding::ISO_8859_1)"
ISO-8859-1Floating point exception

from libxml-ruby.

nkriege avatar nkriege commented on August 30, 2024

Let me clarify that the segfault happens for me when the code from the bug description is run in isolation. I believe that this is because the Encoding support for ISO-8859-1 has not been initialized yet. I can avoid the segfault by preceding this code with something that causes the ISO-8859-1 support to be initialized. For example, the segfault goes away if I add either of the following lines to the beginning:

Encoding.find('ISO-8859-1')

or

"any string".force_encoding(Encoding::ISO8859_1)

These calls have the side effect of triggering a call to load_encoding (see encoding.c) which is what loads enc/iso_8859_1.so. After that has happened once, the segfault will not occur.

The existing tc_reader.rb test passes for me. I also tried add my code as a new test case to that file, like this:

  def test_iso8859
    data = '<?xml version="1.0" encoding="ISO-8859-1" ?><Result></Result>'

    got_result = false
    reader = LibXML::XML::Reader.string(data)

    while reader.read
      if reader.node_type == LibXML::XML::Reader::TYPE_ELEMENT

        # The following line will segfault in ruby 1.9.2
        case reader.name
        when "Result"
          got_result = true
        end
      end
    end

    assert(got_result)
  end

If I then run the whole file (ruby tc_reader.rb), it still passes. However if I run just my test case (ruby tc_reader.rb --name=test_iso8859), I get the segfault. This is because another test case in the file is doing something to trigger a call to load_encoding. I believe this is happening in the call to String#force_encoding in test_encoding. I can see this by running in gdb and breaking on calls to load_encoding in encoding.c.

Are you saying that the segfault does not repro for you even if you run the code in isolation? I wonder if your are somehow automatically loading some encodings that I am not.

from libxml-ruby.

cfis avatar cfis commented on August 30, 2024

If I then run the whole file (ruby tc_reader.rb), it still passes. However if I run just my test case (ruby tc_reader.rb --name=test_iso8859), I get the segfault. This is because another test case in the file is doing something to trigger a call to load_encoding. I believe this is happening in the call to String#force_encoding in test_encoding. I can see this by running in gdb and breaking on calls to load_encoding in encoding.c.

Interesting. Both cases work fine for me. Don't have Ubuntu around,
but will check on my Fedora x64_86 box.

Are you saying that the segfault does not repro for you even if you run the code in isolation?

Yup, works fine here on Windows 7 using mingw compiled libraries or VC++
compiled libraries. Will try Fedora next...

Charlie

from libxml-ruby.

nkriege avatar nkriege commented on August 30, 2024

Thanks for trying to repro.

The reason I didn't do the encoding lookup the way you have in the patch is that Ruby and LibXML spell encodings differently.

I am not sure I understand the subtleties of this, but I tried to address this concern in the version of the file at nkriege/libxml-ruby@ec91e47. Does that look more appropriate to you? It would have been nice to be able to rely on xmlGetCharEncodingName, but I have gone back to a lookup table based on your feedback. My primary goal is to eliminate the call to rb_const_get(rb_cEncoding, encoding_name) that used to exist in ruby_xml_encoding.c and replace it with a call to rb_enc_find. So the new lookup table returns ruby encoding names rather than names of constants in the Encoding namespace.

from libxml-ruby.

cfis avatar cfis commented on August 30, 2024

Ok,

I can reproduce this on Linux using Fedora.

The change in version nkriege/libxml-ruby@ec91e47 looks fine.

So let me recap. The current code just uses a Ruby constant lookup, which seems to sometimes not work. By using rb_enc_find instead the encodig machinery is correctly initialized?

To minimize the changes, and avoid having two methods, could we just do this at the end of rxml_xml_encoding_to_rb_encoding:

rb_encoding* encoding = rb_enc_find(encodingName);
return rb_enc_from_encoding(encoding);

Or is there some advantage splitting them out? Is all this encoding lookup going to be a performance hit?

Thanks - Charlie

from libxml-ruby.

nkriege avatar nkriege commented on August 30, 2024

I can reproduce this on Linux using Fedora.

Great

So let me recap. The current code just uses a Ruby constant lookup, which seems to sometimes not work. By using rb_enc_find instead the encodig machinery is correctly initialized?

That sounds right to me.

To minimize the changes, and avoid having two methods, could we just do this at the end of rxml_xml_encoding_to_rb_encoding:

rb_encoding* encoding = rb_enc_find(encodingName);
return rb_enc_from_encoding(encoding);

Or is there some advantage splitting them out? Is all this encoding lookup going to be a performance hit?

Yes, that is a good suggestion. I had added initially added the extra method just to try to improve performance by eliminating the extra call to rb_enc_from_encoding. Looking at that method more closely, though, it looks to me like it should be fast and basically just boil down to an array access. So I agree with you that this is a good tradeoff to avoid unnecessary code churn. I have implemented your suggestion in nkriege/libxml-ruby@72dd3a4. This eliminates the need for any changes to libxml.c.

I think we are converging on an agreed solution, so I will submit a pull request to give you another opportunity for review.

from libxml-ruby.

cfis avatar cfis commented on August 30, 2024

Thanks again for helping figure this out and the patch!

Charlie

from libxml-ruby.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.