Git Product home page Git Product logo

Comments (10)

flavorjones avatar flavorjones commented on July 19, 2024

I'm afraid I don't completely understand your explanation. This code was originally added as part of commit a2c669d, and allows Nokogiri to do encoding detection.

Detecting the encoding is not trivial, and if not provided by the Nokogiri caller, is done by libxml based on a combination of headers and character set.

So, now, armed with the knowledge that this code exists for a reason, could you please explain more? Specifically,

  • Example code (and html document!) that reproduces the observed behavior.
  • An explanation of what the expected behavior is.

Thank you!

from mechanize.

dima4p avatar dima4p commented on July 19, 2024

It seems that nokogiri does not detect the encoding. Add to test/htdocs/tc_charset.html some tags with the text in windows-1255 and see it is not converted to utf-8
In my case I go to http://www.kinopoisk.ru/

from mechanize.

flavorjones avatar flavorjones commented on July 19, 2024

Hi!

If you provide runnable code that reproduces the observed behavior, and perhaps explains what you expect the behavior to be, it would save the developers time, and will be much more likely to be addressed.

A failing test case, or a fork of mechanize with the test case written, would be perfect.

Thanks!

from mechanize.

dima4p avatar dima4p commented on July 19, 2024

I will prepare the test later. May bye tomorrow. In short
Mechanize.new.get('http://www.kinopoisk.ru/').links.last.text.should == "Реклама на КиноПоиске!"
But it gives "Ðåêëàìà íà ÊèíîÏîèñêå!"

from mechanize.

kitamomonga avatar kitamomonga commented on July 19, 2024

If

agent.get('http://www.kinopoisk.ru/')
p agent.page.parser.errors.map{|e| e.to_s}.grep('Input is not proper UTF-8, indicate encoding !')

returns ["Input is not proper UTF-8, indicate encoding !"],
Nokogiri failed parsing multibyte characters.

agent.get('http://www.kinopoisk.ru/')
# bypass Page#encoding= bug ( http://github.com/tenderlove/mechanize/issues#issue/43 )
agent.page.instance_variable_set(:@parser,nil)
agent.page.encoding = "windows-1251"
p(agent.page.links.last.text == "Реклама на КиноПоиске!")

may return true.

and, too old libxml2/libxslt often causes errors. Please use newer one.

p Nokogiri::VERSION_INFO

=> {"warnings"=>[], "ruby"=>{"engine"=>"mri", "version"=>"1.8.7", "platform"=>"i
386-mswin32"}, "libxml"=>{"loaded"=>"2.7.7", "binding"=>"extension", "compiled"=>
"2.7.7"}, "nokogiri"=>"1.4.3.1"}

from mechanize.

dima4p avatar dima4p commented on July 19, 2024

I have installed new version of libxml, it helped but not completely

>> p Nokogiri::VERSION_INFO
{"warnings"=>[], "ruby"=>{"engine"=>"mri", "version"=>"1.8.7", "platform"=>"i486-linux"}, "libxml"=>{"loaded"=>"2.7.7", "binding"=>"extension", "compiled"=>"2.7.7"}, "nokogiri"=>"1.4.3.1"}
=> nil
>> (page = Mechanize.new.get('http://www.kinopoisk.ru/')).title
=> "��\270ноПоиск.ru. Все фильмы планеты"
>> page.parser.errors.map{|e| e.to_s}.grep('Input is not proper UTF-8, indicate encoding !')
=> ["Input is not proper UTF-8, indicate encoding !"]

from mechanize.

dima4p avatar dima4p commented on July 19, 2024

And additionally, I've met the situation when the the server send the proper encoding while meta tag has a wrong one. What about the option to use the server info in the case of the conflict?

from mechanize.

drbrain avatar drbrain commented on July 19, 2024

This should be fixed

from mechanize.

dima4p avatar dima4p commented on July 19, 2024

Indeed more problems appeared. So my monkey patch now is https://gist.github.com/911742

from mechanize.

drbrain avatar drbrain commented on July 19, 2024

The current code for detecting the charset seems superior as it tries all possible values for encoding and stops when it finds one without errors.

Your monkey patch is also out of date. Stop supplying monkey patches and supply diffs at a minimum.

from mechanize.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.