Git Product home page Git Product logo

pismo's People

Contributors

bborn avatar cheald avatar dparis avatar glaszig avatar peterc avatar sleeptillseven avatar sutto avatar xxx avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pismo's Issues

getting page links

Is it good to add sort of "doc.links" method?
The purpose is to get list of links inside the document.

Thanks,
Dimas

UTF-8 Characters Replaced with ?

Pismo outputs ? rather than the actual character.

Example site: http://googleblog.blogspot.com/2012/02/unicode-over-60-percent-of-web.html

Output:

Computers store every piece of text using a "character encoding," which gives a number to each character. For example, the byte stands for 'a' and stands for 'b' in the ASCII encoding, which was launched in 1963. Before the web, computer systems were siloed, and there were hundreds of different encodings. Depending on the encoding, could mean any of ¡, ?, ?, ?, ', ", or parts of thousands of characters, from æ to ?. If you brought a file from one computer to another, it could come out as gobbledygook.Unicode was invented to solve that problem: to encode all human languages, from Chinese (??) to Russian (???????) to Arabic (???????), and even emoji symbols like or ; it encodes nearly 75,000 Chinese ideographs alone. In the ASCII encoding, there wasn't even enough room for all the English punctuation (like curly quotes), while Unicode has room for over a million characters. Unicode was first published in 1991, coincidentally the year the World Wide Web debuted—little did anyone realize at the time they would be so important for each other. Today, people can easily share documents on the web, no matter what their language. Every January, we look at the percentage of the webpages in our index that are in different encodings. Here's what our data looks like with the latest figures*: *Your mileage may vary: these figures may vary somewhat from what other search engines find. The graph lumps together encodings by script. We detect the encoding for each webpage; the ASCII pages just contain ASCII characters, for example. Thanks again to Erik van der Poel for collecting the data. As you can see, Unicode has experienced an 800 percent increase in "market share" since 2006. Note that we separate out ASCII (~16 percent) since it is a subset of most other encodings. When you include ASCII, nearly 80 percent of web documents are in Unicode (UTF-8). The more documents that are in Unicode, the less likely you will see mangled characters (what Japanese call mojibake) when you're surfing the web. We've long used Unicode as the internal format for all the text Google searches and process: any other encoding is first converted to Unicode. Version 6.1 just released with over 110,000 characters; soon we'll be updating to that version and to Unicode's locale data from CLDR 21 (both via ICU). The continued rise in use of Unicode makes it even easier to do the processing for the many languages that we cover. Without it, our unified index it would be nearly impossible—it'd be a bit like not being able to convert between the hundreds of currencies in the world; commerce would be, well, difficult. Thanks to Unicode, Google is able to help people find information in almost any language.

Default reader gets wrong content

Ran into an issue with Pismo's default reader returning the wrong section of an HTML document for its body/html_body fields. It does work, however, with the cluster reader. This might be a good addition to the test corpus for the default reader.

http://www.universalhub.com/2013/touchy-tabloid-tries-wreck-globe-story

The default reader seems to pull content from <div id="navbar"> rather than <div id="content">.

>> doc = Pismo::Document.new("http://www.universalhub.com/2013/touchy-tabloid-tries-wreck-globe-story")
>> doc.body
=> "* The T\n* Casinos\n* News by neighborhood\n* Crime\n* Fires\n* Boston Store\n* Photos\n* Boston English\n* Restrooms\n* Blogs"

>> doc = Pismo::Document.new("http://www.universalhub.com/2013/touchy-tabloid-tries-wreck-globe-story", :reader => :cluster)
>> doc.body
=> "Sour grapes at the Herald? With bonus gratuitous quote from some lawyer making accusations with no apparent facts behind them:\nIf he was a reporter on deadline and he's distracted and making phone calls and texting, then that's something that adds to his fault. You're not supposed to be distracted in a cab, you're supposed to focus fully on your job,\" said Douglas Sheff, a Boston personal injury lawyer and president-elect of the Massachusetts Bar Association.\nDoes the esquire have any proof the reporter was on deadline and making phone calls and texting right before the crash? If so, he and the Herald failed to produce it."

(Originally reported in feedbin/support#35)

Https support

Any possibly to support work with https links ?
example doc = Pismo::Document.new('https://www.google.com/')
Tnx :)

EDIT:
Closed, reason: #21 (comment)

Slow Startup/Require

I've been benchmarking the require time for the libraries in an application I'm working on to reduce the startup time (which is over 30 seconds). It turns out Pismo, which we use only rarely, takes over 1.2 seconds on my relatively speedy 13" Macbook Air.

Here's a perftools.rb output of where that time is spent (running the following script):

require 'rubygems'
require 'pismo'

Perftools Output

pismo fails on many pages with encoding issues

burl="http://www.momfluential.net"
=> "http://www.momfluential.net"
ruby-1.9.2-p0 > pismo = Pismo[burl]
ArgumentError: invalid byte sequence in UTF-8
from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/pismo-0.7.2/lib/pismo/document.rb:48:in gsub!' from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/pismo-0.7.2/lib/pismo/document.rb:48:inclean_html'
from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/pismo-0.7.2/lib/pismo/document.rb:36:in load' from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/pismo-0.7.2/lib/pismo/document.rb:16:ininitialize'
from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/pismo-0.7.2/lib/pismo.rb:29:in new' from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/pismo-0.7.2/lib/pismo.rb:29:in[]'
from (irb):58
from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/railties-3.0.3/lib/rails/commands/console.rb:44:in start' from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/railties-3.0.3/lib/rails/commands/console.rb:8:instart'
from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/railties-3.0.3/lib/rails/commands.rb:23:in <top (required)>' from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/activesupport-3.0.3/lib/active_support/dependencies.rb:239:inrequire'
from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/activesupport-3.0.3/lib/active_support/dependencies.rb:239:in block in require' from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/activesupport-3.0.3/lib/active_support/dependencies.rb:225:inblock in load_dependency'
from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/activesupport-3.0.3/lib/active_support/dependencies.rb:596:in new_constants_in' from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/activesupport-3.0.3/lib/active_support/dependencies.rb:225:inload_dependency'
from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/activesupport-3.0.3/lib/active_support/dependencies.rb:239:in `require'

Support HTML DEL and INS elements.

Noticed this when I was using the Pismo powered ‘entry text extraction’ on Feedbin.

>> Pismo['http://hsivonen.iki.fi/accept-charset/'].lede
=> "Accept-Charset Is No More. Now that Firefox 10 has been released, the Accept-Charset HTTP header. During the Firefox 4 development cycle, I noticed that IE and Safari were not sending the Accept-Charset HTTP header in their HTTP requests. This meant that the Web had to work even without browser sending that header." 

The first sentence given by Pismo:

Now that Firefox 10 has been released, the Accept-Charset HTTP header.

Comes from the following HTML:

<p>Now that Firefox 10 has been released, <del>none of the major browsers send</del> <ins>only Chrome sends</ins> the <code>Accept-Charset</code> HTTP header.</p>

If anything I would have expected Pismo to drop the DEL elements but keep the INS elements like so:

Now that Firefox 10 has been released, only Chrome sends the Accept-Charset HTTP header.

Even the html_body does not return these tags. This means possible important parts of a document can go missing. See this return, edited to only show the first paragraph:

>> Pismo['http://hsivonen.iki.fi/accept-charset/'].html_body
 => "Accept-Charset Is No More<p>Now that Firefox 10 has been released,   the <code>Accept-Charset</code> HTTP header.</p>\n\n"

Please support the DEL and INS elements by:

  1. Drop the DEL and its content from lede and body but keep the content of INS in both.
  2. Keep the DEL and INS elements and their content in html_body.

Image fetch is not working

"doc.images" call returns "nil" every time, even if there are valid images with absolute urls in the html page. The reader_doc.images array is empty every time.

(undefined method `first' for nil:NilClass ) when there is no top_content_candidate

When I tried to get images in this website I got this exception
http://verkoren.wordpress.com/2013/04/12/you-cant-skate-you-old/

<NoMethodError: undefined method `first' for nil:NilClass>

/app/vendor/pismo/pismo/reader/tree.rb:149:in content_at' /app/vendor/pismo/pismo/images/image_extractor.rb:28:ininitialize'
/app/vendor/pismo/pismo/internal_attributes.rb:181:in new' /app/vendor/pismo/pismo/internal_attributes.rb:181:inimages'

Problem with (CoffeeScript) Code

Hello,

I was just trying to use Pismo on a webpage with this syntax code in the article:

sys:   require 'sys'
http:  require 'http'
client: http.createClient 8080, 'localhost'
# start an active chain gang queue with 3 workers by default.
chain: require('chain-gang').create()

# downloads a web page, runs the callback when it's done.
get_path: (path, cb) ->
  req: client.request('GET', path, {host: 'localhost'})
  req.end()
  req.addListener 'response', (resp) ->
    resp.addListener 'data', (chunk) ->
      sys.puts chunk
    resp.addListener 'end', cb

# returns a chain gang job that downloads a web page and finishes the worker.
job: (timeout, name) ->
  (worker) ->
    get_path "/$timeout/$name", ->
      worker.finish()

But it came out as:

'sys'
'http'
'localhost'
# start an active chain gang queue with 3 workers by default.
'chain-gang'
# downloads a web page, runs the callback when it's done.
'GET'
'localhost'
'response'
'data'
'end'
# returns a chain gang job that downloads a web page and finishes the worker.
"/$timeout/$name"

'foo'
# queues the job with the unique name "foo_request"
'foo'
'foo_request'

When I ran html_body. :(

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.