peterc / pismo Goto Github PK

View Code? Open in Web Editor NEW

746.0 746.0 106.0 754 KB

Extracts machine-readable metadata and content from Web pages

Home Page: http://coder.io/

License: Other

Ruby 100.00%

pismo's People

Contributors

Stargazers

Watchers

Forkers

bigfolio mastooo ashleyw bborn taoh guniskandar41studio amitagrawal pixelate muthhus sikachu stipple dimaspriyanto alek dparis ahmadsherif michaelwills imenlo amalrik adamcrown gmarcus mitijain123 macbury rkellermeyer darkphantum tjsingleton thehodge cactis spnkr iceskysl brainsley metricjar malachaifrazier fanfannothing agileanimal fluxsaas harlantwood bscott fatum dasdad satelin2002 rajaramu terryma joshmerchant thegrubbsian glaszig uxcas jontonsoup new-media itsbalamurali yurixliv theanonymous siefca darkslategrey lonestarheartcenter codeinvain anoras lightyrs yarbro ygelfand t7y saksman oshea dheeraj510 numencapital okolehao newbin shubham2303 vine-comment cbartlett tuantranf damned netinmax zhaodonghui3939 rovr readom bivek dingxizheng maomaompxyinjun vxvinh jhoutz scottwater codenamev changjoo-park sudhanshusiwan bearerpipelinetest sharq1 iq-scm quinnmit

pismo's Issues

case sensitive matcher and default favicon location.

maybe you want to add a case-sensitive matchers for looking up the favicon:

 ['link[@rel="Shortcut Icon"]', lambda { |el| el.attr('href') }],

https://github.com/fluxsaas/pismo/blob/master/lib/pismo/internal_attributes.rb#L36

also, it might be nice to add the default favicon location e.g:

example.com/favicon.ico

seems to be widely used...

thanks for the library!

Can't match Chinese keywords

You can try this page:
http://huacnlee.com/blog/move-rails-project-from-linux-file-system-to-mongodb-gridfs

There is only get keywords in English.

I18n: work with non english websites

Is it possible to add support for different languages?
May be some kind of API / settings for it?

getting page links

Is it good to add sort of "doc.links" method?
The purpose is to get list of links inside the document.

Thanks,
Dimas

UTF-8 Characters Replaced with ?

Pismo outputs ? rather than the actual character.

Example site: http://googleblog.blogspot.com/2012/02/unicode-over-60-percent-of-web.html

Output:

Computers store every piece of text using a "character encoding," which gives a number to each character. For example, the byte stands for 'a' and stands for 'b' in the ASCII encoding, which was launched in 1963. Before the web, computer systems were siloed, and there were hundreds of different encodings. Depending on the encoding, could mean any of ¡, ?, ?, ?, ', ", or parts of thousands of characters, from æ to ?. If you brought a file from one computer to another, it could come out as gobbledygook.Unicode was invented to solve that problem: to encode all human languages, from Chinese (??) to Russian (???????) to Arabic (???????), and even emoji symbols like or ; it encodes nearly 75,000 Chinese ideographs alone. In the ASCII encoding, there wasn't even enough room for all the English punctuation (like curly quotes), while Unicode has room for over a million characters. Unicode was first published in 1991, coincidentally the year the World Wide Web debuted—little did anyone realize at the time they would be so important for each other. Today, people can easily share documents on the web, no matter what their language. Every January, we look at the percentage of the webpages in our index that are in different encodings. Here's what our data looks like with the latest figures*: *Your mileage may vary: these figures may vary somewhat from what other search engines find. The graph lumps together encodings by script. We detect the encoding for each webpage; the ASCII pages just contain ASCII characters, for example. Thanks again to Erik van der Poel for collecting the data. As you can see, Unicode has experienced an 800 percent increase in "market share" since 2006. Note that we separate out ASCII (~16 percent) since it is a subset of most other encodings. When you include ASCII, nearly 80 percent of web documents are in Unicode (UTF-8). The more documents that are in Unicode, the less likely you will see mangled characters (what Japanese call mojibake) when you're surfing the web. We've long used Unicode as the internal format for all the text Google searches and process: any other encoding is first converted to Unicode. Version 6.1 just released with over 110,000 characters; soon we'll be updating to that version and to Unicode's locale data from CLDR 21 (both via ICU). The continued rise in use of Unicode makes it even easier to do the processing for the many languages that we cover. Without it, our unified index it would be nearly impossible—it'd be a bit like not being able to convert between the hundreds of currencies in the world; commerce would be, well, difficult. Thanks to Unicode, Google is able to help people find information in almost any language.

Default reader gets wrong content

Ran into an issue with Pismo's default reader returning the wrong section of an HTML document for its body/html_body fields. It does work, however, with the cluster reader. This might be a good addition to the test corpus for the default reader.

http://www.universalhub.com/2013/touchy-tabloid-tries-wreck-globe-story

The default reader seems to pull content from <div id="navbar"> rather than <div id="content">.

>> doc = Pismo::Document.new("http://www.universalhub.com/2013/touchy-tabloid-tries-wreck-globe-story")
>> doc.body
=> "* The T\n* Casinos\n* News by neighborhood\n* Crime\n* Fires\n* Boston Store\n* Photos\n* Boston English\n* Restrooms\n* Blogs"

>> doc = Pismo::Document.new("http://www.universalhub.com/2013/touchy-tabloid-tries-wreck-globe-story", :reader => :cluster)
>> doc.body
=> "Sour grapes at the Herald? With bonus gratuitous quote from some lawyer making accusations with no apparent facts behind them:\nIf he was a reporter on deadline and he's distracted and making phone calls and texting, then that's something that adds to his fault. You're not supposed to be distracted in a cab, you're supposed to focus fully on your job,\" said Douglas Sheff, a Boston personal injury lawyer and president-elect of the Massachusetts Bar Association.\nDoes the esquire have any proof the reporter was on deadline and making phone calls and texting right before the crash? If so, he and the Herald failed to produce it."

(Originally reported in feedbin/support#35)

segfaults when retrieving keywords from pismo document

Environment:
ruby 1.8.7 (2010-01-10 patchlevel 249) [i686-darwin10.3.1]
pismo (0.6.2)
nokogiri (1.4.2)

irb session showing error:
http://gist.github.com/482550

I get this behavior repeatedly with lots of different URLs. Very strange...

Https support

Any possibly to support work with https links ?
example doc = Pismo::Document.new('https://www.google.com/')
Tnx :)

EDIT:
Closed, reason: #21 (comment)

Slow Startup/Require

I've been benchmarking the require time for the libraries in an application I'm working on to reduce the startup time (which is over 30 seconds). It turns out Pismo, which we use only rarely, takes over 1.2 seconds on my relatively speedy 13" Macbook Air.

Here's a perftools.rb output of where that time is spent (running the following script):

require 'rubygems'
require 'pismo'

pismo fails on many pages with encoding issues

burl="http://www.momfluential.net"
=> "http://www.momfluential.net"
ruby-1.9.2-p0 > pismo = Pismo[burl]
ArgumentError: invalid byte sequence in UTF-8
from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/pismo-0.7.2/lib/pismo/document.rb:48:in gsub!' from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/pismo-0.7.2/lib/pismo/document.rb:48:inclean_html'
from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/pismo-0.7.2/lib/pismo/document.rb:36:in load' from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/pismo-0.7.2/lib/pismo/document.rb:16:ininitialize'
from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/pismo-0.7.2/lib/pismo.rb:29:in new' from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/pismo-0.7.2/lib/pismo.rb:29:in[]'
from (irb):58
from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/railties-3.0.3/lib/rails/commands/console.rb:44:in start' from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/railties-3.0.3/lib/rails/commands/console.rb:8:instart'
from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/railties-3.0.3/lib/rails/commands.rb:23:in <top (required)>' from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/activesupport-3.0.3/lib/active_support/dependencies.rb:239:inrequire'
from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/activesupport-3.0.3/lib/active_support/dependencies.rb:239:in block in require' from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/activesupport-3.0.3/lib/active_support/dependencies.rb:225:inblock in load_dependency'
from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/activesupport-3.0.3/lib/active_support/dependencies.rb:596:in new_constants_in' from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/activesupport-3.0.3/lib/active_support/dependencies.rb:225:inload_dependency'
from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/activesupport-3.0.3/lib/active_support/dependencies.rb:239:in `require'

Rails + Pismo

Is there a way to use this within a model?

Support HTML DEL and INS elements.

Noticed this when I was using the Pismo powered ‘entry text extraction’ on Feedbin.

>> Pismo['http://hsivonen.iki.fi/accept-charset/'].lede
=> "Accept-Charset Is No More. Now that Firefox 10 has been released, the Accept-Charset HTTP header. During the Firefox 4 development cycle, I noticed that IE and Safari were not sending the Accept-Charset HTTP header in their HTTP requests. This meant that the Web had to work even without browser sending that header."

The first sentence given by Pismo:

Now that Firefox 10 has been released, the Accept-Charset HTTP header.

Comes from the following HTML:

<p>Now that Firefox 10 has been released, <del>none of the major browsers send</del> <ins>only Chrome sends</ins> the <code>Accept-Charset</code> HTTP header.</p>

If anything I would have expected Pismo to drop the DEL elements but keep the INS elements like so:

Now that Firefox 10 has been released, only Chrome sends the Accept-Charset HTTP header.

Even the html_body does not return these tags. This means possible important parts of a document can go missing. See this return, edited to only show the first paragraph:

>> Pismo['http://hsivonen.iki.fi/accept-charset/'].html_body
 => "Accept-Charset Is No More<p>Now that Firefox 10 has been released,   the <code>Accept-Charset</code> HTTP header.</p>\n\n"

Please support the DEL and INS elements by:

Drop the DEL and its content from lede and body but keep the content of INS in both.
Keep the DEL and INS elements and their content in html_body.

Pismo doesn't allow redirection to https

Pismo Redirection Not Allowed

Any fix planned for allowing redirects? thanks!

"redirection forbidden: http://www.bettiepageclothing.com -> https://www.bettiepageclothing.com/"

Image fetch is not working

"doc.images" call returns "nil" every time, even if there are valid images with absolute urls in the html page. The reader_doc.images array is empty every time.

(undefined method `first' for nil:NilClass ) when there is no top_content_candidate

When I tried to get images in this website I got this exception
http://verkoren.wordpress.com/2013/04/12/you-cant-skate-you-old/

<NoMethodError: undefined method `first' for nil:NilClass>

/app/vendor/pismo/pismo/reader/tree.rb:149:in content_at' /app/vendor/pismo/pismo/images/image_extractor.rb:28:ininitialize'
/app/vendor/pismo/pismo/internal_attributes.rb:181:in new' /app/vendor/pismo/pismo/internal_attributes.rb:181:inimages'

Problem with (CoffeeScript) Code

Hello,

I was just trying to use Pismo on a webpage with this syntax code in the article:

sys:   require 'sys'
http:  require 'http'
client: http.createClient 8080, 'localhost'
# start an active chain gang queue with 3 workers by default.
chain: require('chain-gang').create()

# downloads a web page, runs the callback when it's done.
get_path: (path, cb) ->
  req: client.request('GET', path, {host: 'localhost'})
  req.end()
  req.addListener 'response', (resp) ->
    resp.addListener 'data', (chunk) ->
      sys.puts chunk
    resp.addListener 'end', cb

# returns a chain gang job that downloads a web page and finishes the worker.
job: (timeout, name) ->
  (worker) ->
    get_path "/$timeout/$name", ->
      worker.finish()

But it came out as:

'sys'
'http'
'localhost'
# start an active chain gang queue with 3 workers by default.
'chain-gang'
# downloads a web page, runs the callback when it's done.
'GET'
'localhost'
'response'
'data'
'end'
# returns a chain gang job that downloads a web page and finishes the worker.
"/$timeout/$name"

'foo'
# queues the job with the unique name "foo_request"
'foo'
'foo_request'

When I ran html_body. :(