peterc / pismo Goto Github PK
View Code? Open in Web Editor NEWExtracts machine-readable metadata and content from Web pages
Home Page: http://coder.io/
License: Other
Extracts machine-readable metadata and content from Web pages
Home Page: http://coder.io/
License: Other
maybe you want to add a case-sensitive matchers for looking up the favicon:
['link[@rel="Shortcut Icon"]', lambda { |el| el.attr('href') }],
https://github.com/fluxsaas/pismo/blob/master/lib/pismo/internal_attributes.rb#L36
also, it might be nice to add the default favicon location e.g:
example.com/favicon.ico
seems to be widely used...
thanks for the library!
coder.io not accessable
You can try this page:
http://huacnlee.com/blog/move-rails-project-from-linux-file-system-to-mongodb-gridfs
There is only get keywords in English.
Is it possible to add support for different languages?
May be some kind of API / settings for it?
Is it good to add sort of "doc.links" method?
The purpose is to get list of links inside the document.
Thanks,
Dimas
Pismo outputs ? rather than the actual character.
Example site: http://googleblog.blogspot.com/2012/02/unicode-over-60-percent-of-web.html
Output:
Computers store every piece of text using a "character encoding," which gives a number to each character. For example, the byte stands for 'a' and stands for 'b' in the ASCII encoding, which was launched in 1963. Before the web, computer systems were siloed, and there were hundreds of different encodings. Depending on the encoding, could mean any of ¡, ?, ?, ?, ', ", or parts of thousands of characters, from æ to ?. If you brought a file from one computer to another, it could come out as gobbledygook.Unicode was invented to solve that problem: to encode all human languages, from Chinese (??) to Russian (???????) to Arabic (???????), and even emoji symbols like or ; it encodes nearly 75,000 Chinese ideographs alone. In the ASCII encoding, there wasn't even enough room for all the English punctuation (like curly quotes), while Unicode has room for over a million characters. Unicode was first published in 1991, coincidentally the year the World Wide Web debuted—little did anyone realize at the time they would be so important for each other. Today, people can easily share documents on the web, no matter what their language. Every January, we look at the percentage of the webpages in our index that are in different encodings. Here's what our data looks like with the latest figures*: *Your mileage may vary: these figures may vary somewhat from what other search engines find. The graph lumps together encodings by script. We detect the encoding for each webpage; the ASCII pages just contain ASCII characters, for example. Thanks again to Erik van der Poel for collecting the data. As you can see, Unicode has experienced an 800 percent increase in "market share" since 2006. Note that we separate out ASCII (~16 percent) since it is a subset of most other encodings. When you include ASCII, nearly 80 percent of web documents are in Unicode (UTF-8). The more documents that are in Unicode, the less likely you will see mangled characters (what Japanese call mojibake) when you're surfing the web. We've long used Unicode as the internal format for all the text Google searches and process: any other encoding is first converted to Unicode. Version 6.1 just released with over 110,000 characters; soon we'll be updating to that version and to Unicode's locale data from CLDR 21 (both via ICU). The continued rise in use of Unicode makes it even easier to do the processing for the many languages that we cover. Without it, our unified index it would be nearly impossible—it'd be a bit like not being able to convert between the hundreds of currencies in the world; commerce would be, well, difficult. Thanks to Unicode, Google is able to help people find information in almost any language.
Ran into an issue with Pismo's default reader returning the wrong section of an HTML document for its body
/html_body
fields. It does work, however, with the cluster reader. This might be a good addition to the test corpus for the default reader.
http://www.universalhub.com/2013/touchy-tabloid-tries-wreck-globe-story
The default reader seems to pull content from <div id="navbar">
rather than <div id="content">
.
>> doc = Pismo::Document.new("http://www.universalhub.com/2013/touchy-tabloid-tries-wreck-globe-story")
>> doc.body
=> "* The T\n* Casinos\n* News by neighborhood\n* Crime\n* Fires\n* Boston Store\n* Photos\n* Boston English\n* Restrooms\n* Blogs"
>> doc = Pismo::Document.new("http://www.universalhub.com/2013/touchy-tabloid-tries-wreck-globe-story", :reader => :cluster)
>> doc.body
=> "Sour grapes at the Herald? With bonus gratuitous quote from some lawyer making accusations with no apparent facts behind them:\nIf he was a reporter on deadline and he's distracted and making phone calls and texting, then that's something that adds to his fault. You're not supposed to be distracted in a cab, you're supposed to focus fully on your job,\" said Douglas Sheff, a Boston personal injury lawyer and president-elect of the Massachusetts Bar Association.\nDoes the esquire have any proof the reporter was on deadline and making phone calls and texting right before the crash? If so, he and the Herald failed to produce it."
(Originally reported in feedbin/support#35)
Environment:
ruby 1.8.7 (2010-01-10 patchlevel 249) [i686-darwin10.3.1]
pismo (0.6.2)
nokogiri (1.4.2)
irb session showing error:
http://gist.github.com/482550
I get this behavior repeatedly with lots of different URLs. Very strange...
Any possibly to support work with https links ?
example doc = Pismo::Document.new('https://www.google.com/')
Tnx :)
EDIT:
Closed, reason: #21 (comment)
I've been benchmarking the require time for the libraries in an application I'm working on to reduce the startup time (which is over 30 seconds). It turns out Pismo, which we use only rarely, takes over 1.2 seconds on my relatively speedy 13" Macbook Air.
Here's a perftools.rb output of where that time is spent (running the following script):
require 'rubygems'
require 'pismo'
burl="http://www.momfluential.net"
=> "http://www.momfluential.net"
ruby-1.9.2-p0 > pismo = Pismo[burl]
ArgumentError: invalid byte sequence in UTF-8
from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/pismo-0.7.2/lib/pismo/document.rb:48:in gsub!' from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/pismo-0.7.2/lib/pismo/document.rb:48:in
clean_html'
from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/pismo-0.7.2/lib/pismo/document.rb:36:in load' from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/pismo-0.7.2/lib/pismo/document.rb:16:in
initialize'
from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/pismo-0.7.2/lib/pismo.rb:29:in new' from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/pismo-0.7.2/lib/pismo.rb:29:in
[]'
from (irb):58
from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/railties-3.0.3/lib/rails/commands/console.rb:44:in start' from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/railties-3.0.3/lib/rails/commands/console.rb:8:in
start'
from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/railties-3.0.3/lib/rails/commands.rb:23:in <top (required)>' from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/activesupport-3.0.3/lib/active_support/dependencies.rb:239:in
require'
from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/activesupport-3.0.3/lib/active_support/dependencies.rb:239:in block in require' from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/activesupport-3.0.3/lib/active_support/dependencies.rb:225:in
block in load_dependency'
from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/activesupport-3.0.3/lib/active_support/dependencies.rb:596:in new_constants_in' from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/activesupport-3.0.3/lib/active_support/dependencies.rb:225:in
load_dependency'
from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/activesupport-3.0.3/lib/active_support/dependencies.rb:239:in `require'
Is there a way to use this within a model?
Noticed this when I was using the Pismo powered ‘entry text extraction’ on Feedbin.
>> Pismo['http://hsivonen.iki.fi/accept-charset/'].lede
=> "Accept-Charset Is No More. Now that Firefox 10 has been released, the Accept-Charset HTTP header. During the Firefox 4 development cycle, I noticed that IE and Safari were not sending the Accept-Charset HTTP header in their HTTP requests. This meant that the Web had to work even without browser sending that header."
The first sentence given by Pismo:
Now that Firefox 10 has been released, the Accept-Charset HTTP header.
Comes from the following HTML:
<p>Now that Firefox 10 has been released, <del>none of the major browsers send</del> <ins>only Chrome sends</ins> the <code>Accept-Charset</code> HTTP header.</p>
If anything I would have expected Pismo to drop the DEL
elements but keep the INS
elements like so:
Now that Firefox 10 has been released, only Chrome sends the Accept-Charset HTTP header.
Even the html_body
does not return these tags. This means possible important parts of a document can go missing. See this return, edited to only show the first paragraph:
>> Pismo['http://hsivonen.iki.fi/accept-charset/'].html_body
=> "Accept-Charset Is No More<p>Now that Firefox 10 has been released, the <code>Accept-Charset</code> HTTP header.</p>\n\n"
Please support the DEL
and INS
elements by:
DEL
and its content from lede
and body
but keep the content of INS
in both.DEL
and INS
elements and their content in html_body
.Any fix planned for allowing redirects? thanks!
"redirection forbidden: http://www.bettiepageclothing.com -> https://www.bettiepageclothing.com/"
"doc.images" call returns "nil" every time, even if there are valid images with absolute urls in the html page. The reader_doc.images array is empty every time.
When I tried to get images in this website I got this exception
http://verkoren.wordpress.com/2013/04/12/you-cant-skate-you-old/
/app/vendor/pismo/pismo/reader/tree.rb:149:in content_at' /app/vendor/pismo/pismo/images/image_extractor.rb:28:in
initialize'
/app/vendor/pismo/pismo/internal_attributes.rb:181:in new' /app/vendor/pismo/pismo/internal_attributes.rb:181:in
images'
Hello,
I was just trying to use Pismo on a webpage with this syntax code in the article:
sys: require 'sys'
http: require 'http'
client: http.createClient 8080, 'localhost'
# start an active chain gang queue with 3 workers by default.
chain: require('chain-gang').create()
# downloads a web page, runs the callback when it's done.
get_path: (path, cb) ->
req: client.request('GET', path, {host: 'localhost'})
req.end()
req.addListener 'response', (resp) ->
resp.addListener 'data', (chunk) ->
sys.puts chunk
resp.addListener 'end', cb
# returns a chain gang job that downloads a web page and finishes the worker.
job: (timeout, name) ->
(worker) ->
get_path "/$timeout/$name", ->
worker.finish()
But it came out as:
'sys'
'http'
'localhost'
# start an active chain gang queue with 3 workers by default.
'chain-gang'
# downloads a web page, runs the callback when it's done.
'GET'
'localhost'
'response'
'data'
'end'
# returns a chain gang job that downloads a web page and finishes the worker.
"/$timeout/$name"
'foo'
# queues the job with the unique name "foo_request"
'foo'
'foo_request'
When I ran html_body. :(
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.