Git Product home page Git Product logo

nikkou's Introduction

Nikkou

Extract useful data from HTML and XML with ease!

Description

Nikkou adds additional methods to Nokogiri to make extracting commonly-used data from HTML and XML easier. It lets you transform HTML into structured data very quickly, and it integrates nicely with Mechanize.

Installation

Add Nikkou to your Gemfile:

gem 'nikkou'

Method Overview

Here's a summary of the methods Nikkou provides (see "Methods" for details):

Formatting

parse_text - Parses the node's text as XML and returns it as a Nokogiri::XML::NodeSet

time(options={}) - Intelligently parses the time (relative or absolute) of either the text or a specified attribute; accepts a time_zone option

url(attribute='href') - Converts the href (or other specified attribute) into an absolute URL using the document's URI; <a href="/p/1">Link</a> yields http://mysite.com/p/1

Searching

attr_equals(attribute, string) - Finds nodes where the attribute equals the string

attr_includes(attribute, string) - Finds nodes where the attribute includes the string

attr_matches(attribute, pattern) - Finds nodes where the attribute matches the pattern

*drill(methods) - Nil-safe method chaining

find(path) - Same as search but returns the first matched node

text_equals(string) - Finds nodes where the text equals the string

text_includes(string) - Finds nodes where the text includes the string

text_matches(pattern) - Finds nodes where the text matches the pattern

Methods

Formatting

time(options={})

Returns a Time object (in UTC) by automatically parsing the text or specified attribute of the node.

# <a href="/p/1">3 hours ago</a>
doc.search('a').first.time
Options

attribute

The attribute to parse:

# <a href="/p/1" data-published-at="2013-05-22 02:42:34">My link</a>
doc.search('a').first.time(attribute: 'data-published-at')

time_zone

The document's time zone (the time will be converted from that to UTC):

# <a href="/p/1">3 hours ago</a>
doc.search('a').first.time(time_zone: 'America/New_York')

url(attribute='href')

Returns an absolute URL; useful for parsing relative hrefs. The document's uri needs to be set for Nikkou to know what domain to add to relative paths.

# <a href="/p/1">My link</a>
doc.uri = 'http://mysite.com/mypage'
doc.search('a').first.url # "http://mysite.com/p/1"

If Mechanize is being used, the uri doesn't need to be manually set.

Options

attribute

The attribute to parse:

# <a href="/p/1" data-comments-url="/p/1#comments">My Link</a>
doc.uri = 'http://mysite.com/mypage'
doc.search('a').first.url('data-comments-url') # "http://mysite.com/p/1#comments"

Searching

attr_equals(attribute, string)

Selects nodes where the specified attribute equals the string.

# <div data-type="news">My Text</div>
doc.attr_equals('data-type', 'news').first.text # "My Text"

attr_includes(attribute, string)

Selects nodes where the specified attribute includes the string.

# <div data-type="major-news">My Text</div>
doc.attr_includes('data-type', 'news').first.text # "My Text"

attr_matches(attribute, pattern)

Selects nodes with an attribute matching a pattern. The pattern's matches are available in Node#matches.

# <span data-tooltip="3 Comments">My Text</span>
doc.attr_matches('data-tooltip', /(\d+) comments/i).first.text # "My Text"
doc.attr_matches('data-tooltip', /(\d+) comments/i).first.matches # ["3 Comments", "3"]

drill(*methods)

Nil-safe method chaining. Replaces this:

node = doc.find('.count')
if node
  attribute = node.attr('data-count')
  if attribute
    return attribute.to_i
  end
end

With this:

return doc.drill([:find, '.count'], [:attr, 'data-count'], :to_i)

find(path)

Same as search, but returns the first matched node. Replaces this:

nodes = node.search('h4')
if nodes
  return nodes.first
end

With this:

return node.find('h4')

text_includes(string)

Selects nodes where the text includes the string.

# <div data-type="news">My Text</div>
doc.text_includes('Text').first.text # "My Text"

text_matches(pattern)

Selects nodes with text matching a pattern. The pattern's matches are available in Node#matches.

# <a href="/p/1">3 Comments</a>
doc.text_matches(/^(\d+) comments$/i).first.attr('href') # "/p/1"
doc.text_matches(/^(\d+) comments$/i).first.matches # ["3 Comments", "3"]

License

Nikkou is released under the MIT License. Please see the MIT-LICENSE file for details.

nikkou's People

Contributors

asiniy avatar tombenner avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

nikkou's Issues

Unpublished changes

Hi, you fixed requiring ActiveSupport in b84812f, but this change never made into RubyGems.

Can you update version and publish current code?

Need `require 'pathname'`

Hi

my irb session

$ irb
2.1.2 :001 > require 'nikkou'
NameError: uninitialized constant Pathname
        from /usr/local/rvm/gems/ruby-2.1.2/gems/nikkou-0.0.3/lib/nikkou.rb:8:in `<top (required)>'
        from /usr/local/rvm/rubies/ruby-2.1.2/lib/ruby/2.1.0/rubygems/core_ext/kernel_require.rb:135:in `require'
        from /usr/local/rvm/rubies/ruby-2.1.2/lib/ruby/2.1.0/rubygems/core_ext/kernel_require.rb:135:in `rescue in require'
        from /usr/local/rvm/rubies/ruby-2.1.2/lib/ruby/2.1.0/rubygems/core_ext/kernel_require.rb:144:in `require'
        from (irb):1
        from /usr/local/rvm/rubies/ruby-2.1.2/bin/irb:11:in `<main>'

and, new session

$ irb
2.1.2 :001 > require 'pathname'
 => true
2.1.2 :002 > require 'nikkou'
 => true

Looks like the require 'pathname' has been skipped

Info:
  • nikkou 0.0.3
  • ruby 2.1.2p95
  • irb 0.9.6
  • linux kubuntu 14.04, kernel 3.13.0 x86_64

Error: uninitialized constant ActiveSupport::Autoload

After just requiring the gem

require 'nikkou'
p 'error'

I am getting this error:

/home/ubuntu/.rvm/gems/ruby-2.3.1/gems/activesupport-4.2.6/lib/active_support/number_helper.rb:3:in `<module:NumberHelper>': uninitialized constant ActiveSupport::Autoload (NameError)

System info:

$ ruby --version
ruby 2.3.1p112 (2016-04-26 revision 54768) [x86_64-linux]

$ lsb_release -irc
Distributor ID: Ubuntu
Release:        16.04
Codename:       xenial

$ gem list activesupport nokogiri
activesupport (4.2.6, 4.2.5.2, 4.2.5.1, 3.2.22.2)
nokogiri (1.6.8, 1.6.7.2)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.