Git Product home page Git Product logo

saxerator's Introduction

Saxerator soulcutterCode Climate

Saxerator is a streaming xml-to-hash parser designed for working with very large xml files by giving you Enumerable access to manageable chunks of the document.

Each xml chunk is parsed into a JSON-like Ruby Hash structure for consumption.

You can parse any valid xml in 3 simple steps.

  1. Initialize the parser
  2. Specify which tag you care about using a simple DSL
  3. Perform your work in an each block, or using any Enumerable method

Installation

  1. gem install saxerator
  2. Choose an xml parser
    • (default) Use ruby's built-in REXML parser - no other dependencies necessary
    • gem install nokogiri
    • gem install ox
  3. If not using the default, specify your adapter in the Saxerator configuration

The DSL

The DSL consists of predicates that may be combined to describe which elements the parser should enumerate over. Saxerator will only enumerate over chunks of xml that match all of the combined predicates (see Examples section for added clarity).

Predicate Explanation
all Returns the entire document parsed into a hash. Cannot combine with other predicates
for_tag(name) Elements whose name matches the given name
for_tags(names) Elements whose name is in the names Array
at_depth(n) Elements n levels deep inside the root of an xml document. The root element itself is n = 0
within(name) Elements nested anywhere within an element with the given name
child_of(name) Elements that are direct children of an element with the given name
with_attribute(name, value) Elements that have an attribute with a given name and value. If no value is given, matches any element with the specified attribute name present
with_attributes(attrs) Similar to with_attribute except takes an Array or Hash indicating the attributes to match

On any parsing error it'll raise an Saxerator::ParseException exception with the message that describe what is wrong on XML document. Warning Rexml won't raise and error if root elent wasn't closed. (will be fixed on ruby 2.5)

Examples

parser = Saxerator.parser(File.new("rss.xml"))

parser.for_tag(:item).each do |item|
  # where the xml contains <item><title>...</title><author>...</author></item>
  # item will look like {'title' => '...', 'author' => '...'}
  puts "#{item['title']}: #{item['author']}"
end

# a String is returned here since the given element contains only character data
puts "First title: #{parser.for_tag(:title).first}"

Attributes are stored as a part of the Hash or String object they relate to

# author is a String here, but also responds to .attributes
primary_authors = parser.for_tag(:author).select { |author| author.attributes['type'] == 'primary' }

You can combine predicates to isolate just the tags you want.

require 'saxerator'

parser = Saxerator.parser(bookshelf_xml)

# You can chain predicates
parser.for_tag(:name).within(:book).each { |book_name| puts book_name }

# You can re-use intermediary predicates
bookshelf_contents = parser.within(:bookshelf)

books = bookshelf_contents.for_tag(:book)
magazines = bookshelf_contents.for_tag(:magazine)

books.each do |book|
  # ...
end

magazines.each do |magazine|
  # ...
end

Configuration

Certain options are available via a configuration block at parser initialization.

Saxerator.parser(xml) do |config|
  config.output_type = :xml
end
Setting Default Values Description
adapter :nokogiri :nokogiri, :oga, :ox, :rexml The XML parser used by Saxerator
output_type :hash :hash, :xml The type of object generated by Saxerator's parsing. :hash generates a Ruby Hash, :xml generates a REXML::Document
symbolize_keys! n/a n/a Call this method if you want the hash keys to be symbols rather than strings
ignore_namespaces! n/a n/a Call this method if you want to treat the XML document as if it has no namespace information. It differs slightly from strip_namespaces! since it deals with how the XML is processed rather than how it is output
strip_namespaces! n/a user-specified Called with no arguments this strips all namespaces, or you may specify an arbitrary number of namespaces to strip, i.e. config.strip_namespaces! :rss, :soapenv
put_attributes_in_hash! n/a n/a Call this method if you want xml attributes included as elements of the output hash - only valid with output_type = :hash

Known Issues

  • JRuby closes the file stream at the end of parsing, therefor to perform multiple operations which parse a file you will need to instantiate a new parser with a new File object.

Other Documentation

FAQ

Why the name 'Saxerator'?

It's a combination of SAX + Enumerator.

Why use Saxerator over regular SAX parsing?

Much of the SAX parsing code I've written over the years has fallen into a pattern that Saxerator encapsulates: marshall a chunk of an XML document into an object, operate on that object, then move on to the next chunk. Saxerator alleviates the pain of marshalling and allows you to focus solely on operating on the document chunk.

Why not DOM parsing?

DOM parsers load the entire document into memory. Saxerator only holds a single chunk in memory at a time. If your document is very large, this can be an important consideration.

When I fetch a tag that has one or more elements, sometimes I get an Array, and other times I get a Hash or String. Is there a way I can treat these consistently?

You can treat objects consistently as arrays using Ruby's built-in array conversion method in the form Array(element_or_array)

Why Active Record fails when I'm passing String value to the query?

Saxerator doesn't return Array, Hash or String to you. But you can convert it to needed type by calling .to_<type> method as you usually do.

Contribution

For running tests for all parsers run rake spec:adapters

Acknowledgements

Saxerator was inspired by - but not affiliated with - nori and Gregory Brown's Practicing Ruby

Legal Stuff

Copyright ยฉ 2012-2020 Bradley Schaefer. MIT License (see LICENSE file).

saxerator's People

Contributors

doomspork avatar doublestranded avatar exaspark avatar fanantoxa avatar quoideneuf avatar soulcutter avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

saxerator's Issues

Output encoding

Ox can use document encoding: <?xml version="1.0" encoding="Windows-1251" ?>.
I want convert this encoding like this string&.encode('UTF-8', invalid: :replace, undef: :replace, replace: '?')
But my parsed item is Saxerator::Builder::HashElement
I can't do this way item.transform_values! { |value| value&.encode('UTF-8', invalid: :replace, undef: :replace, replace: '?') }
because item can contains elements of types: Saxerator::Builder::ArrayElement and Saxerator::Builder::HashElement recursively.
Moreover I must convert encoding for attributes.

So there are 2 ways to resolve this problem.

  1. Worst way: Add method deep_encode like https://apidock.com/rails/Hash/deep_merge
  2. Best way: Add output_encoding to saxerator configuration and convert on parsing

P.S. And there is another question. Why we are using this types? Can we simplify like this: https://github.com/savonrb/gyoku

XML attribute not copied.

When saxing using xml, the attributes of the current root element is not accessible, even though it prints using element.to_s:

parser.for_tag(:product).each do |offer|
puts offer.attributes('id').value
end

Doesn't work. but:

parser.for_tag(:product).each do |offer|
offer = Nokogiri::XML(offer.to_s).elements[0]
puts offer.attributes('id').value
end

does.

xml:

<products>
<product id="1"><name>test1</name><.product>
<product id="2"><name>test2</name><.product>
<product id="3"><name>test3</name><.product>
<products>

dependencies fails

@soulcutter Since we have nokogiri and ox only as dev deps:

s.add_development_dependency 'nokogiri', '>= 1.4.0'
s.add_development_dependency 'ox'

So we have to make and not only dev or add Attention warning on Readme.md that we have to manually add one of them (or both) in project.

How can I get a count of ?

parser = Saxerator.parser(File.new(file))
parser.for_tag('RECORD').each_with_index do |item, index|
end

How can I get a count of all parser.for_tag('RECORD') elements?

Rubocop config

@soulcutter Do you use Rubocop?
Maybe we can create config file and CI integration for that to be on the same code style?

How to ignore encoding error?

For example I have such error:

error on line 1890 at column 14: Encoding error

It is undefined symbol.

How can I replace it to blank, ignore it or go to next tag?

Implement Ox support

It would be nice to support parsers other than Nokogiri. Ox in particular is supposed to have great performance, and so would be a good first candidate.

Check Configuration class for useless functionality

@soulcutter Check we still need the method strip_namespaces! and what we can do with hash_key_normalizer. Lambdas might slow the gem.

Let me know with your decision, if you have no time for updating the code just write down your suggestions here and I'll update it.

Bug with put_attributes_in_hash and repeated elements

Hey @soulcutter, I think I've stumbled onto a bug with put_attributes_in_hash!, check out the following XML:

<Listing>
    <Amenity AmenityID="49" AmenityName="Elevator in Building" />
    <Amenity AmenityID="42" AmenityName="Exercise Facility" />
    <Amenity AmenityID="39" AmenityName="Other">1 Sauna</Amenity>
    <Amenity AmenityID="38" AmenityName="Parking" />
    <Amenity AmenityID="41" AmenityName="Pool">2 Pools</Amenity>
    <Amenity AmenityID="37" AmenityName="Unfurnished" />
</Listing>

Which when run through Saxerator outputs this Hash:

 :Amenity=>
  [{:AmenityID=>"49", :AmenityName=>"Elevator in Building"},
   {:AmenityID=>"42", :AmenityName=>"Exercise Facility"},
   "1 Sauna",
   {:AmenityID=>"38", :AmenityName=>"Parking"},
   "2 Pools",
   {:AmenityID=>"37", :AmenityName=>"Unfurnished"}],

The attributes are missing from the elements that contain both attributes and a value, is this is intended behavior?

Thanks!

it won't parse attributes when there's content inside tag

Hi
When parsing that piece

<offers>
<offer available="true" original_id="893169" type="vendor.model" id="7602443">
<delivery>true</delivery>
[...skip...]
<vendorCode>10131034  500</vendorCode>
<param name="Color">blue</param>
<param name="Collection">Winter 2014/2015</param>
<param name="Season">Winter</param>
<param name="Country">China</param>
<param name="Unit=INT">46/48</param>
<param name="Sex">Female</param>
<param name="Age">Adult</param>
<categoryId>1908</categoryId>
[...skip...]
</offer>

with that code:

parser = Saxerator.parser(xml) do |config|
  config.output_type = :hash
  config.symbolize_keys!
  config.put_attributes_in_hash!
end
items = parser.for_tag(:offer)

I get

puts items.first
{:delivery=>"true",
  [...skip...],
  :manufacturer_warranty=>"true", 
  :param=>["blue", "winter 2014/2015", "Winter", "China", "46/48", "Female", "Adult"], 
  :categoryId=>"1908",
  [...skip...]
}

instead of, for ex.

{ :param => [ { name:  "Color", "": "blue" }, 
  { name: "Collection", "": "Winter  2014/2015" },
  { name: "Season", "": "Winter"  },
  { name: "Sex", "": "Female" } ...], ...

etc.
Also that piece
<category id="1783" parentId="1781">Jeans</category>
is parsed simply as "Jeans" ignoring :id and :parentId

I really don't know how exactly that situation must be processed, but sure there's information lost, and this is unwanted.

Release new version.

@soulcutter I think we're ready to release new version.
I've also Update Doc a bit with new updates.
So since gem became self-sufficient, have and same output for different adapters, all bugs that we know fixed and we have huge breaking changes maybe we can finally release major version like 1.0.0 ๐ŸŽ‚?

What do you think about it?

Add namespace support

Also, 'config.symbolize_keys!' will generate a weird looking symbol, if the tag parsed has a namespace :"ns:tag"

json output

Just wondering why saxerator doesn't feature a :json output option?

Attributes hashes should respect symbolize keys

e.g.

xml = <<-XML
<params>
  <param name="Color">blue</param>
  <param name="Collection">Winter 2014/2015</param>
</params>
XML
parser = Saxerator.parser(xml) do |config|
  config.output_type = :hash
  config.symbolize_keys!
end

parser.for_tag(:param).each do |param|
  param.attributes # => { "name" => "Color" } SHOULD BE { :name => "Color" }
end

Slow performance of lamdas

@soulcutter Actually lambdas is very slow and can't be optimized by ruby VM.
What would you think if I'll replace lambdas for normalization stuff with some abstract class?

==================================
  Mode: cpu(1000)
  Samples: 901 (0.00% miss rate)
  GC: 126 (13.98%)
==================================
     TOTAL    (pct)     SAMPLES    (pct)     FRAME
       107  (11.9%)         107  (11.9%)     Saxerator::Configuration#hash_key_normalizer

Does it help to parse large size xml?

Hi,
parser = Saxerator.parser(File.new("rss.xml"))
parser.for_tag(:item).each do |item|
puts "#{item['title']}: #{item['author']}"
end

Can this code parse 300MB xml file?

Optional errors

I need parse some invalid files. Now i hack it this way:

module Saxerator
  module Adapters
    class Ox
      def error(message, _, _)
        # raise Saxerator::ParseException, message
      end
    end
  end
end

Can you add option ignore_errors?

def error(message, _, _)
  return if ignore_errors
  raise Saxerator::ParseException, message
end

Slice does not work on HashElements

I expect slice to work like on a normal Ruby hash.
This now fails with the error:
ArgumentError: wrong number of arguments (given 0, expected 2)

If this is not possible and I should just convert it to a normal Ruby hash, that's fine, how do I do that?

Update XmlBuilder to use REXML class

We need to have single type of object for all parsers when out_type is xml.
So We need to update the code to return REXML intead of nokogiry:xml and update code to use it for ox parser too.
(REXML)[http://ruby-doc.org/stdlib-2.2.3/libdoc/rexml/rdoc/REXML/Document.html]

Problem when passing String element to activeRecord.find_by

el = Saxerator::Builder::StringElement.new("somestring")
SomeModel.find_by!(some_tag: el)
# I'll failt and not pass string 
# but if
SomeModel.find_by!(some_tag: el.to_s)
# It'll work fine

So the problem is here:
https://github.com/rails/rails/blob/92703a9ea5d8b96f30e0b706b801c9185ef14f0e/activerecord/lib/active_record/relation/where_clause_factory.rb

Since we use Delegation we can't pass to where directly.
@soulcutter What do you think about it?

Should we do something to fix it or just update doc that StringElement !=String

can't convert Saxerator::Builder::StringElement to Array

According to the README:

You can treat objects consistently as arrays using Ruby's built-in array conversion method in the form Array(element_or_array)

Snippet:

require 'saxerator'

p RUBY_VERSION # => "2.3.1"
p Saxerator::VERSION # => "0.9.8"

xml = <<-XML
<root>
  <product>
    <items>
      <item>foo</item>
      <item>bar</item>
    </items>
  </product>
  <product>
    <items>
      <item>lonely</item>
    </items>
  </product>
</root>
XML

parser = Saxerator.parser(xml)

parser.for_tag("product").each do |node|
  p Array(node['items']['item'])
end

gives

`Array': can't convert Saxerator::Builder::StringElement to Array (Saxerator::Builder::StringElement#to_a gives Saxerator::Builder::ArrayElement) (TypeError)

Implement REXML support

REXML support would be nice since it is in the std lib, and so would reduce Saxerator's dependencies.

Can't specify adapter in config block

Trying the following:

Saxerator.parser("<title>Hello</title>") do |config|
  config.adapter = :ox
  config.output_type = :hash
end

I get hit with:

NoMethodError: undefined method `adapter=' for #<Saxerator::Configuration:0x007f8004c236f8>
  from (irb):5:in `block in irb_binding'
  from /Users/home/.rvm/gems/ruby-2.3.1/gems/saxerator-0.9.5/lib/saxerator.rb:32:in `parser'
  from (irb):5
  from /Users/home/.rvm/gems/ruby-2.3.1/gems/railties-5.0.0/lib/rails/commands/console.rb:65:in `start'
  from /Users/home/.rvm/gems/ruby-2.3.1/gems/railties-5.0.0/lib/rails/commands/console_helper.rb:9:in `start'
  from /Users/home/.rvm/gems/ruby-2.3.1/gems/railties-5.0.0/lib/rails/commands/commands_tasks.rb:78:in `console'
  from /Users/home/.rvm/gems/ruby-2.3.1/gems/railties-5.0.0/lib/rails/commands/commands_tasks.rb:49:in `run_command!'
  from /Users/home/.rvm/gems/ruby-2.3.1/gems/railties-5.0.0/lib/rails/commands.rb:18:in `<top (required)>'
  from bin/rails:4:in `require'
  from bin/rails:4:in `<main>'

Am I missing something or is this a bug?

Check performance

@soulcutter I've worked with ox some time and it worked better than we can see on benchmark now.
Usually it parse xml in 2 .. 10 times faster. So I think we can have some performance issues here.

You can check other benchmark here: https://gist.github.com/danneu/3977120 .
It's a bit old but numbers really should be like this.
I'll try to update to use our xml example and compare it.

HashElement vs StringElement

Suppose you read a XML file with some blank nodes:

require 'saxerator'

data = <<-DATA
  <products>
    <product>
      <name>iPhone 5S</name>
      <ean>1234567890</ean>
    </product>
    <product>
      <name>XBOX 360</name>
      <ean />
    </product>
  </products>
DATA

parser = Saxerator.parser(data)

parser.for_tag(:product).each do |item|
  p item['ean']
end

# => "1234567890"
# => {}

Wouldn't it make more sense to return a StringElement with the value of nil instead of returning an empty HashElement? It won't work well with the put_attributes_in_hash! option but anyways.

Nested DocumentFragments

Hi, I have a question:
Can we change somehow DocumentFragment#each method to stop parsing a source?
What do you think about new behavior that allows us to use it like this:

parser.for_tag(:item).each do |item|
  # where the xml contains <item><author><name>...</name></author></item>
  item.for_tag(:author).each do |author|
    puts author.to_h["name"]
  end
end

Maybe it's hard to implement and of course these changes break a compatibility. But I personally prefer to keep using Saxerator's objects while iterating instead of dealing with nested hashes until I really want it. What do you think? :)

Attributes are lost on single child

It might be me doing something wrong, but when a node can have zero or more children and there is only one child, the parsed object becomes an array without its attributes being present.

For example, in case of the following xml the children of the first node are returned separately as objects inspected as {:N=>[{}, {}, {}]}, with .attributes['F'] returning 189 and 190 respectively. In the second case an array is returned like [:N, [{}, {}, {}]] and the F attribute is no longer accessible (undefined method 'attributes' for Array).

<Root>
    <N F="1" T="t">
        <N K="p" T="n" V="4"/>
        <N K="t" T="t">
            <N F="189" T="t">
                <N K="i" T="s" V="data1"/>
                <N K="t" T="s" V="data2"/>
                <N K="n" T="s" V="data3"/>
            </N>
            <N F="190" T="t">
                <N K="i" T="s" V="data1"/>
                <N K="t" T="s" V="data2"/>
                <N K="n" T="s" V="data3"/>
            </N>
        </N>
    </N>
    <N F="2" T="t">
        <N K="p" T="n" V="8"/>
        <N K="t" T="t">
            <N F="195" T="t">
                <N K="i" T="s" V="data1"/>
                <N K="t" T="s" V="data2"/>
                <N K="n" T="s" V="data3"/>
            </N>
        </N>
    </N>
</Root>

Doc improvement

Change:

Saxerator.parse(xml) do |config|
  # config object has accessors for various settings
end

to an example, because was unclear how to set changes.

Saxerator.parser(xml) do |config|
  config. output_type = :xml
end

Also note that the current docs state Saxerator.parse instead of Saxerator.parser

Question about parsing HTML elements

Hi, Bradley!

Great job on this, I was doing some similar XML parsing myself at work and decided to try your gem. Works great, there is just one minor thing that is confusing me. Let's say I have a collection of summary elements that I am traversing and processing. Each summary is a HTML element, like for example this real one:

<summary id=1071025>
                <p>Even the most jaded visitor may thrill in the Chinese's famous forecourt, where generations of screen legends have left their imprints in cement: feet, hands, dreadlocks (Whoopi Goldberg), and even magic wands (the young stars of the <em>Harry Potter</em> films). Actors dressed as Superman, Marilyn Monroe and the like pose for photos (for tips), and you may be offered free tickets to TV shows.</p>
                <p>The theater is on the <strong>Hollywood Walk of Fame</strong>, which honors over 2000 celebrities with stars embedded in the sidewalk. Other historic theaters include the flashy <a href="/pois/1130703/lang/en" class="poi inline"><name>El Capitan Theater</name></a> and the 1922 <a href="/pois/379895/lang/en" class="poi inline"><name>Egyptian Theater</name></a>, home to American Cinematheque, which presents arty retros and Q&amp;As with directors, writers and actors.</p>
              </summary>

Let's suppose what I want to do is update a database with the HTML content of each summary. When I am iterating through my document with @doc.for_tag(:summary).each do |summary|, the summaries are Saxerator::HashWithAttributes with the tree structure of the parsed HTML.

That was the setup, the question is: Is there a easy way to get back the HTML (something like summary.inner_html in Nokogiri), or to tell saxerator not to parse what is inside summary and treat it is as a string?

Other than that, Saxerator did an excellent job parsing some gnarly XML files!

Thanks again,

Javier

XML list into array for single element

This is more of a question than an issue.

Consider this XML:

<root>
  <items>
    <item>my_item 1</item>
    <item>my_item 2</item>
  </items>
</root>

This will be turned into the following hash using saxerator

{
  :items=> {
    :item=> ["my_item 1", "my_item 2"]
  }
}

That's totally fine and works for me.

However: If you reduce the items to one element the problem occurs.

Consider this:

<root>
  <items>
    <item>my_item</item>
  </items>
</root>

The accordingly converted hash looks like this:

{
  :items=> {
    :item=> "my_item"
  }
}

See any differences?

The value of :item is not wrapped in an array. This can lead to problems when genericly using this data structure.

So my question is:

Is there a possibility to turn on wrapping in arrays or something similar?

Ox SAX Parser not Raising Errors

Hi There,

I was debugging some fault XML (invalid byte sequence for UTF-8) and noticed that Saxerator just ignored the error and stopped processing. It would process all nodes successfully up to that point, and then stop finding more elements.

At first I thought that was Saxerator having an issue, but upon closer inspection it turns out that is how the Ox SAX parser is configured by default. In short, unless an error method is defined for the Ox SAX parser it will silently close the remaining tags. You can check out (ohler55/ox#166) to get more details.

How do people feel about adding the error method and raising an exception when an error is found by the Ox parser?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.