Git Product home page Git Product logo

hadoop-xmlinputformatwithmultipletags's Introduction

Description

XMLInputFormat is an implementation that reads records which are delimited by a specific begin/end tag. This is useful when providing logical file splits to the mapper based on a single start and end tag irrespective of the physical splits of the file.

Consider the following XML

<catalog>
   <book>
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications 
      with XML.</description>
   </book>
   <book>
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies, 
      an evil sorceress, and her own childhood to become queen 
      of the world.</description>
   </book>
   <book>
      <author>Corets, Eva</author>
      <title>Maeve Ascendant</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-11-17</publish_date>
      <description>After the collapse of a nanotechnology 
      society in England, the young survivors lay the 
      foundation for a new society.</description>
   </book> .....
</catalog>

Setting the start_tag as <book> and end_tag as </book> for the XMLInputFormat :

Configuration conf = new Configuration();
conf.set(XmlInputFormat.START_TAG_KEY, "<book>");
conf.set(XmlInputFormat.END_TAG_KEY, "</book>");
....
Job job = new Job(conf, "XML Processing Processing");
job.setInputFormatClass(XmlInputFormat.class);
....

This will ensure that every mapper gets one logical entity irrespective of the physical split of the file on HDFS. (Of course the XML file on HDFS must be well-formed)

However this is not useful when you would like to extract records based on multiple start and end tags.

Consider the following XML

<catalog>
   <book>
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications 
      with XML.</description>
   </book>
   <book>
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies, 
      an evil sorceress, and her own childhood to become queen 
      of the world.</description>
   </book>
   <book>
      <author>Corets, Eva</author>
      <title>Maeve Ascendant</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-11-17</publish_date>
      <description>After the collapse of a nanotechnology 
      society in England, the young survivors lay the 
      foundation for a new society.</description>
   </book> 
   <article>
      <author>Corets, Eva</author>
      <title>Oberon's Legacy</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2001-03-10</publish_date>
      <description>In post-apocalypse England, the mysterious 
      agent known only as Oberon helps to create a new life 
      for the inhabitants of London. Sequel to Maeve 
      Ascendant.</description>
   </article>
   <paper>
      <author>Corets, Eva</author>
      <title>The Sundered Grail</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2001-09-10</publish_date>
      <description>The two daughters of Maeve, half-sisters, 
      battle one another for control of England. Sequel to 
      Oberon's Legacy.</description>
   </paper>
   <article>
      <author>Randall, Cynthia</author>
      <title>Lover Birds</title>
      <genre>Romance</genre>
      <price>4.95</price>
      <publish_date>2000-09-02</publish_date>
      <description>When Carla meets Paul at an ornithology 
      conference, tempers fly as feathers get ruffled.</description>
   </article>
   <paper>
      <author>Thurman, Paula</author>
      <title>Splish Splash</title>
      <genre>Romance</genre>
      <price>4.95</price>
      <publish_date>2000-11-02</publish_date>
      <description>A deep sea diver finds true love twenty 
      thousand leagues beneath the sea.</description>
   </paper>
   
   .....
</catalog>

If you want to grab all those entities within <book></book>, <article></article>,<paper></paper> use this XMLInputFormatWithMultipleTags as :

Configuration conf = new Configuration();
conf.set(XmlInputFormatWithMultipleTags.START_TAG_KEYS, "<book>,<article>,<paper>");
conf.set(XmlInputFormatWithMultipleTags.END_TAG_KEYS, "</book>,</article>,</paper>");

Note that multiple tags are delimeted by a comma (,)

Job job = new Job(conf, XmlInputFormatWithMultipleTags.class);
....

hadoop-xmlinputformatwithmultipletags's People

Contributors

mohammed-siddiq avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

hadoop-xmlinputformatwithmultipletags's Issues

org.apache.hadoop.mapreduce.TaskAttemptContext

TaskAttemptContext Has been updated since Hadoop 2, It is now an interface, which causes the program not to run.
The following error would be given:

Screen Shot 2020-10-22 at 2 52 43 PM

Taking about line 39 variable 'tac' :
Screen Shot 2020-10-22 at 2 53 22 PM

Note: I've translated it to scala but the error still occurs in java regardless.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.