brndnmtthws / hive-json-serde Goto Github PK
View Code? Open in Web Editor NEWThis project forked from rcongiu/hive-json-serde
Read - Write JSON SerDe for Apache Hive.
License: Other
This project forked from rcongiu/hive-json-serde
Read - Write JSON SerDe for Apache Hive.
License: Other
JsonSerde - a read/write SerDe for JSON Data AUTHOR: Roberto Congiu <[email protected]> Serialization/Deserialization module for Apache Hadoop Hive This module allows hive to read and write in JSON format (see http://json.org for more info). Features: * Read data stored in JSON format * Convert data to JSON format when INSERT INTO table * arrays and maps are supported * nested data structures are also supported. COMPILE Use maven to compile the serde. $ mvn package If you want to compile the serde against a different version of the cloudera libs, use -D: mvn -Dcdh.version=0.9.0-cdh3u4c-SNAPSHOT package EXAMPLES Example scripts with simple sample data are in src/test/scripts. Here some excerpts: * Query with complex fields like arrays CREATE TABLE json_test1 ( one boolean, three array<string>, two double, four string ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE; LOAD DATA LOCAL INPATH 'data.txt' OVERWRITE INTO TABLE json_test1 ; hive> select three[1] from json_test1; gold yellow * Nested structures You can also define nested structures: add jar ../../../target/json-serde-1.0-SNAPSHOT-jar-with-dependencies.jar; CREATE TABLE json_nested_test ( country string, languages array<string>, religions map<string,array<int>>) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE; -- data : {"country":"Switzerland","languages":["German","French","Italian"],"religions":{"catholic":[10,20],"protestant":[40,50]}} LOAD DATA LOCAL INPATH 'nesteddata.txt' OVERWRITE INTO TABLE json_nested_test ; select * from json_nested_test; -- result: Switzerland ["German","French","Italian"] {"catholic":[10,20],"protestant":[40,50]} select languages[0] from json_nested_test; -- result: German select religions['catholic'][0] from json_nested_test; -- result: 10 * MALFORMED DATA The default behavior on malformed data is throwing an exception. For example, for malformed json like {"country":"Italy","languages" "Italian","religions":{"catholic":"90"}} you get: Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: Row is not a valid JSON Object - JSONException: Expected a ':' after a key at 32 [character 33 line 1] this may not be desirable if you have a few bad lines you wish to ignore. If so you can do: ALTER TABLE json_table SET SERDEPROPERTIES ( "ignore.malformed.json" = "true"); it will not make the query fail, and the above record will be returned as NULL null null * MAPPING HIVE KEYWORDS Sometimes it may happen that JSON data has attributes named like reserved words in hive. For instance, you may have a JSON attribute named 'timestamp', which is a reserved word in hive, and hive will fail when issuing a CREATE TABLE. This SerDe can map hive columns over attributes named differently, using SerDe properties. For instance: CREATE TABLE mytable ( myfield string, ts string ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' WITH SERDEPROPERTIES ( "mapping.ts" = "timestamp" ) STORED AS TEXTFILE; Notice the "mapping.ts", that means: take the column 'ts' and read into it the JSON attribute named "timestamp" # ARCHITECTURE For the JSON encoding/decoding, I am using a modified version of Douglas Crockfords JSON library: https://github.com/douglascrockford/JSON-java which is included in the distribution. I had to make some minor changes to it, for this reason I included it in my distribution and moved it to another package (since it's included in hive!) The SerDe builds a series of wrappers around JSONObject. Since serialization and deserialization are executed for every (and possibly billions) record we want to minimize object creation, so instead of serializing/deserializing to an ArrayList, I kept the JSONObject and built a cached objectinspector around it. So when deserializing, hive gets a JSONObject, and a JSONStructObjectInspector to read from it. Hive has Structs, Maps, Arrays and primitives while JSON has Objects, Arrays and primitives. Hive Maps and Structs are both implemented as object, which are less restrictive than hive maps: a JSON Object could be a mix of keys and values of different types, while hive expects you to declare the type of map (example: map<string,string>). The user is responsible for having the JSON data structure match hive table declaration. More detailed explanation on my blog: http://www.congiu.com/articles/json_serde # CONTRIBUTING I am using gitflow for the release cycle. * THANKS Thanks to Douglas Crockford for the liberal license for his JSON library, and thanks to my employer OpenX and my boss Michael Lum for letting me open source the code. Versions: 1.0: initial release 1.1: fixed some string issues 1.1.1 (2012/07/03): fixed Map Adapter (get and put would call themselves...ooops) 1.1.2 (2012/07/26): Fixed issue with columns that are not mapped into JSON, reported by Michael Phung 1.1.4 (2012/10/04): Fixed issue #13, problem with floats, Reported by Chuck Connell 1.1.6 (2013/07/10): Fixed issue #28, error after 'alter table add columns'
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.