nazgul33 / impala-get-json-object-udf Goto Github PK
View Code? Open in Web Editor NEWA UDF for Cloudera Impala ( hive get_json_object equivalent )
A UDF for Cloudera Impala ( hive get_json_object equivalent )
When testing with Impala cdh5-2.6.0_5.8.0 on debian 7 (wheezy) x64, I get segfault on most calls:
> select json_get_object('{"name":"steven"}', '$.name');
Query: select json_get_object('{"name":"steven"}', '$.name')
Error communicating with impalad: TSocket read 0 bytes
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x0000000000811535, pid=8740, tid=140217639298816
#
# JRE version: Java(TM) SE Runtime Environment (7.0_80-b15) (build 1.7.0_80-b15)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (24.80-b11 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C [impalad+0x411535] rapidjson::MemoryPoolAllocator<rapidjson::CrtAllocator>::Malloc(unsigned long)+0x15
...
However, the following does not fail:
> select json_get_object('42', '$');
Query: select json_get_object('42', '$')
+------------------------------------+
| default.json_get_object('42', '$') |
+------------------------------------+
| 42 |
+------------------------------------+
A similar SEGFAULT was believed to be caused by multiple version rapidjson being present. Impala does include the old version 0.11 of rapidjson, while impala-get-json-object-udf seems to ship version 1.0.2. If this really is the root cause, I am wondering why I did not have such issues with Impala 2.2.0 also shipping rapidjson 0.11.
table data like this:
user_id | real_name | auth_status | extend_info | |
---|---|---|---|---|
20005140 | d3 | 3 | {"kill": false, "memberType": 1} | |
20004911 | d34 | 3 | {"kill": false, "memberType": 1} | |
20005136 | d44 | 3 | {"kill": false, "killTime": "2018-02-10 10:10:54", "memberType": 3, "memberExpireTime": "2024-02-28 00:00:00"} | |
20004905 | autotest | 3 | {"kill": false, "killTime": "2018-03-23 00:00:00", "memberType": 1} | |
20005133 | autotest2 | 3 | {"kill": false, "memberType": 1} |
correctly sql:
select c1.username,c1.real_name,nvl2(c2.username,'0','1') as total,c2.user_id,c2.nn from consignor c1
left outer join
(select user_id,username, json_get_object(extend_info,'$.kill') as nn from consignor
) c2
on c1.user_id=c2.user_id where c2.username is NULL;
incorrectly sql: At the same time,if i run this sql,impala-deamon crushing.
select c1.username,c1.real_name,nvl2(c2.username,'0','1') as total,c2.user_id,c2.nn from consignor c1
left outer join
(select user_id,username, json_get_object(extend_info,'$.kill') as nn from consignor
where json_get_object(extend_info,'$.kill')='false' ) c2
on c1.user_id=c2.user_id where c2.username is NULL;
error message : Could not connect to AvatarTest2:21050 (code THRIFTTRANSPORT): TTransportException('Could not connect to AvatarTest2:21050',)
avatartest is my computer's hostname
it seems like "json function" can not in where condition??
BASE ON :
CDH 14.2
HUE 3.9
IMPALA 2.11.0
Impala version 2.8
UDF breaks connection upon trying to deal with nested arrays
example JSON:
{"customer_info":[{"field_name":"family_names","field_value":"Gonzalez"},{"field_name":"given_names","field_value":"Pablo"}],"phone":null}
this works
select json_get_object('{"customer_info":[{"field_name":"family_names","field_value":"Gonzalez"},{"field_name":"given_names","field_value":"Pablo"}],"phone":null}','$.customer_info') ;
but this breaks impala
select json_get_object('{"customer_info":[{"field_name":"family_names","field_value":"Gonzalez"},{"field_name":"given_names","field_value":"Pablo"}],"phone":null}','$.customer_info.field_name') ;
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.