amirziai / flatten Goto Github PK

View Code? Open in Web Editor NEW

529.0 10.0 97.0 136 KB

Flatten JSON in Python

Home Page: https://pypi.python.org/pypi/flatten_json

License: MIT License

Python 100.00%

python flattened-objects pandas unflatten flattens json

flatten's Introduction

flatten_json

Flattens JSON objects in Python. flatten_json flattens the hierarchy in your object which can be useful if you want to force your objects into a table.

Installation

pip install flatten_json

flatten

Usage

Let's say you have the following object:

dic = {
    "a": 1,
    "b": 2,
    "c": [{"d": [2, 3, 4], "e": [{"f": 1, "g": 2}]}]
}

which you want to flatten. Just apply flatten:

from flatten_json import flatten
flatten(dic)

Results:

{'a': 1,
 'b': 2,
 'c_0_d_0': 2,
 'c_0_d_1': 3,
 'c_0_d_2': 4,
 'c_0_e_0_f': 1,
 'c_0_e_0_g': 2}

Usage with Pandas

For the following object:

dic = [
    {"a": 1, "b": 2, "c": {"d": 3, "e": 4}},
    {"a": 0.5, "c": {"d": 3.2}},
    {"a": 0.8, "b": 1.8},
]

We can apply flatten to each element in the array and then use pandas to capture the output as a dataframe:

dic_flattened = [flatten(d) for d in dic]

which creates an array of flattened objects:

[{'a': 1, 'b': 2, 'c_d': 3, 'c_e': 4},
 {'a': 0.5, 'c_d': 3.2},
 {'a': 0.8, 'b': 1.8}]

Finally you can use pd.DataFrame to capture the flattened array:

import pandas as pd
df = pd.DataFrame(dic_flattened)

The final result as a Pandas dataframe:

	a	b	c_d	c_e
0	1	2	3	4
1	0.5	NaN	3.2	NaN
2	0.8	1.8	NaN	NaN

Custom separator

By default _ is used to separate nested element. You can change this by passing the desired character:

flatten({"a": [1]}, '|')

returns:

{'a|0': 1}

Ignore root keys

By default flatten goes through all the keys in the object. If you are not interested in output from a set of keys you can pass this set as an argument to root_keys_to_ignore:

dic = {
    'a': {'a': [1, 2, 3]},
    'b': {'b': 'foo', 'c': 'bar'},
    'c': {'c': [{'foo': 5, 'bar': 6, 'baz': [1, 2, 3]}]}
}
flatten(dic, root_keys_to_ignore={'b', 'c'})

returns:

{
    'a_a_0': 1,
    'a_a_1': 2,
    'a_a_2': 3
}

This feature can prevent unnecessary processing which is a concern with deeply nested objects.

unflatten

Reverses the flattening process. Example usage:

from flatten_json import unflatten

dic = {
    'a': 1,
    'b_a': 2,
    'b_b': 3,
    'c_a_b': 5
}
unflatten(dic)

returns:

{
    'a': 1,
    'b': {'a': 2, 'b': 3},
    'c': {'a': {'b': 5}}
}

Unflatten with lists

flatten encodes key for list values with integer indices which makes it ambiguous for reversing the process. Consider this flattened dictionary:

a = {'a': 1, 'b_0': 5}

Both {'a': 1, 'b': [5]} and {'a': 1, 'b': {0: 5}} are legitimate answers.

Calling unflatten_list the dictionary is first unflattened and then in a post-processing step the function looks for a list pattern (zero-indexed consecutive integer keys) and transforms the matched values into a list.

Here's an example:

from flatten_json import unflatten_list
dic = {
    'a': 1,
    'b_0': 1,
    'b_1': 2,
    'c_a': 'a',
    'c_b_0': 1,
    'c_b_1': 2,
    'c_b_2': 3
}
unflatten_list(dic)

returns:

{
    'a': 1,
    'b': [1, 2],
    'c': {'a': 'a', 'b': [1, 2, 3]}
}

Command line invocation

>>> echo '{"a": {"b": 1}}' | flatten_json
{"a_b": 1}

>>> echo '{"a": {"b": 1}}' > test.json
>>> cat test.json | flatten_json
{"a_b": 1}

flatten's People

Contributors

Stargazers

Watchers

Forkers

azaitsev datamattsson jvalhondo aquilax solaris-meng yzadik therealjumbo scifabric streamnsight talwrii prateekmehta trencyclopedia shichaoji irishbird sgutenburgh969 nmcginn djdeejay erlichmen marcstreeter mr-brody tishmen arutledge-xactly pombredanne sai937 abhayap shrinivdeshmukh jpenttinen gachet marketing1by1 mattbornski davidzyx profbiyi andresvidal kaiaeberli eycab outfenneced nikolayvoronchikhin chcheruk jaytalreja rumdood sneharamamurthy milateef smlbiobot junqueira fsramalho sudor padwasabimasala guility ftarlaci mimic66 leegohi shellyfeng jinoobaek-qz terodea luckynummer13 hmtrung crazyprogrammer thekakodkar satyajitovelil cor-j rebeccajiajunluo libaowei freemansgit yrui-punahou alessandro308 vlasvlasvlas kanelee7512 auphofbsf xiaok1981 ferltcarol baydakovss felixlu raaghulr shabbirhasan1 knottasoft emily-wade jemima1992 andreuha dansteingart lukelin780902 sepojp noahjepstein elevine4 stungkit thomashoddinott itayshal memadsen msullivancm chetanknowit doddgray vking12 0xbadidea statefarmins hman009

flatten's Issues

unflatten(flatten(data)) not the inverse

`from flatten_json import flatten
from flatten_json import unflatten
data = {
"a": 1,
"b": 2,
"c": [{"d": [2, 3, 4], "e": [{"f": 1, "g": 2}]}]
}

print(data)
flatData = flatten(data)
print(flatData)
unflat = unflatten(flatData)
print(unflat)`

output...

{'a': 1, 'b': 2, 'c': [{'d': [2, 3, 4], 'e': [{'f': 1, 'g': 2}]}]} {'a': 1, 'b': 2, 'c_0_d_0': 2, 'c_0_d_1': 3, 'c_0_d_2': 4, 'c_0_e_0_f': 1, 'c_0_e_0_g': 2} {'a': 1, 'b': 2, 'c': {'0': {'d': {'0': 2, '1': 3, '2': 4}, 'e': {'0': {'f': 1, 'g': 2}}}}}

Order preserving

from flatten_json import unflatten_list -> is not preserving the order as it returns dict. is there any way we can preserve the order?
I am passing an ordered dict as function parameter.

Outputting only specific keys in sub dictionary

If you want to add a whole new level of sophistication (and probably complexity) to the library, you could allow pulling out a subset of keys within a dict eg.
[ { "id": "12345", "organization": { "description": "Informing humanitarians worldwide", "created": "2016-11-11T17:30:16.140515", "title": "ReliefWeb", "name": "reliefweb", "id": "de410fc7-6116-4283-9c26-67287aaa2634", "approval_status": "approved" }, },...
In the above, I might want only the organization "title" and "id" not all the rest (ie. not approval_status etc.).

Just an idea I had - feel free to ignore if too complicated.

flatten_preserve_lists is failed to flatten json data which includes an array with more than 10 rows

I have a json data, the orderItems has more than 10 items. it looks like:
{
"orderHeader":{
"orderEvent":"STATUS_UPDATE_IN",
"orderNumber":"099432727"
},
"orderItems":[
{
"itemLineNumber":1,
"itemSku":3822,
"itemQuantity":468
},
{
"itemLineNumber":2,
"itemSku":8805,
"itemQuantity":414
},
{
"itemLineNumber":3,
"itemSku":10045,
"itemQuantity":24
},
{
"itemLineNumber":4,
"itemSku":10150,
"itemQuantity":24
},
{
"itemLineNumber":5,
"itemSku":10212,
"itemQuantity":36
},
{
"itemLineNumber":6,
"itemSku":10218,
"itemQuantity":24
},
{
"itemLineNumber":7,
"itemSku":10224,
"itemQuantity":84
},
{
"itemLineNumber":8,
"itemSku":10226,
"itemQuantity":60
},
{
"itemLineNumber":9,
"itemSku":10227,
"itemQuantity":42
},
{
"itemLineNumber":10,
"itemSku":10242,
"itemQuantity":84
},
{
"itemLineNumber":11,
"itemSku":10444,
"itemQuantity":12
},
{
"itemLineNumber":12,
"itemSku":10507,
"itemQuantity":12
},
{
"itemLineNumber":13,
"itemSku":10583,
"itemQuantity":6
},
{
"itemLineNumber":14,
"itemSku":11661,
"itemQuantity":396
},
{
"itemLineNumber":15,
"itemSku":11693,
"itemQuantity":48
},
{
"itemLineNumber":16,
"itemSku":11776,
"itemQuantity":24
},
{
"itemLineNumber":17,
"itemSku":11811,
"itemQuantity":42
},
{
"itemLineNumber":18,
"itemSku":11927,
"itemQuantity":24
},
{
"itemLineNumber":19,
"itemSku":12195,
"itemQuantity":732
},
{
"itemLineNumber":20,
"itemSku":12334,
"itemQuantity":24
}
]
}

rows = flatten_preserve_lists(json_data, separator='.', max_depth=5, max_list_index=100)
len(rows)
Out[13]: 11

after flatten, it should be 20 flatted rows. I did some troubleshooting and found that there is a bug in your code.

  global_max_record = int(max(list(
      list_prebuilt_flattened_dict.keys())))

global_max_record is always 9 once list_prebuilt_flattened_dict.keys reach '10'. So you cannot generate more than 11 rows.
keys() are '0','1','2','3','4','5','6','7','8','9','10'. max() function always get '9' among the keys because it get the lexicographically-largest value from the list.

i changed it to:
global_max_record = int(max(list(
list_prebuilt_flattened_dict.keys()), key=int))
then it works.

rows = flatten_preserve_lists(json_data, separator='.', max_depth=5, max_list_index=100)
len(rows)
Out[16]: 20

flatten option for root keys to ignore keeps some data from the root keys

I'm working with data returned from the twitter API for tweets, which returns data that is very nested for certain elements. One of these elements, the entities, contains annotations, hashtags, mentions and urls. I only care about the urls, so I passed the following statement:

        data = json_response['data']
        dict_flattened = (flatten(record, '.', root_keys_to_ignore={'mentions', 'hashtags', 'annotations'}) for record in data)

Doing so did remove some columns that were being returned, but I am still receiving data back from within those root keys. Am I just calling this incorrectly? Attaching output that was sent to a csv.

json_response_flatten.zip

Here is the output without specifying root keys to ignore:
json_response_flatten include all.zip

Unflat and flat with list format a[0] instead of a.0

Hello guy,

Good library. However, I prefer the format {a[0]:1,a[1]:1,a[2]:2} rather than the format {a.0:1,a.1:1,a.2:2} for list.

Can you refer to this link for good format for list, it's natural than your lib format ? this is also a good solution from 'blhsing' in answer.
https://stackoverflow.com/questions/53017956/unflatten-a-dict-obtained-from-json-using-python

I hope, you can support this format in library.

ModuleNotFoundError for v0.1.8

I'm getting import errors after upgrading to v0.1.8:

$ pip3 install flatten_json==0.1.8 && python3 -c "import flatten_json; print('ok')"
Collecting flatten_json==0.1.8
  Using cached https://files.pythonhosted.org/packages/a0/d3/a5d5d3ed553e059e2b844177087beb5d32cd2d19959a466b1e9b3c795995/flatten_json-0.1.8-py3-none-any.whl
Requirement already satisfied: six in ./venv/dev/lib/python3.8/site-packages (from flatten_json==0.1.8) (1.14.0)
Installing collected packages: flatten-json
  Found existing installation: flatten-json 0.1.7
    Uninstalling flatten-json-0.1.7:
      Successfully uninstalled flatten-json-0.1.7
Successfully installed flatten-json-0.1.8
WARNING: You are using pip version 19.2.3, however version 20.1.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'flatten_json'

Note that the previous one (v0.1.7) works fine:

$ pip3 install flatten_json==0.1.7 && python3 -c "import flatten_json; print('ok')"
Collecting flatten_json==0.1.7
  Using cached https://files.pythonhosted.org/packages/eb/a9/1e35abfc4726065f9692decb3c57cf379e5d5329befc6fa5a1ab835fffb8/flatten_json-0.1.7-py3-none-any.whl
Installing collected packages: flatten-json
  Found existing installation: flatten-json 0.1.8
    Uninstalling flatten-json-0.1.8:
      Successfully uninstalled flatten-json-0.1.8
Successfully installed flatten-json-0.1.7
WARNING: You are using pip version 19.2.3, however version 20.1.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
ok

I can see the package metadata in my virtual env, but no package source or wheel.

Perhaps something went wrong during the last release process?

Python: 3.8.1
OS: macOS 10.15.4
venv: python3 -m venv venv

unable to convert back

d = { "Required": { "a": "1", "b": [ "1", "2", "3" ], "c": { "d": { "e": [ [{ "s1": 1 }, { "s2": 2 } ],[{ "s3": 1 }, { "s4": 2 } ]] } }, "f": [ "1", "2" ] }, "Optional": { "x": "1", "y": [ "1", "2", "3" ] } }

In [14]: fd = flatten(d, separator=".")
In [15]: fd
Out[15]:
{'Optional.x': '1',
'Optional.y.0': '1',
'Optional.y.1': '2',
'Optional.y.2': '3',
'Required.a': '1',
'Required.b.0': '1',
'Required.b.1': '2',
'Required.b.2': '3',
'Required.c.d.e.0.0.s1': 1,
'Required.c.d.e.0.1.s2': 2,
'Required.c.d.e.1.0.s3': 1,
'Required.c.d.e.1.1.s4': 2,
'Required.f.0': '1',
'Required.f.1': '2'}

In [21]: unflatten_list(fd, separator=".")
Out[21]:
{'Optional.x': '1',
'Optional.y.0': '1',
'Optional.y.1': '2',
'Optional.y.2': '3',
'Required.a': '1',
'Required.b.0': '1',
'Required.b.1': '2',
'Required.b.2': '3',
'Required.c.d.e.0.0.s1': 1,
'Required.c.d.e.0.1.s2': 2,
'Required.c.d.e.1.0.s3': 1,
'Required.c.d.e.1.1.s4': 2,
'Required.f.0': '1',
'Required.f.1': '2'}

see the unflatten output, it's not the original json data.

Unable to use flatten + unflatten when original dict key names contain seperator char

Hi, I like this library but it has 1 flaw preventing me from using it. I am unable to flatten and unflatten dictionaries that include the separating character. I only noticed this because I use underscores in my key names a lot and that is the default separating char for this library.
Example:

from flatten_json import flatten, unflatten_list
import pprint

starter_dict = {
    'normal': 'kskdaskad',
    'nested': {'dict': 'asdasd'},
    'array': [1, 2, 3],
    'deeparray': [
        'string',
        {'key': 'val'},
        ['yet', 'another', 'list', ['moar']]
    ],
    'MY_KEY_WHICH_INCLUDES_UNDERSCORES': 'a single value'
}
print('-------------- original dict --------------')
pprint.pprint(starter_dict)
flat = flatten(starter_dict)
print('-------------- flat packed --------------')
pprint.pprint(flat)
unflat = unflatten_list(flat)
print('-------------- unpacked --------------')
pprint.pprint(unflat)

Output:

-------------- original dict --------------
{'MY_KEY_WHICH_INCLUDES_UNDERSCORES': 'a single value',
 'array': [1, 2, 3],
 'deeparray': ['string', {'key': 'val'}, ['yet', 'another', 'list', ['moar']]],
 'nested': {'dict': 'asdasd'},
 'normal': 'kskdaskad'}
-------------- flat packed --------------
{'MY_KEY_WHICH_INCLUDES_UNDERSCORES': 'a single value',
 'array_0': 1,
 'array_1': 2,
 'array_2': 3,
 'deeparray_0': 'string',
 'deeparray_1_key': 'val',
 'deeparray_2_0': 'yet',
 'deeparray_2_1': 'another',
 'deeparray_2_2': 'list',
 'deeparray_2_3_0': 'moar',
 'nested_dict': 'asdasd',
 'normal': 'kskdaskad'}
-------------- unpacked --------------
{'MY': {'KEY': {'WHICH': {'INCLUDES': {'UNDERSCORES': 'a single value'}}}},
 'array': [1, 2, 3],
 'deeparray': ['string', {'key': 'val'}, ['yet', 'another', 'list', ['moar']]],
 'nested': {'dict': 'asdasd'},
 'normal': 'kskdaskad'}

I understand I could use the separator param and make it something really weird, but I am always running the risk of ruining my original dict structure if (heaven forbid) any of my key names include that same string pattern. Maybe quote encapsulation could help here?

Fixing this issue would make the library more plug-and-play since I am sure I am not the only one that likes to use underscores in their JSON key names.

Option to ignore fields and specify certain fields to flatten

It would be great to be able to specify fields not to flatten as sometimes there can be one or two fields that when expanded produce a huge number of columns.

It would also be great to be able to do the opposite: ie. only specify certain fields to flatten, the rest being left as is.

Related would be an option for lists to flatten up to n elements in the list and ignore the rest (or put rest unflattened in one column perhaps).

Thx for a great utility.

Two different install instructions on PyPI

It is a little confusing that two different install instructions are given on PyPI. One using - and one using _. Will both work? Which is the preferred?

More complex flattening?

Wondering if you could provide an example of a very deep nested JSON?

Something like:

{
    "results": [
        {
            "project_name": "Test",
            "start_date": "2020-03-25",
            "associated_references": [
                {
                    "full_reference": "Reference A",
                    "year": 2010,
                },
                {
                    "full_reference": "Reference B",
                    "year": 2010
                },
            ],
            "hierarchy":[{
                      "hierarchy_id": "top level event",
                       "event":[
                           {
                                  "event name": "event 1"
                            },
                            {
                                  "event name": "event 2"
                            },
                            {
                                  "event name": "event 3"
                            }]
                       },
                       {"hierarchy_id":"second level event",
                        "event":[
                           {
                                  "event name": "event 1"
                            },
                            {
                                  "event name": "event 2"
                            }
                            }]]
                             
              }]
      ]}}

The header structure to be returned here would be something like:

Project_name	start_date	associated_references.full_reference	associated_references.year	hierarchy.hierarchy_id	event.event_name

In this case, the row values for each column are replicated (e.g., project_name and start_date) up to the number of instances of the sub-queries..

new function flatten_preserve_lists to limit flattening and preserve list structure

I propose to add a new function called flatten_preserve_list.

It would create a new record for every element in a list, akin to a left join between two tables in sql. It adds options max_list_index, which controls how many list indices are processed, and max_depth, which controls how many recursions are permitted. Given that the result of this function may be very large, these options help reduce output size and can be used for quick data investigation.

This function requires import of copy, re, and math libraries.

See below for the pull request. It works, but still contains some debug statement which I will remove if it is accepted.

Key Value Pairs in JSON

We have a situation where we most of the data is in "normal" json formats and then we have a "catch all" that is a key-value pairing.

{
"objects":[
          {
          "objectId": "one",
          "Tags": [
                {
                    "Key": "key1",
                    "Value": "value1"
                },
                {
                    "Key": "key2",
                    "Value": "valu2"
                },
                {
                    "Key": "key3",
                    "Value": "value3"
                }
            ]
          },
          {
          "objectId": "two",
          "Tags": [
                          {
                              "Key": "key1",
                              "Value": "value4"
                          },
                          {
                              "Key": "key3",
                              "Value": "value5"
                          },
                          {
                              "Key": "key4",
                              "Value": "value6"
                          }
                  ]
              }
          ]
}

My anticipated output would look something like the following.

ObjectId     Key1     Key2       Key3        Key4
one             value1   value2    value3     NaN
two             value4   NaN       value5     value6

Just starting to think about a way to do this by augmenting the flatten_json code, but was curious if anyone came up with a solution to this problem.

Several issues with flatten_preserve_lists

I found the following issues with flatten_preserve_lists

flatten_preserve_lists({'K': ['abc', [1, 2, 3]]}, separator="+") results in the error re.error: nothing to repeat at position 1
flatten_preserve_lists({'K': ['abc', [1, 2, 3]]}, separator="=+") (note that the 'separator' is '=+' and not '+' as in the previous case)

output: [{'K=+0=+1=+0=+1=+2': None, 'K=+0=+1=+0=+1': None, 'K=+0=+1=+0': None, 'K=+0': None, 'K': 'abc'}, {'K=+0=+1=+0=+1=+2': None, 'K=+0=+1=+0=+1': None, 'K=+0=+1=+0': None, 'K=+0': None, 'K': 1}, {'K=+0=+1=+0=+1=+2': None, 'K=+0=+1=+0=+1': None, 'K=+0=+1=+0': None, 'K=+0': None, 'K': 2}, {'K=+0=+1=+0=+1=+2': None, 'K=+0=+1=+0=+1': None, 'K=+0=+1=+0': None, 'K=+0': None, 'K': 3}]
expected: {'K': ['abc', [1, 2, 3]]} as the input is not nested

flatten_preserve_lists({'K': ['abc', [[1, 2, 3]]]}, separator="=+")

output: [{'K=+0=+1=+0=+0=+1': None, 'K=+0=+1=+0=+0=+1=+2': None, 'K=+0=+1=+0=+0': None, 'K=+0': None, 'K': 'abc'}, {'K=+0=+1=+0=+0=+1': None, 'K=+0=+1=+0=+0=+1=+2': None, 'K=+0=+1=+0=+0': None, 'K=+0': None, 'K': [1, 2, 3]}]
expected: {'K': ['abc', [[1, 2, 3]]]} as the input is not nested

flatten_preserve_lists({'K': ['abc', [1, 2, 3]]})

output: [{'K': 'abc'}, {'K': 1}, {'K': 2}, {'K': 3}]
expected: {'K': ['abc', [1, 2, 3]]} as the input is not nested

flatten_preserve_lists({'K': ['abc', [[1, 2, 3]]]})

output: [{'K': 'abc'}, {'K': [1, 2, 3]}]
expected: {'K': ['abc', [1, 2, 3]]} as the input is not nested

I am working with the latest version of flatten_json, which is currently 0.1.13

Unflatten: run with partial flatten dict

>>> from flatten_json import flatten,unflatten
>>> unflatten({'a.b':[1,2]}, '.')

AssertionError: provided dictionary is not flat

Not sure why the assertion is so strict, I just what the result {'a':{'b':[1,2]}}

Now I use this: https://stackoverflow.com/questions/6037503/python-unflatten-dict/6037657#6037657 , which works fine to me.

New Release to PyPi?

Hi! Are there any plans to make a new release to PyPi with the new updates (since 2017 is the last published update)

BTW this is a great project that has saved me and others countless hours! Props to this project!!

Wheel on pypi?

Greetings,

It would be great to ship a wheel to pypi and test the project on newer python versions to ensure it continues to function with python updates. I opened a PR to enable this.

unflatten from pandas dataframe

My use case is the following: I convert a json string in below form to a pandas dataframe. Pandas will automatically fill columns with NaN's if they are not defined for a particular record (as in below example).

import pandas as pd, json
from flatten_json import flatten, unflatten

input = json.loads(""" 

[{
    "cola": {"colb": null, "colc": 10},          
    "cold": "hi"    
},
{
    "cola": null, 
    "cold": "hi"    
}]
""")


flat_json = [flatten(record, '.') for record in input]
df_flat = pd.DataFrame(flat_json)

when I try to unflatten the dataframe records, unflatten throws an error.

[unflatten(record, '.') for record in df_flat.to_dict('records')]

TypeError: 'float' object does not support item assignment

Hope you would agree that this is a fairly common use case, so would warrant a fix.

My proposal:

in flatten_json.py, function unflatten, I would change:

for item in flat_dict:
       _unflatten(unflattened_dict, item.split(separator), flat_dict[item])

with

list_keys = sorted(flat_dict.keys())
for i, item in enumerate(list_keys):
	if i != len(list_keys)-1:
		if not list_keys[i+1].startswith(list_keys[i]):
			_unflatten(unflattened_dict, item.split(separator), flat_dict[item])
		else:
			pass  # if key contained in next key, json will be invalid.
	else:
		#  last element
		_unflatten(unflattened_dict, item.split(separator), flat_dict[item])

Data is lost when a root key contains another root key

Related to #48 but the solution to this issue is different.

Input

{
    'a': 1,
    'a_b': 2,
    'a_c.d': 3,
    'a_c.e': 4
}

Expected

{
    'a': 1,
    'a_b': 2,
    'a_c': {'d': 3, 'e': 4}
}

Actual ('a' key is missing)

{
    'a_b': 2,
    'a_c': {'d': 3, 'e': 4}
}

cannot flatten arrays of numbers

print flatten_json.unflatten(flatten_json.flatten({'data' : [2]})) :
{'data': {'0': 2}}

unflatten does not work when importing JSON from file

when using the code below and attached file
data.json.zip, unflatten does not work. Shows error of 'provided dictionary is not flat'

import yaml,json
from flatten_json import unflatten

json_file = open('data.json')
json_str = json_file.read()
json_data = json.loads(json_str)
unflatten(json_data)

Exploding non-nested lists rather than unpacking into separate columns

I noticed the flatten_json function unpacks a list even when it contains only integers for example:
list = [1,2,3]

Rather than exploding like so:
list
1
2
3

It places them into individual columns
list_0 list_0 list_0
1 2 3

Is there a way it can accommodate lists which aren't nested and just need to be exploded rather than unpacked like a list of dictionaries ?

Still maintained?

Does anyone know if this project is still being maintained? I've run into issue #62 and read that it was fixed in the v0.1.8 release which was released and then yanked from pypi due to a build issue.

If the project is still being maintained, is it likely to be deployed at all?
If the project is unmaintained, is there any interest in forking and fixing outstanding issues?

Option to leave some structure intact

It would be great to be able to specify that some fields at root level are not to be flattened but will still be outputted.

Here is an example: json
In the above, if you look under the key "solr_additions", there are many values in the list (one for each country in Africa). It would be great to output this, because it is useful, but not flatten it as it results in more than 50 keys ie. output it unflattened.

The point of flatten to me is that the resulting object has no structure (no iterables other than strings are present as values of the flat dictionary). This way you can easily pass the dictionary to Pandas.

The unflattened structure would need to be converted into a long string and in the above case, put under a single key called "solr_additions". json_normalize in Pandas behaves this way (but obviously doesn't have all the other flattening features of this library).

I don't know how this could be achieved outside flatten.

The util namespace

I am unable to use flatten_json in a project that has its own util:

$ (virtualenv --python=python3 flatten && cd flatten && source bin/activate && pip install flatten-json six && touch util.py && python -c "import flatten_json")
[...]
Installing collected packages: flatten-json, six
Successfully installed flatten-json-0.1.7 six-1.15.0
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Users/am/flatten/lib/python3.7/site-packages/flatten_json.py", line 13, in <module>
    from util import check_if_numbers_are_consecutive
ImportError: cannot import name 'check_if_numbers_are_consecutive' from 'util' (/Users/am/flatten/util.py)

(remove touch util.py from the shell line and it works without error)

I think the name util is a bit too generic to be reserved by this package. Is it possible to change that, maybe by using a relative import (from .util import check_if_numbers_are_consecutive)?

support better sorting of flattened columnames

Hi,
nice work, btw
i would like to request adding an formatstring for bulding new keynames as it happens (when sorting the columns in Pandas) stuff like this:
key_10_name, key_11_name, key_1_name, key_2_name ....
with

 for index, item in enumerate(object_):
                _flatten(item, _construct_key(key, separator, '{:02d}'.format(index)))

the formatstring could be a parameter, or configurationattribute

the keys would look like this
key_01_name, key_02_name,, ...., , key_10_name, key_11_name

regards
djdeejay

Data is lost if unflatten column name equals to another original column

Seems like a corner case, so I think at least we should raise an error when such situation happens:

echo '{"id": [0], "id_0": 2}' | flatten_json
>>>{"id_0": 2}

As you see, "id": [0] is lost. One might use another separator, but the idea is that one can never be sure which separator will definitely work before he actually checks the data.

Could be related with #17

The issue with the performance of unflattening process

The issue is related to the performance in the large datasets.
The method 'unflatten_list' has been used for the unflattening process.
The dataset has 80750 rows and 1051 columns.
The process of unflattening took 6 hours and 5 minutes.

Have you faced with this issue? How the unflattening process might be optimized?

double quotes vs single quotes

This package is great. Only one minor thing, instead of single quotes, can we change to double quotes to make it a valid json object.

Python 3?

Hi.

What's the story with flatten_json on Python 3.x? You probably already know that Python 2.x will not be supported much longer. I'm asking mostly because caniusepython3 identified flatten_json as one of the projects that need to be looked at.

If the compatibility is currently unknown, is there a test suite I can run with a Python 3.x interpreter to see how things look?

Thanks!

flatten_preserve_lists is failed to flatten data which has array in array

Test data is like:
data = { "header": "header a",
"item": [{"itemLineNumber": 1}, {"itemLineNumber": 2}],
"containerInfo":
[{ "containerId":"A",
"containerItems":[{"itemSKU":1}]},
{"containerId":"B",
"containerItems":[{"itemSKU":2}]},
{"containerId":"C",
"containerItems":[{"itemSKU":3}]},
{"containerId":"D",
"containerItems":[{"itemSKU":4}]},
{"containerId":"E",
"containerItems":[{"itemSKU":5}]}]}

Int[5]: flatten_json.flatten_preserve_lists(data, max_list_index=10, max_depth=10)
Out[8]:
[{'header': 'header a',
'containerInfo_containerItems': 1,
'containerInfo_containerId': 'E',
'item': 1},
{'header': 'header a',
'containerInfo_containerItems': 1,
'containerInfo_containerId': 'E',
'item': 2},
{'header': 'header a',
'containerInfo_containerItems': 2,
'containerInfo_containerId': 'E',
'item': 1},
{'header': 'header a',
'containerInfo_containerItems': 2,
'containerInfo_containerId': 'E',
'item': 2},
{'header': 'header a',
'containerInfo_containerItems': 3,
'containerInfo_containerId': 'E',
'item': 1},
{'header': 'header a',
'containerInfo_containerItems': 3,
'containerInfo_containerId': 'E',
'item': 2},
{'header': 'header a',
'containerInfo_containerItems': 4,
'containerInfo_containerId': 'E',
'item': 1},
{'header': 'header a',
'containerInfo_containerItems': 4,
'containerInfo_containerId': 'E',
'item': 2},
{'header': 'header a',
'containerInfo_containerItems': 5,
'containerInfo_containerId': 'E',
'item': 1},
{'header': 'header a',
'containerInfo_containerItems': 5,
'containerInfo_containerId': 'E',
'item': 2}]

The header field (parent level) and containerItems(grandchild level) and Item(child level) fields are fine. But the containerId(child level) is always overwritten by the latest array item "containerId":"E".

Limit processing in `flatten`

Abandon flattening past a certain depth.
Abandon flattening past an element index number across all lists. This is useful for very large lists. For example we can set max_list_elements=100 and only the top 100 elements in all lists are processed.

add type hints or stub file

At the moment mypy won't work for code that uses flatten_json:
Skipping analyzing "flatten_json": found module but no type hints or library stubs [import]

Python None becomes string "None" not JSON literal null

Hello!

I am using unflatten_list() and I am running into a case where the Python None object is being serialized into the string "None" rather than the JSON literal null. I am using this in a script that extracts data from etcd3 to generate a JSON file for internal processes.

True Flattening

This method does flatten nested data in a sense, but if you look at the academic literature, flattening refers more to this:

[{
'a': 1,
'b': 2,
'c_d': 2,
'c_e_f': 1,
'c_e_g': 2
}, {
'a': 1,
'b': 2,
'c_d': 3
}, {
'a': 1,
'b': 2,
'c_d': 4
}]

In the case of being a feature for your package, this could be called "true flatten" or "full flatten" or something of that nature. What do you think?

flatten_preserve_lists returns a list instead of dict

Hi guys,

flatten_preserve_lists(data) appears to return a list, whereas the definition says it returns a flattened dictionary.

Cheers,
Rory

Regression 0.1.6 -> 0.1.7 for numpy arrays

Such a useful module - thank you.

I've noticed that when a numpy array is present as a dictionary value, flatten() throws the following:

  File "myfile.py", line 452, in myfunc
    flattened = flatten(x)
  File "/anaconda2/lib/python2.7/site-packages/flatten_json.py", line 82, in flatten
    _flatten(nested_dict, None)
  File "/anaconda2/lib/python2.7/site-packages/flatten_json.py", line 74, in _flatten
    object_key))
  File "/anaconda2/lib/python2.7/site-packages/flatten_json.py", line 66, in _flatten
    if not object_:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

This worked in 0.1.6. Any ideas what happened? Has some default changed? Is there a workaround? Thanks

Thank you

This is not a bug or anything, it's just to say thank you. Your library is really GREAT. We're now using it in our open source PYBOSSA software for improving our CSV exporters with Pandas.

Again, thanks for sharing this work!

Maybe unflatten as option?

First things first: Great module :)!
But I got a question, could you maybe make an also recursively working unflatten function, so that we could get the old datastructure back? (Yes, one needs to use an seperator which is not used in the keynames, that should be checked as an error while doing that, but other than that it should work :))

Thanks a lot!

Error if a key is substring of another key

Example:

unflatten(flatten({'a': '1', 'ab': 2, 'c': {'d': 4} }))
{'ab': 2, 'c': {'d': 4}}
expected result: {'a': '1', 'ab': 2, 'c': {'d': 4} }

Assistance needed

Sorry to bother.. not really an issue. just an "almost there".

Wondering if someone have a moment to guide me.

I have this data:

in []:

dfc=dfb['entities']
dfc

out []:

0    [{'name': 'Shop', 'type': 'LOCATION', 'salienc...
1    [{'name': 'Shop', 'type': 'LOCATION', 'salienc...
Name: entities, dtype: object

in []:

dfTodict = dfc.to_dict()
dfTodict

out []:

{0: [{'name': 'Shop',
   'type': 'LOCATION',
   'salience': 0.034899305552244186,
   'wikipedia_url': '-'},
  {'name': 'Brand',
   'type': 'ORGANIZATION',
   'salience': 0.019303809851408005,
   'wikipedia_url': 'https://en.wikipedia.org/wiki/Brand'},
......
 {'name': '57', 'type': 'NUMBER', 'salience': 0.0, 'wikipedia_url': '-'}],
 1: [{'name': 'Shop',
   'type': 'LOCATION',
   'salience': 0.054236818104982376,
   'wikipedia_url': '-'},
  {'name': 'Brand',
   'type': 'ORGANIZATION',
   'salience': 0.023990564048290253,
   'wikipedia_url': 'https://en.wikipedia.org/wiki/Brand'},
......

I want to preform the seconed flatten format from the readme:

dic_flattened = (flatten(d) for d in dfTodict)

but then i get a generator object which is difficult for me to understand how to make it a dataframe.

<generator object <genexpr> at 0x0000022301C025E8>

Thank you !

Data is lost if unflatten column name begins with name

import flatten_json

records = {
    'lives.name':'person',
    'lives.name_mother':'mother'
}

flatten_json.unflatten(records, ".")

expected:

{
   'lives': {
      'name': 'person',
      'name_mother': 'mother'
   }
}

but result in:

{
   'lives': {
      'name_mother': 'mother'
   }
}

Handle keys that look like integers

I had to flatten and then unflatten json which had arrays and also some keys of the form "0", "1" (strings with numbers). I hacked on the code a bit disambiguate arrays from these keys by adding a "@", but the way I did it breaks backward compatibility. Here it is for reference:

yzadik@3302f14

unflatten_list() does not progress deeper once a list is found, missing lists inside lists

I am encountering issues where some two item lists are getting missed by unflatten_list(), generating instead a dictionary of the format:

      "server": [
        {
          "0": "redacted",
          "1": "redacted1"
        },
        {
          "0": "redacted",
          "1": "redacted1"
        } 
      ]

What should be generated is:

      "server": [
        [
          "redacted",
          "redacted1"
        ],
        [
          "redacted",
          "redacted1"
        ]
      ]

Looks like it has to do with nested lists.

Nested lists throwing error

Currently I have the code:

# Data is a list with dictionary elements which in turn have nested lists, so we turn the first list layer into a dictionary.
jira_data = { i : jira_data[i] for i in range(0, len(jira_data) ) }
jira_flatten = (flatten(d) for d in jira_data)
jira_df = pd.DataFrame(jira_flatten)

the last line errors with the following:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-17-75a3a8fe964c> in <module>()
----> 1 jira_df = pd.DataFrame(jira_flatten)

~/anaconda3/envs/python3/lib/python3.6/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
    362         elif isinstance(data, (list, types.GeneratorType)):
    363             if isinstance(data, types.GeneratorType):
--> 364                 data = list(data)
    365             if len(data) > 0:
    366                 if is_list_like(data[0]) and getattr(data[0], 'ndim', 1) == 1:

<ipython-input-16-1cf0d677c973> in <genexpr>(.0)
----> 1 flatten_jira = (flatten(d) for d in jira_data)

~/anaconda3/envs/python3/lib/python3.6/site-packages/flatten_json.py in flatten(nested_dict, separator, root_keys_to_ignore)
     32     :return: flattened dictionary
     33     """
---> 34     assert isinstance(nested_dict, dict), "flatten requires a dictionary input"
     35     assert isinstance(separator, str), "separator must be a string"
     36 

AssertionError: flatten requires a dictionary input

Handling of Escaped Double Quotes

In case a value of any attributes contains an escaped double quote the flatten json generates a corrupt json file. The escape sign should be taken over.

e.g.
"title": "JTC 24,5\" (...)",

becomes :
"lineItems_0_title": "JTC 24,5" (...)

no output, tests pass

Trying this for the first time, installed via pip for python3. Does not seem to work, without giving any errors.

Running example from the docs:

0 $ echo '{"a": {"b": 1}}' | python3 -m flatten_json
0 $

The 0 is the exit status.

Running the test suite also passes:

0 $ python3 /home/zuzuzu/.local/lib/python3.6/site-packages/test_flatten.py 
............
----------------------------------------------------------------------
Ran 12 tests in 0.001s

OK
0 $

Any ideas?

Issue with unflatten

When there is a flattened field with a "duplicated" inner part there's an error:
AttributeError: ‘str’ object has no attribute ‘setdefault’
The problematic json:

{
     "field": "",
     "field.inner.part": "123"
}

While this would work:

{
     "field.inner.part": "123"
}

And this would also work:

{
     "field": "",
     "field.inner": "", 
     "field.inner.part": "123"
}

The desired output for all cases above should be:

{“field”: {“inner”: {“part”: “123”}}}

The function used is:

unflatten_list(example, separator=".")

I don't have a way of removing the "duplicated" ones as this is the way some users input their flattened json.

unflatten_list does not work with more complex cases

I am running the following test case on python 3.5.2:

    def test_unflatten_with_list_custom_separator(self):
        """Dictionary with lists"""
        self.maxDiff = None
        dic = {
            'a:b': 'str0',
            'c:0:d:0:e': 'str1',
            'c:0:f': 'str2',
            'c:0:g': 'str3',
            'c:1:d:0:e': 'str4',
            'c:1:f': 'str5',
            'c:1:g': 'str6',
            'h:d:0:e': 'str7',
            'h:i:0:f': 'str8',
            'h:i:0:g': 'str9'
        }
        expected = {
            'a': {'b': 'str0'},
            'c': [
                {
                    'd': [{'e': 'str1'}],
                    'f': 'str2',
                    'g': 'str3'
                }, {
                    'd': [{'e': 'str4'}],
                    'f': 'str5',
                    'g': 'str6'
                }
            ],
            'h': {
                'd': [{'e': 'str7'}],
                'i': [{'f': 'str8', 'g': 'str9'}]
            }
        }
        actual = unflatten_list(dic, ':')
        self.assertEqual(actual, expected)

First it fails because the separator is not passed to unflatten (fixed in #7)

When this is fixed, I get the following error:

flatten_json.py:119: in unflatten_list
    _convert_dict_to_list(unflattened_dict, None, None)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

object_ = {'a': {'b': 'str0'}, 'c': {'0': {'d': {'0': {'e': 'str1'}}, 'f': 'str2', 'g': 'str3'}, '1': {'d': {'0': {'e': 'str4'}}, 'f': 'str5', 'g': 'str6'}}, 'h': {'d': {'0': {'e': 'str7'}}, 'i': {'0': {'f': 'str8', 'g': 'str9'}}}}, parent_object = None
parent_object_key = None

    def _convert_dict_to_list(object_, parent_object, parent_object_key):
        if isinstance(object_, dict):
            try:
                keys = [int(key) for key in sorted(object_) if
                        not isinstance(object_[key], Iterable) or isinstance(object_[key], str)]
                keys_len = len(keys)
>               if (sum(keys) == int(((keys_len - 1) * keys_len) / 2) and keys[0] == 0 and keys[-1] == keys_len - 1 and
                        check_if_numbers_are_consecutive(keys)):
E                       IndexError: list index out of range

amirziai / flatten Goto Github PK

flatten's Introduction

flatten_json

Installation

flatten

Usage

Usage with Pandas

Custom separator

Ignore root keys

unflatten

Unflatten with lists

Command line invocation

flatten's People

Contributors

Stargazers

Watchers

Forkers

flatten's Issues

Recommend Projects

Recommend Topics

Recommend Org