cl3wis / jsonpydexer Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 0.0 43 KB

A (python) indexer for large collections of json files

Python 100.00%

jsonpydexer's People

Stargazers

Watchers

jsonpydexer's Issues

Reduce duplicate code in recursive and non-recursive sections of `index` method.

Investigate other JSON decoders

It's very slow now

If a json file is missing the key, the indexer throws a keyerror and dies

Error:

Traceback (most recent call last):
  File "jp.py", line 4, in <module>
    jp.index(["citation", "doi"], r=True)
  File ...lib\site-packages\JsonPydexer.py", line 61, in index
    file_key = reduce(getitem, key, j)
KeyError: 'doi'

jp.py:

from JsonPydexer import JsonPydexer
jp = JsonPydexer("data")
jp.index(["citation", "doi"], r=True)

(not every file has the doi key under citation)

Nested keys

Allow a list of strings to be passed to JsonPydexer.index() in the first positional argument.
EG ["personInfo", "contact", "phone"] would access "555-555-5555":

{
    personInfo: {
        contact: {
            phone: "555-555-5555"
        }
    }
}

Add or identify test data
Write the tests
Pass the tests

Write API for loading index

For now it's as simple as

with open("key.pickle", "rb") as f:
   index = pickle.load(f)

but with other features ~~coming~~ here it likely won't be as simple. Hoping for something like

from JsonPydexer import JsonPydexer
jp = JsonPydexer("test_data/3")
jp.index(["name", "status", "id"])

print(jp.get_file(["id"], "id1"))
print()
for f in jp.get_files(["name"], "Alice"):
    print(f)

1.json

3.json
1.json

Update all path operations to use pathlib

Persistence formatting

Irrespective of the language bindings for the library indexing files created using the library in one language must work in other languages as well.

many upcoming issues (#28 , #24 , etc) require some changes to the way indexes are stored. the index for a directory should be contained in one file, a serialized index object. the index object should have:

an attribute called files which contains a dict of filenames and some indication of when they were last seen by our indexer (hash or datetime, tbd): {"filename": indication} (#24 )
an attribute called unique_indices which is a dict of dicts (with the outer dict key being the keyfield we indexed on), each inner dict being an index as we previously treated them: {"key_field": {"key_value": "filename"}}
an attribute called group_indices which is a dict of dicts, much as above, but the inner dicts' values are sets: {"key_field": {"key_value": set(filename1, filename2...)}} (#28}

A more concrete example:

files = {
    "0001.json": "01-01-2018", #the value here may be a datetime object, a hash, or some other indication. for illustration here it's simply a string containing a mm-dd-yyyy date
    "0002.json": '01-01-2018",
    "0003.json": "01-03-2018"
}

unique_indices = {
    "id": {
        "z1": "0001.json",
        "z2": "0002.json",
        "z3": "0003.json",
    }
    "name": {
        "Alice": "0001.json",
        "Bob": "0002.json",
        "Eve": "0003.json",
    }
}

group_indices = {
    "status": {
        "active": ("0001.json", "0002.json"),
        "disabled": ("0003.json")
    }
}

Add support for predicate functions to determine if a file should be included

For example, I want some hash list (no longer an index, I suppose, since it's specific) of files that have values that meet some condition.

.jp.pkl created in run directory, not index directory

Write tests for removing zombie files. depends on issue #24

Adding keynames

Adding keynames after opening an existing index does not work. As this is not currently documented as a feature, I'm categorizing this as an enhancement and not a bug.

check if filename already exists

JsonPydexer/JsonPydexer.py

Line 51 in ce72e8c

#TODO check if filename already exists,

behavior for now should be to throw an error. in future versions we can either open it, or implement some kind of "update changes since last indexing" behavior, or something else entirely. havent decided yet

KeyError when missing key

This should be caught

Directory check should raise exception only, not print to stderr

After un-pickling and attempting to print the resulting dict, Unicode errors happen

error:

  File "open.py", line 4, in <module>
    print(f)
  File "....lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2606' in position 248: character maps to <undefined>

open.py

import pickle
with open("citationtitle.pickle", "rb") as f:
    f = pickle.load(f)
    print(f)

Update readme

Set up Travis CI with the unittests

Don't allow users to pass keynames that resolve to a list/dict

add a `categorize` method or similar?

Basically, if there is a keyname that the user expects to exists in some, but not all, of the json files, the categorize (I think there is a better name somewhere) should give some way to access the "exists" set and the "doesn't exist" set. Right now, I think the way this is handled is a key:value pair in group_indices with the key of "None" and the value of (set of documents), and a key:value pair in group_indices for each other key. tl;dr I want a concise way to access the second group as a single set.

Write additional tests

test_constructor_no_perms
Define the JsonPydexer constructor's behavior when the targeted directory lacks permissions we require, and test for this behavior.
test_index_filename
Test for passing a explicit filename kwarg to JsonPydexer.index() I think explicit filenames should be disallowed for reasons i'll put in an issue soon

Recursive directories

Recursion into directories in JsonPydexer.index().

Setting the kwarg to True should make this happen, otherwise it shouldn't.
Add testdata and tests for both cases.

Allow indexing on multiple fields at once

Setup.py should pull version, author, etc from JsonPydexer.py

Use a file hash instead of a timestamp in the Index object

remove unused imports

Write tests for adding new files. depends on issue #24

Fix broken build

I think adding a requirements.txt will fix this

Enhance test_index_no_filename

The test PydexerIndex.test_index_no_filename() is asserting the bare minimum. Update this test to assert against the expected value from the testdata.

Allow indexing on non-unique key (ie date)

Update inline docstrings

Add ability to "update" an index

Include a hash of each file in the index

~~Check for deleted files when updating an existing index~~

Check for modified files when updating an existing index

~~Add new files when updating an existing index~~

Edit: I think this is mostly completed in PR #38. At least, deleted and new files are taken care of. Modified files/hashing still needs to be done, and issues #35 #34 can be done to test the behavior introduced in pr #38.

cl3wis / jsonpydexer Goto Github PK

jsonpydexer's People

Stargazers

Watchers

jsonpydexer's Issues

Recommend Projects

Recommend Topics

Recommend Org