Git Product home page Git Product logo

jsonpydexer's People

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

jsonpydexer's Issues

If a json file is missing the key, the indexer throws a keyerror and dies

Error:

Traceback (most recent call last):
  File "jp.py", line 4, in <module>
    jp.index(["citation", "doi"], r=True)
  File ...lib\site-packages\JsonPydexer.py", line 61, in index
    file_key = reduce(getitem, key, j)
KeyError: 'doi'

jp.py:

from JsonPydexer import JsonPydexer
jp = JsonPydexer("data")
jp.index(["citation", "doi"], r=True)

(not every file has the doi key under citation)

Nested keys

Allow a list of strings to be passed to JsonPydexer.index() in the first positional argument.
EG ["personInfo", "contact", "phone"] would access "555-555-5555":

{
    personInfo: {
        contact: {
            phone: "555-555-5555"
        }
    }
}
  • Add or identify test data
  • Write the tests
  • Pass the tests

Write API for loading index

For now it's as simple as

with open("key.pickle", "rb") as f:
   index = pickle.load(f)

but with other features coming here it likely won't be as simple. Hoping for something like

from JsonPydexer import JsonPydexer
jp = JsonPydexer("test_data/3")
jp.index(["name", "status", "id"])

print(jp.get_file(["id"], "id1"))
print()
for f in jp.get_files(["name"], "Alice"):
    print(f)
1.json

3.json
1.json

Persistence formatting

Irrespective of the language bindings for the library indexing files created using the library in one language must work in other languages as well.

introduce index object

many upcoming issues (#28 , #24 , etc) require some changes to the way indexes are stored. the index for a directory should be contained in one file, a serialized index object. the index object should have:

  • an attribute called files which contains a dict of filenames and some indication of when they were last seen by our indexer (hash or datetime, tbd): {"filename": indication} (#24 )
  • an attribute called unique_indices which is a dict of dicts (with the outer dict key being the keyfield we indexed on), each inner dict being an index as we previously treated them: {"key_field": {"key_value": "filename"}}
  • an attribute called group_indices which is a dict of dicts, much as above, but the inner dicts' values are sets: {"key_field": {"key_value": set(filename1, filename2...)}} (#28}

A more concrete example:

files = {
    "0001.json": "01-01-2018", #the value here may be a datetime object, a hash, or some other indication. for illustration here it's simply a string containing a mm-dd-yyyy date
    "0002.json": '01-01-2018",
    "0003.json": "01-03-2018"
}

unique_indices = {
    "id": {
        "z1": "0001.json",
        "z2": "0002.json",
        "z3": "0003.json",
    }
    "name": {
        "Alice": "0001.json",
        "Bob": "0002.json",
        "Eve": "0003.json",
    }
}

group_indices = {
    "status": {
        "active": ("0001.json", "0002.json"),
        "disabled": ("0003.json")
    }
}
    

Adding keynames

Adding keynames after opening an existing index does not work. As this is not currently documented as a feature, I'm categorizing this as an enhancement and not a bug.

check if filename already exists

#TODO check if filename already exists,

behavior for now should be to throw an error. in future versions we can either open it, or implement some kind of "update changes since last indexing" behavior, or something else entirely. havent decided yet

After un-pickling and attempting to print the resulting dict, Unicode errors happen

error:

  File "open.py", line 4, in <module>
    print(f)
  File "....lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2606' in position 248: character maps to <undefined>

open.py

import pickle
with open("citationtitle.pickle", "rb") as f:
    f = pickle.load(f)
    print(f)

add a `categorize` method or similar?

Basically, if there is a keyname that the user expects to exists in some, but not all, of the json files, the categorize (I think there is a better name somewhere) should give some way to access the "exists" set and the "doesn't exist" set. Right now, I think the way this is handled is a key:value pair in group_indices with the key of "None" and the value of (set of documents), and a key:value pair in group_indices for each other key. tl;dr I want a concise way to access the second group as a single set.

Write additional tests

  • test_constructor_no_perms
    Define the JsonPydexer constructor's behavior when the targeted directory lacks permissions we require, and test for this behavior.
  • test_index_filename
    Test for passing a explicit filename kwarg to JsonPydexer.index()
    I think explicit filenames should be disallowed for reasons i'll put in an issue soon

Recursive directories

Recursion into directories in JsonPydexer.index().

  • Setting the kwarg to True should make this happen, otherwise it shouldn't.

  • Add testdata and tests for both cases.

Enhance test_index_no_filename

The test PydexerIndex.test_index_no_filename() is asserting the bare minimum. Update this test to assert against the expected value from the testdata.

Add ability to "update" an index

Include a hash of each file in the index

Check for deleted files when updating an existing index

Check for modified files when updating an existing index

Add new files when updating an existing index

Edit: I think this is mostly completed in PR #38. At least, deleted and new files are taken care of. Modified files/hashing still needs to be done, and issues #35 #34 can be done to test the behavior introduced in pr #38.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.