cl3wis / jsonpydexer Goto Github PK
View Code? Open in Web Editor NEWA (python) indexer for large collections of json files
A (python) indexer for large collections of json files
It's very slow now
Error:
Traceback (most recent call last):
File "jp.py", line 4, in <module>
jp.index(["citation", "doi"], r=True)
File ...lib\site-packages\JsonPydexer.py", line 61, in index
file_key = reduce(getitem, key, j)
KeyError: 'doi'
jp.py:
from JsonPydexer import JsonPydexer
jp = JsonPydexer("data")
jp.index(["citation", "doi"], r=True)
(not every file has the doi
key under citation
)
Allow a list of strings to be passed to JsonPydexer.index()
in the first positional argument.
EG ["personInfo", "contact", "phone"]
would access "555-555-5555"
:
{
personInfo: {
contact: {
phone: "555-555-5555"
}
}
}
For now it's as simple as
with open("key.pickle", "rb") as f:
index = pickle.load(f)
but with other features coming here it likely won't be as simple. Hoping for something like
from JsonPydexer import JsonPydexer
jp = JsonPydexer("test_data/3")
jp.index(["name", "status", "id"])
print(jp.get_file(["id"], "id1"))
print()
for f in jp.get_files(["name"], "Alice"):
print(f)
1.json
3.json
1.json
Irrespective of the language bindings for the library indexing files created using the library in one language must work in other languages as well.
many upcoming issues (#28 , #24 , etc) require some changes to the way indexes are stored. the index for a directory should be contained in one file, a serialized index object. the index object should have:
files
which contains a dict of filenames and some indication of when they were last seen by our indexer (hash or datetime, tbd): {"filename": indication}
(#24 )unique_indices
which is a dict of dicts (with the outer dict key being the keyfield we indexed on), each inner dict being an index as we previously treated them: {"key_field": {"key_value": "filename"}}
group_indices
which is a dict of dicts, much as above, but the inner dicts' values are sets: {"key_field": {"key_value": set(filename1, filename2...)
}} (#28}A more concrete example:
files = {
"0001.json": "01-01-2018", #the value here may be a datetime object, a hash, or some other indication. for illustration here it's simply a string containing a mm-dd-yyyy date
"0002.json": '01-01-2018",
"0003.json": "01-03-2018"
}
unique_indices = {
"id": {
"z1": "0001.json",
"z2": "0002.json",
"z3": "0003.json",
}
"name": {
"Alice": "0001.json",
"Bob": "0002.json",
"Eve": "0003.json",
}
}
group_indices = {
"status": {
"active": ("0001.json", "0002.json"),
"disabled": ("0003.json")
}
}
For example, I want some hash list (no longer an index, I suppose, since it's specific) of files that have values that meet some condition.
Adding keynames after opening an existing index does not work. As this is not currently documented as a feature, I'm categorizing this as an enhancement and not a bug.
Line 51 in ce72e8c
behavior for now should be to throw an error. in future versions we can either open it, or implement some kind of "update changes since last indexing" behavior, or something else entirely. havent decided yet
This should be caught
error:
File "open.py", line 4, in <module>
print(f)
File "....lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2606' in position 248: character maps to <undefined>
open.py
import pickle
with open("citationtitle.pickle", "rb") as f:
f = pickle.load(f)
print(f)
Basically, if there is a keyname that the user expects to exists in some, but not all, of the json files, the categorize
(I think there is a better name somewhere) should give some way to access the "exists" set and the "doesn't exist" set. Right now, I think the way this is handled is a key:value pair in group_indices
with the key of "None" and the value of (set of documents), and a key:value pair in group_indices
for each other key. tl;dr I want a concise way to access the second group as a single set.
JsonPydexer
constructor's behavior when the targeted directory lacks permissions we require, and test for this behavior.JsonPydexer.index()
Recursion into directories in JsonPydexer.index()
.
Setting the kwarg to True should make this happen, otherwise it shouldn't.
Add testdata and tests for both cases.
I think adding a requirements.txt will fix this
The test PydexerIndex.test_index_no_filename()
is asserting the bare minimum. Update this test to assert against the expected value from the testdata.
Include a hash of each file in the index
Check for deleted files when updating an existing index
Check for modified files when updating an existing index
Add new files when updating an existing index
Edit: I think this is mostly completed in PR #38. At least, deleted and new files are taken care of. Modified files/hashing still needs to be done, and issues #35 #34 can be done to test the behavior introduced in pr #38.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.