jaymon / datatypes Goto Github PK
View Code? Open in Web Editor NEWMy personal standard library
License: MIT License
My personal standard library
License: MIT License
Both Headers
and Environ
could live in this module since they seem generic enough
It would be great to pull some of the image handling code from testdata to create pure python image handlers that could give basic information about images like height and width or if a gif is animated without having to install a more substantial external library
Something like:
t = HTMLTokenizer(s, "a")
for atag in t:
pout.v(atag)
It would be nice to have some light parsing functionality without having to pull in beautiful soup. I think the string.HTMLCleaner could be modified to do this
This way, on smaller projects I can just install datatypes and do from datatypes.compat import *
in my files instead of creating my own compat.py, or use it as a placeholder in other projects's compat.py
modules:
# <NEWPROJECT>/compat.py
from datatypes.compat import *
# customize for <NEWPROJECT>
Would it be worth making this a method on the String object?
from xml.sax.saxutils import escape
def xmlescape(data):
return escape(data, entities={
"'": "'",
"\"": """
})
probably into a copy.py
module
Change the name also
https://github.com/Jaymon/prom/blob/master/prom/query.py#L1432
Might be worth import classproperty
and converting these methods (ie, path_class
, file_class
, etc) into dynamic class properties
Here are the various path versions I have and some of the stuff I might still want to port over from them:
datatypes.path
, there are some things I didn't port over: Dirpath.create_files
, Dirpath.create_files
and none of the modules stuff (eg, Modulepath
and the Dirpath
module methods) was ported over. I didn't bring over the copy_into/put_into
because you can just switch target and dest and get the same behavior with the current copy_to
functionality.writelines
, contains
, and delete_lines
. If I find a need for these methods then they should be moved over, if not, if/when I update stockton I should just have stockton's classes extend datatype's classes and layer on the methods that stockton uses. I did move over and expand the Sentinal
stuffso if there is no path then create a tempfile that the CSV can write to?
if the fieldnames have unicode characters it chokes.
Similar to defaultdict
, you could do something like this:
d = SchemaDict(foo=dict, bar=0, che="", boos=list)
d["foo"] # {}
d["bar"] # 0
d["che"] # ""
d["boos"] # []
So it's basically a default dict where you can have multiple keys
that basically just wraps the standard library:
import cgi
cgi.escape("<STRING>")
and python 3:
import html
html.escape("<STRING>")
Found this in some old application code, could probably be moved into Request core:
def parse_user_agent(self, user_agent):
"""parses any user agent string to the best of its ability and tries not
to error out"""
d = {}
regex = "^([^/]+)" # 1 - get everything to first slash
regex += "\/" # ignore the slash
regex += "(\d[\d.]*)" # 2 - capture the numeric version or build
regex += "\s+\(" # ignore whitespace before parens group
regex += "([^\)]+)" # 3 - capture the full paren body
regex += "\)\s*" # ignore the paren and any space if it is there
regex += "(.*)$" # 4 - everything else (most common in browsers)
m = re.match(regex, user_agent)
if m:
application = m.group(1)
version = m.group(2)
system = m.group(3)
system_bits = re.split("\s*;\s*", system)
tail = m.group(4)
# common
d['client_application'] = application
d['client_version'] = version
d['client_device'] = system_bits[0]
if application.startswith("Mozilla"):
for browser in ["Chrome", "Safari", "Firefox"]:
browser_m = re.search("{}\/(\d[\d.]*)".format(browser), tail)
if browser_m:
d['client_application'] = browser
d['client_version'] = browser_m.group(1)
break
return d
and the test:
def test_user_agent(self):
user_agents = [
(
"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36",
{
'client_application': "Chrome",
'client_version': "44.0.2403.157",
'client_device': "Windows NT 6.3"
}
),
(
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36",
{
'client_application': "Chrome",
'client_version': "44.0.2403.157",
'client_device': "Macintosh"
}
),
(
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:40.0) Gecko/20100101 Firefox/40.0",
{
'client_application': "Firefox",
'client_version': "40.0",
'client_device': "Macintosh"
}
),
(
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/600.7.12 (KHTML, like Gecko) Version/8.0.7 Safari/600.7.12", # Safari
{
'client_application': "Safari",
'client_version': "600.7.12",
'client_device': "Macintosh"
}
),
(
"curl/7.24.0 (x86_64-apple-darwin12.0) libcurl/7.24.0 OpenSSL/0.9.8x zlib/1.2.5",
{
'client_application': "curl",
'client_version': "7.24.0",
'client_device': "x86_64-apple-darwin12.0"
}
)
]
for user_agent in user_agents:
d = self.user_agent(user_agent[0])
self.assertDictContainsSubset(user_agent[1], d)
I got what looked like some funky behavior, it didn't look like:
String("\nFOO\n").indent(1)
was indenting as expected, so I should add some tests to make sure it is doing what is expected, I would expect something like (period represents spaces):
....\n
....FOO\n
and I think it might be stripping that last \n
because it is the last character
in a reflect.py
module. I would've done it already but I couldn't decide on what interface I wanted so I just left it in prom for right now
If I needed a better check on if the file is binary, look here:
The gist seems to be to open the file and check for the NULL byte (b'\0'
).
I did something like this and it worked for what I needed, but at some point I might want to make this a Filepath
method and flesh it out:
import mimetypes
def is_binary(ext):
t = mimetypes.guess_type(ext)
return "plain" in t or "text" in t
a case-insensitive set
Current datetime.timedelta
doesn't take months or years, it would be nice if it could.
I had this in another library's String class:
if not encoding:
# ??? use chardet to figure out what encoding val is?
# https://stackoverflow.com/questions/196345/how-to-check-if-a-string-in-python-is-in-ascii/6988354#6988354
encoding = sys.getdefaultencoding()
I'm just saving this, because I switched the library over to use datatypes's String class so this was going to get deleted.
Url
takes *paths
when it should take *parts
to be consistent with Path
.Url
and Path
use .create()
for creating a new instance, all of this should be renamed to create_instance
, this would free up .create
to be used by children for whatever. Right now it's just a tad too confusingDiscussed here:
Escape Sequence:
\N{name}
meaning "Character named name in the Unicode database"
Move in Pout's pout.utils.String.indent
method to the String class:
def indent(self, indent_count):
'''
add whitespace to the beginning of each line of val
link -- http://code.activestate.com/recipes/66055-changing-the-indentation-of-a-multi-line-string/
val -- string
indent -- integer -- how much whitespace we want in front of each line of val
return -- string -- val with more whitespace
'''
if indent_count < 1: return self
s = ((environ.INDENT_STRING * indent_count) + line for line in self.splitlines(False))
s = "\n".join(s)
return type(self)(s)
They clobber each other, I just did:
from datatypes import Environ
Thinking I was importing environ.Environ
and instead got headers.Environ
I think I should rename headers.Environ
since I think that will be the less common one. Maybe HTTPEnviron
?
I was seeing some interesting behavior when python2 had only unicode ucs2
support:
$ python
Python 2.7.18 (default, Sep 1 2020, 16:08:16)
>>> s = u'\uD859\uDFCC'
>>> s
u'\U000267cc'
u'\uD859\uDFCC'.encode("UTF-32").decode("UTF-32")
u'\U000267cc'
It was taking the utf-16 hex codes (\uD859
and \uDFCC
) and converting them to the utf-32
hex code (\U000267cc
) behind the scenes. I have methods like repr_string
and repr_bytes
and I might want to add some utf-8 (bytes), utf-16 (the \u
values) and utf-32 (the \U
values) methods just so you can get more information about the character. To see how all these come together, you can use fileformat.info and these are some pages I had open:
search:
from datetime import date
d = date()
dt = Datetime
d <= dt # TypeError: can't compare Datetime to datetime.date
I think there are methods I can override to make this compare possible
Would something like this work?
class Foobar(object):
@alias("bar")
def foo(self):
return "foo"
fb = Foobar()
fb.foo() # foo
fb.bar() # foo
This was in the caches comments, a version of this would go great here because I was surprised I didn't already have it:
Bang utils has a bunch of utility classes that might be nice:
looks like query should be called url search params.
you should be able to set a strict value to True and if strict is True then if you are missing a fieldname in the dict to add or if you have extra fieldnames then it should throw an error.
We have an rglob
and reglob
and it would be nice to have an rcontains
and recontains
also, which would search for matching files and then also match on what the file contains, basically doing something like this:
for p in basedir.rglob("<PATTERN>"):
if "<STRING>" in p.read_text():
# do something if it matches
Instead, you could do something like:
for p in basedir.rcontains("<STRING>", "<PATTERN>"):
# do something if it matches
I've been trying to figure out a good interface for SFTP, and these types of classes would keep the same path interface but extend it to SFTP, so the idea would be you do this:
d = SFTPDirpath("/foo/bar")
for f in d:
f.cp("local/filepath")
A big issue with this interface is when does it connect and disconnect, could use the io interface of .open
and .close
with a context manager, so something like:
d = SFTPDirpath("/foo/bar")
with d.open(username, password):
for f in d:
f.cp("local/filepath")
Sometimes you just want to know how many tokens you have, it would be nice to be able to do something like this:
s = String("foo bar che")
s.tokenize().count() # 3
This could be done by saving the position, calling .readall()
and then reseting the position
So create
is just too loaded a name to use raw. I think I should rename create
to create_instance
and create_file
to create_file_instance
, etc.
I could keep create_as
the same, but I could also rename it to create_as_instance
.
You can stream files from requests like this:
# https://stackoverflow.com/a/16696317/5006
r = requests.get(url, stream=True)
with open("some/path", 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
I'd love for http.HTTPClient
to have this functionality also, but maybe if I ever needed it it would be better to just install Requests
works the same way as ritems
, it will find the matching the first matching key no matter how far buried in the dict it is
Seems like it would be nice to have it outside of testdata sometimes
It would be nice to have a count method, because sometimes you just want to count how many of something you have:
s = String("foo bar foo")
s.regex(r"foo").count()
It would be great to turn this into a supported feature:
d = Datetime()
six_months_ago = Datetime(d.year + ((d.month - 5) / 12), ((d.month - 5) % 12) - 1, d.day)
via How do I calculate the date six months from the current date using the datetime Python module?
search:
Right now, Datetime
creates a naive datetime with UTC time. It should probably create a UTC pegged datetime that has the tzinfo
set to datetime.timezone.utc
.
My guess is this would be more annoying than I think, because if you've got a tz set then you can't compare against naive datetimes and stuff, but it would be worth looking into making this work.
It would be great to add a wrap method that will split on word boundaries, then use that method in Captain Jaymon/captain#54
Basically, something like:
s = String("foo bar che")
s.wrap(5) # foo\nbar\nche
s.wrap(2) # ValueError cannot wrap lines with a max of 2 characters per line
You could do this in an rpop
method, but the idea would be you could do something like this:
d = {
"foo": {
"bar": 1
"che": 2
}
}
d[["foo", "bar"]] # 1
d.pop(["foo", "bar"]) # 1
The reason why this would work is because, by default:
>>> d = {}
>>> d[["foo", "bar"]]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'
Endpoints's Url class would probably be useful outside of endpoints because there is a similar class in bang.utils
def has(name):
k = environ.key(name)
return k in os.environ
I just need to audit this code and make sure I understand, and document, the codepaths that result in Path
inferring. I know .create_as()
is the codepath. I might want to add a .create_inferred_instance()
or something like that also
This failed:
fs = OrderedList(key=lambda f: f.integer_value)
for f in foo:
fs.append(fs) # AttributeError: 'OrderedList' object has no attribute 'integer_value'
with this stacktrace:
File "filename.py", line N, in method
fs.append(fs)
File "/.../site-packages/datatypes/collections.py", line 445, in append
k = self.key(x)
File "filename.py", line N, in <lambda>
fs = OrderedList(key=lambda f: f.fstat.st_mtime)
AttributeError: 'OrderedList' object has no attribute 'fstat'
Interestingly, this failed also:
fs = OrderedList(key=lambda self, f: f.integer_value)
for f in foo:
fs.append(fs) # TypeError: <lambda>() takes exactly 2 arguments (1 given)
So it is inconsistent, I need to add tests and fix this issue so that key can be set or overridden.
I do a lot of things like this:
def foo(self):
foo = getattr(self, "_foo", None)
if not foo:
foo = 5
self._foo = foo
return foo
It would be great to have a cachedmethod
similar to property
so I can do something like:
@cachedmethod(_cached="_foo")
def foo(self):
return 5
There is functools.cache but I'd like a bit of control over what property gets set
so you could pass in a callback and it will run that for every File/Directory it finds, what should be the name:
Basically returns a vanilla datetime instance:
def datetime(self):
return datetime(
self.year,
self.month,
self.day,
self.hour,
self.minute,
self.second,
self.microsecond,
)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.