Git Product home page Git Product logo

datatypes's People

Contributors

jaymon avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

richlysakowski

datatypes's Issues

Lightweight image handler

It would be great to pull some of the image handling code from testdata to create pure python image handlers that could give basic information about images like height and width or if a gif is animated without having to install a more substantial external library

html Tokenizer

Something like:

t = HTMLTokenizer(s, "a")
for atag in t:
    pout.v(atag)

It would be nice to have some light parsing functionality without having to pull in beautiful soup. I think the string.HTMLCleaner could be modified to do this

Datatypes's compat.py` module should be comprehensive

This way, on smaller projects I can just install datatypes and do from datatypes.compat import * in my files instead of creating my own compat.py, or use it as a placeholder in other projects's compat.py modules:

# <NEWPROJECT>/compat.py
from datatypes.compat import *

# customize for <NEWPROJECT>

String.xmlescape method?

Would it be worth making this a method on the String object?

from xml.sax.saxutils import escape

def xmlescape(data):
    return escape(data, entities={
        "'": "&apos;",
        "\"": "&quot;"
    })

via

Path.*_class methods

Might be worth import classproperty and converting these methods (ie, path_class, file_class, etc) into dynamic class properties

Path module future additions

Here are the various path versions I have and some of the stuff I might still want to port over from them:

  • testdata.path - this was the primary base for datatypes.path, there are some things I didn't port over: Dirpath.create_files, Dirpath.create_files and none of the modules stuff (eg, Modulepath and the Dirpath module methods) was ported over. I didn't bring over the copy_into/put_into because you can just switch target and dest and get the same behavior with the current copy_to functionality.
  • stockton.path - This has a lot of utility methods that I didn't move over because I wasn't sure how niche they are, for example, writelines, contains, and delete_lines. If I find a need for these methods then they should be moved over, if not, if/when I update stockton I should just have stockton's classes extend datatype's classes and layer on the methods that stockton uses. I did move over and expand the Sentinal stuff
  • heard.path - I started porting and expanding the zip_to() functionality from this module and realized it was going to be more work than I wanted to do for the first pass, so I abandoned it mid port. I did port and expand the Tempdir from this module and also ported over the ext argument you can pass to Filepath.
  • bang.path - I probably want to bring over the Image code from this module, and also the Directory.create_file(), Directory.has_file(), and Directory.file_contents() methods. I might even want to bring over the DataDirectory class.

SchemaDict

Similar to defaultdict, you could do something like this:

d = SchemaDict(foo=dict, bar=0, che="", boos=list)

d["foo"] # {}
d["bar"] # 0
d["che"] # ""
d["boos"] # []

So it's basically a default dict where you can have multiple keys

Request parse user agent

Found this in some old application code, could probably be moved into Request core:

    def parse_user_agent(self, user_agent):
        """parses any user agent string to the best of its ability and tries not
        to error out"""
        d = {}

        regex = "^([^/]+)" # 1 - get everything to first slash
        regex += "\/" # ignore the slash
        regex += "(\d[\d.]*)" # 2 - capture the numeric version or build
        regex += "\s+\(" # ignore whitespace before parens group
        regex += "([^\)]+)" # 3 - capture the full paren body
        regex += "\)\s*" # ignore the paren and any space if it is there
        regex += "(.*)$" # 4 - everything else (most common in browsers)
        m = re.match(regex, user_agent)
        if m:
            application = m.group(1)
            version = m.group(2)
            system = m.group(3)
            system_bits = re.split("\s*;\s*", system)
            tail = m.group(4)

            # common
            d['client_application'] = application
            d['client_version'] = version
            d['client_device'] = system_bits[0]

            if application.startswith("Mozilla"):
                for browser in ["Chrome", "Safari", "Firefox"]:
                    browser_m = re.search("{}\/(\d[\d.]*)".format(browser), tail)
                    if browser_m:
                        d['client_application'] = browser
                        d['client_version'] = browser_m.group(1)
                        break

        return d

and the test:

    def test_user_agent(self):
        user_agents = [
            (
                "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36",
                {
                    'client_application': "Chrome",
                    'client_version': "44.0.2403.157",
                    'client_device': "Windows NT 6.3"
                }
            ),
            (
                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36",
                {
                    'client_application': "Chrome",
                    'client_version': "44.0.2403.157",
                    'client_device': "Macintosh"
                }
            ),
            (
                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:40.0) Gecko/20100101 Firefox/40.0",
                {
                    'client_application': "Firefox",
                    'client_version': "40.0",
                    'client_device': "Macintosh"
                }
            ),
            (
                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/600.7.12 (KHTML, like Gecko) Version/8.0.7 Safari/600.7.12", # Safari
                {
                    'client_application': "Safari",
                    'client_version': "600.7.12",
                    'client_device': "Macintosh"
                }
            ),
            (
                "curl/7.24.0 (x86_64-apple-darwin12.0) libcurl/7.24.0 OpenSSL/0.9.8x zlib/1.2.5",
                {
                    'client_application': "curl",
                    'client_version': "7.24.0",
                    'client_device': "x86_64-apple-darwin12.0"
                }
            )
        ]

        for user_agent in user_agents:
            d = self.user_agent(user_agent[0])
            self.assertDictContainsSubset(user_agent[1], d)

String.indent

I got what looked like some funky behavior, it didn't look like:

String("\nFOO\n").indent(1)

was indenting as expected, so I should add some tests to make sure it is doing what is expected, I would expect something like (period represents spaces):

....\n
....FOO\n

and I think it might be stripping that last \n because it is the last character

Add testing if a file is binary to Filepath

If I needed a better check on if the file is binary, look here:

The gist seems to be to open the file and check for the NULL byte (b'\0').

I did something like this and it worked for what I needed, but at some point I might want to make this a Filepath method and flesh it out:

import mimetypes

def is_binary(ext):
    t = mimetypes.guess_type(ext)
    return "plain" in t or "text" in t

String encoding thing

I had this in another library's String class:

        if not encoding:
            # ??? use chardet to figure out what encoding val is?
            # https://stackoverflow.com/questions/196345/how-to-check-if-a-string-in-python-is-in-ascii/6988354#6988354
            encoding = sys.getdefaultencoding()

I'm just saving this, because I switched the library over to use datatypes's String class so this was going to get deleted.

Conventions cleanup

  • Url takes *paths when it should take *parts to be consistent with Path.
  • Both Url and Path use .create() for creating a new instance, all of this should be renamed to create_instance, this would free up .create to be used by children for whatever. Right now it's just a tad too confusing

String.indent

Move in Pout's pout.utils.String.indent method to the String class:

    def indent(self, indent_count):
        '''
        add whitespace to the beginning of each line of val

        link -- http://code.activestate.com/recipes/66055-changing-the-indentation-of-a-multi-line-string/

        val -- string
        indent -- integer -- how much whitespace we want in front of each line of val

        return -- string -- val with more whitespace
        '''
        if indent_count < 1: return self

        s = ((environ.INDENT_STRING * indent_count) + line for line in self.splitlines(False))
        s = "\n".join(s)
        return type(self)(s)

There is a headers.Environ and an environ.Environ

They clobber each other, I just did:

from datatypes import Environ

Thinking I was importing environ.Environ and instead got headers.Environ

I think I should rename headers.Environ since I think that will be the less common one. Maybe HTTPEnviron?

Character class narrow unicode

I was seeing some interesting behavior when python2 had only unicode ucs2 support:

$ python
Python 2.7.18 (default, Sep  1 2020, 16:08:16)
>>> s = u'\uD859\uDFCC'
>>> s
u'\U000267cc'
u'\uD859\uDFCC'.encode("UTF-32").decode("UTF-32")
u'\U000267cc'

It was taking the utf-16 hex codes (\uD859 and \uDFCC) and converting them to the utf-32 hex code (\U000267cc) behind the scenes. I have methods like repr_string and repr_bytes and I might want to add some utf-8 (bytes), utf-16 (the \u values) and utf-32 (the \U values) methods just so you can get more information about the character. To see how all these come together, you can use fileformat.info and these are some pages I had open:

search:

  • python utf16 to utf32
  • convert utf16 to utf32

Datetime can't be compared to Date

from datetime import date

d = date()
dt = Datetime

d <= dt # TypeError: can't compare Datetime to datetime.date

I think there are methods I can override to make this compare possible

alias decorator

Would something like this work?

class Foobar(object):
    @alias("bar")
    def foo(self):
        return "foo"

fb = Foobar()
fb.foo() # foo
fb.bar() # foo

bang.utils classes

Bang utils has a bunch of utility classes that might be nice:

  • ContextCache - provides a namespace cache
  • Scanner - Python implementation of Obj-c Scanner
  • UnlinkedTagTokenizer - This will go through an html block of code and return pieces that aren't linked (between and ), allowing you to mess with the blocks of plain text that isn't special in some way

CSV strict

you should be able to set a strict value to True and if strict is True then if you are missing a fieldname in the dict to add or if you have extra fieldnames then it should throw an error.

Path.grep

We have an rglob and reglob and it would be nice to have an rcontains and recontains also, which would search for matching files and then also match on what the file contains, basically doing something like this:

for p in basedir.rglob("<PATTERN>"):
    if "<STRING>" in p.read_text():
        # do something if it matches

Instead, you could do something like:

for p in basedir.rcontains("<STRING>", "<PATTERN>"):
    # do something if it matches

SFTPDirpath and SFTPFilepath

I've been trying to figure out a good interface for SFTP, and these types of classes would keep the same path interface but extend it to SFTP, so the idea would be you do this:

d = SFTPDirpath("/foo/bar")
for f in d:
    f.cp("local/filepath")

A big issue with this interface is when does it connect and disconnect, could use the io interface of .open and .close with a context manager, so something like:

d = SFTPDirpath("/foo/bar")
with d.open(username, password):
    for f in d:
        f.cp("local/filepath")

StreamTokenizer.count() method

Sometimes you just want to know how many tokens you have, it would be nice to be able to do something like this:

s = String("foo bar che")
s.tokenize().count() # 3

This could be done by saving the position, calling .readall() and then reseting the position

Add streaming to HTTPClient

You can stream files from requests like this:

# https://stackoverflow.com/a/16696317/5006
r = requests.get(url, stream=True)

with open("some/path", 'wb') as f:
    for chunk in r.iter_content(chunk_size=1024): 
        if chunk: # filter out keep-alive new chunks
            f.write(chunk)

I'd love for http.HTTPClient to have this functionality also, but maybe if I ever needed it it would be better to just install Requests

Dict.rget

works the same way as ritems, it will find the matching the first matching key no matter how far buried in the dict it is

string.Regex.count() method

It would be nice to have a count method, because sometimes you just want to count how many of something you have:

s = String("foo bar foo")
s.regex(r"foo").count()

Datetime should be timezone aware

Right now, Datetime creates a naive datetime with UTC time. It should probably create a UTC pegged datetime that has the tzinfo set to datetime.timezone.utc.

My guess is this would be more annoying than I think, because if you've got a tz set then you can't compare against naive datetimes and stuff, but it would be worth looking into making this work.

https://docs.python.org/3/library/datetime.html

String.wrap using String.truncate method

It would be great to add a wrap method that will split on word boundaries, then use that method in Captain Jaymon/captain#54

Basically, something like:

s = String("foo bar che")
s.wrap(5) # foo\nbar\nche
s.wrap(2) # ValueError cannot wrap lines with a max of 2 characters per line

Augment the dict to take a list as the key

You could do this in an rpop method, but the idea would be you could do something like this:

d = {
    "foo": {
        "bar": 1
        "che": 2
    }
}

d[["foo", "bar"]] # 1
d.pop(["foo", "bar"]) # 1

The reason why this would work is because, by default:

>>> d = {}
>>> d[["foo", "bar"]]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'

OrderedList custom key bug

This failed:

fs = OrderedList(key=lambda f: f.integer_value)
for f in foo:
    fs.append(fs) # AttributeError: 'OrderedList' object has no attribute 'integer_value'

with this stacktrace:

File "filename.py", line N, in method
    fs.append(fs)
  File "/.../site-packages/datatypes/collections.py", line 445, in append
    k = self.key(x)
  File "filename.py", line N, in <lambda>
    fs = OrderedList(key=lambda f: f.fstat.st_mtime)
AttributeError: 'OrderedList' object has no attribute 'fstat'

Interestingly, this failed also:

fs = OrderedList(key=lambda self, f: f.integer_value)
for f in foo:
    fs.append(fs) # TypeError: <lambda>() takes exactly 2 arguments (1 given)

So it is inconsistent, I need to add tests and fix this issue so that key can be set or overridden.

cachedmethod similar to property

I do a lot of things like this:

def foo(self):
    foo = getattr(self, "_foo", None)
    if not foo:
        foo = 5
        self._foo = foo
    return foo

It would be great to have a cachedmethod similar to property so I can do something like:

@cachedmethod(_cached="_foo")
def foo(self):
    return 5

There is functools.cache but I'd like a bit of control over what property gets set

Path callback glob or iter

so you could pass in a callback and it will run that for every File/Directory it finds, what should be the name:

  • cbglob?
  • itercb?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.