chrisbeaumont / soupy Goto Github PK
View Code? Open in Web Editor NEWEasier wrangling of web data.
Home Page: http://soupy.readthedocs.org/
License: MIT License
Easier wrangling of web data.
Home Page: http://soupy.readthedocs.org/
License: MIT License
If a user doesn't provide an input to filter()
, it should just drop all false-y values.
These would return Scalar(bool)
Thanks for your package.
When I finished my setup, import soupy, there is error happened :
import soupy
Traceback (most recent call last):
File "", line 1, in
File "/Library/Python/2.7/site-packages/soupy.py", line 139, in
@six.python_2_unicode_compatible
AttributeError: 'module' object has no attribute 'python_2_unicode_compatible'
Do you have any ideas to fix it ?
Thanks for your help.
list(NullCollection())
should probably return the empty list instead of raising NullValueError
Currently require simply checks if a Node is null. It should accept an function, and assert that the function evaluates to true when mapped on the data.
dom.find('a').require(Q['href'].startswith('https'))
Right now things like Scalar don't hash like they ought to:
Scalar(2) in {Scalar(2)} # False
BS turns attribute getting into an alias for find
dom.a.b # == dom.find('a').find('b')
Soupy doesn't do this yet. The terseness is nice, but it has a few downsides:
A few libraries do this (numpy recarrays, pandas), and I always have mixed feelings about it.
each
currently takes a single function which it maps over the collection of items. Each could take N functions as input, map each one, and pack the result as a Collection of N-tuples. That's more symmetric to what dump()
does -- each builds unlabeled tuples, dump builds labeled dicts.
cc @cryzed
Calling dump
on a Node currently applies each kwarg to the node, and packs the result into a dict. It could also alternatively accept args, apply each to the node, and return the result as a tuple:
>>> node.dump(href=Q.attrs['href'], class=Q.attrs['class'])
Scalar({'href': 'https://www.google.com/imghp?hl=en&tab=wi', 'cls': ['gb1']})
>>> node.dump(Q.attrs['href'], Q.attrs['class'])
Scalar(('https://www.google.com/imghp?hl=en&tab=wi', ['gb1']))
To keep things simple, using both args and kwargs is a ValueError
>>> node.dump(Q.attrs['href'], class=Q.attrs['class'])
TypeError("Cannot pass both arguments and keywords to dump")
A common pattern among my old bs scripts is to extract colulmn names from the header of a table, and then repeatedly dict(zip(names, values))
for each row. A couple of proposals to do that with soupy:
# should work now, cumbersome
cols.each(Q.text).map(lambda vals: dict(zip(names, vals)))
# new method
cols.dictzip(names, Q.text)
cols.each(Q.text).dictzip(names)
# more general zip + mapping
cols.each(Q.text).zip(names).map(reversed).map(dict)
# overload dump -- don't like this
c.each(Q.find_all('td').dump(names, Q.text))
I think I like the idea of adding zip
, and then also adding the second version of dictzip
which is implemented using zip
If an exception is raised when evaluating a Q expression, the traceback is pretty opaque
<ipython-input-7-fcabd18eb998> in iter_page(gene, dom)
26 rows = (table.find('tbody')
27 .find_all('tr', recursive=False)[1:] # first row is junk
---> 28 .each(Q.find_all('td').each(Q.text.replace('\xa0', ' ')).dictzip(column_names))
29 )
30
/Users/cbeaumont/anaconda/lib/python2.7/site-packages/soupy.pyc in each(self, func)
467 """
468 func = _make_callable(func)
--> 469 return Collection(imap(func, self._items))
470
471 def filter(self, func):
/Users/cbeaumont/anaconda/lib/python2.7/site-packages/soupy.pyc in __init__(self, items)
433
434 def __init__(self, items):
--> 435 super(Collection, self).__init__(list(items))
436 self._items = self._value
437 self._assert_items_are_wrappers()
/Users/cbeaumont/anaconda/lib/python2.7/site-packages/soupy.pyc in __eval__(self, val)
1293 def __eval__(self, val):
1294 for item in self._items:
-> 1295 val = item.__eval__(val)
1296 return val
1297
/Users/cbeaumont/anaconda/lib/python2.7/site-packages/soupy.pyc in __eval__(self, val)
1246
1247 def __eval__(self, val):
-> 1248 return val.__call__(*self._args, **self._kwargs)
1249
1250
/Users/cbeaumont/anaconda/lib/python2.7/site-packages/soupy.pyc in each(self, func)
467 """
468 func = _make_callable(func)
--> 469 return Collection(imap(func, self._items))
470
471 def filter(self, func):
/Users/cbeaumont/anaconda/lib/python2.7/site-packages/soupy.pyc in __init__(self, items)
433
434 def __init__(self, items):
--> 435 super(Collection, self).__init__(list(items))
436 self._items = self._value
437 self._assert_items_are_wrappers()
/Users/cbeaumont/anaconda/lib/python2.7/site-packages/soupy.pyc in __eval__(self, val)
1293 def __eval__(self, val):
1294 for item in self._items:
-> 1295 val = item.__eval__(val)
1296 return val
1297
/Users/cbeaumont/anaconda/lib/python2.7/site-packages/soupy.pyc in __eval__(self, val)
1246
1247 def __eval__(self, val):
-> 1248 return val.__call__(*self._args, **self._kwargs)
1249
1250
/Users/cbeaumont/anaconda/lib/python2.7/site-packages/soupy.pyc in __call__(self, *args, **kwargs)
355
356 def __call__(self, *args, **kwargs):
--> 357 return self.map(operator.methodcaller('__call__', *args, **kwargs))
358
359 def __eq__(self, other):
/Users/cbeaumont/anaconda/lib/python2.7/site-packages/soupy.pyc in map(self, func)
189 Scalar(6)
190 """
--> 191 return Wrapper.wrap(_make_callable(func)(self._value))
192
193 def apply(self, func):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)
It would probably be easier to parse if Q expressions could better repr themselves, and some how add a hint to the traceback about what step failed
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.