maxhumber / gazpacho Goto Github PK
View Code? Open in Web Editor NEW🥫 The simple, fast, and modern web scraping library
Home Page: https://www.gazpacho.xyz
License: MIT License
🥫 The simple, fast, and modern web scraping library
Home Page: https://www.gazpacho.xyz
License: MIT License
Is your feature request related to a problem? Please describe.
It's great to be able to run find and then find within the initial result, but it seems more readable to be able to find based on CSS selectors.
Describe the solution you'd like
selector = '.foo img.bar'
soup.select(selector) # this would return any img item with the class "bar" inside of an object with the class "foo"
Describe the bug
It seems that the lxml
dependency is not install by this package on installation.
To Reproduce
I just pip installed the project and this code (which is on the docs) fails.
from gazpacho import get, Soup
import pandas as pd
url = 'https://www.capfriendly.com/browse/active/2020/salary?p=1'
response = get(url)
soup = Soup(response)
df = pd.read_html(str(soup.find('table')))[0]
print(df[['PLAYER', 'TEAM', 'SALARY', 'AGE']].head(3))
This is the error.
venv/lib/python3.7/site-packages/pandas/io/html.py in _parser_dispatch(flavor)
846 else:
847 if not _HAS_LXML:
--> 848 raise ImportError("lxml not found, please install it")
849 return _valid_parsers[flavor]
850
ImportError: lxml not found, please install it
Expected behavior
Everything here is fixed with a manual pip install lxml
. But if lxml
is a dependency then I would expect it to be installed automatically when gazpacho
is installed.
Environment:
Is your feature request related to a problem? Please describe.
Improve the .github
issue template
Describe the solution you'd like
I would like a better issue and feature request template in the .github
folder. The format I would like is the bolded headings to become proper sections, and the help line below them comments.
Describe alternatives you've considered
None
Additional context
What I would like is instead of:
---
name: Bug report
about: Create a report to help gazpacho improve
title: ''
labels: ''
assignees: ''
---
**Describe the bug**
A clear and concise description of what the bug is.
**To Reproduce**
Code to reproduce the behaviour:
```python
\```
**Expected behavior**
A clear and concise description of what you expected to happen.
**Environment:**
- OS: [macOS, Linux, Windows]
- Version: [e.g. 0.8.1]
**Additional context**
Add any other context about the problem here.
It should be something like:
---
name: Bug report
about: Create a report to help gazpacho improve
title: ''
labels: ''
assignees: ''
---
## Describe the bug
<!-- A clear and concise description of what the bug is. -->
## To Reproduce
<!-- Code to reproduce the behaviour: -->
```python
# code
\```
## Expected behavior
<!-- A clear and concise description of what you expected to happen. -->
**Environment:**
- OS: [macOS, Linux, Windows]
- Version: [e.g. 0.8.1]
## Additional context
<!-- Add any other context about the problem here. Delete this section if not applicable -->
Or something like this
Describe the bug
Error when running mypy gazpacho
Error
This is the error:
max@mbp gazpacho % mypy gazpacho
gazpacho/get.py:35: error: Argument 1 to "update" of "dict" has incompatible type "Optional[Dict[str, Any]]"; expected "Mapping[str, str]"
Found 1 error in 1 file (checked 4 source files)
Expected behavior
No errors 🙈
Environment:
find
changes the content of attrs
When using the find
method on a Soup
object, the content of attrs
is overwritten by the parameter attrs
in find
.
Try the following:
from gazpacho import Soup
div = Soup("<div id='my_id' />").find("div")
print(div.attrs)
div.find("span", {"id": "invalid_id"})
print(div.attrs)
The expected output will be the following, because we twice print the attributes of a
:
{'id': 'my_id'}
{'id': 'my_id'}
But instead you actually receive:
{'id': 'my_id'}
{'id': 'invalid_id'}
which is wrong.
My current workaround is to save the attributes before I execute find
.
I tried code:
all_urls = [link.attrs['href'] for link in Soup(get(browser_link)).find('a')]
and I got AttributeError: 'Soup' object has no attribute 'decode'. What check? Where is mistake in my code?
Full info:
File "C:\webscraper\lib\site-packages\gazpacho\get.py", line 29, in get
url = sanitize(url)
File "C:\webscraper\lib\site-packages\gazpacho\utils.py", line 128, in sanitize
scheme, netloc, path, query, fragment = urlsplit(url)
File "C:\Program Files\Python39\lib\urllib\parse.py", line 455, in urlsplit
url, scheme, _coerce_result = _coerce_args(url, scheme)
File "C:\Program Files\Python39\lib\urllib\parse.py", line 125, in _coerce_args
return _decode_args(args) + (_encode_result,)
File "C:\Program Files\Python39\lib\urllib\parse.py", line 109, in _decode_args
return tuple(x.decode(encoding, errors) if x else '' for x in args)
File "C:\Program Files\Python39\lib\urllib\parse.py", line 109, in <genexpr>
return tuple(x.decode(encoding, errors) if x else '' for x in args)
AttributeError: 'Soup' object has no attribute 'decode'
Describe the bug
When using the find
method on the Soup
class, and there are no results found, an exception is thrown if you used mode="first"
.
To Reproduce
Code to reproduce the behaviour:
from gazpacho import Soup
s = Soup("<p>test</p>")
s.find("a", mode="first")
Expected behavior
It should return None
.
Environment:
Additional context
None
Describe the bug
I get a UnicodeEncodeError when calling get() with a URL that contains Unicode characters.
To Reproduce
Code to reproduce the behaviour:
from gazpacho import get
url = 'https://worldofwarcraft.com/en-us/character/us/stormrage/drãke'
html = get(url)
UnicodeEncodeError: 'ascii' codec can't encode character '\xe3' in position 36: ordinal not in range(128)
Expected behavior
get() should succeed without throwing an exception
Environment:
Using soup.find on particular website(s) returns entire html instead of the matching tag(s)
Look for ul tag with attribute class="cves" (<ul class="cves">) on https://mariadb.com/kb/en/security/
from gazpacho import get, Soup
endpoint = "https://mariadb.com/kb/en/security/"
html_dump = Soup.get(endpoint)
sample = html_dump.find('ul', attrs={'class': 'cves'}, mode='all')
sample contains the contents of an entire html
sample should contain the contents of the tag <ul class "cves">, which in this case would be rows of <li>-s, listing the CVEs and corresponding fixed version in MariaDB, something like:
<ul class="cves">
<li>..</li>
...
<li>..</li>
</ul>
Using BeautifulSoup on the same html_dump did get the job done, although the <li>-tags are weirdly nested together.
from bs4 import BeautifulSoup
# html_dump from above Soup.get(endpoint)
bs_soup = BeautifulSoup(html_dump.html, 'html.parser')
ul_cves = bs_soup.find_all('ul','cves')
ul_cves contain strangely nested <li>-s, from which it was still possible to extract the rows of <li>-s I was looking for.
<ul class="cves">
<li>
<li>
...
</li></li>
</ul>
Need a method that triggers the Javascript on a page to fire (see https://github.com/psf/requests-html, r.html.render()).
Thank you for your nice project!
Please add an argument encoding
to decode that does not utf-8 encoded pages.
Line 51 in ecd53af
I tried EUC-KR
encoded page and got an error message.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbd in position 95: invalid start byte
Really just a question, followed by a request. Does gazpacho currently support the use of proxies? (If not, would be great to include).
Is your feature request related to a problem? Please describe.
I'm worried that the format
function is brittle.
Describe the solution you'd like
It should always return html. And never fail. I could use some help writing more tests for this function (that is run on every repr and str call).
Is your feature request related to a problem? Please describe.
Right now it's hard to reason about the behaviour of the find
method. If it finds one element it will return a Soup
object, if it finds more than one it will return a list of Soup
objects.
Describe the solution you'd like
Separate find
into a find
method and find_one
method.
Describe alternatives you've considered
Keep it and YOLO?
Additional context
Conversation with Michael Kennedy:
If I were designing the api, i'd have that always return a List[Node] (or whatever the class is). Then add two methods:
- find() -> List[Node]
- find_one() -> Optional[Node]
- one() -> Node (exception if the there are zero or two or more nodes)
URL sanitization is overly aggressive on URLs containing percent-encoded data
Run the following on v1.1:
>>> gazpacho.utils.sanitize('https://en.wikipedia.org/wiki/M%26M%27s')
'https://en.wikipedia.org/wiki/M%2526M%2527s'
Alternatively, on 7cc9488 with a tweak to get the package loadable, and logging str(url)
:
>>> gazpacho.get('https://en.wikipedia.org/wiki/M%26M%27s')
url http://https://en.wikipedia.org/wiki/M%2526M%2527s
...
(i.e., no change)
This results in certain URLs not being get
-able. In my case en.wikipedia.org serves a 404 response.
To get 7cc9488 importable I changed
Line 2 in 7cc9488
from .soup2 import Soup
The valid URL I give is unchanged during sanitization, and is fetched successfully.
Python 3.11.3
Seems like occurrences of %
are getting rewritten as %25
.
Hi,
There was a pull request (#48) to add whl publishing but it appears to have been lost somewhere in a merge on October 31st, 2020. (v1.1...master). Therefore, no wheels have been published for 1.1.
This causes the installation error on my system that the PR was meant to address.
Install gazpacho
with a wheel, not a tar.gz;. Please re-add the whl publishing.
Describe the bug
When I create a soup object...
To Reproduce
Calling .text returns an empty string:
from gazpacho import Soup
html = """<p>£682m</p>"""
soup = Soup(html)
print(soup.text)
''
Expected behavior
Should output:
print(soup.text)
'£682m'
Environment:
Additional context
Inspired by this S/O question
Describe the bug
When trying to get text from a tag, gazpacho returns empty string
To Reproduce
Code to reproduce the behaviour:
from gazpacho import Soup
html = '<a href="/Sorasful?source=gig_cards&referrer_gig_slug=edit-mixing-and-mastering&ref_ctx_id=42d34014-b499-46fa-a1d3-04318b12fecc" rel="nofollow noopener noreferrer" target="_self"><span>by </span>Sorasful</a>'
soup = Soup(html)
print(soup.text)
# prints nothing
print(soup.find('a').text)
# prints "by"
Expected behavior
Should return "by Sorasful"
Environment:
I would like to be able to find the parent of a node.
I think soup.parent
would be a nice UI.
For example if we have:
<ul>
<li class="my-class"></li>
</ul>
We can get the ul tag with ul_tag = soup.find("li", attrs={"class": "my-class").parent
An alternative would be some nested filtering perhaps like:
li = soup.find("li", "my-class")
ul = soup.find("ul", with_child=li)
Thanks for your work on the package!
Describe the bug
Right now match
has an ability to be strict. This functionality is presently not enable for find
.
To Reproduce
Code to reproduce the behaviour:
from gazpacho import Soup, match
match({'foo': 'bar'}, {'foo': 'bar baz'})
# True
match({'foo': 'bar'}, {'foo': 'bar baz'}, strict=True)
# False
Expected behavior
The find
method should be forgiving (partial match) to protect ease of use, and maintain backwards compatibility, but there should be an argument to enable strict/exact matching that piggybacks on match
Environment:
The default auto behavior of .find()
doesn't work for me, because it means I can't trust my code not to start throwing errors if the page I am scraping adds another matching element, or drops the number of elements down to one (triggering a change in return type).
I know I can do this:
div = soup.find("div", mode="first")
# Or this:
divs = soup.find("div", mode="all")
But having function parameters that change the return type is still a bit weird - not great for code hinting and suchlike.
Changing how .find()
works would be a backwards incompatible change, which isn't good now that you're past the 1.0
release. I suggest adding two new methods instead:
div = soup.first("div") # Returns a single element
# Or:
divs = soup.all("div") # Returns a list of elements
This would be consistent with your existing API design (promoting the mode arguments to first class method names) and could be implemented without breaking existing code.
I'd like to use gazpacho
for testing the HTML output of my Django application. A natural way to do this would be to search for a given HTML chunk. Therefore I'd like to be able to use the in
operator on a Soup
to check for that HTML chunk. This probably has utility outside of tests.
I imagine this working something like:
response = self.client.get('/') # django's test client
assert response.status_code == HTTPStatus.OK
body = response.content.decode()
assert '<h1><a href="/">Home page</a></h1>' in Soup(body)
It's possible to reproduce this with find()
and making assertions on the contents of the found node, but much more complicated since it requires assertions for each node in the tree.
Django already supports a similar assertion called assertInHTML
. However this relies on normalizing the HTML text and making text assertions, so it's clunky around matching the actual elements.
Can't parse some entries, there are 40 entries for every page, but some are not being parsed correctly.
from gazpacho import get, Soup
for i in range(1, 15):
link = f'https://1337x.to/category-search/aladdin/Movies/{i}/'
html = get(link)
soup = Soup(html)
body = soup.find("tbody")
# extracting all the entries in the body,
# there are 40 entries for every page, the last one can have less,
entries = body.find("tr", mode='all')[::-1]
# but for some pages it can't retrives all the entries from some reason
print(f'{len(entries)} entries -> {link}')
See 40 entries for every page
Arch Linux - 5.13.10-arch1-1
Python - 3.9.6
Gazpacho - 1.1
Is your feature request related to a problem? Please describe.
gazpacho should be able to take html that looks like this:
html = """<ul><li>Item</li><li>Item</li></ul>"""
Describe the solution you'd like
And through some kind of magic turn it into this:
<ul>
<li>Item</li>
<li>Item</li>
</ul>
Describe alternatives you've considered
A quick prototype:
from xml.dom.minidom import parseString as string_to_dom
def prettify(string, html=True):
dom = string_to_dom(string)
ugly = dom.toprettyxml(indent=" ")
split = list(filter(lambda x: len(x.strip()), ugly.split('\n')))
if html:
split = split[1:]
pretty = '\n'.join(split)
return pretty
Describe the bug
Find isn't working properly on tags that don't close
To Reproduce
Code to reproduce the behaviour:
from gazpacho import Soup, get
html = """
<div>
<span>Blah</span>
<p>Blah Blah</p>
<img src='hi.png'>
<br/>
<img src='sup.png'>
</div>
"""
soup = Soup(html)
imgs = soup.find("img")
imgs[0].attrs['src']
Expected behavior
Should yield: 'hi.png'
Right now it errors with: TypeError: 'Soup' object is not subscriptable
Environment:
Is your feature request related to a problem? Please describe.
Right now I manually run:
isort gazpacho
black .
mypy gazpacho
To make sure that the types are appropriate and the code is black.
Describe the solution you'd like
This should be performed automatically on pushes, merges, and releases
Describe the bug
Although gazpacho is now type hinted, trying to use gazpacho types in another package (quote) causes this error:
quote/quote.py:3: error: Skipping analyzing 'gazpacho': found module but no type hints or library stubs
To Reproduce
Code to reproduce the behaviour:
mypy quote
Expected behavior
Shouldn't throw an error!
Environment:
Is your feature request related to a problem? Please describe.
gazpacho uses Portray to publish the documentation at https://gazpacho.xyz/
Describe the solution you'd like
This should happen automatically on new releases (perhaps with TravisCI)
Describe alternatives you've considered
Right now I have to manually run:
portray on_github_pages
To publish...
running the example on the readme resulted in such error or this one depending on how the import was written
ImportError: cannot import name 'get' from partially initialized module 'gazpacho' (most likely due to a circular import)
Using windows 10, Python 3.9.1
Is your feature request related to a problem? Please describe.
It might be nice if gazpacho had the ability to rotate/fake a user agent
Describe the solution you'd like
Sort of like this but more primitive. (Importantly gazpacho does not want to take on any dependencies)
Additional context
Right now gazpacho just spoofs the latest Firefox User Agent
Describe the bug
The find
method gets confused on empty element tags (img, meta, etc...)
To Reproduce
Code to reproduce the behaviour:
from gazpacho import Soup
html = '''
<div class="foo-list">
<a class="foo" href="/foo/1">
<div class="foo-image-container">
<img src="image.jpg">
</div>
</a>
<a class="foo" href="/foo/2">
<div class="foo-image-container">
<img src="image.jpg">
</div>
</a>
</div>
'''
soup = Soup(html)
soup.find('a', {'class': "foo"})
Expected behavior
find
should be able to "find" a list of two a
tags. Instead the full blob is getting returned.
Environment:
Describe the bug
Running Flake8 highlighted a couple of unused imports and other little things that aren't serious but can be cleaned up pretty easily.
To Reproduce
Run flake8 on the codebase
Expected behavior
Warnings about unused imports should be removed
Environment:
Additional context
Nothing serious, just a bit of a cleanup. PR raised - #52
Is your feature request related to a problem? Please describe.
I would like try to a .children()
method in the Soup object that can list all the child elements of the Soup
object.
Describe the solution you'd like
I would make a regex pattern to match each inner element and return a list of Soup()
objects with those elements. I might also try to make an option for recurse or not.
Describe alternatives you've considered
All that I can think of is doing the same thing mentioned above in the scraping code
Additional context
None
Hi there! Just wanted to throw out there that your README.md states that this project is actively maintained, while your last commits were some years ago. I think it might useful to remove the 'actively maintained' part from there :p. Cheers!
Describe the bug
Came across this issue in the wild. If there is a ">"
character in an attribute, the parser will misinterpret that as the closing tag, and the parsed text will include the some strings from the attributes.
To Reproduce
Code to reproduce the behaviour:
>>> import gazpacho
>>> html = '<div tooltip-content="{"id": "7", "graph": "1->2"}">text</div>'
>>> soup = gazpacho.Soup(html)
>>> soup.find("div"}).text
'2"}">text'
Expected behavior
>>> import gazpacho
>>> html = '<div tooltip-content="{"id": "7", "graph": "1->2"}">text</div>'
>>> soup = gazpacho.Soup(html)
>>> soup.find("div").text
'text'
Environment:
Was just recommended this library and am a huge fan of the api you came up with, thanks a lot for this project!
I don't understand the statement "Element attributes are partially matched by default." Does it mean attrs={"id": "foo"}
will match attrs={"id": "foob"}
?
Better description with examples of what would/would not be matched with partial=True
vs False
.
n/a
n/a
Add typehints to code.
Can we add type hints to the code? Maybe then we can run mypy on it?
It may help to fish out hidden bugs.
Describe the bug
A clear and concise description of what the bug is.
To Reproduce
Code to reproduce the behaviour:
Expected behavior
A clear and concise description of what you expected to happen.
Environment:
Additional context
Add any other context about the problem here.
$ git tag v0.7.2 && git push --tags
🎉 🎈
I really like this project. I think that adding releases to the repository can help the project grow in popularity. I'd like to see that!
Describe the bug
Soup can handle and format matched tags no problem:
from gazpacho import Soup
html = """<ul><li>Item 1</li><li>Item 2</li></ul>"""
Soup(html)
Which correctly formats to:
<ul>
<li>Item 1</li>
<li>Item 2</li>
</ul>
But it can't handle void tags (like img)...
To Reproduce
For example, this bit of html:
html = """<ul><li>Item 1</li><li>Item 2</li></ul><img src="image.png">"""
Soup(html)
Will fail to format on print:
<ul><li>Item 1</li><li>Item 2</li></ul><img src="image.png">
Expected behavior
Ideally Soup formats it as:
<ul>
<li>Item 1</li>
<li>Item 2</li>
</ul>
<img src="image.png">
Environment:
Additional context
The problem has to do with the underlying parseString function unable to handle void tags:
from xml.dom.minidom import parseString as string_to_dom
string_to_dom(html)
Possible solution, turn void tags into self-closing tags on input, and the transform them back to void tags on print....
Introducing a campaign that I'm calling #Stacktoberfest 🥫
If you're a fan of gazpacho
and want to help evangelize the package, this campaign is for you!
I've created a question bank and have already committed ~30 answers myself.
There are several questions in the bank that I still think deserve modern gazpacho
answers.
If you decide to answer any of the questions in the bank (or find another one that you think deserves a gazpacho
answer, please submit a PR with a link to your answer!
Importantly, these answers should be high quality (we want to convince users that gazpacho
> bs4
), respectful, and the opposite of obnoxious.
I found this question by searching for popular [web-scraping], [python]
questions. It has 55k views, 19 upvotes and the original link is dead. Given that it gets a lot of traffic, I thought it deserved a new modern answer... here it is:
The original link posted by OP is dead... but here's how you might scrape table data with gazpacho:
Step 1 - import Soup and download the html:
from gazpacho import Soup
url = "https://en.wikipedia.org/wiki/List_of_multiple_Olympic_gold_medalists"
soup = Soup.get(url)
Step 2 - Find the table and table rows:
table = soup.find("table", {"class": "wikitable sortable"}, mode="first")
trs = table.find("tr")[1:]
Step 3 - Parse each row with a function to extract desired data:
def parse_tr(tr):
return {
"name": tr.find("td")[0].text,
"country": tr.find("td")[1].text,
"medals": int(tr.find("td")[-1].text)
}
data = [parse_tr(tr) for tr in trs]
sorted(data, key=lambda x: x["medals"], reverse=True)
Looking forward to your contributions!
Describe the bug
Error when running mypy:
gazpacho/get.py:68: error: "HTTPError" has no attribute "msg"
Found 1 error in 1 file (checked 5 source files)
To Reproduce
Code to reproduce the behaviour:
mypy gazpacho
Expected behavior
A clear and concise description of what you expected to happen.
mypy
should run without error
Environment:
from gazpacho import get, Soup
ImportError: cannot import name 'get' from 'gazpacho'
import gazpacho
Does work
# Your code here
in VS code
from gazpacho import get, Soup
#import gazpacho #works
'''
url = 'https://scrape.world/books'
html = get(url)
soup = Soup(html)
#soup = soup.get(url)
books = soup.find('div', {'class': 'book-'}, partial=True)
def parse(book):
name = book.find('h4').text
price = float(book.find('p').text[1:].split(' ')[0])
return name, price
[parse(book) for book in books]
'''
Traceback (most recent call last):
File "c:/Passport_G/Rob_justpy/jptutorial/gazpacho.py", line 1, in
from gazpacho import get, Soup
File "c:\Passport_G\Rob_justpy\jptutorial\gazpacho.py", line 1, in
from gazpacho import get, Soup
ImportError: cannot import name 'get' from 'gazpacho' (c:\Passport_G\Rob_justpy\jptutorial\gazpacho.py)
Windows 10]
from command line
(jp) C:\Passport_G\Rob_justpy\jptutorial>pip install -U gazpacho
Processing c:\users\rober\appdata\local\pip\cache\wheels\db\6b\a2\486f272d5e523b56bd19817c14ef35ec1850644dea78f9dd76\gazpacho-1.1-py3-none-any.whl
Installing collected packages: gazpacho
Successfully installed gazpacho-1.1
WARNING: You are using pip version 20.2.4; however, version 20.3.3 is available.
You should consider upgrading via the 'c:\passport_g\rob_justpy\jptutorial\jp\scripts\python.exe -m pip install --upgrade pip' command.
(jp) C:\Passport_G\Rob_justpy\jptutorial>
(jp) C:\Passport_G\Rob_justpy\jptutorial>python
Python 3.7.6 (default, Jan 8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32
Warning:
This Python interpreter is in a conda environment, but the environment has
not been activated. Libraries may fail to load. To activate this environment
please see https://conda.io/activation
Type "help", "copyright", "credits" or "license" for more information.
import gazpacho
Traceback (most recent call last):
File "", line 1, in
File "C:\Passport_G\Rob_justpy\jptutorial\gazpacho.py", line 1, in
from gazpacho import get, Soup
ImportError: cannot import name 'get' from 'gazpacho' (C:\Passport_G\Rob_justpy\jptutorial\gazpacho.py)
I have 'div' element with class='' and other div elements at the same level with class='whatever'
I cannot find a way to say get all the element with the attribute class
soup.find('div', attrs={'class':''}, partial=True, mode='all')
should return a list with all the 'div' elements, but that is not the case
Tried to get the 'divs' in a 'div', but was not able to find a solution for that either.
Maybe solution is to do 2 find mode 'all' and concat the lists
None
Hi! This library is cool, but I've started using it and immediately stumbled upon one difficulty:
Soup.find()
returns a list
of Soup
s if it finds multiple tags, a single Soup
object if it finds single tag, and None
if it finds no tags.
This makes impossible to seamlessly use find()
in for
expressions and comprehensions like one in the "Books" example.
Imagine, I need to parse multiple pages, each one containing unknown ammount of books in 0 to N range.
To make it seamlessly I need to write 3-branch if
expression or somehow catch the TypeError
with nested try
blocks.
This is what happens with the example when it finds only one book:
In [7]: from gazpacho import get, Soup
...:
...: url = 'https://scrape.world/books'
...: html = get(url)
...: soup = Soup(html)
...: books = soup.find('div', {'class': 'book-early'})
...:
...: def parse(book):
...: name = book.find('h4').text
...: price = float(book.find('p').text[1:].split(' ')[0])
...: return name, price
...:
...: [parse(book) for book in books]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-7-16cc6dbabde2> in <module>
11 return name, price
12
---> 13 [parse(book) for book in books]
TypeError: 'Soup' object is not iterable
So, could you please make find()
return an empty list if no tags, and a list with one element if only one tag is found?
Is your feature request related to a problem? Please describe.
Only certain "mode" strings are accepted to Soup.find()
, and underneath that, Soup._triage()
. The type annotation str
allows typos through for a runtime error.
Describe the solution you'd like
Use typing.Literal
to specify the known mode names, so that such bugs can be caught at type checking type.
Describe alternatives you've considered
n/a
Additional context
n/a
Is your feature request related to a problem? Please describe.
Right now gazpacho has to hit various websites in it's test suite to make sure everything works, including:
Describe the solution you'd like
These tests should really be mocked.
Additional context
Could use some help on doing this properly!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.