jjlee / mechanize Goto Github PK
View Code? Open in Web Editor NEWStateful programmatic web browsing in Python, after Andy Lester's Perl module WWW::Mechanize .
Home Page: http://wwwsearch.sourceforge.net/mechanize/
Stateful programmatic web browsing in Python, after Andy Lester's Perl module WWW::Mechanize .
Home Page: http://wwwsearch.sourceforge.net/mechanize/
This project has moved to https://github.com/python-mechanize/mechanize It's now being maintained by other people, principally Kovid Goyal -- John Lee, March 2017
mtamizi reports that the lack of an equality operator makes it awkward to use pickled cookies in sqlalchemy.
Expect: This is true: mechanize.Cookie(**args) == mechanize.Cookie(**args)
Got: It isn't
Test case using sqlalchemy: (fails with Python 2.7 and sqlalchemy 0.6.3): http://gist.github.com/550319
FileControl assumes that the data to include in the field of the form comes from a file on disk. It should also allow adding a file from a byte array source.
In most environments there's an easy work-around for the current limitation, just write a temporary file. Unfortunate Google AppEngine doesn't allow writing files. (And I want to take data from a db.Blob and upload it to a web form.)
If there's another workaround I haven't thought of please let me know. Maybe a proxy class that works enough like File but is sourced with a byte array?
Issue: Calling br.open(url) enters an infinite refresh loop if the page has a refresh header pointing to itself.
Reasons:
In my case, I don't care about refresh headers, so I simply changed the default arguments at _useragent.py:107.
Possible Solutions:
Thoughts?
(Thanks for mechanize, btw, it's a fantastic piece of software!)
Got: When you request a timeout using the timeout parameter to urlopen (or Browser.open), in order to tell that a timeout occurred, you have to use a poorly-defined interface like HTTPError.reason, using code like this:
import mechanize
import socket
br = mechanize.Browser()
try:
br.open("http://python.org/", timeout=0.001)
except mechanize.URLError, exc:
if isinstance(exc.reason, socket.timeout):
print "timeout occurred"
Expect: There's some clearly defined iinterface for finding out that a timeout imposed by module socket occurred.
There's no .set_timeout() method.
Expect:
browser = mechanize.Browser()
browser.set_timeout(10.)
browser.open("http://example.com")
Got: no .set_timeout method
Python 2.6 supports the CONNECT method for establishing HTTPS connections through a web proxy.
To reproduce: attempt to mechanize.urlopen() an https: URL served by a remote host when the only route to the web from your host is through an HTTP proxy that supports the CONNECT method.
Expect: can fetch page
Got: fetch fails due to failure connect to remote host
If cookie has path attribute set to empty, mechanize thinks that it is incorrect and bypass them.
But all modern browser (ie, firefox, chrome) work correctly with empty path attributes.
I have a quick patch:
diff --git a/mechanize/_clientcookie.py b/mechanize/_clientcookie.py
index 2ed4c87..2af778a 100644
--- a/mechanize/_clientcookie.py
+++ b/mechanize/_clientcookie.py
@@ -1291,6 +1291,9 @@ class CookieJar:
# is a request to discard (old and new) cookie, though.
k = "expires"
v = self._now + v
if k == "path":
if v is None:
v = "/"
if (k in value_attrs) or (k in boolean_attrs):
if (v is None and
k not in ["port", "comment", "commenturl"]):
If you automate an ASP.NET site quite often you have to "emulate" javascript handlers in your python code. I have seen a couple of cases then submit should be done after clicking on A tag at the same time form has a clickable control.
Even if I update required hidden controls (__EVENTTARGET) and do browser.form.submit() without arguments, mechanize "emulates" click on the first clickable control and I got wrong result.
It would be very useful if I can use some special argument(s) value to HTMLForm.click which will result in running HTMLForm._switch_click method even if there clickable controls in the form.
To reproduce:
python -c "import sgmllib; print sgmllib.charref; import mechanize; print sgmllib.charref"
Expect: prints the same both times.
Got: doesn't, since mechanize takes it upon itself to monkey-patch sgmllib to fix http://bugs.python.org/issue803422
In response to this bug:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=456944
The debian mechanize package is carrying this patch:
http://svn.debian.org/viewsvn/pkg-zope/python-mechanize/trunk/debian/patches/mechanize_seek.dpatch?revision=2231&view=markup
It seems to not have been applied to recent versions of mecanize. It would be nice to get rid of that patch one way or another.
Here is simple html:
<title></title>
This is simple python example to reproduce the problem
import mechanize
br = mechanize.Browser()
br.open_local_file('test.html')
br.select_form('f')
this example crash with AttributeError: control 'i' is disabled
on form selection, root cause is the following line (_form.py, line 2336):
if self.value is None: self.value = ""
in SubmitControl constructor.
Using regular or ==dev easy_install "fails" in both cases, same problem with Python 2.4 on Ubuntu
easy_install-2.4 -U mechanize
Searching for mechanize
Reading http://pypi.python.org/simple/mechanize/
Reading http://wwwsearch.sourceforge.net/mechanize/
Best match: mechanize 0.1.11
Downloading http://wwwsearch.sourceforge.net/mechanize/src/mechanize-0.1.11.zip
Processing mechanize-0.1.11.zip
Running mechanize-0.1.11/setup.py -q bdist_egg --dist-dir /tmp/easy_install-IwwENn/mechanize-0.1.11/egg-dist-tmp-iOvYrh
no previously-included directories found matching 'docs-in-progress'
File "build/bdist.linux-i686/egg/mechanize/_firefox3cookiejar.py", line 91
yield row
SyntaxError: 'yield' not allowed in a 'try' block with a 'finally' clause
Adding mechanize 0.1.11 to easy-install.pth file
Installed /usr/local/lib/python2.4/site-packages/mechanize-0.1.11-py2.4.egg
Processing dependencies for mechanize
Finished processing dependencies for mechanize
Python libraries for parsing HTML have improved. mechanize doesn't support three of the most popular choices of the current crop.
Expect: can use some mechanize API to request that one of these libraries is used to parse HTML:
lxml.html
BeautifulSoup 3
html5lib
Got: can only use bundled BeautifulSoup v.2 or Python's sgmllib or SGMLParser modules.
The cookies test should be using tempfile.mkstemp instead of tempfile.mktemp.
One example of using the tempfile.mktemp method is TempfileTestMixin in test/test_cookies.py. tempfile.mktemp - as per [0] , it is Deprecated since version 2.3: Use mkstemp() instead.
[0] - http://docs.python.org/library/tempfile.html#tempfile.mktemp
This is with the latest mechanize. Just as an example, ClientForm is unable to parse https://delicious.com/login correctly. It fails to pick up the second form which comes right after the <hr/>
. If you insert any form right after that <hr/>
, it will be omitted from Browser.forms()
. If you remove the <hr/>
, the form gets picked up.
Require Python 2.5 and use the absolute and relative imports features.
The reason for using absolute imports is described in PEP 328.
From Felix Heß
trying to read www.cortalconsors.de with mechanize fails. The problem is
in _http.py in the function http_response (line 197). Calling
ct_hdrs = http_message.getheaders("content-type")
returns [''] sometimes. Then is_html(ct_hdrs, url, self._allow_xhtml) fails.
proposed bugfix:
if '' in ct_hdrs:
ct_hdrs.remove('')
before calling
if is_html(ct_hdrs, url, self._allow_xhtml):
I hope this information helps you to resolve the bug.
Best regards
Felix
If an attribute is very long, FormParser.feed() might get the value of the attribute in multiple chunks. If it happens to be chunked before a newline, handle_data() will strip that newline.
I am attaching patch to fix that.
snippet:
import mechanize
import socket
br = mechanize.Browser()
try:
br.open("http://python.org/", timeout=0.001)
except mechanize.URLError, exc:
if isinstance(exc.reason, socket.timeout):
print "timeout occurred"
Hello there,
I am having issues when trying to open pages when getting Transfer-Encoding: chunked
in responses.
Browser.open()
simply hangs without raising any exception. I don't have stacktrace to show, but here is the debug output of the request:
send: 'GET http://www.tuttosport.com/robots.txt HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.tuttosport.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.6\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Content-Length: 28
header: ETag: "5417a9-1c-44692ef7da100"
header: Date: Sat, 01 Oct 2011 01:00:29 GMT
header: Last-Modified: Wed, 20 Feb 2008 08:40:04 GMT
header: Expires: Sat, 01 Oct 2011 01:05:29 GMT
header: Server: Apache
header: Accept-Ranges: bytes
header: Content-Type: text/plain
header: Connection: close
send: 'GET http://www.tuttosport.com/ HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.tuttosport.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.6\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Sat, 01 Oct 2011 01:00:30 GMT
header: Expires: Sat, 01 Oct 2011 01:05:30 GMT
header: Server: Apache
header: Accept-Ranges: bytes
header: Content-Type: text/html
header: Transfer-Encoding: chunked
header: Age: 1
header: Connection: close
I had some issues with Browser.retrieve and original filenames:
That's not really mechanize's fault: to extract those header parameters, httplib.HTTPMessage is missing a crucial 'get_filename' or a more generic 'get_param' methods, that are both present in the email.message.Message class.
httplib.HTTPMessage has indeed a 'getparam' method, but unfortunately, it's only used/usable for 'content-type' header parsing.
I submitted an issue on the Python tracker (http://bugs.python.org/issue11316) and proposed a 'monkeypatch_http_message' decorator as a workaround, so we can do:
import mechanize
from some.module import monkeypatch_http_message
browser = mechanize.Browser()
(tmp_filename, headers) = browser.retrieve(someurl)
# monkeypatch the httplib.HTTPMessage instance
monkeypatch_http_message(headers)
# yeah... my original filename, finally
filename = headers.get_filename()
If you navigate to a page A that redirect to page B, the page you visit is the page B. This is the page that must be added to the history.
When reload()ing the browser, that the page B that must be requested, not page A
This is really problematic if the page A is a submitted form.
The fix is easy, however it break a doctest but, as explain above, this test seems based on a false assumption.
Patch: http://paste.pocoo.org/show/229138/
For testing purpose, here is a simple server and a test script
http://paste.pocoo.org/show/229139/
http://paste.pocoo.org/show/229140/
import socks
import socket
import mechanize
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, "76.73.239.33", 27977)
socket.socket = socks.socksocket
br = mechanize.Browser()
br.open("https://www.google.com")
Traceback (most recent call last):
File "tets1.py", line 16, in
br.open("https://www.google.com")
File "build\bdist.win32\egg\mechanize_mechanize.py", line 203, in open
File "build\bdist.win32\egg\mechanize_mechanize.py", line 230, in _mech_open
File "build\bdist.win32\egg\mechanize_opener.py", line 188, in open
File "build\bdist.win32\egg\mechanize_http.py", line 316, in http_request
File "build\bdist.win32\egg\mechanize_http.py", line 242, in read
File "build\bdist.win32\egg\mechanize_mechanize.py", line 203, in open
File "build\bdist.win32\egg\mechanize_mechanize.py", line 230, in _mech_open
File "build\bdist.win32\egg\mechanize_opener.py", line 193, in open
File "build\bdist.win32\egg\mechanize_urllib2_fork.py", line 344, in _open
File "build\bdist.win32\egg\mechanize_urllib2_fork.py", line 332, in _call_chain
File "build\bdist.win32\egg\mechanize_urllib2_fork.py", line 1170, in https_open
File "build\bdist.win32\egg\mechanize_urllib2_fork.py", line 1115, in do_open
File "V:\python\python.v2.54_portable\App\lib\httplib.py", line 866, in request
self._send_request(method, url, body, headers)
File "V:\python\python.v2.54_portable\App\lib\httplib.py", line 889, in _send_request
self.endheaders()
File "V:\python\python.v2.54_portable\App\lib\httplib.py", line 860, in endheaders
self._send_output()
File "V:\python\python.v2.54_portable\App\lib\httplib.py", line 732, in _send_output
self.send(msg)
File "V:\python\python.v2.54_portable\App\lib\httplib.py", line 699, in send
self.connect()
File "V:\python\python.v2.54_portable\App\lib\httplib.py", line 1134, in connect
sock.connect((self.host, self.port))
File "V:\python\python.v2.54_portable\App\lib\site-packages\socks.py", line 369, in connect
self.__negotiatesocks5(destpair[0],destpair[1])
File "V:\python\python.v2.54_portable\App\lib\site-packages\socks.py", line 236, in __negotiatesocks5
raise Socks5Error(ord(resp[1]),_generalerrors[ord(resp[1])])
TypeError: init() takes exactly 2 arguments (3 given)
Exit code: 1
Expect: All proxy features available in Python 2.6 are available in mechanize.
Got: proxy bypass settings (e.g. no_proxy environment variable) are ignored (and probably bug fixes, and perhaps other changes are missing).
The parser throwing an error (ParseError: expected name token at "<!';\npixiv.context.u") when using follow_link(url_regex='URL_PATTERN').
The "<!" is inside javascript string variable, not for denoting html comment, the full script in here:
....
<script> pixiv.context.illustId = '14245299'; pixiv.context.illustTitle = 'Go to school>///...
Doesn't throw exception if using RobustFactory()
To reproduce:
import mechanize
import mechanize._response
response = mechanize._response.test_response(
"<",
headers=[("Content-type", "text/html; charset=\"bogus\"")])
browser = mechanize.Browser()
browser.set_response(response)
browser.forms()
Expect: no traceback (falls back to default encoding)
Got:
Traceback (most recent call last):
File "/home/john/dev/tst.py", line 93, in
browser.forms()
File "/home/john/dev/mechanize/mechanize/_mechanize.py", line 420, in forms
return self._factory.forms()
File "/home/john/dev/mechanize/mechanize/_html.py", line 549, in forms
self._forms_factory.forms())
File "/home/john/dev/mechanize/mechanize/_html.py", line 229, in forms
_urlunparse=_rfc3986.urlunsplit,
File "/home/john/dev/mechanize/mechanize/_form.py", line 844, in ParseResponseEx
_urlunparse=_urlunparse,
File "/home/john/dev/mechanize/mechanize/_form.py", line 981, in _ParseFileEx
fp.feed(data)
File "/home/john/dev/mechanize/mechanize/_form.py", line 758, in feed
_sgmllib_copy.SGMLParser.feed(self, data)
File "/home/john/dev/mechanize/mechanize/_sgmllib_copy.py", line 110, in feed
self.goahead(0)
File "/home/john/dev/mechanize/mechanize/_sgmllib_copy.py", line 199, in goahead
self.handle_entityref(name)
File "/home/john/dev/mechanize/mechanize/_form.py", line 650, in handle_entityref
'&%s;' % name, self._entitydefs, self._encoding))
File "/home/john/dev/mechanize/mechanize/_form.py", line 143, in unescape
return re.sub(r"&#?[A-Za-z0-9]+?;", replace_entities, data)
File "/usr/lib/python2.6/re.py", line 151, in sub
return _compile(pattern, 0).sub(repl, string, count)
File "/home/john/dev/mechanize/mechanize/_form.py", line 135, in replace_entities
repl = repl.encode(encoding)
LookupError: unknown encoding: bogus
The problem is: in RobustFactory, the FormsFactory is in fact still a
default FormsFactory, not the beautifulsoup's.
In _html.py, line 423 (the blue line), when constructing RobustFormsFactory,
the assignment does not work, I printed out the form_parser_class in
FormsFactory's constructor, and it's shown as "None";
So I added the red line to solve it, just a quick fix, hope you can update
it in next version:
class RobustFormsFactory(FormsFactory):
def __init__(self, *args, **kwds):
args = form_parser_args(*args, **kwds)
if args.form_parser_class is None:
args.form_parser_class = RobustFormParser
args.dictionary['form_parser_class'] = RobustFormParser
FormsFactory.__init__(self, **args.dictionary)
To reproduce:
print str(mechanize.ParseError("spam"))
Expect: "spam" printed
Got:
File "/usr/lib/python2.6/HTMLParser.py", line 59, in str
result = self.msg
AttributeError: 'ParseError' object has no attribute 'msg'
I have a form with a couple of submit buttons which look liks this:
<button type="submit" name="action" value="publish">Publish</button>
<button type="submit" name="action" value="preview">Preview</button>
When I click on one of those buttons mechanize submits the form, but does not include an action value in the request data.
Debugging this shows that this goes wrong in ScalarControl._totally_ordered_pairs() for the SubmitButtonControl instance: disabled is set to True, so no pair is returned.
To reproduce:
Expect: ParseError
Got: AttributeError (from Elaine Angelino):
In [46]: from mechanize import Browser
In [47]: br = Browser()
In [48]: br.open('http://www.walgreens.com/marketing/storelocator/find.jsp')
Out[48]: <response_seek_wrapper at 0x1b7a080 whose wrapped object =
<closeable_response at 0x1c8b170 whose fp = <socket._fileobject object at
0x1b846b0>>>
ParseError Traceback (most recent call last)
/Users/elaineangelino/gotdata/Temp/ in ()
/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/mechanize-0.1.11-py2.6.egg/mechanize/_mechanize.pyc
in forms(self)
424 if not self.viewing_html():
425 raise BrowserStateError("not viewing HTML")
--> 426 return self._factory.forms()
427
428 def global_form(self):
/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/mechanize-0.1.11-py2.6.egg/mechanize/_html.pyc
in forms(self)
557 try:
558 self._forms_genf = CachingGeneratorFunction(
--> 559 self._forms_factory.forms())
560 except: # XXXX define exception!
561 self.set_response(self._response)
/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/mechanize-0.1.11-py2.6.egg/mechanize/_html.pyc
in forms(self)
226 )
227 except ClientForm.ParseError, exc:
--> 228 raise ParseError(exc)
229 self.global_form = forms[0]
230 return forms[1:]
<type 'str'>: (<type 'exceptions.AttributeError'>,
AttributeError("'ParseError' object has no attribute 'msg'",))
In [50]:
I was expecting it to return attributes like the form, as a dictionary. I changed a single line of code in _html.py and now it appears to be working...
On line 190 of _html.py changed token.attrs to attrs...
Reading this page:
with this script:
import mechanize
br = mechanize.Browser()
url = r'file:///home/catherine/Music/badform.html'
br.open(url)
br.select_form('login')
br['passwd'] = 'no problem'
br['username'] = 'problem'
I get:
catherine@dellzilla:~/Music$ python mechbug.py
Traceback (most recent call last):
File "mechbug.py", line 7, in
br['username'] = 'problem'
File "/usr/local/lib/python2.6/dist-packages/ClientForm-0.2.10-py2.6.egg/ClientForm.py", line 2895, in setitem
control = self.find_control(name)
File "/usr/local/lib/python2.6/dist-packages/ClientForm-0.2.10-py2.6.egg/ClientForm.py", line 3222, in find_control
return self._find_control(name, type, kind, id, label, predicate, nr)
File "/usr/local/lib/python2.6/dist-packages/ClientForm-0.2.10-py2.6.egg/ClientForm.py", line 3306, in _find_control
raise ControlNotFoundError("no control matching "+description)
ClientForm.ControlNotFoundError: no control matching name 'username'
Examining the form controls shows that the submit
and passwd
controls are present, but the username
field is absent from form.controls
.
Removing <br/>
from the form fixes the problem. In fact, even changing <br/>
to <br />
(inserting a space) fixes the problem. Unfortunately, I can't stop the form authors of the world from sticking <br/>
in their forms!
Just making sure you are aware that this bug was reported on the debian mechanize package:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=555349
Debian tries to prevent duplication of code in it's archive mainly for security reasons.
I have mechanize 0.2.4, python 2.7, zope.interface 3.6.1 and twisted.web2 8.1.0 on Fedora 14.
build log:
http://koji.fedoraproject.org/koji/getfile?taskID=2698590&name=build.log
I had some issues with Browser.retrieve and original filenames, at least in Python 2.6:
That's not really mechanize's fault: to extract those header parameters, httplib.HTTPMessage is missing a crucial 'get_filename' or a more generic 'get_param' methods, that are both present in the email.message.Message class. httplib.HTTPMessage has indeed a 'getparam' method, but unfortunately, it's only used/usable for 'content-type' header parsing.
I submitted an issue on the Python tracker (http://bugs.python.org/issue11316) and proposed a 'monkeypatch_http_message' decorator as a workaround, so we can do:
import mechanize
from some.module import monkeypatch_http_message
browser = mechanize.Browser()
(tmp_filename, headers) = browser.retrieve(someurl)
# monkeypatch the httplib.HTTPMessage instance
monkeypatch_http_message(headers)
# yeah... my original filename, finally
filename = headers.get_filename()
Once again, that's the situation in Python 2.6. According to http://bugs.python.org/issue4773, httplib.HTTPMessage in Python 3.x is using email.message.Message underneath.
(ps: this is an edited repost of issue 35, that I closed by mistake...)
In UTF-8 the character Ü is represented by two bytes, one of which appears as a key in mechanize._beautifulsoup.BeautifulStoneSoup.MS_CHARS
In Browser.open a subclass of BeautifulStoneSoup called MechanizeBs is used, which overrides BeautifulStoneSoup.PARSER_MASSAGE, so that MS_CHARS is ignored.
In Browser.select_form however, mechanize._form.RobustFormParser is used, which uses BeautifulStoneSoup directly, which uses MS_CHARS for replacements. This leads to one of the bytes of UTF-8 Ü being replaced, which destroys the Ü character. As a consequence controls with labels containing Ü cannot be found by their label anymore, i.e. the following : browser.click( label='Übernehmen' ) fails with a ControlNotFoundError: no control matching kind 'clickable', label 'Übernehmen'.
I currently worked around that using a monkey patch:
import mechanize
mechanize._form.RobustFormParser.PARSER_MASSAGE = mechanize._html.MechanizeBs.PARSER_MASSAGE
A real fix would be appreciated :). Thx!
I have a link with a fragment in it, for example:
<a href="/somepage#header">More info</a>
if I click on such a link using mechanize I always get a 404. The problems appears to be that the fragment is not removed from the URL before a request is created. This should probably be done in Browser.click_link with a simple link.absolute_url.split("#",1)[0] or something similar.
Do not remove file_object after uploading (submit) the file in a form
I have the following code:
<form action="blabla" blabla >
<input 1 type=blah>
<input 2 type=blah2> etc
<noscript>
<textarea name="prda" rows="3" cols="40"></textarea>
</noscript>
I want to fill out that textarea preferrably with mechanize (in Python), however, form["prda"] is always giving me control not found error. A user on StackOverflow has suggested that mechanize cannot parse controls that are within tag, which seems kind of odd for me. Is this true?
I've been trying to select the 2nd form out of 20+ forms in a page. And it happens to be the only form in the page with a name 'send_form'.
I've tried
br.select_form(nr=1)
and
br.select_form(name='send_form')
and
for f in br.forms():
if f.name != None:
br.select_form(name=f.name)
The first results in every single form object from the page. The second returns a no form by that name error. And the third also returns every single form object on the page.
Now there are three fields I'm trying to access name=prospect_email[], name=prospect_name[], name=prospect_telephone[]. Now these input fields also have id's with the same name lacking the []. Now I've successfully input data into fields on other forms so I know how to do it. But when I try to access these I get an error saying the name of ... does not exist. I was figuring it's probably because I don't have the right form selected. I've spent hours on this and I'm racking my brain trying to figure it out. Help will be appreciated.
Hi, I was trying to package this for Gentoo. This is what my test run gives me:
/var/tmp/portage/dev-python/mechanize-0.2.0/work/mechanize-0.2.0/test/test_api.py:6: SyntaxWarning: import * only allowed at module level
def test_import_all(self):
test-tools/testprogram.py:401: UserWarning: Skipping functional tests: Failed to import twisted.web2 and/or zope.interface
warnings.warn("Skipping functional tests: Failed to import "
(After that, all tests are either skipped or pass. Not sure the UserWarning here is good, I'd prefer to just have a bunch of skipped tests.)
setuptools install of mechanize is broken (whereas pip works)
Facts :
$ man virtualenv
$ virtualenv --setuptools test
New python executable in test/bin/python
Installing setuptools............done.
$ cd test/
$ . bin/activate
(test)$ bin/easy_install mechanize
Searching for mechanize
Reading http://pypi.python.org/simple/mechanize/
Reading http://wwwsearch.sourceforge.net/mechanize/
Best match: mechanize 0.2.4
Downloading http://wwwsearch.sourceforge.net/mechanize/src/mechanize-0.2.4.tar.gz
Traceback (most recent call last):
File "bin/easy_install", line 8, in <module>
load_entry_point('setuptools==0.6c11', 'console_scripts', 'easy_install')()
…
File "/tmp/test/lib/python2.6/site-packages/setuptools-0.6c11-py2.6.egg/setuptools/package_index.py", line 553, in _download_to
ValueError: invalid literal for int() with base 10: '382727, 382727'
After investigating it seems that setuptools use dowload url on pypi page (whereas pip use archives hosted on pypi). It fails when checking Content-Length sent by http://wwwsearch.sourceforge.net/mechanize/src/mechanize-0.2.4.tar.gz which is repeated two times and given to setuptool as a tuple (and setuptool expect an int) :
$ curl -D - http://wwwsearch.sourceforge.net/mechanize/src/mechanize-0.2.4.tar.gz -o /tmp/tmp.tar.gz
Server: Apache/2.2.3 (CentOS)
Last-Modified: Thu, 28 Oct 2010 20:57:05 GMT
ETag: "5d707-493b395976a40"
Content-Length: 382727 <-----
Expires: Sat, 02 Apr 2011 07:41:43 GMT
Content-Type: application/x-gzip
Content-Length: 382727 <-----
Date: Thu, 31 Mar 2011 07:41:43 GMT
X-Varnish: 58847923
Age: 0
Via: 1.1 varnish
Connection: keep-alive
Readme points to docs/html/index.html, which doesn't exist. There is no 'html' directory in the docs directory.
No control with label 'asdfasdfasdf' exists. Mechanize still submits the form.
browser.forms().next().click(label='asdfasdfasdf')
The following seems to fix the issueVersion 0.2.0. In _form.py, line 3190 add to the condition:
or (label is not None)
With this fix the above clock() call raises
ControlNotFoundError: no control matching kind 'clickable', label 'asdfasdfasdf'
mechanize does not throw an error when connecting to a website using an invalid SSL certificate. This means that mechanize users are vulnerable to man-in-the-middle attacks even when they think they are protected by SSL.
Here’s a test case:
mechanize.Browser().open('https://scripts-vhosts.mit.edu/').read()
Whene mechanize find a form that contain 2 controls with the same name but different id, the dict of controls in form have the key like name not the id.
In _clientcookie.py:_cookie_attrs() cookie values that do match only \W characters will have there double quotes explicitly escaped.
This changes the value of quoted-cookies when they return to the webapp. For example:
A cookie comes in with the key/value pair: hello => "world"
The quote substitution will make this into: hello => "world"
Now wireshark tells me that the following is send on the next request:
Cookie: hello=\"world\"; $Path="/"; $Domain=".some.testdomain.com"
When I comment out this part of the code:
# quote cookie value if necessary
# (not for Netscape protocol, which already has any quotes
# intact, due to the poorly-specified Netscape Cookie: syntax)
# if ((cookie.value is not None) and
# self.non_word_re.search(cookie.value) and version > 0):
# value = self.quote_re.sub(r"\\\1", cookie.value)
# else:
value = cookie.value
Everything works as expected.
I'm not quite sure what the comments means by 'not for netscape protocol', should there be an extra check in there
to check that it's not a mozilla style mechanized browser ?
I've checked the webapp (not mime) and it appears to not be doing anything that is not understood by any browser. The cookies as given by the webapp work as expected on any browser.
This should result in the first button being clicked:
import mechanize
browser = mechanize.Browser()
browser.set_response(mechanize.make_response(
"""\
<button type="submit" name="action" value="publish">Publish</button>
<button type="submit" name="action" value="preview">Preview</button>
""",
[("Content-Type", "text/html")],
"http://example.com/", 200, "OK"))
form = browser.global_form()
form.click(predicate=lambda control: control.name == "publish")
I notice that if I try to follow an ftp:// type link, there is a crash.
This is because in the _urllib2_fork.py, it imports ftpwrapper from urllib.
This function expects 6 arguments, but throughout the _urllib2_fork.py file, a timeout argument is given, which causes it to choke...
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.