jjlee / mechanize Goto Github PK

Stateful programmatic web browsing in Python, after Andy Lester's Perl module WWW::Mechanize .

Home Page: http://wwwsearch.sourceforge.net/mechanize/

mechanize's Introduction

This project has moved to https://github.com/python-mechanize/mechanize

It's now being maintained by other people, principally Kovid Goyal

-- John Lee, March 2017

mechanize's People

Contributors

Stargazers

Watchers

Forkers

trcjr alexmipego matin mikluko frankk00 dss09 albertz axil jasonkotenko bjmc tmiyamon redfax vikalp tlevine abielr gugu djon3s dhionel newvem forschnix jamzo maznika wzin rickychang qqism mitchellrj travis-sun cluther purcaro jeffbuttars ipkn tachang streambo daqing15 papajo moscowmule2240 pansuo kacecode poemcao devlato jzr9356 mikewebkist caseydunham on2valhalla idobatter jackyxie jasimmk wesmadrigal dshah133 dishida bcriswell sfall visualrealityink muxuezi bihicheng hugovk b-rich coinpayee hickford therockspush delan-h dellis23 chasesmith95 marynaqa skada tanwanirahul amintos toro68 s-fiebig winnetou fioan89 hurzl pmulhol mehakinderoberoi glasslion phorkyou cainiaocome johnteifel mckenlee simudream danozgriff shareablee karbon0x edencysec iwsaas dkim0718 sundrique spiiin gurudiel barseghyanartur scharsey paveltr tforsberg shiweixingcn elsierey l9c yuanbei vedraiyani ss2nddiv losintikfos

mechanize's Issues

No equality operator on Cookie

mtamizi reports that the lack of an equality operator makes it awkward to use pickled cookies in sqlalchemy.

Expect: This is true: mechanize.Cookie(**args) == mechanize.Cookie(**args)
Got: It isn't

Test case using sqlalchemy: (fails with Python 2.7 and sqlalchemy 0.6.3): http://gist.github.com/550319

No way to give data as bytes to FileControl

FileControl assumes that the data to include in the field of the form comes from a file on disk. It should also allow adding a file from a byte array source.

In most environments there's an easy work-around for the current limitation, just write a temporary file. Unfortunate Google AppEngine doesn't allow writing files. (And I want to take data from a db.Blob and upload it to a web form.)

If there's another workaround I haven't thought of please let me know. Maybe a proxy class that works enough like File but is sourced with a byte array?

Infinite loop on self-refreshing pages

Issue: Calling br.open(url) enters an infinite refresh loop if the page has a refresh header pointing to itself.

Reasons:

The default arguments to HTTPRefreshProcessor follow all refreshes, after waiting the page-requested amount of time.
Since the page is fetched correctly, adding a timeout parameter to open() does nothing. (Upon reflection, this seems like correct behavior to me. However, it is quite counter-intuitive to a library user, especially since there's no indication of why mechanize is hanging.)

In my case, I don't care about refresh headers, so I simply changed the default arguments at _useragent.py:107.

Possible Solutions:

Allow customization of refresh header behavior in the Browser object.
Ignore header refreshes after the browser timeout has passed.

Thoughts?

(Thanks for mechanize, btw, it's a fantastic piece of software!)

No good way to tell when a timeout occurred

Got: When you request a timeout using the timeout parameter to urlopen (or Browser.open), in order to tell that a timeout occurred, you have to use a poorly-defined interface like HTTPError.reason, using code like this:

import mechanize
import socket
br = mechanize.Browser()
try:
br.open("http://python.org/", timeout=0.001)
except mechanize.URLError, exc:
if isinstance(exc.reason, socket.timeout):
print "timeout occurred"

Expect: There's some clearly defined iinterface for finding out that a timeout imposed by module socket occurred.

No .set_timeout() method on mechanize.UserAgent

There's no .set_timeout() method.

Expect:

This should cause the .open() to time out in the same way as providing a timeout argument to .open():

browser = mechanize.Browser()
browser.set_timeout(10.)
browser.open("http://example.com")

The argument to .open() should override the .set_timeout() default.

Got: no .set_timeout method

HTTPS CONNECT proxies not supported

Python 2.6 supports the CONNECT method for establishing HTTPS connections through a web proxy.

To reproduce: attempt to mechanize.urlopen() an https: URL served by a remote host when the only route to the web from your host is through an HTTP proxy that supports the CONNECT method.

Expect: can fetch page

Got: fetch fails due to failure connect to remote host

Mechanize doesn`t work with cookies with an empty "path" attribute

If cookie has path attribute set to empty, mechanize thinks that it is incorrect and bypass them.
But all modern browser (ie, firefox, chrome) work correctly with empty path attributes.
I have a quick patch:

diff --git a/mechanize/_clientcookie.py b/mechanize/_clientcookie.py
index 2ed4c87..2af778a 100644
--- a/mechanize/_clientcookie.py
+++ b/mechanize/_clientcookie.py
@@ -1291,6 +1291,9 @@ class CookieJar:
# is a request to discard (old and new) cookie, though.
k = "expires"
v = self._now + v

```
           if k == "path":
```
```
               if v is None:
```

                   v = "/"
         if (k in value_attrs) or (k in boolean_attrs):
             if (v is None and
                 k not in ["port", "comment", "commenturl"]):

Add an easy way to submit forms without "clicking" on control

If you automate an ASP.NET site quite often you have to "emulate" javascript handlers in your python code. I have seen a couple of cases then submit should be done after clicking on A tag at the same time form has a clickable control.

Even if I update required hidden controls (__EVENTTARGET) and do browser.form.submit() without arguments, mechanize "emulates" click on the first clickable control and I got wrong result.

It would be very useful if I can use some special argument(s) value to HTMLForm.click which will result in running HTMLForm._switch_click method even if there clickable controls in the form.

mechanize monkey-patches sgmllib.charref

To reproduce:

python -c "import sgmllib; print sgmllib.charref; import mechanize; print sgmllib.charref"

Expect: prints the same both times.

Got: doesn't, since mechanize takes it upon itself to monkey-patch sgmllib to fix http://bugs.python.org/issue803422

forgot ``seek(0)'' in mechanize ``RobustFactory.set_response()''?

In response to this bug:

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=456944

The debian mechanize package is carrying this patch:

 http://svn.debian.org/viewsvn/pkg-zope/python-mechanize/trunk/debian/patches/mechanize_seek.dpatch?revision=2231&view=markup

It seems to not have been applied to recent versions of mecanize. It would be nice to get rid of that patch one way or another.

Mechanize cannot handle forms with disabled inputs with image type

Here is simple html:

<title></title>

This is simple python example to reproduce the problem

import mechanize
br = mechanize.Browser()
br.open_local_file('test.html')
br.select_form('f')

this example crash with AttributeError: control 'i' is disabled on form selection, root cause is the following line (_form.py, line 2336):

if self.value is None: self.value = ""

in SubmitControl constructor.

easy_install error on 2.4 due to yield in try/finally in _firefox3cookiejar.py

Using regular or ==dev easy_install "fails" in both cases, same problem with Python 2.4 on Ubuntu

easy_install-2.4 -U mechanize
Searching for mechanize
Reading http://pypi.python.org/simple/mechanize/
Reading http://wwwsearch.sourceforge.net/mechanize/
Best match: mechanize 0.1.11
Downloading http://wwwsearch.sourceforge.net/mechanize/src/mechanize-0.1.11.zip
Processing mechanize-0.1.11.zip
Running mechanize-0.1.11/setup.py -q bdist_egg --dist-dir /tmp/easy_install-IwwENn/mechanize-0.1.11/egg-dist-tmp-iOvYrh
no previously-included directories found matching 'docs-in-progress'
File "build/bdist.linux-i686/egg/mechanize/_firefox3cookiejar.py", line 91
yield row
SyntaxError: 'yield' not allowed in a 'try' block with a 'finally' clause
Adding mechanize 0.1.11 to easy-install.pth file

Installed /usr/local/lib/python2.4/site-packages/mechanize-0.1.11-py2.4.egg
Processing dependencies for mechanize
Finished processing dependencies for mechanize

Add support for HTML parsing libraries

Python libraries for parsing HTML have improved. mechanize doesn't support three of the most popular choices of the current crop.

Expect: can use some mechanize API to request that one of these libraries is used to parse HTML:

lxml.html
BeautifulSoup 3
html5lib

Got: can only use bundled BeautifulSoup v.2 or Python's sgmllib or SGMLParser modules.

the cookies test should be using tempfile.mkstemp instead of tempfile.mktemp

The cookies test should be using tempfile.mkstemp instead of tempfile.mktemp.
One example of using the tempfile.mktemp method is TempfileTestMixin in test/test_cookies.py. tempfile.mktemp - as per [0] , it is Deprecated since version 2.3: Use mkstemp() instead.

[0] - http://docs.python.org/library/tempfile.html#tempfile.mktemp

Breaks on Delicious login form

This is with the latest mechanize. Just as an example, ClientForm is unable to parse https://delicious.com/login correctly. It fails to pick up the second form which comes right after the <hr/>. If you insert any form right after that <hr/>, it will be omitted from Browser.forms(). If you remove the <hr/>, the form gets picked up.

Use absolute imports

Require Python 2.5 and use the absolute and relative imports features.

The reason for using absolute imports is described in PEP 328.

Traceback with multiple content-type headers

From Felix Heß

trying to read www.cortalconsors.de with mechanize fails. The problem is
in _http.py in the function http_response (line 197). Calling

ct_hdrs = http_message.getheaders("content-type")

returns [''] sometimes. Then is_html(ct_hdrs, url, self._allow_xhtml) fails.

proposed bugfix:

if '' in ct_hdrs:
    ct_hdrs.remove('')

before calling

if is_html(ct_hdrs, url, self._allow_xhtml):

I hope this information helps you to resolve the bug.

Best regards
Felix

Long HTML attributes might get newlines stripped in the middle

If an attribute is very long, FormParser.feed() might get the value of the attribute in multiple chunks. If it happens to be chunked before a newline, handle_data() will strip that newline.
I am attaching patch to fix that.

Test issue to experiment with github tracker's code snippet formetting

snippet:
import mechanize
import socket
br = mechanize.Browser()
try:
br.open("http://python.org/", timeout=0.001)
except mechanize.URLError, exc:
if isinstance(exc.reason, socket.timeout):
print "timeout occurred"

Browser.open() hangs if Transfer-Encoding: chunked

Hello there,

I am having issues when trying to open pages when getting Transfer-Encoding: chunked in responses.

Browser.open() simply hangs without raising any exception. I don't have stacktrace to show, but here is the debug output of the request:

send: 'GET http://www.tuttosport.com/robots.txt HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.tuttosport.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.6\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Content-Length: 28
header: ETag: "5417a9-1c-44692ef7da100"
header: Date: Sat, 01 Oct 2011 01:00:29 GMT
header: Last-Modified: Wed, 20 Feb 2008 08:40:04 GMT
header: Expires: Sat, 01 Oct 2011 01:05:29 GMT
header: Server: Apache
header: Accept-Ranges: bytes
header: Content-Type: text/plain
header: Connection: close
send: 'GET http://www.tuttosport.com/ HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.tuttosport.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.6\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Sat, 01 Oct 2011 01:00:30 GMT
header: Expires: Sat, 01 Oct 2011 01:05:30 GMT
header: Server: Apache
header: Accept-Ranges: bytes
header: Content-Type: text/html
header: Transfer-Encoding: chunked
header: Age: 1
header: Connection: close

Browser.retrieve, original filename and incomplete httplib.HTTPMessage RFC822 header parsing

I had some issues with Browser.retrieve and original filenames:

Browser.retrieve(someurl) returns a (tmp_filename, httplib.HTTPMessage), with a temporary filename from tempfile.mkstemp;
Browser.retrieve(someurl, filename) returns a (filename, httplib.HTTPMessage);
but there's no way tho get the original filename, even if it's present in the 'Content-disposition: attachment; filename="abcd.xyz"' httplib.HTTPMessage header.

httplib.HTTPMessage has indeed a 'getparam' method, but unfortunately, it's only used/usable for 'content-type' header parsing.

I submitted an issue on the Python tracker (http://bugs.python.org/issue11316) and proposed a 'monkeypatch_http_message' decorator as a workaround, so we can do:

import mechanize
from some.module import monkeypatch_http_message
browser = mechanize.Browser()
(tmp_filename, headers) = browser.retrieve(someurl) 

# monkeypatch the httplib.HTTPMessage instance
monkeypatch_http_message(headers)

# yeah... my original filename, finally
filename = headers.get_filename()

Redirect request must visit if original request does

If you navigate to a page A that redirect to page B, the page you visit is the page B. This is the page that must be added to the history.
When reload()ing the browser, that the page B that must be requested, not page A

This is really problematic if the page A is a submitted form.

The fix is easy, however it break a doctest but, as explain above, this test seems based on a false assumption.

Patch: http://paste.pocoo.org/show/229138/

For testing purpose, here is a simple server and a test script
http://paste.pocoo.org/show/229139/
http://paste.pocoo.org/show/229140/

mechanize.Browser() not working with socks have port is 27977

import socks
import socket
import mechanize

socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, "76.73.239.33", 27977)
socket.socket = socks.socksocket

br = mechanize.Browser()

br.open("https://www.google.com")

Traceback (most recent call last):
File "tets1.py", line 16, in
br.open("https://www.google.com")
File "build\bdist.win32\egg\mechanize_mechanize.py", line 203, in open
File "build\bdist.win32\egg\mechanize_mechanize.py", line 230, in _mech_open
File "build\bdist.win32\egg\mechanize_opener.py", line 188, in open
File "build\bdist.win32\egg\mechanize_http.py", line 316, in http_request
File "build\bdist.win32\egg\mechanize_http.py", line 242, in read
File "build\bdist.win32\egg\mechanize_mechanize.py", line 203, in open
File "build\bdist.win32\egg\mechanize_mechanize.py", line 230, in _mech_open
File "build\bdist.win32\egg\mechanize_opener.py", line 193, in open
File "build\bdist.win32\egg\mechanize_urllib2_fork.py", line 344, in _open
File "build\bdist.win32\egg\mechanize_urllib2_fork.py", line 332, in _call_chain
File "build\bdist.win32\egg\mechanize_urllib2_fork.py", line 1170, in https_open
File "build\bdist.win32\egg\mechanize_urllib2_fork.py", line 1115, in do_open
File "V:\python\python.v2.54_portable\App\lib\httplib.py", line 866, in request
self._send_request(method, url, body, headers)
File "V:\python\python.v2.54_portable\App\lib\httplib.py", line 889, in _send_request
self.endheaders()
File "V:\python\python.v2.54_portable\App\lib\httplib.py", line 860, in endheaders
self._send_output()
File "V:\python\python.v2.54_portable\App\lib\httplib.py", line 732, in _send_output
self.send(msg)
File "V:\python\python.v2.54_portable\App\lib\httplib.py", line 699, in send
self.connect()
File "V:\python\python.v2.54_portable\App\lib\httplib.py", line 1134, in connect
sock.connect((self.host, self.port))
File "V:\python\python.v2.54_portable\App\lib\site-packages\socks.py", line 369, in connect
self.__negotiatesocks5(destpair[0],destpair[1])
File "V:\python\python.v2.54_portable\App\lib\site-packages\socks.py", line 236, in __negotiatesocks5
raise Socks5Error(ord(resp[1]),_generalerrors[ord(resp[1])])
TypeError: init() takes exactly 2 arguments (3 given)

Exit code: 1

ProxyHandler is missing features/fixes from Python 2.6

Expect: All proxy features available in Python 2.6 are available in mechanize.

Got: proxy bypass settings (e.g. no_proxy environment variable) are ignored (and probably bug fixes, and perhaps other changes are missing).

Parser error when using Browser.follow_link(url_regex=)

The parser throwing an error (ParseError: expected name token at "<!';\npixiv.context.u") when using follow_link(url_regex='URL_PATTERN').

The "<!" is inside javascript string variable, not for denoting html comment, the full script in here:

....

...

Doesn't throw exception if using RobustFactory()

Traceback on unknown encoding

To reproduce:
import mechanize
import mechanize._response

response = mechanize._response.test_response(
    "&lt;",
    headers=[("Content-type", "text/html; charset=\"bogus\"")])
browser = mechanize.Browser()
browser.set_response(response)
browser.forms()

Expect: no traceback (falls back to default encoding)

Got:
Traceback (most recent call last):
File "/home/john/dev/tst.py", line 93, in
browser.forms()
File "/home/john/dev/mechanize/mechanize/_mechanize.py", line 420, in forms
return self._factory.forms()
File "/home/john/dev/mechanize/mechanize/_html.py", line 549, in forms
self._forms_factory.forms())
File "/home/john/dev/mechanize/mechanize/_html.py", line 229, in forms
_urlunparse=_rfc3986.urlunsplit,
File "/home/john/dev/mechanize/mechanize/_form.py", line 844, in ParseResponseEx
_urlunparse=_urlunparse,
File "/home/john/dev/mechanize/mechanize/_form.py", line 981, in _ParseFileEx
fp.feed(data)
File "/home/john/dev/mechanize/mechanize/_form.py", line 758, in feed
_sgmllib_copy.SGMLParser.feed(self, data)
File "/home/john/dev/mechanize/mechanize/_sgmllib_copy.py", line 110, in feed
self.goahead(0)
File "/home/john/dev/mechanize/mechanize/_sgmllib_copy.py", line 199, in goahead
self.handle_entityref(name)
File "/home/john/dev/mechanize/mechanize/_form.py", line 650, in handle_entityref
'&%s;' % name, self._entitydefs, self._encoding))
File "/home/john/dev/mechanize/mechanize/_form.py", line 143, in unescape
return re.sub(r"&#?[A-Za-z0-9]+?;", replace_entities, data)
File "/usr/lib/python2.6/re.py", line 151, in sub
return _compile(pattern, 0).sub(repl, string, count)
File "/home/john/dev/mechanize/mechanize/_form.py", line 135, in replace_entities
repl = repl.encode(encoding)
LookupError: unknown encoding: bogus

RobustFactory fails to use BeautifulSoup to parse forms

The problem is: in RobustFactory, the FormsFactory is in fact still a
default FormsFactory, not the beautifulsoup's.

In _html.py, line 423 (the blue line), when constructing RobustFormsFactory,
the assignment does not work, I printed out the form_parser_class in
FormsFactory's constructor, and it's shown as "None";

So I added the red line to solve it, just a quick fix, hope you can update
it in next version:

class RobustFormsFactory(FormsFactory):
   def __init__(self, *args, **kwds):
   args = form_parser_args(*args, **kwds)
   if args.form_parser_class is None:
       args.form_parser_class = RobustFormParser
       args.dictionary['form_parser_class'] = RobustFormParser
   FormsFactory.__init__(self, **args.dictionary)

str(mechanize.ParseError()) traceback

To reproduce:
print str(mechanize.ParseError("spam"))

Expect: "spam" printed
Got:
File "/usr/lib/python2.6/HTMLParser.py", line 59, in str
result = self.msg
AttributeError: 'ParseError' object has no attribute 'msg'

value for button not submitted to server

I have a form with a couple of submit buttons which look liks this:

<button type="submit" name="action" value="publish">Publish</button>
<button type="submit" name="action" value="preview">Preview</button>

When I click on one of those buttons mechanize submits the form, but does not include an action value in the request data.

Debugging this shows that this goes wrong in ScalarControl._totally_ordered_pairs() for the SubmitButtonControl instance: disabled is set to True, so no pair is returned.

AttributeError raised instead of ParseError

To reproduce:

Call urlopen on a document that causes ParseError to be raised internally

Expect: ParseError

Got: AttributeError (from Elaine Angelino):

In [46]: from mechanize import Browser

In [47]: br = Browser()

In [48]: br.open('http://www.walgreens.com/marketing/storelocator/find.jsp')
Out[48]: <response_seek_wrapper at 0x1b7a080 whose wrapped object =
<closeable_response at 0x1c8b170 whose fp = <socket._fileobject object at
0x1b846b0>>>

In [49]: br.forms()

ParseError Traceback (most recent call last)

/Users/elaineangelino/gotdata/Temp/ in ()

/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/mechanize-0.1.11-py2.6.egg/mechanize/_mechanize.pyc
in forms(self)
424 if not self.viewing_html():
425 raise BrowserStateError("not viewing HTML")
--> 426 return self._factory.forms()
427
428 def global_form(self):

/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/mechanize-0.1.11-py2.6.egg/mechanize/_html.pyc
in forms(self)
557 try:
558 self._forms_genf = CachingGeneratorFunction(
--> 559 self._forms_factory.forms())
560 except: # XXXX define exception!
561 self.set_response(self._response)

/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/mechanize-0.1.11-py2.6.egg/mechanize/_html.pyc
in forms(self)
226 )
227 except ClientForm.ParseError, exc:
--> 228 raise ParseError(exc)
229 self.global_form = forms[0]
230 return forms[1:]

<type 'str'>: (<type 'exceptions.AttributeError'>,
AttributeError("'ParseError' object has no attribute 'msg'",))

In [50]:

Link attributes is returning a list of tuples

I was expecting it to return attributes like the form, as a dictionary. I changed a single line of code in _html.py and now it appears to be working...

On line 190 of _html.py changed token.attrs to attrs...

``<br/>`` in form makes following control invisible

Reading this page:

with this script:

import mechanize
br = mechanize.Browser()
url = r'file:///home/catherine/Music/badform.html'
br.open(url)
br.select_form('login')
br['passwd'] = 'no problem'
br['username'] = 'problem'

I get:

catherine@dellzilla:~/Music$ python mechbug.py
Traceback (most recent call last):
File "mechbug.py", line 7, in
br['username'] = 'problem'
File "/usr/local/lib/python2.6/dist-packages/ClientForm-0.2.10-py2.6.egg/ClientForm.py", line 2895, in setitem
control = self.find_control(name)
File "/usr/local/lib/python2.6/dist-packages/ClientForm-0.2.10-py2.6.egg/ClientForm.py", line 3222, in find_control
return self._find_control(name, type, kind, id, label, predicate, nr)
File "/usr/local/lib/python2.6/dist-packages/ClientForm-0.2.10-py2.6.egg/ClientForm.py", line 3306, in _find_control
raise ControlNotFoundError("no control matching "+description)
ClientForm.ControlNotFoundError: no control matching name 'username'

Examining the form controls shows that the submit and passwd controls are present, but the username field is absent from form.controls.

Removing <br/> from the form fixes the problem. In fact, even changing <br/> to <br /> (inserting a space) fixes the problem. Unfortunately, I can't stop the form authors of the world from sticking <br/> in their forms!

Stop bundling beautifulsoup

Just making sure you are aware that this bug was reported on the debian mechanize package:

 http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=555349

Debian tries to prevent duplication of code in it's archive mainly for security reasons.

Errors met during testing

I have mechanize 0.2.4, python 2.7, zope.interface 3.6.1 and twisted.web2 8.1.0 on Fedora 14.

build log:
http://koji.fedoraproject.org/koji/getfile?taskID=2698590&name=build.log

Browser.retrieve, original filename and incomplete httplib.HTTPMessage RFC822 header parsing

I had some issues with Browser.retrieve and original filenames, at least in Python 2.6:

Browser.retrieve(someurl) returns a (tmp_filename, httplib.HTTPMessage), with a temporary filename from tempfile.mkstemp;
Browser.retrieve(someurl, filename) returns a (filename, httplib.HTTPMessage);
but there's no way tho get the original filename, even if it's present in the 'Content-disposition: attachment; filename="abcd.xyz"' httplib.HTTPMessage header.

That's not really mechanize's fault: to extract those header parameters, httplib.HTTPMessage is missing a crucial 'get_filename' or a more generic 'get_param' methods, that are both present in the email.message.Message class. httplib.HTTPMessage has indeed a 'getparam' method, but unfortunately, it's only used/usable for 'content-type' header parsing.

I submitted an issue on the Python tracker (http://bugs.python.org/issue11316) and proposed a 'monkeypatch_http_message' decorator as a workaround, so we can do:

import mechanize 
from some.module import monkeypatch_http_message 

browser = mechanize.Browser() 
(tmp_filename, headers) = browser.retrieve(someurl) 

# monkeypatch the httplib.HTTPMessage instance 
monkeypatch_http_message(headers) 

# yeah... my original filename, finally 
filename = headers.get_filename()

Once again, that's the situation in Python 2.6. According to http://bugs.python.org/issue4773, httplib.HTTPMessage in Python 3.x is using email.message.Message underneath.

(ps: this is an edited repost of issue 35, that I closed by mistake...)

BeautifulStoneSoup in select_form messes up utf-8

In UTF-8 the character Ü is represented by two bytes, one of which appears as a key in mechanize._beautifulsoup.BeautifulStoneSoup.MS_CHARS

In Browser.open a subclass of BeautifulStoneSoup called MechanizeBs is used, which overrides BeautifulStoneSoup.PARSER_MASSAGE, so that MS_CHARS is ignored.

In Browser.select_form however, mechanize._form.RobustFormParser is used, which uses BeautifulStoneSoup directly, which uses MS_CHARS for replacements. This leads to one of the bytes of UTF-8 Ü being replaced, which destroys the Ü character. As a consequence controls with labels containing Ü cannot be found by their label anymore, i.e. the following : browser.click( label='Übernehmen' ) fails with a ControlNotFoundError: no control matching kind 'clickable', label 'Übernehmen'.

I currently worked around that using a monkey patch:

import mechanize
mechanize._form.RobustFormParser.PARSER_MASSAGE = mechanize._html.MechanizeBs.PARSER_MASSAGE

A real fix would be appreciated :). Thx!

URL fragments in links are not handled

I have a link with a fragment in it, for example:

<a href="/somepage#header">More info</a>

if I click on such a link using mechanize I always get a 404. The problems appears to be that the fragment is not removed from the URL before a request is created. This should probably be done in Browser.click_link with a simple link.absolute_url.split("#",1)[0] or something similar.

upload a file in a form

Do not remove file_object after uploading (submit) the file in a form

Cannot select a textarea within a form but within a <noscript> tag

I have the following code:

<form action="blabla" blabla >
<input 1 type=blah>
<input 2 type=blah2> etc
<noscript>
    <textarea name="prda" rows="3" cols="40"></textarea>
</noscript>

I want to fill out that textarea preferrably with mechanize (in Python), however, form["prda"] is always giving me control not found error. A user on StackOverflow has suggested that mechanize cannot parse controls that are within tag, which seems kind of odd for me. Is this true?

Selecting a single form amongst many

I've been trying to select the 2nd form out of 20+ forms in a page. And it happens to be the only form in the page with a name 'send_form'.

I've tried

br.select_form(nr=1)

and

br.select_form(name='send_form')

and

for f in br.forms():
    if f.name != None:
        br.select_form(name=f.name)

The first results in every single form object from the page. The second returns a no form by that name error. And the third also returns every single form object on the page.

Now there are three fields I'm trying to access name=prospect_email[], name=prospect_name[], name=prospect_telephone[]. Now these input fields also have id's with the same name lacking the []. Now I've successfully input data into fields on other forms so I know how to do it. But when I try to access these I get an error saying the name of ... does not exist. I was figuring it's probably because I don't have the right form selected. I've spent hours on this and I'm racking my brain trying to figure it out. Help will be appreciated.

Errors when running the tests

Hi, I was trying to package this for Gentoo. This is what my test run gives me:

/var/tmp/portage/dev-python/mechanize-0.2.0/work/mechanize-0.2.0/test/test_api.py:6: SyntaxWarning: import * only allowed at module level
def test_import_all(self):
test-tools/testprogram.py:401: UserWarning: Skipping functional tests: Failed to import twisted.web2 and/or zope.interface
warnings.warn("Skipping functional tests: Failed to import "

(After that, all tests are either skipped or pass. Not sure the UserWarning here is good, I'd prefer to just have a bunch of skipped tests.)

setuptool install broken

setuptools install of mechanize is broken (whereas pip works)

Facts :

$ man virtualenv 
$ virtualenv --setuptools test
New python executable in test/bin/python
Installing setuptools............done.
$ cd test/
$ . bin/activate
(test)$ bin/easy_install mechanize
Searching for mechanize
Reading http://pypi.python.org/simple/mechanize/
Reading http://wwwsearch.sourceforge.net/mechanize/
Best match: mechanize 0.2.4
Downloading http://wwwsearch.sourceforge.net/mechanize/src/mechanize-0.2.4.tar.gz
Traceback (most recent call last):
  File "bin/easy_install", line 8, in <module>
    load_entry_point('setuptools==0.6c11', 'console_scripts', 'easy_install')()
…
  File "/tmp/test/lib/python2.6/site-packages/setuptools-0.6c11-py2.6.egg/setuptools/package_index.py", line 553, in _download_to
ValueError: invalid literal for int() with base 10: '382727, 382727'

After investigating it seems that setuptools use dowload url on pypi page (whereas pip use archives hosted on pypi). It fails when checking Content-Length sent by http://wwwsearch.sourceforge.net/mechanize/src/mechanize-0.2.4.tar.gz which is repeated two times and given to setuptool as a tuple (and setuptool expect an int) :

$ curl -D - http://wwwsearch.sourceforge.net/mechanize/src/mechanize-0.2.4.tar.gz -o /tmp/tmp.tar.gz
Server: Apache/2.2.3 (CentOS)
Last-Modified: Thu, 28 Oct 2010 20:57:05 GMT
ETag: "5d707-493b395976a40"
Content-Length: 382727                                             <-----
Expires: Sat, 02 Apr 2011 07:41:43 GMT
Content-Type: application/x-gzip
Content-Length: 382727                                             <-----
Date: Thu, 31 Mar 2011 07:41:43 GMT
X-Varnish: 58847923
Age: 0
Via: 1.1 varnish
Connection: keep-alive

Readme points to docs/html/index.html, which doesn't exist

Readme points to docs/html/index.html, which doesn't exist. There is no 'html' directory in the docs directory.

using non-existent label for click() still submits the form (including FIX)

No control with label 'asdfasdfasdf' exists. Mechanize still submits the form.

browser.forms().next().click(label='asdfasdfasdf')

The following seems to fix the issueVersion 0.2.0. In _form.py, line 3190 add to the condition:

or (label is not None)

With this fix the above clock() call raises

ControlNotFoundError: no control matching kind 'clickable', label 'asdfasdfasdf'

mechanize silently fails to validate SSL certificates over https:

mechanize does not throw an error when connecting to a website using an invalid SSL certificate. This means that mechanize users are vulnerable to man-in-the-middle attacks even when they think they are protected by SSL.

Here’s a test case:
mechanize.Browser().open('https://scripts-vhosts.mit.edu/').read()

Two controls in form same name different id

Whene mechanize find a form that contain 2 controls with the same name but different id, the dict of controls in form have the key like name not the id.

Backport Python 2.6 features / bug fixes

Various features from Python 2.6 urllib2 are not supported by mechanize, including:

issue #5
issue #6

No doubt there are bug fixes that haven't been applied also.

Quoted cookies get wrongly? escaped

In _clientcookie.py:_cookie_attrs() cookie values that do match only \W characters will have there double quotes explicitly escaped.

This changes the value of quoted-cookies when they return to the webapp. For example:

A cookie comes in with the key/value pair: hello => "world"
The quote substitution will make this into: hello => "world"
Now wireshark tells me that the following is send on the next request:

Cookie: hello=\"world\"; $Path="/"; $Domain=".some.testdomain.com"

When I comment out this part of the code:

            # quote cookie value if necessary
            # (not for Netscape protocol, which already has any quotes
            #  intact, due to the poorly-specified Netscape Cookie: syntax)
#            if ((cookie.value is not None) and
#                self.non_word_re.search(cookie.value) and version > 0):
#                value = self.quote_re.sub(r"\\\1", cookie.value)
#            else:
                value = cookie.value

Everything works as expected.

I'm not quite sure what the comments means by 'not for netscape protocol', should there be an extra check in there
to check that it's not a mozilla style mechanized browser ?

I've checked the webapp (not mime) and it appears to not be doing anything that is not understood by any browser. The cookies as given by the webapp work as expected on any browser.

No way to pass arbitrary predicate to .click / .submit methods

This should result in the first button being clicked:

import mechanize
browser = mechanize.Browser()
browser.set_response(mechanize.make_response(
        """\
<button type="submit" name="action" value="publish">Publish</button>
<button type="submit" name="action" value="preview">Preview</button>
""",
        [("Content-Type", "text/html")],
        "http://example.com/", 200, "OK"))
form = browser.global_form()
form.click(predicate=lambda control: control.name == "publish")

bug in ftpwrapper

I notice that if I try to follow an ftp:// type link, there is a crash.
This is because in the _urllib2_fork.py, it imports ftpwrapper from urllib.
This function expects 6 arguments, but throughout the _urllib2_fork.py file, a timeout argument is given, which causes it to choke...