Git Product home page Git Product logo

dryscrape's Introduction

NOTE: This package is not actively maintained. It uses QtWebkit, which is end-of-life and probably doesn't get security fixes backported. Consider using a similar package like Spynner instead.

Overview

Author: Niklas Baumstark

dryscrape is a lightweight web scraping library for Python. It uses a headless Webkit instance to evaluate Javascript on the visited pages. This enables painless scraping of plain web pages as well as Javascript-heavy “Web 2.0” applications like Facebook.

It is built on the shoulders of capybara-webkit's webkit-server. A big thanks goes to thoughtbot, inc. for building this excellent piece of software!

Changelog

  • 1.0: Added Python 3 support, small performance fixes, header names are now properly normalized. Also added the function dryscrape.start_xvfb() to easily start Xvfb.
  • 0.9.1: Changed semantics of the headers function in a backwards-incompatible way: It now returns a list of (key, value) pairs instead of a dictionary.

Supported Platforms

The library has been confirmed to work on the following platforms:

  • Mac OS X 10.9 Mavericks and 10.10 Yosemite
  • Ubuntu Linux
  • Arch Linux

Other unixoid systems should work just fine.

Windows is not officially supported, although dryscrape should work with cygwin.

A word about Qt 5.6

The 5.6 version of Qt removes the Qt WebKit module in favor of the new module Qt WebEngine. So far webkit-server has not been ported to WebEngine (and likely won't be in the near future), so Qt <= 5.5 is a requirement.

Installation, Usage, API Docs

Documentation can be found at dryscrape's ReadTheDocs page.

Quick installation instruction for Ubuntu:

# apt-get install qt5-default libqt5webkit5-dev build-essential python-lxml python-pip xvfb
# pip install dryscrape

Contact, Bugs, Contributions

If you have any problems with this software, don't hesitate to open an issue on Github or open a pull request or write a mail to niklas baumstark at Gmail.

dryscrape's People

Contributors

brycepg avatar elbandi avatar juanriaza avatar juniorojha avatar niklasb avatar trendsetter37 avatar willzhang05 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dryscrape's Issues

pip install -r requirements.txt (error)

Following instructions from the docs at:
http://readthedocs.org/docs/dryscrape/en/latest/installation.html

On Ubuntu server 10.04 I receive an error:

Requirement 'git+git://github.com/niklasb/webkit-server.git' looks like a filename, but the file does not exist
Unpacking ./git+git:/github.com/niklasb/webkit-server.git
Exception:
Traceback (most recent call last):
  File "/usr/lib/python2.6/dist-packages/pip.py", line 252, in main
    self.run(options, args)
  File "/usr/lib/python2.6/dist-packages/pip.py", line 408, in run
    requirement_set.install_files(finder, force_root_egg_info=self.bundle)
  File "/usr/lib/python2.6/dist-packages/pip.py", line 1757, in install_files
    self.unpack_url(url, location)
  File "/usr/lib/python2.6/dist-packages/pip.py", line 1817, in unpack_url
    self.unpack_file(source, location, content_type, link)
  File "/usr/lib/python2.6/dist-packages/pip.py", line 1917, in unpack_file
    or tarfile.is_tarfile(filename)
  File "/usr/lib/python2.6/tarfile.py", line 2529, in is_tarfile
    t = open(name)
  File "/usr/lib/python2.6/tarfile.py", line 1653, in open
    return func(name, "r", fileobj, **kwargs)
  File "/usr/lib/python2.6/tarfile.py", line 1715, in gzopen
    fileobj = bltn_open(name, mode + "b")
IOError: [Errno 2] No such file or directory: '/root/dryscrape/git+git:/github.com/niklasb/webkit-server.git'

Is it possible to get the raw HTML from a Session or Node?

Dear Niklas,

I am trying to parse parts of a weirdly formatted website, where .at_xpath() and .at_css() don't help much. Is it somehow possible to retrieve the raw HTML that a Node or Session instance represent?

Kind regards,
Arne

Dryscrape to login via Facebook.

I'm trying to use Dryscrape to login via Facebook. But I get these error.

Traceback (most recent call last):
  File "/Users/noppanit/.virtualenvs/envpy3/lib/python3.4/site-packages/webkit_server.py", line 420, in __init__
    self._port = int(re.search(b"port: (\d+)", output).group(1))
AttributeError: 'NoneType' object has no attribute 'group'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "facebook_scraper.py", line 40, in <module>
    sess = dryscrape.Session(base_url = 'https://www.facebook.com')
  File "/Users/noppanit/.virtualenvs/envpy3/lib/python3.4/site-packages/dryscrape/session.py", line 22, in __init__
    self.driver = driver or DefaultDriver()
  File "/Users/noppanit/.virtualenvs/envpy3/lib/python3.4/site-packages/dryscrape/driver/webkit.py", line 30, in __init__
    super(Driver, self).__init__(**kw)
  File "/Users/noppanit/.virtualenvs/envpy3/lib/python3.4/site-packages/webkit_server.py", line 230, in __init__
    self.conn = connection or ServerConnection()
  File "/Users/noppanit/.virtualenvs/envpy3/lib/python3.4/site-packages/webkit_server.py", line 507, in __init__
    self._sock = (server or get_default_server()).connect()
  File "/Users/noppanit/.virtualenvs/envpy3/lib/python3.4/site-packages/webkit_server.py", line 450, in get_default_server
    _default_server = Server()
  File "/Users/noppanit/.virtualenvs/envpy3/lib/python3.4/site-packages/webkit_server.py", line 427, in __init__
    raise WebkitServerError("webkit-server failed to start. Output:\n" + err)
webkit_server.WebkitServerError: webkit-server failed to start. Output:
dyld: Library not loaded: @rpath/./libQtWebKit.4.dylib
  Referenced from: /Users/noppanit/.virtualenvs/envpy3/lib/python3.4/site-packages/webkit_server
  Reason: image not found

Here's the code I'm using.

import dryscrape

# make sure you have xvfb installed
dryscrape.start_xvfb()

# set up a web scraping session
sess = dryscrape.Session(base_url = 'https://www.facebook.com')

# we don't need images
sess.set_attribute('auto_load_images', False)

# visit homepage and search for a term
sess.visit('/')
q = sess.at_xpath('//*[@id="email"]')
q.set('email')
q = sess.at_xpath('//*[@id="pass"]')
q.set("password")
login_button = sess.at_xpath('//*[@id="u_0_x"]')
login_button.click()

# save a screenshot of the web page
sess.render('facebook.png')
print("Screenshot written to 'facebook.png'")

Fill a form automatically by nodes

Dear Niklas:
Is there a way to retrieve and full a form automatically?

In your example use:
q = sess.at_xpath('//*[@name="q"]')

This is when you know the structure of the form,
I try this:
text = sess.at_xpath('//*[@type="text"]')
but isn't succesful, what i want is fill a form automatically, without know the previous structure, extracting node by node with their type (text, checkbox, button, etc)

I hope i was explicit.

Thank you, and regards!

Node API could use navigation features

Though the DOM can be navigated with xpath commands, an improved API would allow navigation of the tree without resorting to it.
Eg.
getchildren() - returns list of child tag Nodes
getattributes() - returns list of attribute Nodes
getparent() - returns parent Node

Tutorial COMPOSE returns None

Looks to be a great tool, and apologies if I'm missing something obvious. Running the second example in the docs (the Gmail one) and getting a None object back from the sess.at_xpath('//*[contains(text(), "COMPOSE")]') search. I've had a look at the source and there is definitely text corresponding with COMPOSE. Also entered a direct xpath to the button, which also turned up None. Any thoughts?

Popups

Anyone knows how to handle popups ?

Thanks

installing dryscrape / webkit-server on Mac El Capitan fails with C++ linking step

Trying to build dryscrape fails, manually trying to install webkit-server returns the following error...
(I'm running python using Anaconda)

$ sudo python setup.py install
running install
running build
cd src/ && /Applications/Xcode.app/Contents/Developer/usr/bin/make -f Makefile.webkit_server
g++ -headerpad_max_install_names -arch x86_64 -Xarch_x86_64 -mmacosx-version-min=10.5 -o webkit_server build/Version.o build/EnableLogging.o build/Authenticate.o build/SetConfirmAction.o build/SetPromptAction.o build/SetPromptText.o build/ClearPromptText.o build/JavascriptAlertMessages.o build/JavascriptConfirmMessages.o build/JavascriptPromptMessages.o build/IgnoreSslErrors.o build/ResizeWindow.o build/CurrentUrl.o build/ConsoleMessages.o build/main.o build/WebPage.o build/Server.o build/Connection.o build/Command.o build/SocketCommand.o build/Visit.o build/Reset.o build/Node.o build/JavascriptInvocation.o build/Evaluate.o build/Execute.o build/FrameFocus.o build/Response.o build/NetworkAccessManager.o build/NetworkCookieJar.o build/Header.o build/Render.o build/body.o build/Status.o build/Headers.o build/UnsupportedContentHandler.o build/SetCookie.o build/ClearCookies.o build/GetCookies.o build/CommandParser.o build/CommandFactory.o build/SetProxy.o build/NullCommand.o build/PageLoadingCommand.o build/SetTimeout.o build/GetTimeout.o build/SetSkipImageLoading.o build/WebPageManager.o build/WindowFocus.o build/GetWindowHandles.o build/GetWindowHandle.o build/TimeoutCommand.o build/SetUrlBlacklist.o build/NoOpReply.o build/JsonSerializer.o build/InvocationResult.o build/ErrorMessage.o build/Title.o build/FindCss.o build/JavascriptCommand.o build/FindXpath.o build/NetworkReplyProxy.o build/IgnoreDebugOutput.o build/Source.o build/SetHtml.o build/SetAttribute.o build/moc_Version.o build/moc_EnableLogging.o build/moc_Authenticate.o build/moc_SetConfirmAction.o build/moc_SetPromptAction.o build/moc_SetPromptText.o build/moc_ClearPromptText.o build/moc_JavascriptAlertMessages.o build/moc_JavascriptConfirmMessages.o build/moc_JavascriptPromptMessages.o build/moc_IgnoreSslErrors.o build/moc_ResizeWindow.o build/moc_CurrentUrl.o build/moc_ConsoleMessages.o build/moc_WebPage.o build/moc_Server.o build/moc_Connection.o build/moc_Command.o build/moc_SocketCommand.o build/moc_Visit.o build/moc_Reset.o build/moc_Node.o build/moc_JavascriptInvocation.o build/moc_Evaluate.o build/moc_Execute.o build/moc_FrameFocus.o build/moc_Response.o build/moc_NetworkAccessManager.o build/moc_NetworkCookieJar.o build/moc_Header.o build/moc_Render.o build/moc_Body.o build/moc_Status.o build/moc_Headers.o build/moc_UnsupportedContentHandler.o build/moc_SetCookie.o build/moc_ClearCookies.o build/moc_GetCookies.o build/moc_CommandParser.o build/moc_CommandFactory.o build/moc_SetProxy.o build/moc_NullCommand.o build/moc_PageLoadingCommand.o build/moc_SetSkipImageLoading.o build/moc_WebPageManager.o build/moc_WindowFocus.o build/moc_GetWindowHandles.o build/moc_GetWindowHandle.o build/moc_GetTimeout.o build/moc_SetTimeout.o build/moc_TimeoutCommand.o build/moc_SetUrlBlacklist.o build/moc_NoOpReply.o build/moc_JsonSerializer.o build/moc_ErrorMessage.o build/moc_Title.o build/moc_FindCss.o build/moc_JavascriptCommand.o build/moc_FindXpath.o build/moc_NetworkReplyProxy.o build/moc_Source.o build/moc_SetHtml.o build/moc_SetAttribute.o build/qrc_webkit_server.o -L/Users/Gully/anaconda/lib -lQtWebKit -lQtGui -L/Users/Gully/anaconda/lib -lQtNetwork -lQtCore
Undefined symbols for architecture x86_64:
"__Unwind_Resume", referenced from:
Version::start() in Version.o
Authenticate::start() in Authenticate.o
SetConfirmAction::start() in SetConfirmAction.o
SetPromptAction::start() in SetPromptAction.o
SetPromptText::start() in SetPromptText.o
ClearPromptText::start() in ClearPromptText.o
JavascriptAlertMessages::start() in JavascriptAlertMessages.o
...
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make[1]: *** [webkit_server] Error 1
make: *** [sub-src-webkit_server-pro-make_default-ordered] Error 2
error: [Errno 2] No such file or directory: 'src/webkit_server'

sess.visit() sometimes hangs

I have a program which cycles through a list of several thousand urls of different domains, calling sess.visit() for each without creating a new session object. Usually after visiting several hundred of these urls, there will be a visit() that does not return.
Waiting several hours has no effect - the operation has hung on visit().
When the process is interrupted it displays this trace:

File "/home/user1/projects/MyBot/MyScraper.py", line 50, in Scrape
sess.visit(site_url)
File "/usr/local/lib/python2.7/dist-packages/dryscrape/session.py", line 35, in visit
return self.driver.visit(self.complete_url(url))
File "/usr/local/lib/python2.7/dist-packages/webkit_server.py", line 211, in visit
self.conn.issue_command("Visit", url)
File "/usr/local/lib/python2.7/dist-packages/webkit_server.py", line 429, in issue_command
return self._read_response()
File "/usr/local/lib/python2.7/dist-packages/webkit_server.py", line 433, in _read_response
result = self._readline()
File "/usr/local/lib/python2.7/dist-packages/webkit_server.py", line 467, in _readline
c = self._sock.recv(1)

If then the url that caused the problem is attempted to be visited alone, visit() returns successfully. So the problem does not seem to be related to the url being visited, rather some internal webkit state.
The number of iterations before hanging seems random - sometimes it occurs after less than 100 visits, sometimes after several hundred.

Here's a script that visits the same site 1000 times that will probably demonstrate the problem at some point:

from dryscrape import Session
from dryscrape.driver.webkit import Driver
from webkit_server import InvalidResponseError

link = 'http://insert-some-site-here-that-doesnt-mind-being-hammered.com'
sess = Session(driver = Driver())
sess.set_error_tolerant(True)
for i in range(1,1000):
try:
sess.visit(link)
sess.wait()
print 'Success iteration', i
except InvalidResponseError as e:
print 'InvalidResponseError:', e

wait on condition example

Hi again,
can someone post an example on how to use the methods:
wait_for(), wait_for_safe(), wait_while()

I can't figure out what object to pass as a condition.

Thanks!

InvalidResponseError due to page load failure of a network URL

session.visit('https://www.kayak.co.in/flights/BLR-DEL/2016-02-10/2016-02-24')
throws InvalidResponseError because some network URL https://stags.bluekai.com/site/* failed to load

{
  "class":"InvalidResponseError",
  "message":"Unable to load URL: https://www.kayak.co.in/flights/BLR-DEL/2016-02-10/2016-02-24
because of error loading https://stags.bluekai.com/site/undefined?ret=html&phint=Product%3Dflight&
phint=Class%3DEconomy&phint=DepartureCity%3DBLR&phint=Destination%3DDEL&
phint=TravelersChildren%3D0&phint=TravelersSeniors%3D0&phint=TravelersAdults%3D1&
phint=DepartDate%3DFebruary 10 2016&phint=ReturnDate%3DFebruary 24 
2016&phint=Travelers%3D1&phint=IncludesSaturdayNight%3Dtrue&phint=LoggedIn%3Dfalse&
phint=__bk_t%3DBLR\\u00a0to\\u00a0DEL%2C 10%2F2 \\u2013 24%2F2&phint=__bk_k%3D&
phint=__bk_l%3Dhttps%3A%2F%2Fwww.kayak.co.in%2Fflights%2FBLR-
DEL%2F2016-02-10%2F2016-02-24&limit=6&bknms=ver=2.0,ua=7ae58ad3cad58b81fe8fcf6a076b854f,t=1454521436131,m=4b4e4ecaab1f1c93ab
1f1c93ab1f1c93,k=1,lang=e7425b8d215183162151831621518316,sr=800x680x24,tzo=0,hss=true,hls
=true,idb=false,addb=undefined,odb=function,cpu=4b4e4ecaab1f1c93ab1f1c93ab1f1c93,platform=1c1
7637dbf2f8edebf2f8edebf2f8ede,notrack=,plugins=4b4e4ecaab1f1c93ab1f1c93ab1f1c93,cn=e1cf5730
3394381cdd4226b23e9b0240&r=72950631: Bad Request"
}

Is there anyway in which I can ignore this and proceed with what has been loaded?

subprocess.py raise child_exception OSError: [Errno 2] No such file or directory

[Ubuntu 12.10 64bit, Py2.6, Py2.7]

pip install -e git+https://github.com/niklasb/dryscrape.git#egg=dryscrape
pip install -e git+https://github.com/niklasb/webkit-server.git#egg=webkit-server

    ```Python 2.6.7 (r267:88850, Aug 11 2011, 12:18:09) 
    Type "copyright", "credits" or "license" for more information.

    IPython 0.13.1 -- An enhanced Interactive Python.
    ?         -> Introduction and overview of IPython's features.
    %quickref -> Quick reference.
    help      -> Python's own help system.
    object?   -> Details about 'object', use 'object??' for extra details.

    In [1]: import dryscrape

    In [2]: sess = dryscrape.Session(base_url = 'https://google.com')
    ---------------------------------------------------------------------------
    OSError                                   Traceback (most recent call last)
    <ipython-input-2-3781bbfdb247> in <module>()
    ----> 1 sess = dryscrape.Session(base_url = 'https://google.com')

    /home/dvs/dev/ENV/src/dryscrape/dryscrape/session.pyc in __init__(self, driver, base_url)
         16                driver = None,
         17                base_url = None):
    ---> 18     self.driver = driver or DefaultDriver()
         19     self.base_url = base_url
         20 

    /home/dvs/dev/ENV/src/dryscrape/dryscrape/driver/webkit.pyc in __init__(self, **kw)
         28   def __init__(self, **kw):
         29     kw.setdefault('node_factory_class', NodeFactory)
    ---> 30     super(Driver, self).__init__(**kw)

    /home/dvs/dev/ENV/src/webkit-server/webkit_server.pyc in __init__(self, connection, node_factory_class)
        204                node_factory_class = NodeFactory):
        205     super(Client, self).__init__()
    --> 206     self.conn = connection or ServerConnection()
        207     self._node_factory = node_factory_class(self)
        208 

    /home/dvs/dev/ENV/src/webkit-server/webkit_server.pyc in __init__(self, server)
        416   def __init__(self, server = None):
        417     super(ServerConnection, self).__init__()
    --> 418     self._sock = (server or get_default_server()).connect()
        419 
        420   def issue_command(self, cmd, *args):

    /home/dvs/dev/ENV/src/webkit-server/webkit_server.pyc in get_default_server()
        395   global default_server
        396   if not default_server:
    --> 397     default_server = Server()
        398   return default_server
        399 

    /home/dvs/dev/ENV/src/webkit-server/webkit_server.pyc in __init__(self, binary)
        368                                     stdin  = subprocess.PIPE,
        369                                     stdout = subprocess.PIPE,
    --> 370                                     stderr = subprocess.PIPE)
        371     output = self._server.stdout.readline()
        372     try:

    /usr/lib/python2.6/subprocess.pyc in __init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags)
        621                             p2cread, p2cwrite,
        622                             c2pread, c2pwrite,
    --> 623                             errread, errwrite)
        624 
        625         if mswindows:

    /usr/lib/python2.6/subprocess.pyc in _execute_child(self, args, executable, preexec_fn, close_fds, cwd, env, universal_newlines, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite)
       1139                     if fd is not None:
       1140                         os.close(fd)
    -> 1141                 raise child_exception
       1142 
       1143 

    OSError: [Errno 2] No such file or directory

NoX11Error on OS X Yosemite

Hello,

I'm getting a NoX11Error when attempting to use dryscrape on X11. I'm using python 2.7 (with anaconda), and I have XQuartz installed and running. The installations all went fine (Qt, pip, webserver-kit, etc.). Any help would be appreciated.

Johns-MacBook-Pro:dryscrape johngu$ echo $DISPLAY
/private/tmp/com.apple.launchd.57RQ2D0Hcg/org.macosforge.xquartz:0

/Users/john/anaconda/lib/python2.7/site-packages/webkit_server.pyc in get_default_server()
429 global default_server
430 if not default_server:
--> 431 default_server = Server()
432 return default_server
433

/Users/john/anaconda/lib/python2.7/site-packages/webkit_server.pyc in init(self, binary)
407 self._port = int(re.search("port: (\d+)", output).group(1))
408 except AttributeError:
--> 409 raise NoX11Error, "Cannot connect to X. You can try running with xvfb-run."
410
411 # on program termination, kill the server instance

webkit_server instances are not terminated after execution of test script

OS X LION install:

$ sudo port install qt4-mac-devel py27-lxml py27-pip
$ git clone https://github.com/niklasb/dryscrape.git dryscrape
$ cd dryscrape
$ sudo pip-2.7 install -r requirements.txt
$ sudo python setup.py install

Inserted DEBUG statement webkit_server.py:384 (per your suggestion)

  def kill(self):
    """ Kill the process. """
    import logging
    logging.basicConfig(level=logging.DEBUG, filename='/tmp/webkit_server_debug.log')
    self._server.terminate()
    logging.debug(self._server.__dict__)

Results of /tmp/webkit_server_debug.log:

DEBUG:root:{'_child_created': True, 'returncode': None, 'stdout': <open file '<fdopen>', mode 'rb' at 0x110670300>, 'stdin': <open file '<fdopen>', mode 'wb' at 0x110670270>, 'pid': 33785, 'stderr': <open file '<fdopen>', mode 'rb' at 0x110670390>, 'universal_newlines': False}

Hope that helps, let me know if I can give you any more details.

spawn a dryscrape session?

Dear Niklas,

is it somehow possible to spawn a dryscrape session? I am trying to scrape a website that generates links in javascript and seems to be very picky about referrers, cookies etc.

Best,
Arne

Login to facebook, cookies not enabled

I'm trying to use dryscrape to login to facebook and I'm getting "Cookies Required" (Cookies are not enabled on your browser, please enable them). How can I avoid this?
Here's my code:

    sess = dryscrape.Session(base_url="https://www.facebook.com/")
    sess.visit("/login")
    f = sess.at_xpath('//*[@name="email"]')
    f.set("...")
    f = sess.at_xpath('//*[@name="pass"]')
    f.set("...")
    f.form().submit()
    # sess.visit("/")
    sess.render("test.png")

cygwin error qmake: command not found

Hi,

is it possible to install under cygwin ?

$ pip3 install dryscrape
Collecting dryscrape
Downloading dryscrape-1.0.tar.gz
Collecting webkit-server>=1.0 (from dryscrape)
Downloading webkit-server-1.0.tar.gz (41kB)
100% |████████████████████████████████| 45kB 32kB/s
Requirement already satisfied (use --upgrade to upgrade): lxml in /usr/lib/python3.4/site-packages (from dryscrape)
Collecting xvfbwrapper (from dryscrape)
Downloading xvfbwrapper-0.2.7.tar.gz
Building wheels for collected packages: dryscrape, webkit-server, xvfbwrapper
Running setup.py bdist_wheel for dryscrape ... done
Stored in directory: /home/mitenmehta/.cache/pip/wheels/c9/4a/4e/669ebb4c8c4a6c88b9446eb5263c20813d7e47609c3a0057c4
Running setup.py bdist_wheel for webkit-server ... error
Complete output from command /usr/bin/python3 -u -c "import setuptools, tokenize;file='/tmp/pip-build-terbdf9w/webkit-server/setup.py';exec(compile(getattr(tokenize, 'open', open)(file).read().replace('\r\n', '\n'), file, 'exec'))" bdist_wheel -d /tmp/tmpm3038mqxpip-wheel- --python-tag cp34:
running bdist_wheel
running build
sh: qmake: command not found
error: [Errno 2] No such file or directory: 'src/webkit_server'


Failed building wheel for webkit-server
Running setup.py clean for webkit-server
Running setup.py bdist_wheel for xvfbwrapper ... done
Stored in directory: /home/mitenmehta/.cache/pip/wheels/a9/05/4e/30146b2288b3267a2d8675acf87be67fbccba251e44b946b72
Successfully built dryscrape xvfbwrapper
Failed to build webkit-server
Installing collected packages: webkit-server, xvfbwrapper, dryscrape
Running setup.py install for webkit-server ... error
Complete output from command /usr/bin/python3 -u -c "import setuptools, tokenize;file='/tmp/pip-build-terbdf9w/webkit-server/setup.py';exec(compile(getattr(tokenize, 'open', open)(file).read().replace('\r\n', '\n'), file, 'exec'))" install --record /tmp/pip-wye88dkw-record/install-record.txt --single-version-externally-managed --compile:
running install
running build
sh: qmake: command not found
error: [Errno 2] No such file or directory: 'src/webkit_server'

----------------------------------------

Command "/usr/bin/python3 -u -c "import setuptools, tokenize;file='/tmp/pip-build-terbdf9w/webkit-server/setup.py';exec(compile(getattr(tokenize, 'open', open)(file).read().replace('\r\n', '\n'), file, 'exec'))" install --record /tmp/pip-wye88dkw-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-build-terbdf9w/webkit-server

Disable JavaScript

Thanks for creating a great library!

I'm trying to scrape a page that shows a window alert with OK and Cancel buttons. Since I don't know if it's possible to handle this kind of event (session is always "clicking" on OK button and I would prefer it to click Cancel) I was wondering if you could implement "JavascriptEnabled" Webkit attribute? This would allow me to disable JavaScript at some point.

google search results right panel is not in session source file

running the code below :

search_term = 'tokyo skytree'
sess = dryscrape.Session(base_url = 'https://www.google.co.jp')
sess.set_attribute('auto_load_images', True)
sess.visit('/')
q = sess.at_xpath('//*[@name="q"]')
q.set(search_term)
q.form().submit()

print sess.source()

There is a google right knowledge panel in the rendered png file (see below ) , but can not find in the session source file .

How can I fix it ? Thanks a lot

Richard

google

multiprocessing with dryscrape

This is not an issue , but sharing how to deal with xvfb .

I am using celery to run multiple sessions of webkit-server . I need to remove ' dryscrape.start_xvfb() ' in the code and put xvfb-run before celery command , like this

xvfb-run -a celery -A googlescrape worker --loglevel=INFO --concurrency=10 -Q googlescrape_task

I spent couple of hours to figure out and hope can help somebody to save the time .

Django + Dryscrape

Hello,

I am using Django + Dryscrape to provide a web based on demand crawling. The code is as below

from django.shortcuts import render
from django.http import HttpResponse
from django.views.generic import View
from .forms import ScrapeForm

import dryscrape

class Scrape(View):
    def get(self, request):
        form = ScrapeForm()
        return render(request, 'index.html', {'form': form})

    def post(self, request, *args, **kwargs):
        dryscrape.start_xvfb()
        form = ScrapeForm(request.POST)
        if form.is_valid():

            try:
                sess = dryscrape.Session(base_url = form.data['BASE_URL'])
                sess.set_attribute('auto_load_images', True)
                # sess.set_timeout(30)

                sess.visit(form.data['BASE_URL'] + form.data['URL'])

                x = sess.wait_for_safe(lambda: sess.at_xpath(form.data['XPATH']))
                # x = sess.at_xpath(form.data['XPATH'])

                if x:
                    return HttpResponse(x.text())
                else:
                    return HttpResponse('No Element Found with the given xpath')

            except Exception as e:
                if(e.__doc__):
                    print e.__doc__

                if(e.message):
                    print e.message

                if (e.__doc__):
                    return HttpResponse('Scraping of page failed :: \n' + e.__doc__ +'\n'+ e.message)
                else:
                    return HttpResponse('Scraping of page failed');

        return render(request, 'index.html', {'form': form})

The problem I am facing right now is that, if the scraping fails for one url, the webkit server does not seem to restart unless I kill all services and restart again.

Is there a simple way that I can restart the webkit server when it crashes?

The error message

 Raised when the Webkit server closed the connection unexpectedly. 
Unexpected end of file

After the above error, unless I restart the services, the scraping does not work. The web app however is functional, presenting the error message Scraping of page failed.

Any solution would be highly helpful.

Note: This happens only on AWS ubuntu instance. Works fine on my Macbook Pro.

Thanks.

Could not set file input on EC2

I'm running dryscrape with xvfb-run and it works fine until I need to call something like:

value = "/home/user/img.jpg"
field = sess.at_css("#some_file_input_id")
field.set(value)
assert value == field.value() # Value unchanged on EC2

I'm able to set all others input types and the same code is working as expected on my OSX.

Should I provide further informations?

Node API needs inspection wrt attributes

There current Node API lacks inspection functionality with respect to attributes.

Node.get_attr() requires one to know the attribute name, but there is no method for determining a node's attribute names. By parsing the source html it may be possible to obtain some attribute names, but finding those that are created dynamically in javascript is a much more difficult prospect.

If the node object is an attribute returned by Node.xpath('@*'), there is no method for finding the name of this attribute. Node.tag_name() returns the name only if the node is a tag, not an attribute. Extending this naming convention, there should be a Node.attr_name() that returns a string only when the node is an attribute node. Alternatively, tag_name() could be replaced with name(), which returns the name of the node regardless of whether it is a tag or attribute type.

Speeding up parsing with threads

I'm trying to parse a series of events, but doing it serially is taking upwards of a minute since there are so many parts of each event to parse. I'd like to try and speed it up with threading, the only problem is apparently webkit doesn't like that. See #9

What's the best way to go about it? I can't call node.at_css() on objects in threads. Is it best to save the html using html_body = session.body().encode('ascii', 'ignore') and then use another parser like https://pythonhosted.org/cssselect/ or is there a better way using dryscrape?

Header

def parse_event(event):
    print event.at_css('.name')
    // more parsing
    return

import dryscrape
import threading
url = "https://example.com"
dryscrape.start_xvfb()
session = dryscrape.Session()
session.visit(url)
events = session.css('.event')

Serially works fine

for event in events:
    parse_event(event)

With threads fails

threads = []
for event in events:
    t = threading.Thread(target=parse_event, args=(event, ))
    threads.append(t)
    t.start()

Best practice for handling InvalidResponseError exceptions?

Sometimes sess.visit() throws an exceptions similar to this:

Error while loading URL https://apis.google.com/_/apps-static/_/js/gapi/googleapis_client,iframes_styles_bubble_internal/rt=j/ver=8ruqBK5Rz68.en_GB./sv=1/am=!Ze6NnRS0VYCICGRMrA/d=1/rs=AItRSTMCkBLPuEGW-K5opwJvfmORrpspJQ: Operation canceled (error code 5)

And this:

Error while loading URL http://www.asite.com/images/poster/mirror-mirror.jpg: Error downloading http://www.asite.com/images/poster/mirror-mirror.jpg - server replied: Not Found (error code 203)

It looks like one or more elements on the page I have requested has failed to load. The exceptions are thrown before sess.wait() can be called and are thrown despite having set sess.set_error_tolerant(True)

I might like to 1) retry the page with a specified timeout or 2) use the fetched page anyway, despite the absence of some images or nested frames .
What are the best practices for achieving 1 and 2?

dryscrape visit failed

ENV
*OS:Debian wheezy
*Python : 2.7.3

Issues:
dryscrape visits http://weibo.com/ but failed
source:

import dryscrape                                                                                                                                                      
url = 'http://weibo.com/'                                                        
sess = dryscrape.Session(base_url=url)                                           
sess.visit('/')                                                                  
sess.render('weibo.png')

I read the doc and google it, but I can't solve it. Hope hear from you :) Thank you!

404 error link in the page

Hi,

I am trying to scrap a site. But the site seems to have a link embedded within itself which is turning out to be 404. How should i handle that case.

I tried using set_html. but couldn't actually pinpoint how to use it. Please help.

Both cookies and javascript enabled to login?

Hi there,

I have problems logging into website which requires cookies and javascript enabled.

Check the screenshot below

http://i.imgur.com/dl5wFzl.jpg

Here's my code

import time
import dryscrape

username = '123456'
password = 'mypassword'

# setup a web scraping session
sess = dryscrape.Session(base_url = 'https://www.somewebsitedomain.com/Pages/Login/Login.aspx

# we don't need images
sess.set_attribute('auto_load_images', False)

# visit homepage and log in
print "Logging in..."
sess.visit('/')

username_field = sess.at_css("#txtCustomer")
password_field = sess.at_css("#passwd")
btnlogin_field = sess.at_css("#btnLogin")

username_field.set(username)
password_field.set(password)

# username_field.form().submit() is not working here
# btnlogin_field.form().submit() can't work either

btnlogin_field.click() # this works, but see the issue which requires cookies and javascript enabled

print "Taking snapshot"
sess.render('website.png')

Is there any way to make this worked? I have tested on python selenium firefox and it worked. BUt again I need to run this on server basis. Thus python selenium phantomjs will see the same issue here. I was thinking whether your solution could help.

I need to run this script from an ubuntu 14.04 LTS server (not desktop).

Any help? Thanks.

Handle multiple sessions ?

Is there any way to handle multiple sessions?
The following code doesn't work:

from dryscrape import Session

sess1 = Session(base_url='http://www.google.com')
sess2 = Session(base_url='http://www.yahoo.com')

sess1.visit('/')
sess2.visit('/')

sess1.render('sess1.png')
sess2.render('sess2.png')

both images are from the second url.

How to use multiple proxies?

How would I make use of multiple proxy servers with dryscrape? The documentation says about addin only one proxy server, but what if I want to use multiple servers?

If dryscrape does not provide such provision and supports only single proxy server, then what would be the best way to do it?

Lot of Xvfb processes after using start_xvfb()

I am scraping lot of pages using multi processing.
if I use dryscrape.start_xvfb() before initialing the session then I am let with lot of Xvfb processes which consume a significant chunk of ram.
Is there a way to close/reuse xvfb?

Scrape infinity scrolling pages

Hello,

Thanks for the library.

I am trying to extract content from pages that are scrolling infinity. Can you please let me know how I can achieve this? I did not find an API method to scroll after I have reached the end of a page/section.

Thanks.

Crashes when visiting YouTube

There seems to be a bug where if you visit http://www.youtube.com/ it will crash and potentially cause your entire computer to freeze requiring a reboot. I'm not sure the cause of this and I don't have any error messages to show because as I said, it completely freezes my computer and I have to restart it. I've tried it with other websites such as Google and all works fine, but for some reason it can't handle YouTube.

Problem in serializing the Node.

I am trying to serialize the node and store it in redis ( Crawling a website and using redis to act as queue ), but the Node class which is webkit-server.webserver.Node do not contain the functions __getstate__ and __setstate__ due to which it is failing to serialize. ( Learned this info from stackeoverflow and also tested it on my own custom class)

I am using the following code.

import pickle
pickle.dumps(node)

This is more of a query then an issue, as i couldn't get your email id. Therefore i am contacting you here.
Please help as to what should be the __getstate__ and __setstate__ functions.

I am also pasting a sample code which worked when __getstate__ and __setstate__ were added.
http://pastebin.com/xK20DRyD

how to scrape inside objects?

How can I use this library in objects?
I found that whenever I try to use dryscrape inside xvfbwrapper and there is two instances, a connection refused error occurs.

Example - 1 object

#!/usr/bin/env python
import dryscrape
from xvfbwrapper import Xvfb

class SomeObj(object):
    def __init__(self):
        self.url="http://www.google.com/"
        self.timeout=20

    def fetch_page(self):
        with Xvfb() as xfvb:
            sess = dryscrape.Session(base_url = self.url)
            sess.visit('/')
            print sess.body()

a=SomeObj()
a.fetch_page()

(Outputs html, works OK)

Example 2-two objects

#!/usr/bin/env python
import dryscrape
from xvfbwrapper import Xvfb

class SomeObj(object):
    def __init__(self):
        self.url="http://www.google.com/"
        self.timeout=20

    def fetch_page(self):
        with Xvfb() as xfvb:
            sess = dryscrape.Session(base_url = self.url)
            sess.visit('/')
            print sess.body()

a=SomeObj()
b=SomeObj()
a.fetch_page()
b.fetch_page()

(Outputs html)

Traceback (most recent call last):
  File "objecttest.py", line 19, in <module>
    b.fetch_page()
  File "objecttest.py", line 12, in fetch_page
    sess = dryscrape.Session(base_url = self.url)
  File "/home/carl/workspace/env2/lib/python2.7/site-packages/dryscrape/session.py", line 18, in __init__
    self.driver = driver or DefaultDriver()
  File "/home/carl/workspace/env2/lib/python2.7/site-packages/dryscrape/driver/webkit.py", line 30, in __init__
    super(Driver, self).__init__(**kw)
  File "/home/carl/workspace/env2/lib/python2.7/site-packages/webkit_server.py", line 225, in __init__
    self.conn = connection or ServerConnection()
  File "/home/carl/workspace/env2/lib/python2.7/site-packages/webkit_server.py", line 444, in __init__
    self._sock = (server or get_default_server()).connect()
  File "/home/carl/workspace/env2/lib/python2.7/site-packages/webkit_server.py", line 414, in connect
    sock.connect(("127.0.0.1", self._port))
  File "/usr/lib64/python2.7/socket.py", line 228, in meth
    return getattr(self._sock,name)(*args)
socket.error: [Errno 111] Connection refused

How can I solve this issue? Do I need to use subprocess and have the fetch_page function in a seperate file? Thanks.

Ubuntu-Server 11.10 (NoX11Error)

Hope to get this working on my Linux server, because it works a treat on my Mac.

Traceback (most recent call last):
  File "drydemo.py", line 6, in <module>
    sess = dryscrape.Session(base_url = 'http://google.com')
  File "/usr/local/lib/python2.7/dist-packages/dryscrape/session.py", line 18, in __init__
    self.driver = driver or DefaultDriver()
  File "/usr/local/lib/python2.7/dist-packages/dryscrape/driver/webkit.py", line 30, in __init__
    super(Driver, self).__init__(**kw)
  File "/usr/local/lib/python2.7/dist-packages/webkit_server.py", line 206, in __init__
    self.conn = connection or ServerConnection()
  File "/usr/local/lib/python2.7/dist-packages/webkit_server.py", line 418, in __init__
    self._sock = (server or get_default_server()).connect()
  File "/usr/local/lib/python2.7/dist-packages/webkit_server.py", line 397, in get_default_server
    default_server = Server()
  File "/usr/local/lib/python2.7/dist-packages/webkit_server.py", line 375, in __init__
    raise NoX11Error, "Cannot connect to X. You can try running with xvfb-run."
webkit_server.NoX11Error: Cannot connect to X. You can try running with xvfb-run.

Thanks!

Requesting a page, then visiting another causes issues

Sorry this is a question. There seems to be a problem requesting a page to scrape a special link, then choosing to visit that link. The page does not render or parse correctly. It seems I have to create a second session but xpath is not parsing it correctly.

ie

sess = dryscrape.Session(base_url = 'host')

# we don't need images
sess.set_attribute('auto_load_images', False)

# visit homepage and search for a term
sess.visit('/path')

links = sess.xpath('//a[contains .. ]')
link = links[0]["href"]

time.sleep(10)


sess = dryscrape.Session(base_url = 'host')

sess.visit(link)

 sess.xpath("//div[@class='searchitem']")

This is a problem I have to parse the whole body first. like

tree = fromstring(sess.body())

Unfortunately clicking on the link to visit does not work it has to choose to visit it with the visit method.

Is there a special way to reuse the session so xpath works ?

How to detect Window.alert

Hi,

I am using this awesome library to scrape some site, However i was wondering, how do we detect a window.alert box when a form is submitted and the box appears in the next page load as the first element....or when $(document).ready kicks in.

This will be a ton of help,
Regards,
Wick

EndOfStreamError and SocketError when visiting multiple urls during one session

I'm trying to scrape the content from a large list of urls. However, it seems to throw out EndOfStreamErrors ('Unexpected end of file') or Socket Errors ('Connection closed by peer') after going through a number of them (usually around 20-30 but it varies). I tried catching these exceptions, and telling it to just continue on w/ the rest of the list, but all remaining url visits result in more Socket Errors ('Broken Pipe').

Here's some sample dumbed down code:

import dryscrape
dryscrape.start_xvfb()
sess = dryscrape.Session()

list_of_urls = ['http://www.abc.com', etc...]

for url in list_of_urls:
try:
sess.visit(url)
print url + ': ' + sess.body()[:50]
except:
continue

Is there something that I'm doing incorrectly causes these errors? It seems to happen pretty consistently after around the same number of urls processed, with some urls causing more problems than others.

Thanks in advance!

Catch "save file" dialog box

I'm not sure if I'm asking this correctly. I'm trying to download a file off a javascript page. I've succesfully clicked the right series of nodes. The page shows a message like "Building file to download..." and then (in Chrome at least) a "Save to..." dialog box appears. I've got the first part fine... rendering shows that the message is appearing and then disappearing. However, how do I catch the save file dialog box?

I've looked through the source code and the only thing that looks promising is session.set_attribute('local_storage_enabled'). I assume that is required but I can't figure out how to actually save to the local storage.

most recent version on pip?

I found differences in what was in xvfb.py here on github and what was on pypi. Any chance on getting the new version on pypi as well?

Strange behavior on opening complicated url

I use dryscrape v1.0 as instrument to download stack traces from Google Play. I have downloaded whole crashes report for some period of time, and wanted to download page with stack traces.

And I have met a strange behavior - when url is:

https://play.google.com/apps/publish?dev_acc=18149679673077794436#ErrorClusterDetailsPlace:p=com.android&lr=LAST_6_MONTHS&sh=false&s=new_status_desc&ed=1454285015339&et=CRASH&ecn=java.lang.NullPointerException&tf=Uri.java&tc=android.net.Uri$StringUri&tm=%3Cinit%3E

It opens https://play.google.com/apps/publish/?dev_acc=18149679673077794436#AppListPlace instead of url above.

But at the same time if url is:

https://play.google.com/apps/publish/?dev_acc=18149679673077794436#ErrorClusterDetailsPlace:p=com.android&lr=LAST_6_MONTHS&sh=false&s=new_status_desc&ed=1454285015339&et=CRASH&ecn=java.lang.NullPointerException&tf=Uri.java&tc=android.net.Uri$StringUri&tm=%3Cinit%3E


(differs in slash before question mark) - all works normal

Code is:

def downloadStackTraceByLink(link, session, i):
    # some black magic
    #if link.find("publish/") == -1:
    #   link = link.replace("publish", "publish/")

    session.visit(link)

    # sleep a bit to leave the mail a chance to open.
    # This is ugly, it would be better to find something
    # on the resulting page that we can wait for
    time.sleep(10)

    if link != session.url():
        print("WTF DUDE! Current link is: " + session.url() + "\n but was " + link)
    else:
        print("Ok " + str(i))

    session.driver.render('screenshot ' + str(i) + '.jpg')

When login code is:

from dryscrape import dryscrape

class SessionGoogle:
    def __init__(self, url_login, login, passwd):
        self.ses = dryscrape.Session()
        self.ses.visit(url_login)

        login = self.ses.at_xpath('//*[@id="Email"]').set(login)
        password = self.ses.at_xpath('//*[@id="Passwd"]').set(passwd)

        login_button = self.ses.at_xpath('//*[@id="signIn"]').click()
        self.ses.driver.render('login_result.png')

    def getSes(self):
        return self.ses

url_login = "https://accounts.google.com/ServiceLogin"

Memory usage of webkit never stops growing

First of all thanks for this great package,

The memory usage of the process webkit_server seems to increase with each call to session.visit()
It happens to me using the following script:

import dryscrape
import time


dryscrape.start_xvfb()
session = dryscrape.Session()
session.set_attribute('auto_load_images', False)

while 1:    
    print "Iterating"
    session.visit("https://www.google.es")
    html_source = session.body()
    time.sleep(5)

I see the memory usage with this command:

ps -eo size,pid,user,command --sort -size | grep webkit | awk '{ hr=$1/1024 ; printf("%13.2f Mb ",hr) } { for ( x=1 ; x<=NF ; x++ ) { printf("%s ",$x) } print "" }'

Maybe I'm not doing something right?

python3 support issure

  1. the dryscape init.py
    line 1 from .session import *
    recommend not use the relative represitation
    from dryscrape.session import *
    import dryscrape.driver
  2. the dryscape session.py
    line 1 import urlparse
    changed to
    from urllib.parse import urlparse,urljoin

and the line 34
return urlparse.urljoin(self.base_url, url)

changed to
return urljoin(self.base_url, url)

  1. the dryscrape mixins.py
    line 100: raise WaitTimeoutError , "wait_for timed out"
    changed to
    raise WaitTimeoutError("wait_for timed out")

this is webkit_server issue for python3

niklasb/webkit-server#15

now it can run under python3 .

socket.error: [Errno 32] Broken pipe

first, thank you for the opportunity to parse the javascript powered pages )
but I have a problem. some url crashes my application without exception message.
it just prints 'Killed' and terminated. for example

>>> sess.at_xpath("id('main')/div/div[2]/div/div[2]/ul/li[2]/a").text()
'Return to a previous version of Yahoo! Mail'
>>> sess.at_xpath("id('main')/div/div[2]/div/div[2]/ul/li[2]/a").get_attr('href')
'http://us.mg6.mail.yahoo.com/neo/optOut?rs=1&ncrumb=EoJFBwCfiHm'
>>> sess.visit('http://us.mg6.mail.yahoo.com/neo/optOut?rs=1&ncrumb=EoJFBwCfiHm')
Killed
~/#

the same when call sess.at_xpath("id('main')/div/div[2]/div/div[2]/ul/li[2]/a").click()

which means 'Killed'?
how to avoid it?

El Capitan + Dryscrape

I've been stuck on this issue for a while and I'm wondering if this is a compatibility issue with El Capitan. Running a script with dryscrape yields this error

Traceback (most recent call last):
  File "concat.py", line 12, in <module>
    dryscrape.start_xvfb()
  File "/Users/sethkranzler/Development/cornell_project/cornell_scrape/env/lib/python2.7/site-packages/dryscrape/xvfb.py", line 9, in start_xvfb
    xvfb.start()
  File "/Users/sethkranzler/Development/cornell_project/cornell_scrape/env/lib/python2.7/site-packages/xvfbwrapper.py", line 53, in start
    stderr=fnull)
  File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 710, in __init__
    errread, errwrite)
  File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1335, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory

This script was working perfectly fine before I upgraded and works fine on my raspberry pi 2.

Any idea what could be causing this?

user-agent switching

Hey,
Just a short question, how do I change user-agent in dryscraper?

Sorry if this is wrong place to ask but didnt find other places :)

Method Not Allowed (error code 202)

this is my code:

import dryscrape


# set up a web scraping session
sess = dryscrape.Session(base_url = 'http://www.udacity.com/')

# there are some failing HTTP requests, so we need to enter
# a more error-resistant mode (like real browsers do)
sess.set_error_tolerant(True)

# we don't need images
sess.set_attribute('auto_load_images', False)

# visit homepage and log in
print "Logging in..."
sess.visit('/')

email_field = sess.at_xpath('//input[@name="email"]')
print email_field
password_field = sess.at_xpath('//input[@name="password"]')
print password_field

email_field.set(USERNAME)
password_field.set(PASSWORD)
email_field.form().submit()

and that is the output

Logging in...
<Node #/html/body/div[@id='not-footer']/div[@id='top_bin']/div[@id='top_content']/div/div[@id='user-topbar-button-overlay']/form[@id='signin-form']/div[1]/input[1]>
<Node #/html/body/div[@id='not-footer']/div[@id='top_bin']/div[@id='top_content']/div/div[@id='user-topbar-button-overlay']/form[@id='signin-form']/div[1]/input[2]>
<Node #/html/body/div[@id='not-footer']/div[@id='top_bin']/div[@id='top_content']/div/div[@id='user-topbar-button-overlay']/form[@id='signin-form']>
Traceback (most recent call last):
  File "prova.py", line 30, in <module>
    email_field.form().submit()
  File "/home/simon/projects/udacity_downloader/dryscrape/driver/webkit_server/__init__.py", line 97, in submit
    self.client.wait()
  File "/home/simon/projects/udacity_downloader/dryscrape/driver/webkit_server/__init__.py", line 224, in wait
    self.conn.issue_command("Wait")
  File "/home/simon/projects/udacity_downloader/dryscrape/driver/webkit_server/__init__.py", line 429, in issue_command
    return self._read_response()
  File "/home/simon/projects/udacity_downloader/dryscrape/driver/webkit_server/__init__.py", line 438, in _read_response
    raise InvalidResponseError, self._read_message()
dryscrape.driver.webkit_server.InvalidResponseError: Error while loading URL http://www.udacity.com/: Error downloading http://www.udacity.com/ - server replied: Method Not Allowed (error code 202)

any suggestion to resolve this problem?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.