lorien / grab Goto Github PK

View Code? Open in Web Editor NEW

2.4K 89.0 272.0 5.97 MB

Web Scraping Framework

Home Page: https://grab.readthedocs.io

License: MIT License

Python 99.46% Makefile 0.54%

web-scraping http-client framework python pycurl asynchronous network urllib3 spider crawler

grab's Introduction

Grab Framework Project

Status of Project

I myself have not used Grab for many years. I am not sure it is being used by anybody at present time. Nonetheless I decided to refactor the project, just for fun. I have annotated whole code base with mypy type hints (in strict mode). Also the whole code base complies to pylint and flake8 requirements. There are few exceptions: very large methods and classes with too many local atributes and variables. I will refactor them eventually.

The current and the only network backend is urllib3.

I have refactored a few components into external packages: proxylist, procstat, selection, unicodec, user_agent

Feel free to give feedback in Telegram groups: @grablab and @grablab_ru

Things to be done next

Refactor source code to remove all pylint disable comments like:
- too-many-instance-attributes
- too-many-arguments
- too-many-locals
- too-many-public-methods
Make 100% test coverage, it is about 95% now
Release new version to pypi
Refactor more components into external packages
More abstract interfaces
More data structures and types
Decouple connections between internal components

Installation

That will install old Grab released in 2018 year: pip install -U grab

The updated Grab available in github repository is 100% not compatible with spiders and crawlers written for Grab released in 2018 year.

Documentation

Updated documenation is here https://grab.readthedocs.io/en/latest/ Most updates are removings content related to features I have removed from the Grab since 2018 year.

Documentation for old Grab version 0.6.41 (released in 2018 year) is here https://grab.readthedocs.io/en/v0.6.41-doc/

grab's People

Contributors

Stargazers

Watchers

Forkers

arunahk vs69 artemzi signaldetect enchantner olp-cs brabadu ogrishman weixu8 arturfis moveax kiss2013 tri0l alexondi keyintegrity mrmichalis pombredanne shamcode larvs prostobv qqalexqq craw1er subeax ilushko lyj830818 moonshard kuznitsin salvator scaurus ascii1011 planeta sergithon kalombos spikevlg s-y unixwars neveter skynic julia-bikova topwebmaster abaelhe kolexiang smant modulexcite gwynnbleidd1984 ddshadoww walter211 dipsec vkolev securextools sergiogarciadev-forks sunyinhui cinderalla codevlabs huiyi1990 cybernetics coreilabs giserh benjamesbabala vanms1989 afthill flyeven hasantayyar kentheteck alihalabyah mcmazur noscripter excoriate hleadery winterlike edinunzio spaceappsxploration blessingd136 jrragan fors3cdream imoapps calm4wei jayuloy raybuhr plucena24 kevinlondon panyzzing zpcooper kosogistan darktel codetasks marslabtron stasonhub fangjintang1989 rblack heyarturo gonchik totalboy goldgraal cbxcube wyrover jlzs justangel mulepiemmason wesleydevlab

grab's Issues

Ошибка в документации касательно дескриптора pycurl

в доке есть такой блок

Работа с pycurl-дескриптором
Если вам нужно какая-либо возможность pycurl, интерфейс к которой отсутствует в Grab, вы можете работать с pycurl-дескриптором напрямую. Пример:

from grab import Grab
import pycurl

g = Grab()
g.curl.setopt(pycurl.RANGE, '100-200')
g.go('http://some/url')

http://docs.grablib.org/grab/misc.html#pycurl

но в последней версии нету g.curl свойства, хранится оно теперь в g.transport.curl как я понимаю.

English translation of documentation

The current Grab documentation is in Russian language. It should be translated to English.

Want to back this issue? Place a bounty on it! We accept bounties via Bountysource.

Spider некорректно работает.

Код Spider-a:
http://dumpz.org/836121/
урл для теста: http://genomictechnicalsolutions.com/angazo/index.php/using-joomla/extensions/components/users-component/registration-form

Из pypi grab работает нормально. Установил из гитхаб - джумла пишет session has been expired.

The root element doesn't change.

for item in grab.doc.select('//div[@class="item"]'):
    a_name = item.select('//div[@class="nameRus"]/a').text()
    datetime = item.select('//div[@class="date"]').text()

In this example '//' always searches from the begining of html document. I think that '//' in item.select should search only in item's chunk of html.

Raise exception if @func_field is called without "()"

Raise exception in following case:

@func_field
def some_func(self, sel):
pass

Should be:

@func_field()
def some_func(self, sel):
pass

Игнорирование одного .select в цикле for

Допустим есть такой цикл:

for elem in g.doc.select('//table[contains(@width, "100")]')[4].select('//td[contains(@valign, "middle")]'):
    print(elem.text())

и допустим в документе есть 6 таблиц с width равным 100 и во всех есть td с valign равным middle
исходя из кода подразумевается, что я хочу получить только текст из 5(4) таблицы, но получаю из абсолютно всех т.е. код выше равен этому:

for elem in g.doc.select('//td[contains(@valign, "middle")]'):
    print(elem.text())

мне кажется так не должно быть

использую последний grab отсюда.

Redirect send cookie bug

Redirect =HTTP 301 and 302 Redirects, Meta Refresh Redirect
For Examples:
request www.a.com and saved Cookie: a=123

when www.a.com redirect www.b.com

reuse_cookies = True or reuse_cookies = False

cookies will be sended to www.b.com. They are a different domain，but www.a.com site's cookie will be sended www.b.com

I am sorry for my poor english.

Want to back this issue? Place a bounty on it! We accept bounties via Bountysource.

TypeError: 'bool' object is not iterable

XPath for an empty element check :

content = doc.select(
    u'boolean(//div[contains(@class, "b-serp-list")]/node())'
)

Trace:

File "parser.py", line 62, in go
    u'boolean(//div[contains(@class, "b-serp-list")]/node())'
  File "/development/env/lib/python2.7/site-packages/grab/ext/doc.py", line 15, in select
    return XpathSelector(self.grab.tree).select(*args, **kwargs)
  File "/development/env/lib/python2.7/site-packages/grab/selector/selector.py", line 160, in select
    selector_list = self.wrap_node_list(self.process_query(query), query)
  File "/development/env/lib/python2.7/site-packages/grab/selector/selector.py", line 167, in wrap_node_list
    for node in nodes:
TypeError: 'bool' object is not iterable

в pypi лежит версия с небольшой ошибкой

Сейчас в pypi лежит версия 0.4.12 c ошибкой.

File "/opt/realty/env/local/lib/python2.7/site-packages/grab/spider/task.py", line 186, in clone
    task.url = url
NameError: global name 'url' is not defined```

В текущей версии данная ошибка исправлена

CSS селекторы уже не поддерживаются?

Как-то не ясно, тестов при беглом осмотре тоже не заметно

Update docs about proxies

Please, update docs about proxies
https://www.evernote.com/shard/s168/sh/73409195-9dae-4f65-85aa-0e7e77703a02/9ecf57575cb6dd33f9c5a11a49d012e3

Want to back this issue? Place a bounty on it! We accept bounties via Bountysource.

Добавить тело ответа в исключения

Иногда при поиске каких либо данных в теле ответа возникают исключения. Хотелось бы иметь возможность получить тело ответа в котором происходил поиск по regexp, xpath, text. Тогда появится возможность быстро исправлять ошибки. Это особенно актуально когда мы парсим сайты на которых довольно часто происходят изменения в верстке.

Want to back this issue? Place a bounty on it! We accept bounties via Bountysource.

Error pycurl.error 48 pycurl.COOKIEFILE

Hello!

Trying simple example
from grab import Grab
g = Grab()
resp = g.go('http://livejournal.com/')

get error
File "/usr/local/lib/python2.7/site-packages/grab/transport/curl.py", line 328, in process_config
self.curl.setopt(pycurl.COOKIEFILE, '')
pycurl.error: (48, '')
FreeBSD 9.1 curl 7.33

How to fix this?

Grab falls with AttributeError: 'NoneType' object has no attribute 'unicode_runtime_body'

Code:

from grab import Grab
g = Grab
print g.tree

...error occurs: AttributeError: 'NoneType' object has no attribute 'unicode_runtime_body'

Grab falls even i want to check g.tree is None:
g.tree is None
... and error occurs anyway.

Bugged code is on the bottom of stack trace:

66         if self._lxml_tree is None:
67             body = self.response.unicode_runtime_body(
68                 fix_special_entities=self.config['fix_special_entities']
69             ).strip()

You shouldn't try to call any self.response methods if it wasn't any response.

Want to back this issue? Place a bounty on it! We accept bounties via Bountysource.

Замена параметру request_pause

Добрый день. Заметил что параметр request_pause был убран, объявлен устаревшим и более нигде не используется. Он был очень нужен. Подскажите как без него организовать работу спайдера так чтобы он допустим делал запросы не постоянно, а с паузой в 3-5 секунд допустим. Иначе владельцы ресурсов могут забанить такой такую активность.

Спасибо

Exception if cookiefile file does not exists

Grab should not raise exception if file specified in cookiefile options does not exist. Grab should ignore unexisting file.

утечка открытых файлов

при режиме body_inmemory = False и транспорте curl происходит утечка открытых файлов.
причиной утечки является игнорирования хендла, возвращаемого функцией tempfile.mkstemp. этот хендл уже указывает на открытый файл, которой необходимо закрыть.

SpiderMisuseError: Could not resolve relative URL because base_url is not specified

This exception should not be fatal, it should not stop all spider processes.

Ошибки во внутренней документации

Например /grab/spider/base.py : 123

    * network_try_limit - How many times try to send request
        again if network error was occuried, use 0 to disable
    * network_try_limit - Limit of tries to execute some task
        this is not the same as network_try_limit
        network try limit limits the number of tries which
        are performed automaticall in case of network timeout
        of some other physical error
        but task_try_limit limits the number of attempts which
        are scheduled manually in the spider business logic

Возможно во второй раз должен быть "task_try_limit"? Выходит, что "network_try_limit - ...is not the same as network_try_limit" :)

Сразу же, /grab/spider/pattern.py : 39

def process_next_page(self, grab, task, xpath, resolve_base=False, **kwargs):
    """
    Generate task for next page.
     ...
    Example::
        self.follow_links(grab, 'topic', '//div[@class="topic"]/a/@href')
    """

И т.д., хотя опечатки в примерах - это мелочи!

Selector class is deprecated. Чем заменить?

Столкнулся с необходимостью использовать select(xpath) в произвольном HTML коде. Решил воспользоваться подходом из статьи на хабре http://habrahabr.ru/post/173509/ , но оказывается, что класс Selector уже устарел (Selector class is deprecated. Please use XpathSelector class instead).

Как быть, что читать?

Странное поведение clean_html

Обратил внимание на странное поведение clean_html:

Например, у нас есть html вроде

Text_part1
<img src="test_img.jpg" width="100%" alt="Test image" />
Text_part2

После применения clean_html(html, safe_attrs=('src', 'href')) img пропадает, не смотря на то, что атрибут src разрешен. Вероятно это происходит потому, что другие атрибуты, которые есть в img не разрешены.

Возможно не стоит удалять тег, если в нем есть хотя бы один разрешенный атрибут?

Want to back this issue? Place a bounty on it! We accept bounties via Bountysource.

charset и document_charset

Во всей документации, что я нашёл, было указано, что ручное указание character set для страницы должно быть произведено с помощью charset параметра grab. Это заводит в заблуждение, потому что в действительности, судя по коду, за это отвечает параметр document_charset, а charset - это то, в какую локаль нужно конвертировать скачанную страницу. Думаю, это стоит поправить и пояснить.

Error: select.error: (22, 'Invalid argument')

У меня grab стал выдавать вот такую ошибку иногда, раньше точно не было на более старых версиях, так как этот скрипт старый.

Traceback (most recent call last):
  File "/root/yelp_spider/test_proxy.py", line 100, in <module>
    main()
  File "/root/yelp_spider/test_proxy.py", line 88, in main
    p.run()
  File "/root/yelp_spider/grab/spider/base.py", line 945, in run
    self.transport.process_handlers()
  File "/root/yelp_spider/grab/spider/transport/multicurl.py", line 68, in process_handlers
    select.select(rlist, wlist, xlist, timeout / 1000.0)
select.error: (22, 'Invalid argument')

Код: http://dumpz.org/1129253/
Там grab-объекты создаются явным образом для того, что бы протестировать каждую проксю единожды.

Баг воспроизводится в случае, если последний таск отвалился по таймауту.
В функцию select.select передаются параметры

[24, 25] [] [] -0.001

Могу предположить, что это из-за отрицательного значения. Просьба проверить.

Баг post запросов windows x64

с 32 битными системами все проще, один добрый человек собрал курл под него, есть ли у вас возможность сделать тоже самое для x64? Мы тут всей командой мучаемся из за этого бага. Библиотека мощная и труда в нее вложено не мало, но она становится бесполезной при таком баге. Или возможно ли сделать его не на curl, а на сокетах к примеру.

Обработка Could not resolve host

Скажите, как я могу обработать ошибку: ERROR:grab.spider.base:Could not resolve host

Я создаю задачу для Spider

yield Task('check_domain', url=url)

Но если домен не резолвится то задача не выполняется, а как обработать такую ошибку?

P.S.
Понимаю что проблему возможно описал очень абстрактно, могу дать подробности, но только не знаю какие :)

Document.copy and AttributeError: 'NoneType' object has no attribute 'config'

resp = g.go('http://google.com')
resp1 = resp.copy()
resp.select('//div')
<grab.selector.selector.SelectorList object at 0x7fed2275c808>
resp1.select('//div')
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "../python3.4/site-packages/grab/document.py", line 490, in select
    return XpathSelector(self.tree).select(*args, **kwargs)
  File "../python3.4/site-packages/grab/document.py", line 351, in tree
    if self.grab.config['content_type'] == 'xml':
AttributeError: 'NoneType' object has no attribute 'config'

possible solution

def copy(self):
    return copy.deepcopy(self)

Docs search does not work on website http://docs.grablib.org/

Example:
http://docs.grablib.org/search.html?q=test&check_keywords=yes&area=default
Return:

404 Not Found

nginx/1.2.1

GrabMisuseError: You can not use gzip encoding because pycurl was built without zlib support @ windows

Ставил pycurl из инсталлятора на сайте (того самого, который на сайте лежит). Все работало, пока какой-то модуль не обновил grab и тут поперло.

https://groups.google.com/forum/#!topic/python-grab/7-jC2lOttWc

Lorien, петух ты ткнутый, ты сам проверяешь, что выкладываешь? Сука хуйло ты ебаное, чтоб у тебя яйца отвалились, хотя такому ботану как ты они и не нужны. Блядина ты ебаная, 1,5 часа на эту хуйню потратил. Гний, сука! Гний!!!

Хранение cookies с разделением по доменам

В текущей версии grab куки хранятся в виде dict вида имя=>значение.

Пожалуйста, реализуйте хранение куков, более приближенную к модели браузера, с сохранением информации о папке, домене и т.д.

{
    "domain": ".github.com",
    "expirationDate": 1476761637,
    "hostOnly": false,
    "httpOnly": false,
    "name": "_ga",
    "path": "/",
    "secure": false,
    "session": false,
    "storeId": "0",
    "value": "GA1.2.123456.78910",
    "id": 1
}

Столкнулся с хитрым сайтом на Битриксе, который не отдает контент, если метод обработки куков отличается от браузерного (это наиболее вероятная причина, скажите механизм защиты, если кто знает :) ).

// English version

Current grab saves cookies using python dict with structure like name => value.

Please, implement cookie storage, which is alike browser model. I.e. with info about path, domain, etc.

P.S. I'v faced with complicated site based on Bitrix, which didn't gave me content if cookie handling is differs from browser's cookie-storage model (is it a real reasщт, maybe anybody may share details of this protection :) ).

Заранее спасибо / Thanks a lot

Update docs about handling redirects

Add description of how to control the maximal number of redirects

Что за ошибка

ERROR:grab.spider.base:Error while processing content unencoding: invalid code lengths set

Подскажите как её отловить =(

Ошибка в выводе дебаг-лога POST запросов

Traceback (most recent call last):
File "test.py", line 91, in
test_instance.run()
File "/usr/local/lib/python2.7/dist-packages/grab-0.4.13-py2.7.egg/grab/spider/base.py", line 941, in run
for result in self.transport.iterate_results():
File "/usr/local/lib/python2.7/dist-packages/grab-0.4.13-py2.7.egg/grab/spider/transport/multicurl.py", line 105, in iterate_results
grab.process_request_result()
File "/usr/local/lib/python2.7/dist-packages/grab-0.4.13-py2.7.egg/grab/base.py", line 511, in process_request_result
if len(value) > self.config['debug_post_limit']:
TypeError: object of type 'long' has no len()

UnboundLocalError: local variable 'hammer_timeouts' referenced before assignment

Версия из pypi

Fix "mysql server has gone away" in mysql cache backend

How to repeat:

Start spider, use mysql cache
Pause spider execution (for instance, import pdb; pdb.set_trace()) in some task handler
Wait for a some time, well, 5 hours, for example :)
Get the error: mysql has gone away

KeyError in g.form_fields()

File "/srv/myscript.py", line 383, in get_post_params
post_data = self.g.form_fields()
File "/srv/backend/booking/venv/local/lib/python2.7/site-packages/grab/ext/form.py", line 341, in form_fields
if fields[elem.name] is None:
KeyError: 'ChooseSavedCard'

On this piece of html code:

 <input type="radio"  name="ChooseSavedCard"  id="RadioSelectNewCard"  value="RadioSelectNewCard"  data-bind="checked: isSavedCardCheckbox"  checked />

Возможно ли улучшить html.strip_tags()?

Привет! :)

Было бы неплохо увидеть работу этой полезности на подобии РНР: http://php.net/strip_tags , чтобы можно было указывать какие теги оставить. Иными словами, чтобы не играться с regexp'ами, когда, например, нужно удалить все ссылки из текста (но оставить анкоры) и т.д.

Опечатка в методе Spider.get_name?

Опечатка в методе Spider.get_name (файл grab/spider/base.py Line 950)
Соответственно, если я хочу использовать "разные" очереди mongo я делаю так:

bot = CustomSpider()
bot.spider_name = 'custom_spider-%s' % my_data_item.some_unique_id
bot.setup_queue(backend='mongo', database='some_db') # тут возникает ошибка из-за опечатки
bot.run()

grab (0.4.13)

Want to back this issue? Place a bounty on it! We accept bounties via Bountysource.

grab can't load valid cookiefile

Tested with revision 0ca7188

Steps to reproduce:

AirAlk:grab(master) $ rm /tmp/jar   
AirAlk:grab(master) $ touch /tmp/jar
AirAlk:grab(master) $ python
Python 2.7.5 (default, Aug 25 2013, 00:04:04) 
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from grab import Grab
>>> g=Grab(cookiefile='/tmp/jar')
>>> g.go('http://google.com')
ERROR:root:Call to deprecated function dump_cookies. Use grab.cookies.save_to_file instead.
<grab.response.Response object at 0x104cddbb0>
>>> g=Grab(cookiefile='/tmp/jar')
>>> g.go('http://google.com')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "grab/base.py", line 368, in go
    return self.request(url=url, **kwargs)
  File "grab/base.py", line 443, in request
    self.prepare_request(**kwargs)
  File "grab/base.py", line 396, in prepare_request
    self.transport.process_config(self)
  File "grab/transport/curl.py", line 306, in process_config
    self.process_cookie_options(grab, request_url)
  File "grab/transport/curl.py", line 381, in process_cookie_options
    encoded, tail))
TypeError: invalid arguments to setopt
>>>

Cookie jar:

AirAlk:grab(master) $ cat /tmp/jar
[{"comment": null, "domain": "#httponly_.google.com", "name": "NID", "expires": 1404847059, "value": "67=RmfhSy6-Q0Eq25G1w491QhHnloBwFp11ed7mBdz211R-uVrvYpPokHiWnz8l055HMzQm7IUfIHSVvF8J4Yk-CqRzpoprkvqtug7v0zbTraZzkA2JXjW3XGsfJFjdihCJ", "version": 0, "rfc2109": false, "discard": true, "path": "/", "port": null, "comment_url": null, "secure": false}, {"comment": null, "domain": "#httponly_.google.lv", "name": "NID", "expires": 1404847060, "value": "67=NXeekVhHigpPbN8lOSGAVG4fIcO5gDNBcyq9nnZxOE1ykS-yG3lE6_a5OpjE4O7cA5rUje_Z-_EhXUPKkFySh1DNdfQjHHak8jQ57b7-lC678o8dNgUK4qd8C4s7z-Fi", "version": 0, "rfc2109": false, "discard": true, "path": "/", "port": null, "comment_url": null, "secure": false}, {"comment": null, "domain": ".google.com", "name": "PREF", "expires": 1452107859, "value": "ID=ba40aa1f1893c6b2:FF=0:TM=1389035859:LM=1389035859:S=Ts4dRqBie77tN7M2", "version": 0, "rfc2109": false, "discard": true, "path": "/", "port": null, "comment_url": null, "secure": false}, {"comment": null, "domain": ".google.lv", "name": "PREF", "expires": 1452107860, "value": "ID=0e03b5592755082e:FF=0:TM=1389035860:LM=1389035860:S=BqshP7eebNoH0JS-", "version": 0, "rfc2109": false, "discard": true, "path": "/", "port": null, "comment_url": null, "secure": false}]

Большие файлы считываются в память.

Из-за response.parse в prepare_response файла curl.py большие файлы считываются в память, несмотря на опцию body_inmemory=False. Скорей всего виновата строчка if isinstance(self.body, unicode): в методе parse класса Response

Ошибка в формировании ЗЩЫЕБащкь ьуерщв=ЭзщыеЭ шв=Эштащ-ащкьЭ фсешщт=Э№ЭЮ Бштзге ензу=ЭршввутЭ тфьу=ЭьвЭ мфдгу=ЭьщмуЭ.Ю БефидуЮ БеищвнЮ БекЮ

[Errno 56] GnuTLS recv error (-12): A TLS fatal alert has been received.

Получаю перодическу такую ошибку при запросах, пожет кто-то сталкивался?
pycurl 7.19.0.2

Stacktrace (последний вызов снизу):

  File "api.py", line 156, in request
    x = g1.go(url)
  File "grab/base.py", line 356, in go
    return self.request(url=url, **kwargs)
  File "grab/base.py", line 433, in request
    self.transport.request()
  File "grab/transport/curl.py", line 389, in request
    raise error.GrabNetworkError(ex.args[0], ex.args[1])

process_links и xpath_list

При использовании process_links получаем ошибку:
ERROR:root:Call to deprecated function xpath_list. Use grab.doc.select() instead.

Видимо нужно переписать process_links и process_next_page через select?

Полагаю, что это строка 79 /grab/spider/pattern.py

Заменить
for url in grab.xpath_list(xpath):
на
for url in grab.doc.select(xpath).node_list():

Dependecy

Привет.

Собери, пожалуйста, пакет для pypi так что-бы pycurl и lxml ставились сами по команде pip install grab

Очепятка в PyquerySelector

Файл selector/selector.py строка 296 - написано pyquery вместо query

class PyquerySelector(LxmlNodeBaseSelector):
    __slots__ = ()

    def pyquery_node(self):
        return PyQuery(self.node)

    def process_query(self, query):
        return self.pyquery_node().find(pyquery)

должно быть

    def process_query(self, query):
        return self.pyquery_node().find(query)

Скачивание страниц по очереди

Обратил внимание, что при таком коде (см. ниже) страницы скачиваются в рандомном порядке (т.е. не 1,2,3, а как им вздумается - 2,1,3 или 3,2,1 и т.д.)
С чем это может быть связано и как это исправить?

...
parse_pages = 3

class SpiderExample(Spider):

    def task_generator(self):
        for page_num in xrange(1,parse_pages+1):
            yield Task('pages', url='http://url.com/?page='+str(page_num))

    def task_pages(self, grab, task):
        print task.url
...

  File "modules/grab/grab/ext/form.py", line 311, in submit
    return self.request()
  File "modules/grab/grab/base.py", line 442, in request
    self.prepare_request(**kwargs)
  File "modules/grab/grab/base.py", line 403, in prepare_request
    self.transport.process_config(self)
  File "modules/grab/grab/transport/curl.py", line 308, in process_config
    self.process_cookie_options(grab, request_url)
  File "modules/grab/grab/transport/curl.py", line 381, in process_cookie_options
    tail = b'; domain={}' % cookie_domain
TypeError: unsupported operand type(s) for %: 'bytes' and 'str'

When I use pip3 to install grab, the request works. But if I use branch master of grab as a submodule in my project, it fails : (

See: Adding % formatting to bytes and bytearray -- Request for Pronouncement

Plan for 0.6 release

Repo

Branch name is v06: https://github.com/lorien/grab/tree/v06

[DONE] Refactoring

Move some features out to separate packages:
- grab.captcha --> http://github.com/lorien/captcha_solver
- grab.djangoui --> NULL
- grab.item --> http://github.com/lorien/item
- grab.selector --> http://github.com/lorien/selection
- grab.kit --> http://github.com/lorien/moskit
- grab.tools (most of grab.tools.*) --> http://github.com/lorien/tools
- grab.tools.account --> NULL
- test.server --> http://github.com/lorien/test-server
Put backward-compatibility import hacks to allow import item, selector and tools objects from grab package. Generate warnings in such cases.

Documentation

Write at least brief Egnlish documentation about all Grab/Spider features.

[DONE] Obsoleted things

[DONE] Remove obsoleted transports (all except curl and requests) from grab.transport package.
[DONE] Remove some shit from grab.spider

[DONE] Tests

[DONE] 90% Test coverage (if less than remove some shit)
[DONE] Integrate coveralls.io, put coveralls badge into README

[DONE] Coding style

[DONE] Make code 100% compatible with flake8
[DONE] Connect landscape

Ошибка в формировании POST запроса

Для следующей формы формируется неправильный запрос.
Форма:

<!--<input type="hidden" name="md" value="move"/>-->
<!--<table>-->
    <!--<tbody>-->
    <!--<tr>-->
        <!--<td><input type="checkbox" name="test" value="12##48" checked="checked"/></td>-->
        <!--<td>Some value</td>-->
    <!--</tr>-->
    <!--<tr>-->
        <!--<td><input type="checkbox" name="test" value="15##25" checked="checked"/></td>-->
        <!--<td>Some value</td>-->
    <!--</tr>-->
    <!--<tr>-->
        <!--<th>-->
            <!--<button onclick="this.form.submit(); return false;">Submit</button>-->
        <!--</th>-->
    <!--</tr>-->
    <!--</tbody>-->
<!--</table>-->

Вывод grab.network.log:

[01] GET http://localhost/
[02] POST http://localhost/
POST request
test <CheckboxValues {'12##48', '15##25'} for checkboxes name='test'>
md move

grab_fail.py:

g.go('http://localhost/')
g.choose_form(xpath=".//form[@id='info-form']")
g.submit()

https://gist.github.com/rblack/7730597

lorien / grab Goto Github PK

grab's Introduction

Grab Framework Project

Status of Project

Things to be done next

Installation

Documentation

grab's People

Contributors

Stargazers

Watchers

Forkers

grab's Issues

Repo

[DONE] Refactoring

Documentation

[DONE] Obsoleted things

[DONE] Tests

[DONE] Coding style

Recommend Projects

Recommend Topics

Recommend Org