elliotgao2 / toapi Goto Github PK

View Code? Open in Web Editor NEW

3.5K 78.0 236.0 1.74 MB

Every web site provides APIs.

Home Page: https://gaojiuli.github.io/toapi/

License: MIT License

Python 100.00%

html json api python web spider crawler flask toapi

toapi's Introduction

Toapi

Overview

Toapi give you the ability to make every web site provides APIs.

v1.0.0 Documentation: http://www.toapi.org
Awesome: https://github.com/toapi/awesome-toapi
Organization: https://github.com/toapi

Features

Automatic converting HTML web site to API service.
Automatic caching every page of source site.
Automatic caching every request.
Support merging multiple web sites into one API service.

Get Started

Installation

$ pip install toapi

Usage

create app.py and copy the code:

from flask import request
from htmlparsing import Attr, Text
from toapi import Api, Item

api = Api()


@api.site('https://news.ycombinator.com')
@api.list('.athing')
@api.route('/posts?page={page}', '/news?p={page}')
@api.route('/posts', '/news?p=1')
class Post(Item):
    url = Attr('.storylink', 'href')
    title = Text('.storylink')


@api.site('https://news.ycombinator.com')
@api.route('/posts?page={page}', '/news?p={page}')
@api.route('/posts', '/news?p=1')
class Page(Item):
    next_page = Attr('.morelink', 'href')

    def clean_next_page(self, value):
        return api.convert_string('/' + value, '/news?p={page}', request.host_url.strip('/') + '/posts?page={page}')


api.run(debug=True, host='0.0.0.0', port=5000)

run python app.py

then open your browser and visit http://127.0.0.1:5000/posts?page=1

you will get the result like:

{
  "Page": {
    "next_page": "http://127.0.0.1:5000/posts?page=2"
  }, 
  "Post": [
    {
      "title": "Mathematicians Crack the Cursed Curve", 
      "url": "https://www.quantamagazine.org/mathematicians-crack-the-cursed-curve-20171207/"
    }, 
    {
      "title": "Stuffing a Tesla Drivetrain into a 1981 Honda Accord", 
      "url": "https://jalopnik.com/this-glorious-madman-stuffed-a-p85-tesla-drivetrain-int-1823461909"
    }
  ]
}

Contributing

Write code and test code and pull request.

toapi's People

Contributors

Stargazers

Watchers

Forkers

adonge adolfoeliazat fangj99 c0debrain wuqiangroy vvoody simpleapples hapazores bigff kingackerman nubiayin fntys allensmile dgreyling lichenlijin michaelfeng87 yuangs hzr1140521792 youmuyou magicknight mohan3d yanxue blomman9 alexanderklau yaochao xujin8 daniel-at-github vireshbackup zhaoxianjin zerocry phpdever gclove cced3000 howie6879 prowayne henrychoi7 zidane1980slab rob-rychs mengyou658 quanpower hezhisu bingtel meemaw www3838438 joenaso bigrlab nilopc-python osub bukalov milne-dev wushian kc17 mememero21japan dinneo techscientist apextw leanklass candidate48 mandymali chenkaigithub wall-eeeeeee longjohncoder jithinraj s1gh dm kamilsmuga jianlingzhong davidthewatson programmingtips mehrdad-shokri cycold sawantuday rodoviario isaacdomini tjoen guzishiwo evrimulgen info24 gpsbird luoyufu dinggewennuan dpcat237 97612336 xintiaoxuanlv idealley javashu pinkvido emiya-alter zhianlin big2cat ddfxulei dfj3302695 chengtian5huang baifengbai huozhihui tb1over zhaoyao whmnoe4j charlieincode ysj123688

toapi's Issues

Cache TTL clarification

Good evening,

I have a question regarding the cache and its time to live.

Let's say I want to turn some site into an API and want the results of the very first request to be cached for one hour. How would I specify that in the settings? Is such a setup even possible?

I tried setting ttl: 60 * 60, assuming that that would do the trick. But to me it seems it doesn't...

Could you please clarify?

Thanks in advance.

items获取元素内容的建议

目前 toapi 获取页面元素的处理方式有两种：

精确定位到唯一的元素节点后，返回该节点下的所有 text，实现是通过遍历 node.itertext() 然后将所有的文本内容拼接返回
若定位到的元素节点是多个的，则返回由这些元素节点（Element）组成的一个 list

假设有如下 DOM 结构：

<div class="block">
  <p class="p1">
    hello, P1<br/>
    this is new line
  </p>
  <p class="p2">
    <span>this is a line</span>
    <a href="#">this is a link</a>
  </p>
</div>

有如下 item：

class Demo(Item):
    block = Css('div.block')
    p = Css('div.block > p')
    p1 = Css('div.block > p.p1')

得到的结果将是：

block: hello, P1 this is new line this is a line this is a link
p: [<Element p 0x*********>, <Element p 0x*********>]
p1: hello, P1 this is new line

现在有一种情况是，我想要取得对应元素节点的 HTML 内容而不是它的所有 text，例如 .p1 中带有换行符 <br/>，我想要定位得到该元素下的 HTML 内容，即包含标签 <br/> 和 text 的所有内容，然后自己通过 clean_* 方法处理替换 <br/> 为 '\n'，可是从上面的结果可以看到，获取的 p1 直接取出来了其中的文本内容，若我想得到 p1 原始内容，只能通过 p 获取到一组元素，然后再遍历得到 p1 之后，将它转换成字符串再作处理。

以上只是一个简单的使用场景，实际应用当然不会只有替换换行符这种操作，所以我建议可以给选择器增加一个可选的属性值 raw，默认值取为 False，即 Selector(rule, raw=False)，然后配合 clean_* 方法可以给开发者更大的自由度去处理获取到的内容。

以上。

补充：

选择器还有一个是 Regex，如果使用这个的话，可以得到想要的原始内容，另外两个选择器 Css 和 XPath 情况如前所述。

Awesome code!!!

Congratulations to create a awesome lib and a great code, beautiful and great!!!

Docs clarification on cache update logic

Hi.

On the diagram in the README (nice & clear BTW), I can read:

HTML storage update trigger cache update

Could you somehow please explain what that means in the docs ?
When exactly / how does the cache gets updated ?

没怎么用过python

Traceback (most recent call last):
File "test.py", line 1, in
from toapi import XPath, Item, Api
File "/Library/Python/2.7/site-packages/toapi/init.py", line 2, in
from toapi.item import *
File "/Library/Python/2.7/site-packages/toapi/item.py", line 17
class Item(metaclass=ItemType):
^
SyntaxError: invalid syntax

toapi is easy toStar hard todevelope

install toapi occur errors.

when i install toapi through the commond "pip install toapi",the errors occurs .

Traceback (most recent call last):
  File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/basecommand.py", line 215, in main
    status = self.run(options, args)
  File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/commands/install.py", line 324, in run
    requirement_set.prepare_files(finder)
  File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/req/req_set.py", line 380, in prepare_files
    ignore_dependencies=self.ignore_dependencies))
  File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/req/req_set.py", line 620, in _prepare_file
    session=self.session, hashes=hashes)
  File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/download.py", line 821, in unpack_url
    hashes=hashes
  File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/download.py", line 659, in unpack_http_url
    hashes)
  File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/download.py", line 882, in _download_http_url
    _download_url(resp, link, content_file, hashes)
  File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/download.py", line 603, in _download_url
    hashes.check_against_chunks(downloaded_chunks)
  File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/utils/hashes.py", line 46, in check_against_chunks
    for chunk in chunks:
  File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/download.py", line 571, in written_chunks
    for chunk in chunks:
  File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/utils/ui.py", line 139, in iter
    for x in it:
  File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/download.py", line 560, in resp_read
    decode_content=False):
  File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/_vendor/requests/packages/urllib3/response.py", line 357, in stream
    data = self.read(amt=amt, decode_content=decode_content)
  File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/_vendor/requests/packages/urllib3/response.py", line 324, in read
    flush_decoder = True
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/contextlib.py", line 35, in __exit__
    self.gen.throw(type, value, traceback)
  File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/_vendor/requests/packages/urllib3/response.py", line 246, in _error_catcher
    raise ReadTimeoutError(self._pool, None, 'Read timed out.')
ReadTimeoutError: HTTPSConnectionPool(host='pypi.python.org', port=443): Read timed out.

on macOS, python version is 2.7 and pip version is 9.0.1
any one help me out ?

SyntaxError: invalid syntax

Hi,

If I have run this comment toapi -v I am getting the below issue.

Kindly check and give the needfull suggession

Thanks...

Traceback (most recent call last):
  File "/usr/local/bin/toapi", line 9, in <module>
    load_entry_point('toapi==2.1.2', 'console_scripts', 'toapi')()
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 542, in load_entry_point
    return get_distribution(dist).load_entry_point(group, name)
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 2569, in load_entry_point
    return ep.load()
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 2229, in load
    return self.resolve()
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 2235, in resolve
    module = __import__(self.module_name, fromlist=['__name__'], level=0)
  File "/usr/local/lib/python2.7/dist-packages/toapi/__init__.py", line 1, in <module>
    from toapi.api import Api
  File "/usr/local/lib/python2.7/dist-packages/toapi/api.py", line 20
    def __init__(self, site: str = '', browser: str = None) -> None:
                           ^
SyntaxError: invalid syntax

Web interface.

A Web console interface.

Monitor API.
Statistical

Elements not always present on page

I use:

class ProductPage(Item):
      coupon = Attr('.coupon', 'title')

However some product pages do not contain the coupon html
so they fail with

  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/htmlparsing.py", line 79, in parse
    return element.css(self.selector)[0].attrs[self.attr]
IndexError: list index out of range

What's the best practice to deal with that situation?

Access to RawHTML from selectors

Hello,
I need to get access to the raw HTML in one of Item instances.
Currently the XPath or CSS selectors always convert the node as a string. But in my use case once I select certain part of my webpage, I need to do some post-processing in my clean_ method.
But I can only get a string passed into it. Is there a way to get a rawHTML passed into my clean_ method for a given key.

Thank you,

How to set expiration on Local Storage

Hello,
I am using v1.0.0.
And wanted to check if there is a way to set an expiration time on the local storage. Since the source website changes frequently, I want to only cache the local storage for 24 hours and then force it to refetch the content.
Is there a way to achieve this, and similar to a ttl on cache settings.

Thanks,

can't work when use https website

it's normal when i use xiafufang.com but toapi-pic can not work,what should i config for https website ? thanks

Received log excute time.

Error: No such command "new".

[root@test python3]# python --version
Python 3.6.2

[root@test python3]# toapi new toapi/toapi-pic
Usage: toapi [OPTIONS] COMMAND [ARGS]...

Error: No such command "new".

help ?

Command line interface.

toapi new [project_name]
toapi run
toapi info
toapi status
toapi clear storage
toapi clear cache

modify routing argument 2

Hi
The site I am scraping has urls like:

http://remote.com/i-love-cats/1
http://remote.com/dogs-are-really-great/2
http://remote.com/pupies_kanGourOUs/3

I want to match them with local urls like those:

http://localhost:5000/1
http://localhost:5000/2
http://localhost:5000/3

Is there some magical way to do it?

Or I need to do like #107 and also
add custom code in an external two-column db table to match

1 => http://remote.com/i-love-cats/1
...

Sure as an alternate solution I could maybe add a route like
@api.route('page/{complete_remote_url}', '{complete_remote_url}')
and do like :

wget http://localhost:5000/page/http://remote.com/i-love-cats/1

but I want to hide the scraped site url so the caller does not see the url

Different route load the same item.

class MovieList(Item):
    __base_url = 'http://www.dy2018.com'

    url = XPath('//b//a[@class="ulink"]/@href')
    title = XPath('//b//a[@class="ulink"]/text()')

    class Meta:
        source = XPath('//table[@class="tbspan"]')
        route = {'/movies/?page=1': '/html/gndy/dyzz/',
                 '/movies/?page=:page': '/html/gndy/dyzz/index_:page.html'}

Add force-refresh on api

Hello,
I plan to use this awesome tool, to fetch some content from a dynamic website. Although the cache expiration works for standard use cases.
I was wondering if there is a way on the api to perform a complete fetch from the website instead of using the cache or the storage. A use case would on a mobile device if the user does a refresh to see the latest content from the website.

Any ideas ?

Thank you for this...

Any way to send HTTP POST requests?

In working with toapi I came across a scenario where the web page had an HTML table that was paginated.

Clicking on "next page" would issue an ajax post request to fetch the next set of records in the data set.

Is there anyway to accomplish this with toapi?

使用toapi run 时出现问题

路由 alias 中含中文字符会解析出错

路由中包含中文一般是在搜索时会出现这种情况，查看了源码 toapi/api.py 的实现，发现 _alias_re_string 并没有对中文做匹配，尝试修改匹配规则为 '(?P<{}>[A-Za-z0-9_?&/=\u4e00-\u9fa5]+)' 后测试中文匹配成功。

另外空字符也是没有做匹配的，当搜索多关键词时就会只匹配第一个关键词，修改规则为 '(?P<{}>[A-Za-z0-9_?&/=\s\u4e00-\u9fa5]+)' 可以匹配多关键词。

文档未更新

文档中toapi-pic的例子
toapi new toapi-pic，这行代码将会拉取toapi/toapi-template项目, 改为toapi new toapi/toapi-pic 才可以拉取toapi/toapi-pic项目

Encrypt URL.

We don't want the API users know any information about the source site.

Example:

http://127.0.0.1/adauoai1duaou/

The `/adauoai1duaou/` is  an encrypted URL which means `/users/`

We should encrypt all the relative URLs in all items. Such as

{
    'titile':'My Life',
    'url':'/movie/2017/'
}

Convert it to

{
    'titile':'My Life',
    'url':'/uo23uodaoi123udo/'
}

I need a encrypt.py file.

Storage timeout.

Storage should support timeout.

Log Support.

The request sent to source site.
The request received.
The items parsed.
HTML stored
Cache got.
Errors.

Example:

[Sent] TIME , URL, LENGTH ,STATUS
[Received] TIME, URL, LENGTH ,STATUS
[Parsed] TIME, URL,ITEM, TOTAL, STATUS
[Cache] TIME, URL,STATUS
[warning] TIME, DETAIL
etc...

This should be a log.py file.

Multiple site routing problems.

The example:

from toapi import XPath, Item, Api

api = Api()


class Movie(Item):
    __base_url__ = 'http://www.dy2018.com'

    url = XPath('//b//a[@class="ulink"]/@href')
    title = XPath('//b//a[@class="ulink"]/text()')

    class Meta:
        source = XPath('//table[@class="tbspan"]')
        route = '/'


class Post(Item):
    __base_url__ = 'https://news.ycombinator.com/'

    url = XPath('//a[@class="storylink"]/@href')
    title = XPath('//a[@class="storylink"]/text()')

    class Meta:
        source = XPath('//tr[@class="athing"]')
        route = '/'

api.register(Movie)
api.register(Post)

api.serve()

Expose wsgi inteface for gunicorn or uwsgi.

使用选择器 Regex 无法解析带 source 的内容

如果 Meta 中source 不为 None，选择器得到的 html 内容为 etree 的对象 Element，其中 Regex 这个选择器的实现中没有对 html 进行判断 isinstance(html, etree._Element)，所以会导致 item 中使用 Regex 选择器时会无法过滤出任何内容，在执行 re.findall(self.rule, html) 时会得到错误 expected string or bytes-like object

解决方法：

toapi/selector.py

class Regex(Selector):
    """Regex expression"""

    def parse(self, html):
        if isinstance(html, etree._Element):
            html = etree.tostring(html)
        return re.findall(self.rule, html)

ImportError: cannot import name 'XPath'

XPath去哪啦？？？看起来你好像移除了xpath？？？

Multiple processes share cache.

Minor bug when passing url port data to flask

Bug output

$ toapi run
2017/12/17 19:07:28 [Serving ] OK http://127.0.0.1:5000 
Traceback (most recent call last):
  File "/home/user/.local/share/virtualenvs/toapi_test-UdiKVlKi/bin/toapi", line 11, in <module>
    sys.exit(cli())
  File "/home/user/.local/share/virtualenvs/toapi_test-UdiKVlKi/lib/python3.5/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/home/user/.local/share/virtualenvs/toapi_test-UdiKVlKi/lib/python3.5/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/home/user/.local/share/virtualenvs/toapi_test-UdiKVlKi/lib/python3.5/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/user/.local/share/virtualenvs/toapi_test-UdiKVlKi/lib/python3.5/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/user/.local/share/virtualenvs/toapi_test-UdiKVlKi/lib/python3.5/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/home/user/.local/share/virtualenvs/toapi_test-UdiKVlKi/lib/python3.5/site-packages/toapi/cli.py", line 61, in run
    app.api.serve(ip=ip, port=port)
  File "/home/user/.local/share/virtualenvs/toapi_test-UdiKVlKi/lib/python3.5/site-packages/toapi/api.py", line 42, in serve
    self.app.run(ip, port, debug=False, **options)
  File "/home/user/.local/share/virtualenvs/toapi_test-UdiKVlKi/lib/python3.5/site-packages/flask/app.py", line 841, in run
    run_simple(host, port, self, **options)
  File "/home/user/.local/share/virtualenvs/toapi_test-UdiKVlKi/lib/python3.5/site-packages/werkzeug/serving.py", line 733, in run_simple
    raise TypeError('port must be an integer')
TypeError: port must be an integer

Ip proxies.

Setting and updating of storage and cache issues.

If sent error, don't set storage.
If parse error, don't set cache.

运行topic run报错

python版本是3.5 toapi版本0.2.2

toapi new api
cd api
toapi run

执行topic run时报错

➜  api toapi run
Traceback (most recent call last):
  File "/usr/local/bin/toapi", line 9, in <module>
    load_entry_point('toapi==0.2.2', 'console_scripts', 'toapi')()
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/toapi/cli.py", line 81, in run
    app = importlib.import_module('app', base_path)
  File "/usr/lib/python3.5/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 986, in _gcd_import
  File "<frozen importlib._bootstrap>", line 969, in _find_and_load
  File "<frozen importlib._bootstrap>", line 958, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 673, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 665, in exec_module
  File "<frozen importlib._bootstrap>", line 222, in _call_with_frames_removed
  File "/home/zz/code/python/toapi_project/api/app.py", line 7, in <module>
    api.register(Page)
  File "/usr/local/lib/python3.5/dist-packages/toapi/api.py", line 31, in register
    item.__pattern__ = re.compile(item.__base_url__ + item.Meta.route)
TypeError: Can't convert 'dict' object to str implicitly

item.__base_url__是str, item.Meta.route是字典

关于post数据的获取和item是编写

如果我想解析的一个页面是通过post请求才能得到的

请问 toapi 提供这样的方式么？
我看在定义settings是有一个参数是ajax=true
那么我发送ajax请求的data应该定义在哪里呢？
翻了一圈文档和issues都没找到

关于items的编写

自带的XPath方法返回的好像是处理过的值而不是一个 etree element
这样我比如想要获取h1下所有的文本（包括子标签）就不可以用string(.)方法
必须得再写一个clean_xx的方法

另外能否加入bs4的支持呢？

最后希望该项目能越做越好！
真的很棒！

Parse XPath result in string but not list.

Now result:

{'url': ['https://*'], 
'title': ['Making It ']}

Excepted:

{'url': 'https://*', 
'title': 'Making It'}

Production Deployment Instructions

Hello,
I am relatively new to python web development. And while I am mainly working on a mobile app. I found topapi to be a perfect companion for my backend requirements.
I am now almost ready to launch my app, but am struggling to find a good production hosting environment for the toapi server code.
Mainly looking around using heroku or aws or google app engine for hosting server.

I was wondering if you can provide some instructions for deploying to a production quality server. I did go over this deploy link but still not able to link the content to the actual toapi codebase.

And advise on how can I move forward with this.

Thank you again,

Item clean data support.

class MyItem(Item):
    class Meta:
        route = '\.+'
    def clean_data(self, data):
        """Do some thing here"""       
       return cleaned_data

API service cache.

Suppose we have some permanent storaged HTML.

Now we should improve the performance of api service.

Use redis or what else better?
Cache the JSON that parsed from HTML.

Flask logging error

python 3.7
toapi 2.1.1

Traceback (most recent call last):
  File "main.py", line 5, in <module>
    api = Api()
  File "/usr/local/lib/python3.7/site-packages/toapi/api.py", line 24, in __init__
    self.__init_server()
  File "/usr/local/lib/python3.7/site-packages/toapi/api.py", line 27, in __init_server
    self.app.logger.setLevel(logging.ERROR)
AttributeError: module 'flask.logging' has no attribute 'ERROR'

Default api issues.

/cache Not finish.
/storage Not finish.
/items Result is not excepted.
/status All right.

HTML permanent storage.

The aim is that our api server do not affect the source site.

Save results into files.
File's name is hash(path)
File's content is HTML
Update strategy? what should be updated, what should be permanent storaged?

关于资源获取路径路由的问题

现在获取资源的链接一般是 https://yoursite.com/https://targetsite.com/resource/path/

这样一来会有两个问题：

丑，资源请求路径太长，不好看。
直接暴露源站。

提出一个设想：

是否可以在 Meta 中增加一个 alias 作为源站 base_url 的替代或标识，并且作为一级资源路径插入到路由中，如 https://yoursite.com/<alias>/resource/path/，这样一来既可以满足区分多站点的需求，又可以解决上面提到的两个问题。

在官方仓库中的例子（查看源码）有利用 flask 的路由进行自定义路由的实现，如果有多个站点多个请求路径，这样写在 items 里有一份路由，在这里面又要再写一份路由，显得有点机械了。

Thanks.

过滤删除无用的item

在对item进行解析的时候
有些是垃圾信息
比如一些低质量的广告

请问有什么方法可以实现删除这些信息呢？

我的做法是: 复写一下Item类里的方法
但是值是层层嵌套进去的，这样处理就很不规范
我水平比较差..请见谅

class MyItem(Item):
    @classmethod
    def parse(cls, html):
        """Parse html to json"""
        if cls.Meta.source is None:
            return cls._parse_item(html)
        else:
            sections = cls.Meta.source.parse(html, is_source=True)
            results = []
            for section in sections:
                res = cls._parse_item(section)
                if res:
                    results.append(res)
            return results

    @classmethod
    def _parse_item(cls, html):
        item = OrderedDict()
        for name, selector in cls.__selectors__.items():
            try:
                item[name] = selector.parse(html)
            except Exception:
                item[name] = ''
            clean_method = getattr(cls, 'clean_%s' % name, None)
            if clean_method is not None:
                res = clean_method(cls, item[name])
                if res == None:
                    return None
                else:
                    item[name] = res
        return item


class HotBook(MyItem):
    title = XPath('//a[@class="xst"]/text()[1]')
    
    def clean_title(self, title):
        if '《' in title:
            return title[title.find('\u300a') + 1:title.find('\u300b')][:10]
        else:
            return None
 
    class Meta:
        source = XPath('//tbody[@class="thread_tbody"]')
        route = {'/hotbook?page=:page': '/forum-171-:page.html'}

Settings check.

Whenever we run the app, we should make sure the settings is correct. Just like the django runserver command.

Route order problem.

At present, we define route as follows:

        route = {'/movies/?page=1': '/html/gndy/dyzz/',
                 '/movies/?page=:page': '/html/gndy/dyzz/index_:page.html',
                 '/movies/': '/html/gndy/dyzz/'}

The problem is the ordering.

Use tuple or OrderedDict

Modify routing argument

class Meta:
source = NONE
route = {'/search/:id': '/search/:id'}

Right now ID for host url passes directly into source url. Is there a way we can modify ID before passing them on?

For example, I need to map the query 127.0.0.1:5000/search/1 to bing.com/search/100

So I am going to have to multiply :id with 100 before passing it as argument. Not sure if that makes sense.

python2.7安装报错

Traceback (most recent call last):
File "app.py", line 2, in
from htmlparsing import Attr, Text
File "/usr/local/lib/python2.7/dist-packages/htmlparsing-0.1.5-py2.7.egg/htmlparsing.py", line 21
def init(self, text: str):
^
SyntaxError: invalid syntax