web-to-struct

A tool for data structuring, mainly for web data. 将数据格式化的小工具，主要处理web数据。

安装

pip install web-to-struct

环境

>= python3.6

使用

DEMO

import requests
import json
from web_to_struct import Parser

if __name__ == '__main__':
    headers = {
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
    }
    r = requests.get("https://copymanga.org/recommend", headers=headers)

    config = {
        "name": "data",
        "map": [
            {"function": "string-to-element"},
            {"function": "css", "kwargs": {"patterns": ["#comic > .row > .exemptComicItem"]}},
        ],
        "children": [{
            "name": "title",
            "map": [
                {"function": "css", "kwargs": {"patterns": ["p[title]"]}},
            ]
        }, {
            "name": "img",
            "map": [
                {"function": "css", "kwargs": {"patterns": [".exemptComicItem-img > a > img"]}},
                {"function": "attr", "kwargs": {"attr_name": "data-web_to_struct"}},
            ]
        }, {
            "name": "comic_id",
            "map": [
                {"function": "css", "kwargs": {"patterns": [".exemptComicItem-img > a"]}},
                {"function": "attr", "kwargs": {"attr_name": "href"}},
                {"function": "regex", "kwargs": {"pattern": r"comic/(.*?)$"}},
            ]
        }, {
            "name": "author",
            "map": [
                {"function": "css", "kwargs": {"patterns": [".exemptComicItem-txt > span.exemptComicItem-txt-span > a[href^=\"/author\"]"]}},
            ],
        }]
    }
    parser = Parser()
    resp = parser.parse(r.text, config)
    print(json.dumps(resp, ensure_ascii=False, indent=2))

returns

{
  "data": [
    {
      "title": "見到你之後該說什麼呢",
      "img": "https://mirror277.mangafuna.xyz:12001/comic/jiandaonizhihougaishuoshenmene/cover/e54e3f14-8425-11eb-869d-00163e0ca5bd.jpg!kb_w_item",
      "comic_id": "jiandaonizhihougaishuoshenmene",
      "author": "ねむようこ"
    } //,...
  ]
}

`Config`参数

{
  "name": "",
  "map": [
    { "function": "", "kwargs": {} } // 内置函数，上一个的输出作为下一个的输入
  ],
  "children": [{}] // optional 子节点，结构同本结构。
}

内置函数 Functions

Function 函数名	Accepted Returns 可接受的上一个函数的返回类型	Extra Args 额外的参数	Returns 返回类型	Description 描述
string-to-element	`Union[str, bytes]`	`feature: str = "html5lib"`	Element	-
css	Element	`patterns: Union[str, List[str]]`	[Element, None]	-
index	`Union[Dict, Tuple, List]`	`pattern: str # eg."[1].x"`	Any	-
text	Element	-	String	get pure strings inside the current elements
html	Element	-	String	get HTML strings inside the current element
attr	Element	`attr_name: str`	str	get attribute value of the current element
regex	str	`pattern: str`	`Union[str, tuple, None]`	regex match result
tuple-to-string	`Tuple`	`pattern: str`	String	use $1,$2,... to replace tuple elements, eg. "hello $1, $2" for tuple ("a", "b") returns "hello a, b"
json-parse	str	-	`Union[Dict, List]`	parse json string to dict

其他行为

返回值如果是list，且有children，则处理为返回值叉乘children

参考

部分内置函数参考了Yealico

bd777 / web-to-struct Goto Github PK

web-to-struct's Introduction

web-to-struct

安装

环境

使用

DEMO

`Config`参数

内置函数 Functions

其他行为

参考

web-to-struct's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

bd777 / web-to-struct Goto Github PK

web-to-struct's Introduction

web-to-struct

安装

环境

使用

DEMO

Config参数

内置函数 Functions

其他行为

参考

web-to-struct's People

Contributors

Watchers

Recommend Projects

Recommend Topics

Recommend Org

`Config`参数