Git Product home page Git Product logo

web-to-struct's Introduction

web-to-struct

A tool for data structuring, mainly for web data. 将数据格式化的小工具,主要处理web数据。

安装

pip install web-to-struct

环境

>= python3.6

使用

DEMO

import requests
import json
from web_to_struct import Parser

if __name__ == '__main__':
    headers = {
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
    }
    r = requests.get("https://copymanga.org/recommend", headers=headers)

    config = {
        "name": "data",
        "map": [
            {"function": "string-to-element"},
            {"function": "css", "kwargs": {"patterns": ["#comic > .row > .exemptComicItem"]}},
        ],
        "children": [{
            "name": "title",
            "map": [
                {"function": "css", "kwargs": {"patterns": ["p[title]"]}},
            ]
        }, {
            "name": "img",
            "map": [
                {"function": "css", "kwargs": {"patterns": [".exemptComicItem-img > a > img"]}},
                {"function": "attr", "kwargs": {"attr_name": "data-web_to_struct"}},
            ]
        }, {
            "name": "comic_id",
            "map": [
                {"function": "css", "kwargs": {"patterns": [".exemptComicItem-img > a"]}},
                {"function": "attr", "kwargs": {"attr_name": "href"}},
                {"function": "regex", "kwargs": {"pattern": r"comic/(.*?)$"}},
            ]
        }, {
            "name": "author",
            "map": [
                {"function": "css", "kwargs": {"patterns": [".exemptComicItem-txt > span.exemptComicItem-txt-span > a[href^=\"/author\"]"]}},
            ],
        }]
    }
    parser = Parser()
    resp = parser.parse(r.text, config)
    print(json.dumps(resp, ensure_ascii=False, indent=2))

returns

{
  "data": [
    {
      "title": "見到你之後該說什麼呢",
      "img": "https://mirror277.mangafuna.xyz:12001/comic/jiandaonizhihougaishuoshenmene/cover/e54e3f14-8425-11eb-869d-00163e0ca5bd.jpg!kb_w_item",
      "comic_id": "jiandaonizhihougaishuoshenmene",
      "author": "ねむようこ"
    } //,...
  ]
}

Config参数

{
  "name": "",
  "map": [
    { "function": "", "kwargs": {} } // 内置函数,上一个的输出作为下一个的输入
  ],
  "children": [{}] // optional 子节点,结构同本结构。
}

内置函数 Functions

Function 函数名 Accepted Returns 可接受的上一个函数的返回类型 Extra Args 额外的参数 Returns 返回类型 Description 描述
string-to-element Union[str, bytes] feature: str = "html5lib" Element -
css Element patterns: Union[str, List[str]] [Element, None] -
index Union[Dict, Tuple, List] pattern: str # eg."[1].x" Any -
text Element - String get pure strings inside the current elements
html Element - String get HTML strings inside the current element
attr Element attr_name: str str get attribute value of the current element
regex str pattern: str Union[str, tuple, None] regex match result
tuple-to-string Tuple pattern: str String use $1,$2,... to replace tuple elements, eg. "hello $1, $2" for tuple ("a", "b") returns "hello a, b"
json-parse str - Union[Dict, List] parse json string to dict

其他行为

  • 返回值如果是list,且有children,则处理为返回值叉乘children

参考

  • 部分内置函数参考了Yealico

web-to-struct's People

Contributors

bd777 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.