Git Product home page Git Product logo

parsify's Introduction

Parsify

Stop writing multiple parser scripts for parsing different websites. With Parsify you can have a single few lines script and the configuration file to fit your parser to different websites.

Contents

Installation

pip install parsify

Usage

Make sure you have your configuration file (usually handbook.json) ready.

import parsify as pf


# Create Parsify engine
ngn = pf.Engine(handbook='handbook.json')

# Run a single step
# Provide step name as an argument
# Should be in Engine.current_parser
# Should not have any "dynamic_variables" when custom using this method
# By default Engine.current_parser is the first parser in the Handbook
step_result = ngn.stepshot(step='get_products')
# print(step_result)

# Parse a single website (must be configured in "handbook.json")
# Provide scope name as an argument
scope_result = ngn.scopeshot(parser='example.com')
# print(scope_result)

# Run all the parsers that are configured in "handbook.json"
final_result = ngn.parse()
# print(final_result)

Handbook Tutorial

Required Fields

  • Handbook file should start with "parser" key value of which is the array of parsers.
  • Each parser in the array should have two keys:
    • "scope" - String: Name of the parser. Usually website name, i.e. "example.com".
    • "steps" - Array: Steps to parse.
  • Each step should have at least following fields:
    • "name" - String: Unique name of the step. This field will make possible to access this step's results and dynamic variables in the proceeding steps (if needed).
    • "chain_id" - Integer: Steps with the same chain id will be executed as a sequence of steps on every iteration.
    • "url" - String: Target url of the request(s) for the current step.
    • "method" - String: Request method for the current step.
    • "output_path" String: Path of the result data in response. Use dots if it's multi-nested, for example, if needed result is in response -> "data" -> "products", "output_path" should be "data.products".
    • "output" Dictionary:

License

Distributed under the MIT License. See LICENSE file for more information.

Contact

Luka Sosiashvili - @lukasanukvari - [email protected]

Project Link: https://github.com/lukasanukvari/parsify

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

parsify's People

Contributors

lukasanukvari avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.