Git Product home page Git Product logo

ssscraper's Introduction

Super-Simple Scraper

This a very thin layer on top of Colly which allows configuration from a JSON file. The output is JSONL which is ready to be imported into Typesense.

Features

  • Scrape HTML & PDF documents based on the configured selectors
  • Selectors can use CSS selectors or template-based ones which have sprig functions available.

Configuration

See the example configuration. Many of these options are directly copied to the Colly equivalents:

Running

We have an image on DockerHub, so after installing Docker and jq, something like this will work:

docker run -it -v `pwd`:/go/src/app -e "CONFIG=$(cat ./path/to/your/config.json | jq -r tostring)" gotripod/ssscraper:main

The manual method is:

docker build -t ssscraper .
docker run -v `pwd`:/go/src/app -it --rm --name ssscraper-ahoy ssscraper

# you're now in the docker container

cd src/app
go build
./ssscraper

ssscraper can be called with the --testUrl flag:

./ssscraper --testUrl=https://gotripod.com

It will scrape that URL and not follow its links, only output the results for that one page.

Developing

Using VSCode, clone and open the repo directory with the Containers extension installed.

Future ideas

  • Nested selectors; i.e. select each item from a list on each page
  • Webhook support - POST the output to a URL on completion
  • Different output formats
  • Custom weighting for selectors
  • Extract the selector/template logic to a common function
  • Add Word doc support

Sponsors

Built by Go Tripod, making the web as easy as one, two, three. Go Tripod build bespoke software solutions, and if you need a custom version of SS Scraper please get in touch.

ssscraper's People

Contributors

colinramsay avatar

Stargazers

 avatar David Landa avatar  avatar Jimmy Lam avatar Loshad avatar Stone Gao avatar Ron Wolf avatar  avatar  avatar @karantin2020 avatar Thomas Harr avatar zbv avatar  avatar Vic avatar Lubomir Anastasov avatar Nikita avatar Sylvain avatar Can Evgin avatar

Watchers

Jimmy Lam avatar  avatar Simon Ashley avatar James Cloos avatar Kostas Georgiou avatar  avatar

ssscraper's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.