Git Product home page Git Product logo

shot-scraper's Introduction

shot-scraper

PyPI Changelog Tests License discord

A command-line utility for taking automated screenshots of websites

For background on this project see shot-scraper: automated screenshots for documentation, built on Playwright.

Documentation

Get started with GitHub Actions

To get started without installing any software, use the shot-scraper-template template to create your own GitHub repository which takes screenshots of a page using shot-scraper. See Instantly create a GitHub repository to take screenshots of a web page for details.

Quick installation

You can install the shot-scraper CLI tool using pip:

pip install shot-scraper
# Now install the browser it needs:
shot-scraper install

Taking your first screenshot

You can take a screenshot of a web page like this:

shot-scraper https://datasette.io/

This will create a screenshot in a file called datasette-io.png.

Many more options are available, see Taking a screenshot for details.

Examples

shot-scraper's People

Contributors

eddiechapman avatar humitos avatar iwootten avatar joshuadavidthomas avatar kljohann avatar kylejohnston avatar mbafford avatar mhalle avatar nedbat avatar nielthiart avatar omerr avatar palewire avatar pauloxnet avatar rdmurphy avatar ryancheley avatar sesh avatar simonw avatar stevenmaude avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

shot-scraper's Issues

Don't write binary to terminal by default

The following currently dumps a binary PNG to the user's terminal:

shot-scraper https://datasette.io/

It is vanishingly rare for users to want this to happen!

Instead, the default should imitate wget - if no -o filename is specified, make one up based on the URL - and avoid overwriting an existing file with the same name by adding a .1.

If the user really does want it to write binary to standard output they can specify that with -o -.

Selecting elements for screenshotting based on tag content

Selecting elements as the target for screenshots based on CSS selectors does not (currently) allow for the selection of elements based on tag content, or relative to DOM elements selected based on tag content.

However, elements can be selected based on tag content using Javascript. It would be useful to allow for the selection of elements via Javascript as well as CSS.

Alternatively, support a method that can be called from a javascript scraper call that will apply a screen shot to a Javascript selected element.

At the moment, selector based screenshots seem to be focused in _selector_javascript(selectors) by:

let els = %s.map(s => document.querySelector(s));

As well as passing --selector(s), s, one approach might be to pass element(s) el returned from a --js-selector script?

shot-scraper GitHub repository template

A repository template that helps users create a repo that takes screenshots of a page.

Create your own template from the repo and it will give you a YAML file that you can then edit - it writes screenshots to the same repository.

`--quality` option for JPEGs

While building this I realized that for ongoing git scraping projects file size matters, and so an option for lower quality JPEGs would be good:

Support returning output from evaluated JavaScript, including as status code

This is a bit of an out-there idea: what if you could execute custom JavaScript that returned a result, and then write that result to disk?

You could even skip the screenshot entirely and use this as a generic scraping tool at that point.

Bonus: if it can affect the exit code in some way it could be used as part of a CI flow to test something.

shot-scraper GitHub repository template implementation

A repository template that helps users create a repo that takes screenshots of a page.

Create your own template from the repo and it will give you a YAML file that you can then edit - it writes screenshots to the same repository.

Allow taking an snapshot of a local file

For an .html file stored locally I use python -m http.server 80 and then shot-scraper http://localhost/file but it would be handy to be able to just point shot-scraper to the file directly.

Support taking a shot that boxes multiple selectors

This is a bit of an unconventional need, but I think it's worth exploring.

When taking screenshots for tutorials, I often want to grab an area of the screen that incorporates more than one element - where there's no convenient wrapper element that I can use to get the shot that I want.

Imagine if you could specify multiple selectors and get back a screenshot of the smallest area of the screen that incorporates all of those elements.

The implement would look at the founding box of all of those elements, generate a new box that wraps all of those, inject an absolutely positioned box of that size and take the screenshot of that area.

It could even optionally add some padding to that box before taking the shot.

Use `prefers-reduced-motion`

Suggestion from Ben Pickles: https://twitter.com/benpickles/status/1507391958343471111

@simonw I’m using headless Chrome in an HTTP screenshot service, my favourite feature is being able to force prefers-reduced-motion=reduce mode which can be used as a standards-based hint to prevent animations (useful when taking a screenshot)

Playwright documentation: https://playwright.dev/python/docs/api/class-browser#browser-new-context-option-reduced-motion

browser.new_context(..., reduced_motion="reduce")

Do nothing if `shots.yml` contains no data

So that you can comment out everything in that file and not see this error:

% shot-scraper multi empty.yml
...
    for shot in shots:
TypeError: 'NoneType' object is not iterable

Design and implement selector shots

Goal here is to be able to take a screenshot of a specific component of the page, identified by a CSS selector.

Should work for both multi and single shot mode.

More options for `shot-scraper pdf`

  • --format - one of Letter, Legal, Tabloid, Ledger, A0, A1, A2, A3, A4, A5, A6
  • --print-background - include backgrounds
  • --scale - from 0.1 to 2
  • --width - measurement with units e.g. 10cm
  • --height - ditto

Split from:

Design the CLI interface

I'd like it to be able to do a few different things, so I'm going to break it up into sub-commands. Need to design those first.

Hanging issue with multi-shots

Cool project, thanks

I was able to get DrudgeReport and ZeroHedge, captured both

Then I added LA Times using the sample code provided in the readme, however it hung and needed to be killed (would have been 3 shots)

I rearranged the order and it would still hang after two shots

Removed the first two and it worked, getting 3 shots. But being able to get a long list of shots seems not consistent.

Not sure how to debug, but feel free to review any logs
https://github.com/coding-to-music/page-snapshot-github-actions-shot-scraper/actions

Idea: JavaScript macros in the YAML file

This is just a harebrained idea at the moment: what if the YAML could define JavaScript that could be applied to all of the shots in the file, with customizations for each shot?

Something like this:

macros:
- javascript: |
    Array.from(
      document.querySelectorAll(
        ShotScraper.inputs.toRemove
      )
    ).forEach(el => el.style.display='none')
shots:
- url: https://site-one.com/
  output: one.png
  inputs:
    toRemove: "#ad,#footer"
- url: https://site-two.com/
  output: two.png
  inputs:
    toRemove: "#adbanner"

Option for scripted authentication

The auth command could take the standard --javascript option.

It could also take a --headless option - which causes it to open the page in regular headless mode, execute that JavaScript and then close the page again.

This would allow you to automate logins on pages that support it:

shot-scraper auth https://site.local/ auth.json --headless --javascript "
document.getElementById('#username').value = 'username';
document.getElementById('#password').value = 'password';
document.forms[0].submit();
"

Originally posted by @simonw in #18 (comment)

YAML configuration for PDF shots

I added PDF support here:

But there's currently no way to script PDFs in the YAML syntax with shot-scraper multi. This would be really useful, especially as the PDF generation grows even more options:

Issue on Raspberry PI

Hello,

I got this error:

Traceback (most recent call last): File "/usr/bin/shot-scraper", line 33, in <module> sys.exit(load_entry_point('shot-scraper==0.10', 'console_scripts', 'shot-scraper')()) File "/usr/bin/shot-scraper", line 22, in importlib_load_entry_point for entry_point in distribution(dist_name).entry_points File "/usr/lib/python3.9/importlib/metadata.py", line 524, in distribution return Distribution.from_name(distribution_name) File "/usr/lib/python3.9/importlib/metadata.py", line 187, in from_name raise PackageNotFoundError(name) importlib.metadata.PackageNotFoundError: shot-scraper

is it not working for the pi?

thanks
Carsten

Initial prototype

The initial prototype of this will involve a YAML file that defines multiple screenshots to be taken, and a GitHub Actions workflow that takes those screenshots.

`shot-scraper multi` should be able to ignore errors and keep on running

Describe the bug
When running shot-scraper multi shots.yml if one of the sites times out, the cli stops at that site instead of continuing on

shots.yml

- output: w3c.org.png
  url: https://www.w3.org/
- output: cnn.com.png
  url: http://www.cnn.com/
- output: example.com.png
  url: http://www.example.com/
❯ shot-scraper multi shots.yml
Screenshot of 'https://www.w3.org/' written to 'w3c.org.png'
Traceback (most recent call last):
  File "/Users/ryan/.local/bin/shot-scraper", line 8, in <module>
    sys.exit(cli())
  File "/Users/ryan/.local/pipx/venvs/shot-scraper/lib/python3.10/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/Users/ryan/.local/pipx/venvs/shot-scraper/lib/python3.10/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/Users/ryan/.local/pipx/venvs/shot-scraper/lib/python3.10/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/ryan/.local/pipx/venvs/shot-scraper/lib/python3.10/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/ryan/.local/pipx/venvs/shot-scraper/lib/python3.10/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/Users/ryan/.local/pipx/venvs/shot-scraper/lib/python3.10/site-packages/shot_scraper/cli.py", line 208, in multi
    take_shot(context, shot)
  File "/Users/ryan/.local/pipx/venvs/shot-scraper/lib/python3.10/site-packages/shot_scraper/cli.py", line 469, in take_shot
    page.goto(url)
  File "/Users/ryan/.local/pipx/venvs/shot-scraper/lib/python3.10/site-packages/playwright/sync_api/_generated.py", line 7345, in goto
    self._sync(
  File "/Users/ryan/.local/pipx/venvs/shot-scraper/lib/python3.10/site-packages/playwright/_impl/_sync_base.py", line 88, in _sync
    return task.result()
  File "/Users/ryan/.local/pipx/venvs/shot-scraper/lib/python3.10/site-packages/playwright/_impl/_page.py", line 487, in goto
    return await self._main_frame.goto(**locals_to_params(locals()))
  File "/Users/ryan/.local/pipx/venvs/shot-scraper/lib/python3.10/site-packages/playwright/_impl/_frame.py", line 122, in goto
    await self._channel.send("goto", locals_to_params(locals()))
  File "/Users/ryan/.local/pipx/venvs/shot-scraper/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 39, in send
    return await self.inner_send(method, params, False)
  File "/Users/ryan/.local/pipx/venvs/shot-scraper/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 63, in inner_send
    result = next(iter(done)).result()
playwright._impl._api_types.TimeoutError: Timeout 30000ms exceeded.
=========================== logs ===========================
navigating to "http://www.cnn.com/", waiting until "load"
============================================================

To Reproduce
Steps to reproduce the behavior:

  1. Create a file shots.yml
- output: w3c.org.png
  url: https://www.w3.org/
- output: cnn.com.png
  url: http://www.cnn.com/
- output: example.com.png
  url: http://www.example.com/
  1. run shot-scraper multi shots.yml
  2. See that the second site causes an error and does not move onto the third site

Expected behavior
If a timeout occurs, the tool should output an error has occurred and move onto the next site to process, something like this:

Screenshot of 'https://www.w3.org/' written to 'w3c.org.png'
Timeout 30000ms exceeded.
=========================== logs ===========================
navigating to "http://www.cnn.com/", waiting until "load"
============================================================
Screenshot of 'http://www.cnn.com/' written to 'cnn.com.png'
Screenshot of 'http://www.example.com/' written to 'example.com.png'

Additional context
This may be a duplicate of #48, or perhaps just related

Make `--timeout` error message prettier

Following #47:

 % shot-scraper https://www.cnn.com/ -h 800 --timeout 5
Traceback (most recent call last):
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/bin/shot-scraper", line 33, in <module>
    sys.exit(load_entry_point('shot-scraper', 'console_scripts', 'shot-scraper')())
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/Users/simon/Dropbox/Development/shot-scraper/shot_scraper/cli.py", line 190, in shot
    shot = take_shot(context, shot, use_existing_page=use_existing_page)
  File "/Users/simon/Dropbox/Development/shot-scraper/shot_scraper/cli.py", line 576, in take_shot
    page.screenshot(**screenshot_args)
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/playwright/sync_api/_generated.py", line 7886, in screenshot
    self._sync(
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/playwright/_impl/_sync_base.py", line 111, in _sync
    return task.result()
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/playwright/_impl/_page.py", line 611, in screenshot
    encoded_binary = await self._channel.send("screenshot", params)
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 39, in send
    return await self.inner_send(method, params, False)
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 63, in inner_send
    result = next(iter(done)).result()
playwright._impl._api_types.TimeoutError: Timeout 5ms exceeded.

Mechanism for polling for new scraping updates without restarting Chromium

Had the idea in this Tweet

Yeah, if you want a high frequency you should absolutely run this on its own box

Might be value in supporting that directly in the tool, since it would save it having to launch a brand new Chromium instance each time if it stayed running

If you're scraping a frequently updating resource, it would be neat if you didn't have to instantiate an entirely new Chromium instance every time you ran the scraper.

Extract examples from YAML into shell script

To make these more convenient to run locally:

- name: Generate examples
run: |
mkdir -p examples
shot-scraper https://github.com/ -o examples/github.com.png
shot-scraper https://simonwillison.net/ -s '#bighead' -o examples/bighead.png
shot-scraper https://simonwillison.net/ -s '#bighead' \
--javascript "document.body.style.backgroundColor = 'pink';" \
-o examples/bighead-pink.png
shot-scraper https://simonwillison.net/ -w 400 -h 800 -o examples/simon-narrow.png
shot-scraper https://simonwillison.net/ \
-h 800 -o examples/simonwillison-quality-80.jpg --quality 80
shot-scraper 'https://www.owlsnearme.com/?place=127871' \
--selector 'section.secondary' \
-o examples/owlsnearme-wait.jpg \
--wait 2000
shot-scraper accessibility https://datasette.io/ \
> examples/datasette-accessibility.json
shot-scraper accessibility https://simonwillison.net \
--javascript "document.getElementById('wrapper').style.display='none'" \
> examples/simonwillison-accessibility-javascript.json
shot-scraper accessibility https://simonwillison.net \
--javascript "document.getElementById('wrapper').style.display='none'" \
--output examples/simonwillison-accessibility-javascript-and-dash-output.json
shot-scraper pdf https://datasette.io \
--landscape -o examples/datasette-landscape.pdf
# And using multi
echo '
- output: examples/example.com.png
url: http://www.example.com/
- output: examples/w3c.org.png
url: https://www.w3.org/
- output: examples/bighead-from-multi.png
url: https://simonwillison.net/
selector: "#bighead"
- output: examples/bighead-pink-from-multi.png
url: https://simonwillison.net/
selector: "#bighead"
javascript: |
document.body.style.backgroundColor = "pink";
- output: examples/simon-narrow-from-multi.png
url: https://simonwillison.net/
width: 400
height: 800
- output: examples/simon-quality-80-from-multi.png
url: https://simonwillison.net/
height: 800
quality: 80
' | shot-scraper multi -

`--wait` option to delay before taking the screenshot

While working on #14 I found that all images hadn't yet loaded when a screenshot was taken, so I hacked around it like this:

shot-scraper 'https://www.owlsnearme.com/?place=127871' \
  --selector 'section.secondary' \
  --javascript 'new Promise(resolve => setTimeout(resolve, 2000))' \
  -o owls.jpg

`shot-scraper shot --interactive`

Inspired by how auth works:

It would be cool if you could take a shot where the browser window and suchlike are set on the command-line, but the browser then opens such that you can further interact with the site before the image is taken. Then hit <enter> to take and save the image.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.