Git Product home page Git Product logo

wayback's Introduction

wayback

Build Status Download Latest Version from PyPI Code of Conduct Documentation Status

Wayback is A Python API to the Internet Archive’s Wayback Machine. It gives you tools to search for and load mementos (historical copies of web pages).

The Internet Archive maintains an official “internetarchive” Python package, but it does not focus on the Wayback Machine. Instead, it is mainly concerned with the APIs and tools that manage the Internet Archive as a whole: managing items and collections. These are how e-books, audio recordings, movies, and other content in the Internet Archive are managed. It doesn’t, however, provide particularly good tools for finding or loading historical captures of specific URLs (i.e. the part of the Internet Archive called the “Wayback Machine”). That’s what this package does.

Installation & Basic Usage

Install via pip on the command line:

$ pip install wayback

Then, in a Python script, import it and create a client:

import wayback
client = wayback.WaybackClient()

Finally, search for all the mementos of nasa.gov before 1999 and download them:

for record in client.search('http://nasa.gov', to_date=date(1999, 1, 1)):
    memento = client.get_memento(record)

Read the full documentation for a more in-depth tutorial and complete API reference documentation at https://wayback.readthedocs.io/

Code of Conduct

This repository falls under EDGI’s Code of Conduct. Please take a moment to review it before commenting on or creating issues and pull requests.

Contributors

Thanks to the following people for their contributions and help on this package! See our contributing guidelines to find out how you can help.

License & Copyright

Copyright (C) 2019-2023 Environmental Data and Governance Initiative (EDGI)

This program is free software: you can redistribute it and/or modify it under the terms of the 3-Clause BSD License. See the LICENSE file for details.

wayback's People

Contributors

8w9ag avatar allanpichardo avatar arctansusan avatar autumncoleman avatar chaibapchya avatar danielballan avatar dcwalk avatar dependabot-preview[bot] avatar dependabot-support avatar dependabot[bot] avatar dgilman avatar edsu avatar frijol avatar ftsalamp avatar janakrajchadha avatar jsnshrmn avatar lh00000000 avatar lightandluck avatar lion-sz avatar mr0grog avatar mrotondo avatar stephenalanbuckley avatar titaniumbones avatar vbanos avatar weatherpattern avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wayback's Issues

Add custom error for rate limit issues?

During some recent testing, I ran into some HTTP 429 (too many requests) responses in WaybackClient.search() calls. At the moment, we wrap any HTTP error in search in a generic WaybackException:

wayback/wayback/_client.py

Lines 565 to 567 in b502236

response.raise_for_status()
except requests.exceptions.HTTPError as error:
raise WaybackException(str(error))

But since 429 responses often come with a Retry-After header, it might be useful to have a custom error for this case. Or at least make sure the response object is included as an attribute on the error.

Add useful project links to PyPI

I ran across this very useful post from Simon Willison a while back, and keep forgetting to implement it: Adding project links to PyPI

Specifically, I think it would be useful to add links for:

  • Docs
  • Source (even though it’s redundant w/r/t “home page,” linking it with this name provides a clearer idea of what someone is getting since this link tends to go to a different kind of target across various Python packages)
  • Issues
  • Changelog/history

Searching with datetime.date instead of datetime.datetime

I noticed that when supplying a from_date and to_date as a datetime.date instead of a datetime.datetime that the search results are not time-boxed correctly. This is because a datetime.date doesn't get formatted correctly when it is sent to the CDX API.

Use new CircleCI matrix syntax

CircleCI has a new feature called “matrix builds” which lets you use job parameters to list different setup combinations (like N Python versions × M operating systems) to test. See more in this quick blog post: https://circleci.com/blog/circleci-matrix-jobs/

In #28, we added tests against multiple Python versions. We should clean up the configuration to use the new matrix syntax, which should be a little clearer and definitely more concise.

Implement CDX search based on newer `timemap` CDX API

From a conversation on the Internet Archive’s Research Slack today:

kenji
Igor http://spacex.com/robots.txt has Disallow: /includes/ and http://web.archive.org/cdx/search still honors robots.txt exclusion (because it’s served by older wayback machine), while playback ignores robots.txt (served by new wayback machine).

http://web.archive.org/web/timemap/cdx?url=www.spacex.com&matchType=domain&gzip=false&filter=statuscode:200&to=20041229235959 will give you more results, including those under /include/ path. /web/timemap/cdx is served by new wayback.

I’m sorry for the confusing, inconsistent results - we’re trying to migrate all services to new wayback

oh btw, a tip: to=2004 will be interpreted as 20041231235959 (if you’re not excluding day 30 and 31 on purpose 😄) (edited)

Igor
kenji Thank you!

mr0grog
Oh, I did not know about /web/timemap/cdx as opposed to just /cdx/search/cdx. Should I be using the former instead of the latter?

kenji
/web/timemap/cdx is better functionality-wise, but it’s slower than /cdx/search. So I’d suggest /cdx/search as long as it works ok for your purpose.

mr0grog
ah, ok
Will need to consider which is the right path. Is there anything that documents the functional differences? e.g. the robots.txt issue would be a hard one to discover

Do you have a rough sense of how much slower /web/timemap/cdx is?

kenji
I don’t have good benchmark result (it’s nice to have), but I find /web/timemap/cdx 10-20% slower for matchType=exact query. matchType=domain can be much slower.

We need to look into whether we should switch to /web/timemap/cdx.

Give Memento objects a nicer repr

Calling repr() with a Memento object gets you a not-very-useful representation:

<wayback._models.Memento object at 0x102447970>

Instead, this should include some more info (at least the URL and timestamp?) and not include _models since users shouldn’t be importing that.

For example:

<wayback.Memento url="https://www3.epa.gov/" timestamp="20221001000000">
<wayback.Memento url="https://www3.epa.gov/" timestamp="2022-10-01T00:00:00Z">
<wayback.Memento "https://www3.epa.gov/" at 2022-10-01T00:00:00Z>

Or something along those lines.

This should just involve adding a __repr__(self) method to Memento.

`Memento.history` should only list responses that were actual mementos

This was spawned by an issue in web-monitoring-processing: edgi-govdata-archiving/web-monitoring-processing#565 (review)

In the value returned from WaybackClient.get_memento(), we include a custom history list that looks a lot like the one requests normally creates. (We have to make it custom because we have crazy complex redirect logic.) That history includes every redirect response we followed to get the final memento. However… it might make more sense if this list only included the actual mementos.

Take this URL, for example:

from wayback import WaybackClient

wb = WaybackClient()
memento = wb.get_memento('http://web.archive.org/web/20200327033521id_/https://www.bia.gov/WhoWeAre/BIA/OTS/NaturalResources/FishWildlifeRec/index.htm')

for item in memento.history:
    print(item.url)
print(memento.url)

>>> http://web.archive.org/web/20200327033521id_/https://www.bia.gov/WhoWeAre/BIA/OTS/NaturalResources/FishWildlifeRec/index.htm
>>> http://web.archive.org/web/20200327033521id_/https://www.bia.gov/bia/ots/division-natural-resources/branch-fish-wildlife-recreation
>>> http://web.archive.org/web/20200318032153id_/https://www.bia.gov/bia/ots/division-natural-resources/branch-fish-wildlife-recreation

The first HTTP response is a memento of a redirect to https://www.bia.gov/bia/ots/division-natural-resources/branch-fish-wildlife-recreation.

The second HTTP response, however, is not a memento! It’s the Wayback machine saying “actually, the closest memento of the redirected URL is from timestamp 20200318032153.” Basically, when you load a memento of a redirect, Wayback issues a redirect to the memento’s target at the same time as the memento. However, the target was usually snapshotted slightly later (when Wayback’s crawler was capturing things, the redirect target landed on a crawler’s queue, and the crawler finally got to it seconds, minutes, or hours later).

Then the third response is the actual memento of the original redirect’s target.

I think it might be reasonable to only include the responses that were actually mementos in the history — so in the above example, history would only have one entry, not two. (This fits well with #2, where we no longer want to surface the actual HTTP response objects.)

For diagnostic purposes, we might still want something equivalent to the current value of history (maybe http_history or debug_history?), but it could just be the URLs instead of whole response objects.

Async support

When dealing with web requests libraries can gain a lot from being able to making other stuff while waiting for it.

Basically the easiest approach would be to split the actual logic and the requests in separate functions, and move the logic to a base class.
From that we would have a subclass each for sync and one for async operation.
They would mostly differ in the fact that the async one has async and await in front of definitions and calls respectively.

To give a implementation idea:

class WaybackClientLogic(_utils.DepthCountedContext):
  def _calcualte_final_query(…):
     ...
     
  def _postprocess_response(…):
     ...
class WaybackSyncClient(WaybackClientLogic):
  def search(…):
    final_query = self._calcualte_final_query(…)
    response = self.session.request('GET', CDX_SEARCH_URL, params=final_query)
    self._postprocess_response(response)
    
WaybackClient = WaybackSyncClient  # to keep compatibility with old imports
class WaybackAsyncClient(WaybackClientLogic):
  async def search(…):
    final_query = self._calcualte_final_query(…)
    response = await self.session.request('GET', CDX_SEARCH_URL, params=final_query)
    self._postprocess_response(response)

Resolve Sphinx mismatch between requirements-dev.txt and readthedocs.org

Yesterday I went ahead and merged #31 to “fix” our docs — but it turns out I just created double colons in practice:

double-colons

The problem here is that requirements-dev.txt does not specify acceptable version ranges for any of our dev dependencies. That means that if somebody follows the install instructions, they wind up with Sphinx v2.x. However, readthedocs.org will always choose the latest of Sphinx v1.8.x if you don’t specify an acceptable version range (see https://docs.readthedocs.io/en/stable/intro/import-guide.html#building-your-documentation). That means building docs in development does not match how they build with readthedocs.org.

(In general, we should probably be providing version ranges for all our dependencies, and pinned versions for our dev dependencies.)

Memento links should be in same mode as Memento

In #108, I added a link property to Memento objects with parsed data from the Link HTTP headers of mementos. However, the links to other mementos in that data turn out to always be in view mode, regardless of the mode of the memento you requested!

For example:

from wayback import WaybackClient, Mode
client = WaybackClient()

memento = client.get_memento('https://epa.gov/', '20230210003633')

# Memento is in original mode:
memento.mode == Mode.original.value
# But the links are not:
memento.links == {
    'original': {
        'url': 'https://www.epa.gov/',
        'rel': 'original'
    },
    'timemap': {
        'url': 'https://web.archive.org/web/timemap/link/https://www.epa.gov/',
        'rel': 'timemap',
        'type': 'application/link-format'
    },
    'first memento': {
        # This URL is in `view` mode, not `original`!
        'url': 'https://web.archive.org/web/19970418120600/http://www.epa.gov:80/',
        'rel': 'first memento',
        'datetime': 'Fri, 18 Apr 1997 12:06:00 GMT'
  },
  # ...more links cut for brevity...
}

The suggested use for these links is to pass them directly to get_memento(), but that might get you a memento in a different mode than you expect! It’s a footgun.

Some options here:

  1. Drop the links attribute on Memento for now. Users can parse the Link header(s) themselves if they want it, and are responsible for using them appropriately. (In this case, we also need to reopen #57.)

  2. Update the url field on any link that references a memento to match the mode of the Memento object they are attached to.

    Side note: how do we identify which things are mementos? Look for "memento" as a substring in the rel field? Look for url fields that match known memento URL patterns?

  3. Instead of the values in links being dictionaries, make them some more useful data object. References to other mementos might be more like our CdxRecord objects, where the url is the captured URL (e.g. http://www.epa.gov/ instead of the memento URL), the timestamp is a datetime object, etc.

    • This one’s pretty complicated! It’s how I envisioned this feature might evolve, but isn’t obviously worthwhile in the short term.
    • I don’t know the complete universe of possible object types (it’s not just mementos, see the first two entries in the example above) and technically what goes here is pretty arbitrary. How do we future-proof things we haven’t modeled yet?

I think (3) has too many open questions, but we should do (1) or (2) before cutting a 0.4.1 release.

Multiple threads

Hello,

I would like to be able to check multiple domains at the same time, is it okay to use multithreading ?

Add type annotations & type checking

This is not high priority, but I think it might be useful to add type annotations here. As library, users might appreciate having typing information available (especially now that we have our own return types for everything now, like Memento, rather than returning values directly from requests).

As we add annotations, we should also add type checking to CI, probably with Mypy or Pyright.

Wayback redirects without scheme + domain don’t work

New bug in v0.3.0a1:

Some Wayback redirects use a Location: header with a scheme and domain, e.g:

Location: http://web.archive.org/web/20201027215555id_/https://www.whitehouse.gov/administration/eop/ostp/about/student/faqs

But others don’t, e.g:

Location: /web/20201027215555id_/https://www.whitehouse.gov/ostp/about/student/faqs

The latter will cause Wayback v0.3.0a1 to fail when trying to parse the headers:

>>> import wayback
>>> c = wayback.WaybackClient()
>>> c.get_memento('http://web.archive.org/web/20201027215555id_/https://www.whitehouse.gov/administration/eop/ostp/about/student/faqs')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/rbrackett/Dev/datarescue/wayback/wayback/_client.py", line 724, in get_memento
    headers=Memento.parse_memento_headers(response.headers),
  File "/Users/rbrackett/Dev/datarescue/wayback/wayback/_models.py", line 285, in parse_memento_headers
    headers['Location'], _, _ = memento_url_data(raw_headers['Location'])
  File "/Users/rbrackett/Dev/datarescue/wayback/wayback/_utils.py", line 122, in memento_url_data
    raise ValueError(f'"{memento_url}" is not a memento URL')
ValueError: "/web/20201027215555id_/https://www.whitehouse.gov/ostp/about/student/faqs" is not a memento URL

Ensure support for Python 3.12

This package mostly works for Python 3.12 (testing with 3.12.0rc2, which Python Core expects to be the final release candidate), but has some tooling issues:

  • Versioneer does not work (#122)
  • Flake8 does not work (they added support in v6.1.0 which is a major release ahead of where we are) (#124)
  • Other?
    • Setuptools no longer bundled with Python (#125)

I aim to get all this cleared up this week and cut a release with support on Monday, September 25, 2023, since the planned final release date for Python 3.12.0 is two weeks from today (Monday, October 2, 2023).

`get_memento()` should have `follow_redirects` parameter

Currently, calling WaybackClient.get_memento() follows redirects in mementos. That is, if the memento was of a redirect, we continue on to find the nearest-in-time memento of whatever URL was redirected to. It’s not possible to get the actual memento of the requested URL in this case.

We should add a follow_redirects boolean parameter to get_memento that controls this behavior. I’d suggest it defaults to True.

ValueError: time data

I happened to be doing this:

from wayback import WaybackClient

ia = WaybackClient()
for result in ia.search('lapdonline.org', matchType='prefix'):
    print(result)

and noticed that after running for 10 minutes or so it blew up with:

Traceback (most recent call last):
  File "/Users/edsummers/Projects/wayback/wayback/_client.py", line 543, in search
    capture_time = _utils.parse_timestamp(data.timestamp)
  File "/Users/edsummers/Projects/wayback/wayback/_utils.py", line 57, in parse_timestamp
    .strptime(''.join(timestamp_chars), URL_DATE_FORMAT)
  File "/usr/local/Cellar/python@3.10/3.10.6_1/Frameworks/Python.framework/Versi
ons/3.10/lib/python3.10/_strptime.py", line 568, in _strptime_datetime
    tt, fraction, gmtoff_fraction = _strptime(data_string, format)
  File "/usr/local/Cellar/[email protected]/3.10.6_1/Frameworks/Python.framework/Versions/3.10/lib/python3.10/_strptime.py", line 349, in _strptime
    raise ValueError("time data %r does not match format %r" %
ValueError: time data '20000008241731' does not match format '%Y%m%d%H%M%S'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/edsummers/Projects/wayback/./x.py", line 6, in <module>
    for result in ia.search('lapdonline.org', matchType='prefix'):
  File "/Users/edsummers/Projects/wayback/wayback/_client.py", line 547, in search
    raise UnexpectedResponseFormat(
wayback.exceptions.UnexpectedResponseFormat: Could not parse CDX output: "org,lapdonline)/community/op_valley_bureau/north_hollywood/map/map.htm 20000008241731 http://www.lapdonline.org:80/community/op_valley_bureau/north_hollywood/map/map.htm text/html 200 2GPKQMU3BLZXOEZ5EWDQEYHPMKWEHNT3 1158" (query: {'url': 'lapdonline.org', 'matchType': 'prefix', 'showResumeKey': 'true', 'resolveRevisits': 'true'})

It looks like the CDX API returned a datetime 20000008241731 which throws an exception during parse because 00 isn't a valid month?

I don't know what the solution is here:

  • ignore the record?
  • see if the new CDX API is better behaved and switch to it?
  • something else?

Some original headers are getting lost

It looks like something has changed about either Requests or the Wayback Machine, and we are no longer including all the original archived headers in a Memento object’s headers property. For example:

from wayback import WaybackClient
c = WaybackClient()

memento = c.get_memento('https://robbrackett.com/', datetime='20220315020402')
memento.headers
# {'Content-Type': 'text/html'}

But the value of memento.headers should really be something like:

{'date': 'Tue, 15 Mar 2022 02:04:02 GMT', 'server': 'Apache', 'upgrade': 'h2,h2c', 'connection': 'Upgrade, Keep-Alive', 'last-modified': 'Mon, 30 Nov 2020 22:51:03 GMT', 'accept-ranges': 'bytes', 'content-length': '13182', 'vary': 'Accept-Encoding', 'keep-alive': 'timeout=15, max=768', 'Content-Type': 'text/html'}

(Based on https://web.archive.org/web/20220315020402id_/http://robbrackett.com/)

Expose and document CdxRecord

WaybackClient.search() yields instances of CdxRecord, which makes it a public API:

wayback/wayback/_client.py

Lines 101 to 114 in 800608f

CdxRecord = namedtuple('CdxRecord', (
# Raw CDX values
'key',
'timestamp',
'url',
'mime_type',
'status_code',
'digest',
'length',
# Synthesized values
'date',
'raw_url',
'view_url'
))

We should at least document it, and probably (?) also expose it publicly.

I don’t think this is required for the 0.2 release.

Pool connections across threads and/or make WaybackClient thread-safe

This is a current major goal, but we didn’t have an issue tracking it!

This package should be thread-safe, but because it is based on Requests, we can’t guarantee that. Requests’s authors won’t guarantee it, and at various times have expressed everything from certainty that it’s not thread-safe to cautious optimism that it might be, but overall have expressed that they don’t plan to go out of their way to make certain (and then to maintain that status).

Thread safety is important for EDGI’s use: we need to pull lots of data and so need to make get_memento() calls concurrent. Connections also need to be pooled and shared across threads for stability (when EDGI implemented some hacks to do this, speed and reliability both improved considerably). In any case, other EDGI codebases implement some pretty nutty workarounds to make that relatively safe. That shouldn’t be necessary, and people should just be able to use a single WaybackClient across multiple threads.

Ideally, we should switch to a different HTTP client instead of Requests. Two options:

  1. urllib3 is thread-safe and is what Requests is based on, so it should be a reliable switch. Its API is very different, though, so there’s lots to update and test.

  2. httpx is new, has an API that matches Requests, and has lots of fancy features. It also has async support in case we ever want to make this package async-capable in the future. However, it’s still in beta (should be final before the end of the year), and may possibly have some funny under-the-hood differences we’ll need to account for. We’d at least need to figure out how to re-implement our crazy hack for Wayback’s gzip issues.


Some other approaches worth keeping in the back pocket, but that probably aren’t ideal:

  • A sketched out implementation in #23 makes a funky abstraction that lets it appear as though you are using a single WaybackClient on multiple threads, when in fact you are using several. It’s clever, but also a little hacky and probably has a lot of messy corner cases as a result. I don’t feel great about it.

  • We could implement EDGI’s workaround from web-monitoring-processing under the hood so that all connections across all clients are pooled. This makes things magically seem to work even if you create separate clients on each thread, but it’s probably unexpected behavior. If a user was actually trying to isolate sets of connections, this would get in their way (and do it in a silent way so they might not even know). It also depends on a small part of Requests staying as thread-safe as it currently is, and there are no guarantees there.

Add `MementoOutsideWindow` Exception

Running into and fixing #53 reminded me that we don’t have a specific error for the situation where a memento can be served, but from a time that is beyond the specified target_window parameter. Instead, we currently just raise a fairly generic MementoPlaybackError.

We also have a TODO in the code that references this:

# TODO: split this up into a family of more specific errors? When playback
# failed partway into a redirect chain, when a redirect goes outside
# redirect_target_window, when a memento was circular?
class MementoPlaybackError(WaybackException):
"""
Raised when a Memento can't be 'played back' (loaded) by the Wayback
Machine for some reason. This is a server-side issue, not a problem in
parsing data from Wayback.
"""

I’m thinking MementoOutsideWindow, MementoOutOfWindow, or MementoOutOfRange are good candidates for names here. “Range” feels like more traditional wording, but doesn’t clearly match up with the parameter that causes it, which is “target_window.” We could also change the parameter name to “target_range.” ¯\_(ツ)_/¯

Regardless of the name, it should be a subclass of MementoPlaybackError.

Should get_memento() ignore the mode in archive.org URLs?

Currently, get_memento() can be called in a few different ways:

  • get_memento(archived_url) requests a memento using the URL, timestamp, and mode that are baked into the URL. (archived_url means a URL like https://web.archive.org/web/[YYYYMMDDHHmmss][mode]/[url])
  • get_memento(cdx_record, mode=mode) requests a memento with the URL and timestamp from the CDX record object, and the given mode (where mode defaults to original)
  • get_memento(url, timestamp, mode=mode) requests a memento of the given URL at the given timestamp with the given mode (again, mode is optional and defaults to original)

Folks using this library will usually want mode=Mode.original, which is what we typically do by default. BUT since an archive URL has the mode baked in, we obey whatever mode was in the URL.

The problem is that mode as a concept is a little advanced and requires extra thinking about what you’re asking for. Folks are prone to copying a URL from their browser and dropping it in here to try things out, or accidentally using cdx_record.view_url instead of just passing the CDX record directly without realizing that they are changing modes (or what that even means!). For example, #109 uncovered a legitimate issue with view mode, but the user didn’t actually want to be using view mode at all! (Once I explained that, it turned out the actual issue wasn’t even a blocker for him — he switched to original mode and was good to go.)

So: should calling get_memento(archived_url) ignore the mode that’s in the URL and use whatever one is explicitly set as a parameter instead (as in all other cases, defaulting to original)? For example:

client.get_memento("https://web.archive.org/web/20230101000000/https://www.epa.gov/")

Currently gets you a memento in view mode. The change I’m thinking about would mean you’d get original mode instead here. If you wanted view mode, you’d have to ask for it explicitly:

client.get_memento("https://web.archive.org/web/20230101000000/https://www.epa.gov/", mode=Mode.view)

It would also mean all these calls get you the same result, instead of different ones:

client.get_memento("https://web.archive.org/web/20230101000000/https://www.epa.gov/")
client.get_memento("https://web.archive.org/web/20230101000000id_/https://www.epa.gov/")
client.get_memento("https://web.archive.org/web/20230101000000js_/https://www.epa.gov/")
client.get_memento("https://web.archive.org/web/20230101000000cs_/https://www.epa.gov/")
client.get_memento("https://web.archive.org/web/20230101000000im_/https://www.epa.gov/")
# Note different mode values ---------------------------------^^^

Custom timeouts

Right now the only way to specify timeouts is through the session property.
WaybackSession.request allows the timeout value to be set per request, but the WaybackClient.search and WaybackClient.get_memento functions do not support this.

Add keyword arguments so that if wanted, the timeouts can be overwritten per function call.

ClosedConnectionError & rate limiting

I apologize for the slight abuse of the term "Issues", as I don't think the problem I'm encountering is a true issue of your project.

While using wayback, I've run into issues with the connection being closed by the remote host. I've been performing a lot of search requests/pulling mementos, and suspect I'm hitting a rate limit. However, I have put a large delay between queries (5ish seconds).

Is there a best practice on how much we should throttle usage, and are there other things that we should do beyond just looping over all our searches with a time.sleep call to avoid slamming the server?

Consider filtering out repeated results in CDX search

I’m not sure if this is a bug in the Internet Archive’s CDX search right now or if it’s something that was always possible and I just never noticed, but the CDX search endpoint is returning a lot of results where lines are repeated. For example, see the first 4 lines of this query: http://web.archive.org/cdx/search/cdx?url=energystar.gov/&from=20200612

gov,energystar)/ 20200612014007 http://energystar.gov/ text/html 301 HLNR6AWVWYCU3YAENY3HYHLIPNWN66X7 432
gov,energystar)/ 20200612014007 http://energystar.gov/ text/html 301 HLNR6AWVWYCU3YAENY3HYHLIPNWN66X7 432
gov,energystar)/ 20200612014014 https://www.energystar.gov/ warc/revisit - DBA3N454ZRZ42QEBER5QRT332DJTTLR5 606
gov,energystar)/ 20200612014014 https://www.energystar.gov/ warc/revisit - DBA3N454ZRZ42QEBER5QRT332DJTTLR5 606
gov,energystar)/ 20200612061543 http://energystar.gov/ text/html 301 HLNR6AWVWYCU3YAENY3HYHLIPNWN66X7 433
gov,energystar)/ 20200612061552 https://www.energystar.gov/ warc/revisit - DBA3N454ZRZ42QEBER5QRT332DJTTLR5 605
...

I’ve sent a message over their way to ask what’s up with that, but it seems like it might be worth adding code to filter these out. Even assuming this is accurate (i.e. there actually were two archives made of the same URL in the same second), I’m struggling to think of a scenario where getting both results is really useful.

Care to gut-check me on this, @danielballan?

Add exception for blocked sites in `search()`

Looking at @edsu’s very awesome COVID-19 notebook, it turns out CDX searches can return a special error for blocked sites, e.g. http://web.archive.org/cdx/search/cdx?url=https%3A%2F%2Fnationalpost.com%2Fhealth%2Fbio-warfare-experts-question-why-canada-was-sending-lethal-viruses-to-china&from=20191001000000&showResumeKey=true&resolveRevisits=true

Just like we have a custom BlockedByRobotsError, we should have another error for this, rather than just raising a not-so-great HTTP error.

In this case, the response code is 403 and there is a header like:

X-Archive-Wayback-Runtime-Error: org.archive.util.io.RuntimeIOException: org.archive.wayback.exception.AdministrativeAccessControlException: Blocked Site Error

(And the same text as the header in the response body.)

We can probably follow Wayback’s naming and call this AdministrativeAccessControlException or BlockedSiteError.

It might even make sense to generalize this for any 4xx/5xx response that has an X-Archive-Wayback-Runtime-Error header.

Add a default `limit` to `WaybackClient.search()`

Search pagination does not always function correctly if the limit parameter is not set, so it might be a good example to set a [high] default value.

For example, @edsu ran into this with a query like:

# Equivalent to:
# http://web.archive.org/cdx/search/cdx?matchType=prefix&showResumeKey=true&url=twitter.com/realDonaldTrump/status/
client.search('twitter.com/realDonaldTrump/status/', matchType='prefix')

There’s no resume key in the response. But there are definitely more records than are returned: if you add limit=100000 in there, you get a resume key, and can page through results, eventually getting far more than were returned in the query without limit.

# Equivalent to:
# http://web.archive.org/cdx/search/cdx?matchType=prefix&showResumeKey=true&limit=100000&url=twitter.com/realDonaldTrump/status/
client.search('twitter.com/realDonaldTrump/status/', matchType='prefix', limit=100_000)

I thought we tested this when originally writing the search code, so I’m not sure if something has changed/broken or if I’m just misremembering. Either way, the result is really unintuitive, so it might be a good idea to add a large default for limit instead of None (I’m thinking 500_000). The docs could potentially use some clarification here, too.

I’m also trying to get some insight from the Wayback team about whether this behavior is intended or a bug (in which case maybe no change is needed).

Align CdxRecord attribute names with names the server uses

Our CdxRecord objects have names that do not exactly align with the corresponding field names on the CDX server, or what you receive as column headers if asking for CDX search results in JSON format. We should allow access by those names in addition to the names we currently use (and maybe deprecate the current ones?).

  • key should be urlkey
  • mime_type should be mimetype
  • status_code should be statuscode
  • url should be original

Rethink rate limiting

WaybackClient.get_memento has left-over rate-limiting behavior from web-monitoring-processing:

wayback/wayback/_client.py

Lines 645 to 648 in f1cdb1d

with _utils.rate_limited(calls_per_second=30, group='get_memento'):
# Correctly following redirects is actually pretty complicated. In
# the simplest case, a memento is a simple web page, and that's
# no problem. However...

From the perspective of this more generic module…

  • Should there be rate limiting at all?
  • If there should be, it should probably be optional and/or configurable: the rate shouldn’t be hard-coded, and should maybe be an option on the session or client constructor.
  • If there should be, should it apply across both get_memento and search (and whatever other methods WaybackClient might gain in the future)? Or maybe it should apply at the level of making a request to Wayback, rather than at the higher-level get_memento method?

/cc @danielballan


Updates

  • We should keep rate limiting.
  • It should be expanded to cover search.
  • search and get_memento should fall under separate rate limits that don’t interfere with each other.
  • The rate limit should be configurable (and should continue to apply across all client/session instances on all threads). The right API for this is not yet clear.
  • The default limit for CDX calls should be 1/second and for Memento calls should be 30/second.

Support other archives with Memento and CDX search APIs

There is a lot of hardcoded stuff in this package that is specific to the Internet Archive (i.e. archive.org), but there are other similar web archives that support the Memento API (standardized) and CDX search (not standardized, but common), such as webarchive.org.uk, or any project using pywb. It would be nice to support those here.

The most pleasant approach would be to be able to create a new WaybackClient with different URLs for the memento and search endpoints, but I think there’s a good chance there might be more complex differences, and there should maybe just be different client classes for the different well-known archives.

v0.3.0 Roadmap

The last major item blocking v0.3.0 has a PR open that I will probably merge later this evening: #64 / #67. Since there’s nothing else, I’m planning to publish v0.3.0b1 immediately after merging, and if that runs well in EDGI’s production workflow for a few days, then publish v0.3.0 final from the same commit. I’m currently targeting Friday, March 19th.

If we do uncover other issues, I’ll start noting them here.

Use CI to publish releases to PyPI

Right now, I manually build and publish releases on PyPI after they’ve been tagged. While I haven’t made any horrible mistakes yet, it would generally be better if we had a CI job that automatically built, tested, and published new tags to PyPI for us, so it’s harder to get some spurious uncommitted code in the package or publish something that doesn’t actually pass tests.

Support `datetime.timedelta` for `target_window` in `get_memento()`

The WaybackClient.get_memento() method has a target_window parameter, which is used when a memento does not exist at the requested time. If the parameter exact=False, get_memento() returns the nearest-in-time memento, but only if it is within the number of seconds specified in target_window.

wayback/wayback/_client.py

Lines 717 to 719 in c066e04

def get_memento(self, url, datetime=None, mode=Mode.original, *,
exact=True, exact_redirects=None,
target_window=24 * 60 * 60, follow_redirects=True):

Since Python has a built in datetime.timedelta type to represent exactly this kind of concept, we should also accept it instead of an integer for target_window. (Really, I’m not sure why we didn’t do this originally!)

v0.3.0: Requests API still visible via exceptions

Part of the goal of v0.3.0 is to cover over the requests API so it’s easier to swap it out. While requests objects are no longer involved in method calls or return values, we still surface exceptions from requests in a few spots:

  • WaybackSession.request() and WaybackSession.send() can raise exceptions from requests pertaining to network failures. Generally these are actually urllib3 exceptions that requests wraps, and we could do similarly.

  • WaybackClient.search() can raise when it calls WaybackSession.request() as above.

    wayback/wayback/_client.py

    Lines 482 to 483 in 7e9bccc

    response = self.session.request('GET', CDX_SEARCH_URL,
    params=final_query)

  • WaybackClient.get_memento() can raise when it calls WaybackSession.request() as above.

    response = self.session.request('GET', url, allow_redirects=False)

    And WaybackSession.send() as above.

    response = self.session.send(response.next, allow_redirects=False)

  • WaybackClient.get_memento() can raise when it encounters a non-memento response with a 400+ status code:

    wayback/wayback/_client.py

    Lines 772 to 790 in 7e9bccc

    if not playable:
    read_and_close(response)
    message = response.headers.get('X-Archive-Wayback-Runtime-Error', '')
    if (
    ('AdministrativeAccessControlException' in message) or
    ('URL has been excluded' in response.text)
    ):
    raise BlockedSiteError(f'{url} is blocked from access')
    elif (
    ('RobotAccessControlException' in message) or
    ('robots.txt' in response.text)
    ):
    raise BlockedByRobotsError(f'{url} is blocked by robots.txt')
    elif message:
    raise MementoPlaybackError(f'Memento at {url} could not be played: {message}')
    elif response.ok:
    raise MementoPlaybackError(f'Memento at {url} could not be played')
    else:
    response.raise_for_status()

The last one is the one I'm most concerned about. It's not great that network errors come from an underlying framework that might change, but not the worst thing in the world. However, we should never explicitly raise that kind of error ourselves. Instead this should be a WaybackException or a MementoPlaybackError.

Additionally, there’s a special case we are not handling above: if the status is 404 on a non-memento, that means the URL has never been archived by the Wayback Machine, and we should raise a special exception for that case.

Count method not working properly

I chose to use this library because I'm interested to see if a certain string appears in a page over time. The count method gives values more than 0 even though the string is not present at the snapshot

Python 3.6?

I was trying to use wayback in a Colab Jupyter Notebook. It wouldn't install because wayback specifies 3.7 in its setup.py and Colab is using python v3.6.

I tested wayback under python3.6 and it seemed to work fine, so I was wondering if the version in setup.py could be relaxed a little bit?

Support original URL + date or CdxRecord instances as parameters for `get_memento`?

At current, get_memento() requires the actual URL of a memento as its main parameter:

wayback_client.get_memento('https://web.archive.org/web/20190513000000/https://www.epa.gov/')

That’s fine for EDGI’s existing use cases because we always query CDX for a list of mementos to retrieve. However, it seems reasonable that people would also want to get the memento for a given web page nearest to a given time:

wayback_client.get_memento('https://www.epa.gov/',
                           date=datetime(2019, 5, 13),
                           type='raw',
                           exact=False)

It might also be a nice convenience to accept CdxRecord instances:

wayback_client.get_memento(some_cdx_record,
                           type='raw',
                           exact=False)

Some questions to solve here:

  • Would these be good to add?

  • Should they all be ways to call get_memento or do we need separate methods for [some] of them?

  • When getting a memento via these new approaches, we need some additional info: the type of memento to return. In memento URLs, you can add the following strings to the end of the timestamp to control the memento type:

    • `` (nothing) gets you a memento where all the subresource URLs are replaced with URLs for mementos (so you can reliably render it in a browser, for example) and that has extra UI from Wayback (custom header, timeline, etc).
    • id_ gets you the raw memento with no changes.
    • js_ gets you JavaScript with some extra comments indicating debug and copyright info.
    • cs_ gets you CSS with some extra comments indicating debug and copyright info.
    • im_ gets you an image, presumably with some kind of extras like the above (otherwise you’d just use id_). Not actually sure on the details here.
    • Only docs I’ve ever been able to find for this feature: http://archive-access.sourceforge.net/projects/wayback/administrator_manual.html#Archival_URL_Replay_Mode

    In the examples above, I’ve mocked this up as type='raw' (indicating the id_ type). Lots of options here as to how we treat this (just use the Wayback strings above, have some constants, use custom names like in my example, etc). For EDGI’s case, we always want id_, and I suspect that’s a reasonable default, but seems like it ought to be an option since I can certainly imagine other use cases.

  • Is exact=True still the right default? If requesting an memento URL (as we currently do) or a CdxRecord instance, that makes sense, but maybe not if you are asking for original URL + datetime.

/cc @danielballan

`Memento.url` property can be wrong if it is SURT-equivalent to the actual URL

If you request a memento URL with a SURT form that is equivalent to the memento’s actual URL, the url property of the resulting memento object is incorrect — it reflects the URL you requested, rather than the actual, captured URL.

For example:

from wayback import WaybackClient
c = WaybackClient()

memento = c.get_memento('http://robbrackett.com/', datetime='20220315020402')
memento.url
# 'http://robbrackett.com/'
# But the actual capture was from:
# 'https://robbrackett.com/'

# The `link` header has the right info:
memento._raw_headers['link']
# '<https://robbrackett.com/>; rel="original", ...'

The right details are in the link header, and we should be parsing that. We’ve had a feature request to do that for a while (#57), but I hadn’t realized that there was a bug like this that we have to do it to properly work around.

Should we add tooling for a search + load mementos pipeline?

We have a lot of code we left behind in web-monitoring-processing around knitting together the search and get_memento methods in a high-performance way across many threads. I think it’s good that we didn’t include any of that here to start with, but would it make sense to add some of that back in eventually?

i.e. Some tooling that supports the workflow:

┌──────────────────┐   ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐   ┌─────────────────┐
│   Sequence of    │     ┌─────────────┐   ┌─────────────┐     │   Sequence of   │
│     URLs/URL     │──▶│ │   search    │──▶│ get_memento │ │──▶│    Mementos     │
│     Patterns     │     └─────────────┘   └─────────────┘     │                 │
└──────────────────┘   └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘   └─────────────────┘

It should be significantly more abstract than the way it is currently implemented in https://github.com/edgi-govdata-archiving/web-monitoring-processing/blob/master/web_monitoring/cli.py, of course. 😉

Change WaybackSession’s user-agent string?

WaybackSession has inherited its default user-agent string from web-monitoring-db:

wayback/wayback/_client.py

Lines 292 to 295 in f1cdb1d

self.headers = {
'User-Agent': user_agent or f'edgi.web_monitoring.WaybackClient/{__version__}',
'Accept-Encoding': 'gzip, deflate'
}

That string might not match up so well with this more generic module anymore. Should we change it? If so, I’m thinking something like:

f'python-wayback.WaybackClient/{__version__}'

Any thoughts, @danielballan?

Mementos of redirects in view mode raise "could not be played" error

Hi, I'm getting an error from this code,

from wayback import WaybackClient, WaybackSession
wc = WaybackClient(session = WaybackSession(
                                user_agent='agent-218947',
                                timeout=10,
                             ))
u='https://web.archive.org/web/20230212225711/https://www.reddit.com/r/Suomi/comments/110nd1i/mink%c3%a4_takia_kommentit_ei_aina_n%c3%a4y_redditiss%c3%a4/j8arudd/'
memento = wc.get_memento(u, exact=False)
    raise MementoPlaybackError(f'Memento at {url} could not be played')
wayback.exceptions.MementoPlaybackError: Memento at https://web.archive.org/web/20230212225711/https://www.reddit.com/r/Suomi/comments/110nd1i/mink%c3%a4_takia_kommentit_ei_aina_n%c3%a4y_redditiss%c3%a4/j8arudd/ could not be played

The comment in that section of the WaybackClient code states that this error should only occur if exact is True or if the target URL is outside the target_window. I don't think either of those apply because I'm setting exact to False and the target URL has the same timestamp:

original url / target url (both are 20230212225711)

Anyone know what might cause this?

CDX search results should have time zone information

CdxRecord.timestamp is a datetime object, but we never create them with time zone information:

wayback/wayback/_client.py

Lines 592 to 593 in 5a994c7

capture_time = datetime.strptime(data.timestamp,
URL_DATE_FORMAT)

That’s not bad, exactly, since all communication with Wayback is implicitly in UTC, and Wayback APIs don’t specify time one info, either. However, it would probably be a lot safer if the timestamps we returned were explicitly in UTC — there would never be any question for a user of this library whether they are looking at a local or universal time.

Add info about media type to `Memento`

In EDGI’s web monitoring tools, we often look at the media type (also often referred to as MIME type or content type) of a Memento. One example is that we need to know how to parse the body in order to extract a title (you’d do very different things for HTML vs. PDF, for example). It might be nice to expose some sort of media type information on the Memento class.

We originally planned to do this in #2, but it wasn’t critical and there were enough open questions and options that it seemed worth waiting on coming up with a better design for:

  • Should this be as simple as just the media type with no parameters?

    a_memento.headers['Content-Type'] == 'text/html; encoding=utf-8; some-param=value'
    a_memento.media_type == 'text/html'
  • Should it be a detailed representation?

    class MediaType:
        # Has attributes for each part of a media type string,
        # Probably a __str__() implementation, etc.
        ...
    
    a_memento.headers['Content-Type'] == 'text/html; encoding=utf-8; some-param=value'
    a_memento.media_type == MediaType(type='text',
                                      subtype='html',
                                      parameters={'encoding': 'utf-8',
                                                  'some-param': 'value'})
  • Should it convert known non-canonical types into the canonical one?

    a_memento.headers['Content-Type'] == 'application/xhtml'
    a_memento.media_type == 'application/xhtml+xml'  # Means the same thing, and is more correct.
  • Should it sniff?

    a_memento.headers['Content-Type'] == None
    a_memento.content == b'%PDF-blahblahblah...'
    a_memento.media_type == 'application/pdf'

Add info about link header relationships to `Memento`

Wayback Mementos carry additional info about related resources in the Link header. For example, here’s the header from https://web.archive.org/web/20171124151315id_/https://www.fws.gov/birds/:

<https://www.fws.gov/birds/>; rel="original",
<http://web.archive.org/web/timemap/link/https://www.fws.gov/birds/>; rel="timemap"; type="application/link-format",
<http://web.archive.org/web/https://www.fws.gov/birds/>; rel="timegate",
<http://web.archive.org/web/20050323155300/http://www.fws.gov:80/birds>; rel="first memento"; datetime="Wed, 23 Mar 2005 15:53:00 GMT",
<http://web.archive.org/web/20170929002712/https://www.fws.gov/birds/>; rel="prev memento"; datetime="Fri, 29 Sep 2017 00:27:12 GMT",
<http://web.archive.org/web/20171124151315/https://www.fws.gov/birds/>; rel="memento"; datetime="Fri, 24 Nov 2017 15:13:15 GMT",
<http://web.archive.org/web/20171228222143/https://www.fws.gov/birds/>; rel="next memento"; datetime="Thu, 28 Dec 2017 22:21:43 GMT",
<http://web.archive.org/web/20201011123440/http://www.fws.gov/birds>; rel="last memento"; datetime="Sun, 11 Oct 2020 12:34:40 GMT"

(Line breaks added for clarity.)

This follows the standard format for the Link header. The most accessible docs are at MDN.

It would probably be nice to surface this information in the Memento class in some useful way. We originally planned to do this in #2, but it wasn’t critical and there were enough open questions and options that it seemed worth waiting on coming up with a better design for.

Some possibilities:

  • Simply parse the data generically, like the Requests package does:

    memento.links = {
        'original': {
            'url': 'https://www.fws.gov/birds/',
            'rel': 'original'
        },
        'timemap': {
            'url': 'http://web.archive.org/web/timemap/link/https://www.fws.gov/birds/',
            'rel': 'timemap',
            'type': 'application/link-format'
        },
        'timegate': {
            'url': 'http://web.archive.org/web/https://www.fws.gov/birds/',
            'rel': 'timegate'
        },
        'first memento': {
            'url': 'http://web.archive.org/web/20050323155300/http://www.fws.gov:80/birds',
            'rel': 'first memento',
            'datetime': datetime(2005, 3, 23, 15, 53, tzinfo=timezone.utc)
        },
        'prev memento': {
            'url': 'http://web.archive.org/web/20170929002712/https://www.fws.gov/birds/',
            'rel': 'prev memento',
            'datetime': datetime(2017, 9, 29, 0, 27, 12, tzinfo=timezone.utc)
        },
        'memento': {
            'url': 'http://web.archive.org/web/20171124151315/https://www.fws.gov/birds/',
            'rel': 'memento',
            'datetime': datetime(2017, 11, 24, 15, 13, 15, tzinfo=timezone.utc)
        },
        'next memento': {
            'url': 'http://web.archive.org/web/20171228222143/https://www.fws.gov/birds/',
            'rel': 'next memento',
            'datetime': datetime(2017, 12, 28, 22, 21, 43, tzinfo=timezone.utc)
        },
        'last memento': {
            'url': 'http://web.archive.org/web/20201011123440/http://www.fws.gov/birds',
            'rel': 'last memento',
            'datetime': datetime(2020, 10, 11, 12, 34, 40, tzinfo=timezone.utc)
        }
    }
  • Since the original and memento relationships are redundant (all that info is already on Memento), we could drop them.

  • We could make these a more special type than a dict so they can be passed to get_memento(), e.g:

    get_memento(memento.links['next memento'])
  • We could add get_next_memento(), etc. as a shortcut to get_memento(), e.g:

    memento.get_next_memento()
    # Same as:
    get_memento(memento.links['next memento']['url'],
                memento.links['next memento']['datetime'],
                memento.mode)
  • Since these are known, predictable links, we could add attributes directly to Memento instead of Memento.links:

    memento.first_memento = {...}
    memento.previous_memento = {...}
    memento.next_memento = {...}
    memento.last_memento = {...}

Another thing to consider is that we don’t currently have any special support for timemaps or timegates, so we can’t do anything special for those links. Basic dict parsing would probably be the lowest common denominator here. It’s not very special, but lets us treat everything the same.

Lots of ways we could go here. ¯\_(ツ)_/¯

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.