webrecorder / cdxj-indexer Goto Github PK

View Code? Open in Web Editor NEW

21.0 12.0 10.0 84 KB

CDXJ Indexing of WARC/ARCs

License: Apache License 2.0

Python 96.29% Arc 3.71%

warc web-archiving

cdxj-indexer's Introduction

Conifer

Collect and revisit web pages.

Conifer provides an integrated platform for creating high-fidelity, ISO-compliant web archives in a user-friendly interface, providing access to archived content, and sharing collections.

This repository represents the hosted service running at https://conifer.rhizome.org/, which can also be deployed locally using Docker

This README refers to the 5.x version of Conifer, released in June, 2020. This release includes a new UI and the renaming of Webrecorder.io to Conifer. Other parts of the open source efforts remain at the Webrecorder Project. For more info about this momentous change, read our announcement blog post.

The previous UI is available on the legacy branch.

Frequently asked questions

If you have any questions about how to use Conifer, please see our User Guide.
If you have a question about your account on the hosted service (conifer.rhizome.org), please contact us via email at [email protected]
If you have a previous Conifer installation (version 3.x), see Migration Info for instructions on how to migrate to the latest version.

Using the Conifer Platform

Conifer and related tools are designed to make web archiving more portable and decentralized, as well as to serve users and developers with a broad range of skill levels and requirements. Here are a few ways that Conifer can be used (starting with what probably requires the least technical expertise).

1. Hosted Service

Using our hosted version of Conifer at https://conifer.rhizome.org/, users can sign up for a free account and create their own personal collections of web archives. Captures web content will be available online, either publicly or only privately, under each user account, and can be downloaded by the account owner at any time. Downloaded web archives are available as WARC files. (WARC is the ISO standard file format for web archives.) The hosted service can also be used anonymously and the captured content can be downloaded at the end of a temporary session.

2. Offline Capture and Browsing

The Webrecorder Project is a closely aligned effort that offers OSX/Windows/Linux Electron applications:

Webrecorder Player browse WARCs created by Webrecorder (and other web archiving tools) locally on the desktop.
Webrecorder Desktop a desktop version of the hosted Webrecorder service providing both capture and replay features.

3. Preconfigured Deployment

To deploy the full version of Conifer with Ansible on a Linux machine, the Conifer Deploy workbook can be used to install this repository, configure nginx and other dependencies, such as SSL (via Lets Encrypt). The workbook is used for the https://conifer.rhizome.org deployment.

4. Full Conifer Local Deployment

The Conifer system in this repository can be deployed directly by following the instructions below. Conifer runs entirely in Docker and also requires Docker Compose.

5. Standalone Python Wayback (pywb) Deployment

Finally, for users interested in the core "replay system" and very basic recording capabilities, deploying pywb could also make sense. Conifer is built on top of pywb (Python Wayback/Python Web Archive Toolkit), and the core recording and replay functionality is provided by pywb as a standalone Python library. pywb comes with a Docker image as well.

pywb can be used to deploy your own web archive access service. See the full pywb reference manual for further information on using and deploying pywb.

Running Locally

Conifer can be run on any system that has Docker and Docker Compose installed. To install manually, clone

git clone https://github.com/rhizome-conifer/conifer
cd conifer; bash init-default.sh.
docker-compose build
docker-compose up -d

(The init-default.sh is a convenience script that copies wr_sample.env → wr.env and creates keys for session encryption.)

Point your browser to http://localhost:8089/ to access the locally running Conifer instance.

(Note: you may see a maintenance message briefly while Conifer is starting up. Refresh the page after a few seconds to see the Conifer home page).

Installing Remote Browsers

Remote Browsers are standard browsers like Google Chrome and Mozilla Firefox, encapsulated in Docker containers. This feature allows Conifer to directly use fixed versions of browsers for capturing and accessing web archives, with a more direct connection to the live web and web archives. Remote browsers in many cases can improve the quality of web archives during capture and access. They can be "remote controlled" by users and are launched as needed, and use the same amount of computing and memory resources as they would when just running as regular desktop apps.

Remote Browsers are optional, and can be installed as needed.

Remote Browsers are just Docker images which start with oldweb-today/, and are part of oldweb-today organization on GitHub. Installing the browsers can be as simple as running docker pull on each browser image each as well as additional Docker images for the Remote Desktop system.

To install the Remote Desktop System and all of the officially supported Remote Browsers, run install-browsers.sh

Configuration

Conifer reads its configuration from two files: wr.env, and less-commonly changed system settings in wr.yaml.

The wr.env file contains numerous deployment-specific customization options. In particular, the following options may be useful:

Host Names

By default, Conifer assumes its running on localhost or a single domain, but on different ports for application (the Conifer user interface) and content (material rendered from web archives). This is a security feature preventing archived web sites accessing and possibly changing Conifer's user interface, and other unwanted interactions.

To run Conifer on different domains, the APP_HOST and CONTENT_HOST environment variables should be set.

For best results, the two domains should be two subdomains, both with https enabled.

The SCHEME env var should also be set to SCHEME=https when deploying via https.

Anonymous Mode

By default Conifer disallows anonymous recording. To enable this feature, set ANON_DISABLED=false to the wr.env file and restart.

Note: Previously the default setting was anonymous recording enabled (ANON_DISABLED=false)

Storage

Conifer uses the ./data/ directory for local storage, or an external backend, currently supporting S3.

The DEFAULT_STORAGE option in wr.env configures storage options, which can be DEFAULT_STORAGE=local or DEFAULT_STORAGE=s3

Conifer uses a temporary storage directory for data while it is actively being captured, and temporary collections. Data is moved into the 'permanent' storage when the capturing process is completed or a temporary collection is imported into a user account.

The temporary storage directory is: WARCS_DIR=./data/warcs.

The permanent storage directory is either STORAGE_DIR=./data/storage or local storage.

When using s3, the value of STORAGE_DIR is ignored and data gets placed into S3_ROOT which is an s3:// bucket URL.

Additional s3 auth environment settings must also be set in wr.env or externally.

All data related to Conifer that is not web archive data (WARC and CDXJ) is stored in the Redis instance, which persists data to ./data/dump.rdb. (See Conifer Architecture below.)

Email

Conifer can send confirmation and password recovery emails. By default, a local SMTP server is run in Docker, but can be configured to use a remote server by changing the environment variables EMAIL_SMTP_URL and EMAIL_SMTP_SENDER.

Frontend Options

The react frontend includes a number of additional options useful for debugging. Setting NODE_ENV=development will switch react to development mode with hot reloading on port 8096.

Additional frontend configuration can be found in frontend/src/config.js

Administration tool

The script admin.py provides easy low level management of users. Adding, modifying, or removing users can be done via the command line.

To interactively create a user:

docker exec -it app python -m webrecorder.admin -c

or programmatically add users by supplying the appropriate positional values:

docker exec -it app  python -m webrecorder.admin \
                -c <email> <username> <passwd> <role> '<full name>'

Other arguments:

-m modify a user
-d delete a user
-i create and send a new invite
-l list invited users
-b send backlogged invites

See docker exec -it app python -m webrecorder.admin --help for full details.

Restarting Conifer

When making changes to the Conifer backend app, running

docker-compose kill app; docker-compose up -d app

will stop and restart the container.

To integrate changes to the frontend app, either set NODE_ENV=development and utilize hot reloading. If you're running production (NODE_ENV=production), run

docker-compose kill frontend; docker-compose up -d frontend

To fully recreate Conifer, deleting old containers (but not the data!) use the ./recreate.sh script.

Conifer Architecture

This repository contains the Docker Compose setup for Conifer, and is the exact system deployed on https://conifer.rhizome.org. The full setup consists of the following components:

/app - The Conifer backend system includes the API, recording and WARC access layers, split into 3 containers:
- app -- The API and data model and rewriting system are found in this container.
- recorder -- The WARC writer is found in this container.
- warcserver -- The WARC loading and lookup is found in this container.

The backend containers run different tools from pywb, the core web archive replay toolkit library.

/frontend - A React-based frontend application, running in Node.js. The frontend is a modern interface for Conifer and uses the backend api. All user access goes through frontend (after nginx).
/nginx - A custom nginx deployment to provide routing and caching.
redis - A Redis instance that stores all of the Conifer state (other than WARC and CDXJ).
dat-share - An experimental component for sharing collections via the Dat protocol
shepherd - An instance of OldWebToday Browser Shepherd for managing remote browsers.
mailserver - A simple SMTP mail server for sending user account management mail
behaviors - Custom automation behaviors
browsertrix - Automated crawling system

Dependencies

Conifer is built using both Python (for backend) and Node.js (for frontend) using a variety of Python and Node open source libraries.

Conifer relies on a few separate repositories in this organization:

The remote browser system uses https://github.com/oldweb-today/ repositories, including:

Contact

Conifer is a project of Rhizome, made possible with generous past support from the Andrew W. Mellon Foundation.

For more info on using Conifer, you can consult our user guide at: https://guide.conifer.rhizome.org

For any general questions/concerns regarding the project or https://conifer.rhizome.org you can:

Open issues on GitHub
Tweet to us at https://twitter.com/rhizomeconifer
Contact us at [email protected]

License

Conifer is Licensed under the Apache 2.0 License. See NOTICE and LICENSE for details.

cdxj-indexer's People

Contributors

Stargazers

Watchers

Forkers

xw0078 nlevitt traverseda machawk1 donfanning edsu pagefreezer carlosagea cclauss nanobootawesome

cdxj-indexer's Issues

Revisit records with POST requests lack a POST append in their URL key

When using the cdxj-indexer on a webpage that contains multiple different HTTP POST requests with the same response,
the cdxj-indexer will only append the URL for the response record. This means that revisit records will not have a POST append URL key.

When running the cdxj-indexer
Expected result:
com,example)/inc/postdatastatic.php?__wb_method=post&body=counter0&rnd=result1-0 20230118150449 {... "mime": "text/html"} com,example)/inc/postdatastatic.php?__wb_method=post&body=counter0&rnd=result1-1 20230118150449 {... "mime": "warc/revisit"}
Actual result
com,example)/inc/postdatastatic.php?__wb_method=post&body=counter0&rnd=result1-0 20230118150449 {... "mime": "text/html"} com,example)/inc/postdatastatic.php?rnd=result1-1 20230118150449 {... "mime": "warc/revisit"}

fix_postappend-revist.patch

DepreciationWarnings in `pyamf`

cfxj is impacted by two DepreciationWarnings from upstream: StdCarrot/Py3AMF#19

Probably no impact yet in Python 3.12, and no impact foreseen in 3.13, but always good to know ^^

Feature Requests / questions on use --> Pipe, Readme

Few Feature requests and/or requests for help using cdxj-indexer!
--> Also, my timing is good based on the reply by @ikreymer in another issue, seems we're both coming back to our respective projects. Nothing like global pandemic to make time for hobby projects for myself. haha

One of the first things I tried was trying to pipe the output of a command to cdxj-indexer, but that simply does not work. Whats the recommended method to get the output of a command run in this fashion? (Forgive me if I'm missing something primitive, still learning.)
--> While simple bash scripts are the most likely culprit for trying to pipe to cdxj-indexer, I have a gzip hardware accelerator (FPGA, real world throughput over 1GB per second in either direction), which would work really well if I could pipe my-fast-funzip file.warc.gz | cdxj-indexer. I am working on a python wrapper for my-fast-funzip though, as this need keeps popping up.

As well, when looking at --help, I see some other flags which I am having trouble finding documentation on, such as --compress and --lines. Is there a more robust readme kicking about somewhere that I simply missed?

Lastly multiprocessing would be a god-send
My machine's cpu threads are relatively slow, as it's an old server, but similarly, it's a server with 48 cores, 96 threads.
Generally speaking, I am likely not the only one who will find their way here by working with CommonCrawl warcs. I have ~40TB of warc.gz data to work through, so the use of the gzip FPGA and multiprocessing would reduce the time required for this step by a few orders of magnitude.

I'll likely work on a multiprocessing solution myself. In the past, I've handled multiprocessed writing to one file with the logging library. I believe the cdxj format is fine with an arbitrary line order, as I see sorting functionality here, is that correct?
--> Unless I hear a someone volunteer to help a beginner clean their code up, I likely won't make a pull request.

TLDR:

Is there a method to pipe to cdxj-indexer? If not, this is a feature request
Multiprocessing capabilities for those of us with more warc data than time to wait.

SURT are not created for HTTP CONNECT requests in WARC file

Hi, we are using this cdx-indexer tool and found out that while replaying our Wacz files in Replayweb.page player, sometimes certain resources were not found, while they were present in the Warc files.

What turned out in our Warc files are CONNECT requests and these are not converted to a SURT. For example, url=distillery.wistia.com:443 remains after surt.surt(url) method call distillery.wistia.com:443. The Replayweb.page player checks whether the index.idx has a surt, using useSurt = prefix.indexOf(")/") > 0; in the MultiWacz.js. If by chance the last line has a CONNECT then this block is considered surt = false in the cdx. Then querying in the browser DB using the upperBound method does not work properly.

Given:

A warc file with:

WARC/1.0
Content-Length: 308
Content-Type: application/http;msgtype=request
WARC-Block-Digest: sha1:XDTRC67IG3EYGKYRBFK7BOYLBRJHW52X
WARC-Date: 2022-09-14T14:45:01Z
WARC-Payload-Digest: sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
WARC-Record-ID: <urn:uuid:d083e59a-e1c5-4079-bb20-cf6115fa342d>
WARC-Target-URI: distillery.wistia.com:443
WARC-Type: request

CONNECT distillery.wistia.com:443 HTTP/1.1
Accept-Encoding: *, compress;q=0, br;q=0
Content-Length: 0
Host: distillery.wistia.com:443
Proxy-Connection: keep-alive
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/105.0.5195.102 Safari/537.36

When running the cdxj_indexer with the following parameters:

main.py -p -o index.idx -c index.cdx.gz -s -d -l 1024 small.warc

Then the result in de index is:

!meta 0 {"format": "cdxj-gzip-1.0", "filename": "c:\\temp\\index.cdx.gz"}
distillery.wistia.com:443 20220914144501 {"offset": 0, "length": 371, "digest": "sha256:8e8d3aa0f13b077615de09a2d349121130ec5fca9783c97d10c07721e1d13585"}

excepted:

com,wistia,distillery)/ 20220914144501 {"offset": 0, "length": 377, "digest": "sha256:b75ede157ec02f31a25126270771b287d1ccc42554c9678ebc2c1446249a554d"}

Recompress and Re-indexing Errors

We've run into two issues while trying to recompress and re-index some of our older ARCs.

1): When running warcio recompress IQ04-CRAWL-16-20041020093524-00141-crawling003.archive.org.arc.gz we get:

IQ04-CRAWL-16-20041020093524-00141-crawling003.archive.org.arc.gz could not be read as a WARC or ARC

Could anyone elaborate on what's going on here/suggest possible work around?

2): For some of the ARCs that are sucessfully recompressed, we get this error after running the cdxj-indexer:

UnicodeEncodeError: 'ascii' codec can't encode character '\xed' in position 403: ordinal not in range(128)

We've hand checked a few of these ARCs and it seems that the offending resource is always an image in binary. Any suggestions on how to move forward? I can also post the first error in warcio if that's more appropriate.

AttributeError: 'NoneType' object has no attribute 'protocol'

While using cdxj-indexer to index a backlog of WARC data I ran into this error when using --post-append:

was@was-dev:~$ cdxj-indexer --sort --post-append /web-archiving-stacks/data/collections/jt898xc8096/fq/567/wq/8955/ARCHIVEIT-5425-MONTHLY-JOB292430-20170430083101595-00035.warc.gz > x
Traceback (most recent call last):
  File "/opt/app/was/.local/bin/cdxj-indexer", line 8, in <module>
    sys.exit(main())
  File "/opt/app/was/.local/lib/python3.8/site-packages/cdxj_indexer/main.py", line 477, in main
    write_cdx_index(cmd.output, cmd.inputs, vars(cmd))
  File "/opt/app/was/.local/lib/python3.8/site-packages/cdxj_indexer/main.py", line 492, in write_cdx_index
    indexer.process_all()
  File "/opt/app/was/.local/lib/python3.8/site-packages/cdxj_indexer/main.py", line 210, in process_all
    super().process_all()
  File "/opt/app/was/.local/lib/python3.8/site-packages/warcio/indexer.py", line 33, in process_all
    self.process_one(fh, out, filename)
  File "/opt/app/was/.local/lib/python3.8/site-packages/cdxj_indexer/main.py", line 244, in process_one
    for record in wrap_it:
  File "/opt/app/was/.local/lib/python3.8/site-packages/cdxj_indexer/bufferiter.py", line 49, in buffering_record_iter
    join_req_resp(req, resp, post_append, url_key_func)
  File "/opt/app/was/.local/lib/python3.8/site-packages/cdxj_indexer/bufferiter.py", line 103, in join_req_resp
    method = req.http_headers.protocol
AttributeError: 'NoneType' object has no attribute 'protocol'

I tracked it down to an request record that seems to lack a body, which seems wrong, but probably shouldn't generate an error? These records came from Archive-It.

WARC/1.0
WARC-Type: request
WARC-Target-URI: https://img1.doubanio.com/icon/u3927203-87.jpg
WARC-Date: 2017-04-30T11:39:19Z
WARC-Concurrent-To: <urn:uuid:4ababcf0-a610-4839-9b3e-57e3f1f056e2>
WARC-Record-ID: <urn:uuid:ef603b53-b29b-412e-89c3-2bac194b9224>
Content-Length: 0
WARC-Payload-Digest: sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
WARC-Block-Digest: sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ

Maybe a guard against a req.http_headers being None here would be helpful in (admittedly obscure) cases like this?

Has this been pushed to pypi?

The package does not appear to be available via pip, as recommended as an installation method in the README.

$ pip install cdxj-indexer
Collecting cdxj-indexer
  Could not find a version that satisfies the requirement cdxj-indexer (from versions: )
No matching distribution found for cdxj-indexer
$ python --version
Python 2.7.13
$ pip --version
pip 9.0.1 from /usr/local/lib/python2.7/site-packages (python 2.7)

Extracting page titles / URLs from cdxj

Hello!

Apologies if this is a silly question, but I'm wondering if cdxj-indexer has the ability to generate a list of pages (and potentially their titles) from a warc file? I am thinking of something analogous to the 'Pages' tab in replayweb.page, where it's a list of the captured pages and their titles, rather than the list of all of the many digital objects that make them up. I wondered if there is an http header that could be used for this with cdxj-indexer that would help with this, but I don't see anything obvious.

The use case is that it would be great to provide a human-friendly list of pages on our archive catalogue entries. At the moment I've been generating cdxj's with the default settings but I can see researchers finding this confusing.

Many thanks,

Jake

CDX files generated are not sorted

Similar to the wayback indexer, this indexer doesn't produce a sorted CDX file so when you try to use it on pywb it fails to find links correctly. Just wondering whether there was a particular design decision that was taken for why it works this way?

I should add that I am only looking at using CDX files. This is because I want to test out pywb and openwayback and as far as I can find out (from docs/code), openwayback 2.3.2 doesn't support CDXJ. I found some mention of CDXJ and openwayback in reference to openwayback 3.0.0 but as it is a stale branch on github I assume it has been abandoned.

Error during indexing: No space left on device

I was running the cdxj-indexer utility in parallel on a large number of WARC files and ran into a No space left on device exception where warcio's ensure_digest() is writing to a temp file. When I looked at the filesystem it appeared to have plenty of space, but of course this was after the process had terminated, which may have cleared up any temporary files that had been left open.

I was wondering if the temporary file should be closed at some point so that the space can be reclaimed by the operating system?

--post-append and memory use

When indexing a large WARC file (89GB) with --post-append the cdxj-indexer process starts using all available memory on a machine with 8GB of memory. I don't know if this is something specific about this WARC file, or if it is a more general problem when indexing large WARC files. I can supply the WARC file if it helps in testing.

Ways of handling problematic WARC records

We've found some weird WARCs, looking like this:

WARC/1.0
WARC-Type: response
WARC-Target-URI: http://www.estiethirionphotography.co.za/2011/10/fransua-anne-louise-wedding/feed/
WARC-Date: 2017-09-19T03:35:35Z
WARC-IP-Address: 176.58.112.27
WARC-Payload-Digest: sha1:ZQZJUQJW34BYM2R23SI7PDFMYFUTXGVU
WARC-Record-ID: <urn:uuid:d15353f7-1bb7-4441-92bf-1f2268639d52>
Content-Type: application/http; msgtype=response
Content-Length: 7026

19/Sep/2017:03:35:35 +0000|v1|40.77.167.54|www.mobyaffiliates.com|200|17922|35.197.249.238:80|0.019|0.019|GET /wp-content/uploads/2015/05/i6d2e3jOCVVc-e1432221090328.jpg HTTP/1.1||
19/Sep/2017: 03:35:35 +0000|v1|24.18.58.84|thestar.ie|200|73232|162.13.191.183:80|0.061|0.374|GET /wp-content/uploads/2015/12/video-woman-abusing-mcdonalds-cookies-brandy-wooten-353018.jpg HTTP/1.1||
19/Sep/2017: 03:35:36 +0000|v1|5.62.39.244|markom2020.no|403|0|35.197.196.129:80|0.339|0.339|GET /?author=1 HTTP/1.1||
19/Sep/2017: 03:35:36 +0000|v1|54.82.184.78|thestar.ie|200|0|162.13.191.183:80|0.389|0.389|HEAD /about-us/out-in-the-open-ace-back-at-work-hours-after-pittsburgh-defeat/ HTTP/1.1||
19/Sep/2017: 03:35:36 +0000|v1|69.162.124.230|www.adventure-holidays.ie|301|178|35.197.246.117:80|0.022|0.022|GET / HTTP/1.1||
19/Sep/2017: 03:35:36 +0000|v1|180.76.15.136|www.mobyaffiliates.com|200|20225|35.197.249.238:80|0.945|0.945|GET /mobile-advertising-networks/?key-markets=japan+indonesia&targeting=custom+operator HTTP/1.1||
19/Sep/2017: 03:35:37 +0000|v1|51.255.71.100|thestar.ie|200|32351|162.13.191.183:80|0.416|0.416|GET /about-us/sharon-corr-we-dont-judge-age/ HTTP/1.0||
19/Sep/2017: 03:35:37 +0000|v1|5.9.60.241|gullfoss.is|200|32745|35.197.192.76:80|2.819|2.819|GET /shop/?_wpnonce=9c17844d42&add_to_wishlist=3015 HTTP/1.1||
19/Sep/2017: 03:35:37 +0000|v1|188.163.72.15|www.alnouran.com|200|18017|35.189.109.142:80|0.006|0.006|GET /en/corporate-governance/corporate-social-responsibilities/ HTTP/1.1||
19/Sep/2017: 03:35:37 +0000|v1|178.154.200.9|canieatthere.eu|301|178|104.155.26.132:80|0.018|0.018|GET /robots.txt HTTP/1.1||
19/Sep/2017: 03:35:38 +0000|v1|141.8.142.44|canieatthere.co.uk|301|178|104.155.26.132:80|0.016|0.016|GET / HTTP/1.1||
19/Sep/2017: 03:35:38 +0000|v1|54.80.111.161|ravatherm.com|200|201784|104.199.60.90:80|0.071|0.071|GET /files/2016/03/DoP_RAVATHERM_300WB180_SK.pdf HTTP/1.0||
19/Sep/2017: 03:35:38 +0000|v1|131.253.25.146|www.grandunionhousing.co.uk|200|2117|35.189.99.79:80|0.060|0.060|GET /wp-content/uploads/2017/05/twitter.png HTTP/1.1||
19/Sep/2017: 03:35:38 +0000|v1|131.253.25.146|www.grandunionhousing.co.uk|200|4746|35.189.99.79:80|0.060|0.060|GET /wp-content/uploads/2017/05/google-plus.png HTTP/1.1||
19/Sep/2017: 03:35:38 +0000|v1|131.253.25.146|www.grandunionhousing.co.uk|200|1746|35.189.99.79:80|0.061|0.061|GET /wp-content/uploads/2017/05/facebook-icon.png HTTP/1.1||
19/Sep/2017: 03:35:38 +0000|v1|45.55.55.18|www.lanzarotesurf.com|200|28740|35.197.214.99:80|1.859|1.859|GET /es/reservas/surf-camp-nivel-intermedio/ HTTP/1.1||
19/Sep/2017: 03:35:38 +0000|v1|207.46.13.65|www.stickybottle.com|200|11209|134.213.209.62:80|1.007|1.007|GET /latest-news/hotly-contest-shay-elliott-memorial-in-prospect-as-top-men-fine-tune-ras-form/ HTTP/1.1||
19/Sep/2017: 03:35:38 +0000|v1|66.249.85.10|nutroexpertos.com|200|33426|35.189.69.242:80|0.022|0.022|GET /wp-content/uploads/2015/05/Ejercicio-perro-484x330.jpg HTTP/1.1||
19/Sep/2017: 03:35:39 +0000|v1|157.55.39.239|www.janminihane.co.uk|200|4319|35.197.245.96:80|0.017|0.017|GET /wp-includes/js/jquery/jquery-migrate.min.js HTTP/1.1||
19/Sep/2017: 03:35:39 +0000|v1|35.189.215.158|www.axbom.se|301|178|35.197.249.238:80|0.005|0.005|GET /feed/axbom-se HTTP/1.1||
19/Sep/2017: 03:35:39 +0000|v1|66.249.85.10|nutroexpertos.com|200|96474|35.189.69.242:80|0.016|0.395|GET /wp-content/uploads/2015/06/Post-c%C3%B3mo-ense%C3%B1ar-a-cachorro-a-hacer-sus-necesidades-sobre-los-peri%C3%B3dicos.jpg HTTP/1.1||
19/Sep/2017: 03:35:39 +0000|v1|66.249.85.8|nutroexpertos.com|200|65231|35.189.69.242:80|0.040|0.428|GET /wp-content/uploads/2014/12/garrapatas-en-perros-2-484x330.jpg HTTP/1.1||
19/Sep/2017: 03:35:39 +0000|v1|35.197.192.76|gullfoss.is|200|6275|35.197.192.76:80|0.005|0.007|GET /wp-content/uploads/2016/07/Logo-Gullfoss_website-XI-1.png HTTP/1.0||
19/Sep/2017: 03:35:40 +0000|v1|54.82.184.78|thestar.ie|200|0|162.13.191.183:80|0.424|0.424|HEAD /about-us/1ds-niall-ill-find-the-next-mcilroy/ HTTP/1.1||
19/Sep/2017: 03:35:40 +0000|v1|218.90.137.18|laorcare.com|200|0|35.189.99.79:80|1.079|1.079|HEAD /wp-json/oembed/1.0/embed?url=http%3A%2F%2Flaorcare.com%2F HTTP/1.1||
19/Sep/2017: 03:35:40 +0000|v1|218.90.137.18|laorcare.com|200|2507|35.189.99.79:80|0.006|0.006|GET /wp-json/oembed/1.0/embed?url=http%3A%2F%2Flaorcare.com%2F HTTP/1.1||
Server: nginx
Date: Tue, 19 Sep 2017 03:35:41 GMT
Content-Type: application/rss+xml; charset=UTF-8
Connection: close
X-Cacheable: CacheAlways: feed
Cache-Control: max-age=600, must-revalidate
X-Cache: MISS
X-Cache-Group: bot
X-Pingback: http://www.estiethirionphotography.co.za/xmlrpc.php
Link: <http://www.estiethirionphotography.co.za/wp-json/>; rel="https://api.w.org/"
Link: <http://wp.me/p2ZY6I-Mb>; rel=shortlink
X-Type: feed
ETag: "1f5dd55566f2f1de600da749924ac5fb-gzip"
X-Pass-Why: 
Last-Modified: Fri, 27 Jan 2017 11:12:31 GMT

<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	
	>
<channel>
	<title>Comments on: Fransua &#038; Anne-Louise wedding</title>
	<atom:link href="http://www.estiethirionphotography.co.za/2011/10/fransua-anne-louise-wedding/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.estiethirionphotography.co.za/2011/10/fransua-anne-louise-wedding/</link>
	<description>Photography</description>
	<lastBuildDate>Fri, 27 Jan 2017 11:12:31 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>https://wordpress.org/?v=4.8.1</generator>
	<item>
		<title>By: nastassja harvey</title>
		<link>http://www.estiethirionphotography.co.za/2011/10/fransua-anne-louise-wedding/#comment-9616</link>
		<dc:creator><![CDATA[nastassja harvey]]></dc:creator>
		<pubDate>Thu, 27 Oct 2011 17:41:40 +0000</pubDate>
		<guid isPermaLink="false">http://www.estiethirionphotography.co.za/?p=2987#comment-9616</guid>
		<description><![CDATA[sooo mooi estie! :)]]></description>
		<content:encoded><![CDATA[<p>sooo mooi estie! 🙂</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Kathryn van Eck</title>
		<link>http://www.estiethirionphotography.co.za/2011/10/fransua-anne-louise-wedding/#comment-9609</link>
		<dc:creator><![CDATA[Kathryn van Eck]]></dc:creator>
		<pubDate>Wed, 26 Oct 2011 09:48:58 +0000</pubDate>
		<guid isPermaLink="false">http://www.estiethirionphotography.co.za/?p=2987#comment-9609</guid>
		<description><![CDATA[Beautiful work! I love the softness of your images and how you captured the couples joy.]]></description>
		<content:encoded><![CDATA[<p>Beautiful work! I love the softness of your images and how you captured the couples joy.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
19/Sep/2017:03:35:41 +0000|v1|194.66.232.93|www.estiethirionphotography.co.za|200|1985|162.13.104.162:80|5.773|5.773|GET /2011/10/fransua-anne-louise-wedding/feed/ HTTP/1.0||

which comes out as a malformed CDX record:

za,co,estiethirionphotography)/2011/10/fransua-anne-louise-wedding/feed 20170919033535 {"url": "http://www.estiethirionphotography.co.za/2011/10/fransua-anne-louise-wedding/feed/", "mime": "application/rss+xml", "status": "+0000|v1|40.77.167.54|www.mobyaffiliates.com|200|17922|35.197.249.238:80|0.019|0.019|GET", "digest": "sha1:ZQZJUQJW34BYM2R23SI7PDFMYFUTXGVU", "length": "2785", "offset": "861793820", "filename": "test.warc.gz"}

But I think it'd be better to skip/drop these records?

Problem when URL is malformed

Describe the bug

All processing stops when there is a malformed url.

Steps to reproduce the bug

For the url "http://eosims.asf.alaska.edu:12355.edu:80/" the cdxj-indexer returns:

Traceback (most recent call last):
File "/mnt/c/Users/pgomes/Desktop/Code/venv/bin/cdx-indexer", line 8, in
sys.exit(main())
File "/mnt/c/Users/pgomes/Desktop/Code/venv/lib/python3.6/site-packages/pywb/indexer/cdxindexer.py", line 469, in main
minimal=cmd.minimal_cdxj)
File "/mnt/c/Users/pgomes/Desktop/Code/venv/lib/python3.6/site-packages/pywb/indexer/cdxindexer.py", line 301, in write_multi_cdx_index
for entry in entry_iter:
File "/mnt/c/Users/pgomes/Desktop/Code/venv/lib/python3.6/site-packages/pywb/indexer/archiveindexer.py", line 339, in call
for entry in entry_iter:
File "/mnt/c/Users/pgomes/Desktop/Code/venv/lib/python3.6/site-packages/pywb/indexer/archiveindexer.py", line 172, in create_record_iter
entry['urlkey'] = canonicalize(entry['url'], surt_ordered)
File "/mnt/c/Users/pgomes/Desktop/Code/venv/lib/python3.6/site-packages/pywb/utils/canonicalize.py", line 48, in canonicalize
raise UrlCanonicalizeException('Invalid Url: ' + url)
pywb.utils.canonicalize.UrlCanonicalizeException: Invalid Url: http://eosims.asf.alaska.edu:12355.edu:80/

And stops the whole process.

Expected behavior

Wouldn't it be better to analyze record a record? If there is an error, will it continue to process the next record for the same warc?

'cgi' is deprecated and slated for removal in Python 3.13

Codebase needs to be adapted to cope with the fact the cgi is now deprecated since Python 3.11, and slated for removal in 3.13.