cocrawler / cdx_toolkit Goto Github PK

View Code? Open in Web Editor NEW

153.0 12.0 29.0 187 KB

A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine

License: Apache License 2.0

Makefile 1.37% Python 98.63%

web-archiving web-archives warc cdx cdx-api commoncrawl python

cdx_toolkit's Introduction

CoCrawler

CoCrawler is a versatile web crawler built using modern tools and concurrency.

Crawling the web can be easy or hard, depending upon the details. Mature crawlers like Nutch and Heritrix work great in many situations, and fall short in others. Some of the most demanding crawl situations include open-ended crawling of the whole web.

The object of this project is to create a modular crawler with pluggable modules, capable of working well for a large variety of crawl tasks. The core of the crawler is written in Python 3.7+ using coroutines.

Status

CoCrawler is pre-release, with major restructuring going on. It is currently able to crawl at around 170 megabits / 170 pages/sec on a 4 core machine.

Screenshot:

Installing

We recommend that you use pyenv / virtualenv to separate the python executables and packages used by cocrawler from everything else.

You can install cocrawler from pypi using "pip install cocrawler".

For a more fresh version, clone the repo and install like this:

git clone https://github.com/cocrawler/cocrawler.git
cd cocrawler
pip install . .[test]
make pytest
make test_coverage

The CI for this repo uses the latest versions of everything. To see exactly what worked last, click on the "Build Status" link above. Alternately, you can look at requirements.txt for a test combination that I probably ran before checking in.

Credits

CoCrawler draws on ideas from the Python 3.4 code in "500 Lines or Less", which can be found at https://github.com/aosabook/500lines. It is also heavily influenced by the experiences that Greg acquired while working at blekko and the Internet Archive.

License

Apache 2.0

cdx_toolkit's People

Contributors

Stargazers

Watchers

cdx_toolkit's Issues

404 seen for API call

Hi,

I'm getting this error ValueError: 404 seen for API call, did you configure the endpoint correctly? for the following code:

cdx = cdx_toolkit.CDXFetcher(source='cc')
url = "https://www.cnn.com/*"
objs = list(cdx.iter(url, from_ts='202001', to='202002', filter=['status:200']))

The same code worked two days ago but fails now, do you know why would this happen? Thanks!

here is some usage examples

a general iterator script that takes a url_prefix with asterisk and outputs to an html directory

gist

usage:

python fetch.py 'https://open.spotify.com/*'

a spotify scraper using common crawl as datasource

gist

usage:

python download_spotify.py >> data.json

`collinfo.json` URL returning empty list

Hi, thanks for this package and the work you have done.

I'm running:

import cdx_toolkit

cdx = cdx_toolkit.CDXFetcher(source='cc')

And getting:

Traceback (most recent call last): File "<input>", line 1, in <module> File ".../venv/lib/python3.8/site-packages/cdx_toolkit/__init__.py", line 215, in __init__ self.raw_index_list = get_cc_endpoints(self.cc_mirror) File ".../venv/lib/python3.8/site-packages/cdx_toolkit/commoncrawl.py", line 25, in get_cc_endpoints raise ValueError('Surprisingly few endpoints for common crawl index') # pragma: no cover ValueError: Surprisingly few endpoints for common crawl index

I did some looking, and it appears the code is checking:

https://index.commoncrawl.org/collinfo.json

For indices. No matter how I open or query that link, all I get is an empty list. I'm willing to work around this or work on a patch- but I can't seem to find an alternative to collinfo.json or where a list of indices are kept.

Read timed out

full traceback:- https://pastebin.com/1c61vqAS

is there any way to handle this error in iter obj ?

...

Results not complete for Common Crawl index 2012

This is related to the comment in issue #17.
The following code returns nothing:

cdx = cdx_toolkit.CDXFetcher(source='cc')
url = "www.cnn.com/*"
objs = cdx.iter(url, from_ts='201207', to='201212', filter=['status:200', 'mime:text/html'])

I also ran this code objs = cdx.iter(url, from_ts='201201', to='201206', filter=['status:200', 'mime:text/html']) and it returns some results. However, the results only contain data before 201207. So it seems the data between 201207 - 201212 are missing?
Similarly, objs = cdx.iter(url, from_ts='201301', to='201306', filter=['status:200', 'mime:text/html']) only returns data after 201304.

myrequests.py gets stuck in a loop if the response status is always 429, 500, 502, 503, 504, 509

OS: Windows 10
Python 3.8.5
cdx_toolkit 0.9.34

I'm having an issue where the CDX toolkit get's stuck in a loop and prints out cdx_toolkit.myrequests myrequests.py: 62 : retrying after 1s for 500 constantly. I've tracked this down to this line in the myrequests.py class. If I am reading this correctly, if the response status is always one of 429, 500, 502, 503, 504, 509, you will be stuck in this retry loop.

I suggest that after line 62 we break out of the loop if the number of retires is greater than 5.

Is it possible to get only one URL of one domain from a TLD?

Hello,

I'm trying to look at some specifics only at certain URLs of domains.

Currently, I'm doing:

    url = '*.co.uk/'
    cdx = cdx_toolkit.CDXFetcher(source='cc')
    objs = list(cdx.iter(url, from_ts='202006', to='202006',
                     limit=1000, filter='=status:200'))

and it finds and the 1000 URLs and iterates over them (might be few dozen domains).
Is there a filter I can employ, if I am only interested in the front page for instance?

Thanks,
Chris

"ValueError('invalid hostname in url '+url) from None" when accessing internet archive CaptureObject.content

It seems to happens only with ia as a source and not cc. It also quite seldom, i'd say once every 5000-8000 CaptureObject's content attribute access

my code that triggers:

for obj in cdx.iter(url=url_pattern, 
            from_ts=self.date_range[0], 
            to=self.date_range[1], 
            filter=self.filter):
    with open(f"{html_dir}/{obj.data['digest']}.html", mode="wb") as f_w:
        f_w.write(obj.content)

here is the full traceback:

Traceback (most recent call last):
  File "/home/user/anaconda3/envs/SEER-env/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/user/anaconda3/envs/SEER-env/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/user/SEER/utils/cdx_retriever.py", line 103, in _retrieve_content
    f_w.write(obj.content)
  File "/home/user/anaconda3/envs/SEER-env/lib/python3.7/site-packages/cdx_toolkit/__init__.py", line 122, in content
    self._content = self.fetch_warc_record().content_stream().read()
  File "/home/user/anaconda3/envs/SEER-env/lib/python3.7/site-packages/cdx_toolkit/__init__.py", line 107, in fetch_warc_record
    self.warc_record = fetch_wb_warc(self.data, wb=self.wb)
  File "/home/user/anaconda3/envs/SEER-env/lib/python3.7/site-packages/cdx_toolkit/warc.py", line 118, in fetch_wb_warc
    resp = myrequests_get(wb_url, **kwargs)
  File "/home/user/anaconda3/envs/SEER-env/lib/python3.7/site-packages/cdx_toolkit/myrequests.py", line 63, in myrequests_get
    raise ValueError('invalid hostname in url '+url) from None
ValueError: invalid hostname in url https://web.archive.org/web/20191120071924id_/https%3A//www.placeholder.com/news/articles/2018-04-26/article-news-content

new Common Crawl year indices

The cdx_toolkit code gets Common Crawl index dates by parsing the index name, CC-MAIN-2020-50

The new (old) indices don't have a week number:

"id": "CC-MAIN-2012",
"id": "CC-MAIN-2009-2010",
"id": "CC-MAIN-2008-2009",

which leads to the bug that 2012 is ignored, and I think 2009 and 2008 are treated as if they have a week number of 20.

Filters and url of crawled page in python client

Hi!

I am trying to figure out how to pass on filters, in the same way as possible in the CLI, I am looking to filter by language, date and status code.

The second piece I am missing is when iterating over the results - how does one grab the URL of the crawl?

import cdx_toolkit

cdx = cdx_toolkit.CDXFetcher(source='cc')
url = 'dailymail.co.uk/*'

print(url, 'size estimate', cdx.get_size_estimate(url))

for obj in cdx.iter(url, limit=100):
    print(obj.text) # obj.url would be nice :)

EDIT:

Also the example on the readme does not print out the content of obj, but rather:
<cdx_toolkit.CaptureObject object at 0x1125f84d0>

WET-files

Thank you for your cdx_toolkit! This is great toolkit!
I have one question.
Can I get wet-files from commoncrawl.org using the cdx_toolkit? Or convert warc-file to wet-file?

Thank's

Retrieving objects for a set or list of URL's in parallel

Hi,

Thanks for sharing the programming example - https://github.com/cocrawler/cdx_toolkit#programming-example
I wanted to ask if there is a way to feed in a list of URL's and retrieve their objects. We feed URL's one by one in the above example and looping over a few thousands (or even hundreds) seems to be a little time consuming.

Thanks.

CommonCrawl index date range code is broken

cdxt --cc --from 2021 --to 2020 -v -v --limit 1 iter https://www.pbm.com/
INFO:cdx_toolkit.cli:set loglevel to DEBUG
DEBUG:cdx_toolkit.myrequests:getting https://index.commoncrawl.org/collinfo.json None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): index.commoncrawl.org:443
DEBUG:urllib3.connectionpool:https://index.commoncrawl.org:443 "GET /collinfo.json HTTP/1.1" 200 1157
INFO:cdx_toolkit.commoncrawl:Found 87 endpoints in the Common Crawl index
INFO:cdx_toolkit:making a custom cc index list
INFO:cdx_toolkit.commoncrawl:using cc index range from https://index.commoncrawl.org/CC-MAIN-2021-04-index to https://index.commoncrawl.org/CC-MAIN-2020-50-index
INFO:cdx_toolkit:get_more: fetching cdx from https://index.commoncrawl.org/CC-MAIN-2021-04-index

The above date range should be empty.

[Question]

how can I query just domain names? (without /path/to/files ) and unspecified all tld (no *.com *.io..). I would like to get a just unique domains dataset

next(CDXFetcherIter) hangs when cc request returns 400/404

OS: Windows 10
Python 3.8.5
cdx_toolkit 0.9.29
requests 2.24.0

So recently iterations on CDXFetcherIter hang on some urls, below is an example:

url = 'http://www.illinoishomepage.net/news/local-news/man-who-fell-through-ice-dies'
cdx = cdx_toolkit.CDXFetcher(source='cc')
next(cdx.iter(url, from_ts='201301',to='201709'))

And it turns out that on my machine, this request always returns 404:

cc_url = 'https://index.commoncrawl.org/CC-MAIN-2017-34-index'
params = {'to': '20170930235959', 'url': 'http://www.illinoishomepage.net/news/local-news/man-who-fell-through-ice-dies', 'output': 'json', 'page': 0, 'from': '20130101000000'}
headers = {'User-Agent': 'pypi_cdx_toolkit/installed-from-git'}
resp = requests.get(cc_url, params=params, headers=headers,
                                timeout=(30., 30.), allow_redirects=False)

And because of the way 404 is handled, CDXFetcherIter.get_more() will run this request over and over again with increasing page but self.captures remains [], causing the loop to hang.

I don't know if it's a temporary common crawl server error or it's related to my local connection. but same error last for 3 days now. So I think adding an option to limit number of pages or number of retries of get_more() would be nice. Thank you!

make fetching capture contents a visible API

expose and document the download function behind obj.content
so that users can use it for a subset of captures
also expose a "list of captures -> warc" function

Command-line tools are shipped with non-portable Python runtime

The command-line tools (cdx_iter and cdx_size) when installed via pip contain a non-portable Python runtime:

#!/home/lindahl/.pyenv/versions/3.6.4/bin/python

The non-portable runtime is already deployed in the wheel file (0.9.20) and is not to replaced when the package is installed via pip3 install cdx_toolkit on Linux (Ubuntu 18.04).