Git Product home page Git Product logo

cdx_toolkit's Introduction

CoCrawler

Build Status Coverage Status Apache License 2.0

CoCrawler is a versatile web crawler built using modern tools and concurrency.

Crawling the web can be easy or hard, depending upon the details. Mature crawlers like Nutch and Heritrix work great in many situations, and fall short in others. Some of the most demanding crawl situations include open-ended crawling of the whole web.

The object of this project is to create a modular crawler with pluggable modules, capable of working well for a large variety of crawl tasks. The core of the crawler is written in Python 3.7+ using coroutines.

Status

CoCrawler is pre-release, with major restructuring going on. It is currently able to crawl at around 170 megabits / 170 pages/sec on a 4 core machine.

Screenshot: Screenshot

Installing

We recommend that you use pyenv / virtualenv to separate the python executables and packages used by cocrawler from everything else.

You can install cocrawler from pypi using "pip install cocrawler".

For a more fresh version, clone the repo and install like this:

git clone https://github.com/cocrawler/cocrawler.git
cd cocrawler
pip install . .[test]
make pytest
make test_coverage

The CI for this repo uses the latest versions of everything. To see exactly what worked last, click on the "Build Status" link above. Alternately, you can look at requirements.txt for a test combination that I probably ran before checking in.

Credits

CoCrawler draws on ideas from the Python 3.4 code in "500 Lines or Less", which can be found at https://github.com/aosabook/500lines. It is also heavily influenced by the experiences that Greg acquired while working at blekko and the Internet Archive.

License

Apache 2.0

cdx_toolkit's People

Contributors

machawk1 avatar marshallduval avatar wumpus avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cdx_toolkit's Issues

404 seen for API call

Hi,

I'm getting this error ValueError: 404 seen for API call, did you configure the endpoint correctly? for the following code:

cdx = cdx_toolkit.CDXFetcher(source='cc')
url = "https://www.cnn.com/*"
objs = list(cdx.iter(url, from_ts='202001', to='202002', filter=['status:200']))

The same code worked two days ago but fails now, do you know why would this happen? Thanks!

`collinfo.json` URL returning empty list

Hi, thanks for this package and the work you have done.

I'm running:

import cdx_toolkit

cdx = cdx_toolkit.CDXFetcher(source='cc')

And getting:

Traceback (most recent call last): File "<input>", line 1, in <module> File ".../venv/lib/python3.8/site-packages/cdx_toolkit/__init__.py", line 215, in __init__ self.raw_index_list = get_cc_endpoints(self.cc_mirror) File ".../venv/lib/python3.8/site-packages/cdx_toolkit/commoncrawl.py", line 25, in get_cc_endpoints raise ValueError('Surprisingly few endpoints for common crawl index') # pragma: no cover ValueError: Surprisingly few endpoints for common crawl index

I did some looking, and it appears the code is checking:

https://index.commoncrawl.org/collinfo.json

For indices. No matter how I open or query that link, all I get is an empty list. I'm willing to work around this or work on a patch- but I can't seem to find an alternative to collinfo.json or where a list of indices are kept.

Results not complete for Common Crawl index 2012

This is related to the comment in issue #17.
The following code returns nothing:

cdx = cdx_toolkit.CDXFetcher(source='cc')
url = "www.cnn.com/*"
objs = cdx.iter(url, from_ts='201207', to='201212', filter=['status:200', 'mime:text/html'])

I also ran this code objs = cdx.iter(url, from_ts='201201', to='201206', filter=['status:200', 'mime:text/html']) and it returns some results. However, the results only contain data before 201207. So it seems the data between 201207 - 201212 are missing?
Similarly, objs = cdx.iter(url, from_ts='201301', to='201306', filter=['status:200', 'mime:text/html']) only returns data after 201304.

myrequests.py gets stuck in a loop if the response status is always 429, 500, 502, 503, 504, 509

OS: Windows 10
Python 3.8.5
cdx_toolkit 0.9.34

I'm having an issue where the CDX toolkit get's stuck in a loop and prints out cdx_toolkit.myrequests myrequests.py: 62 : retrying after 1s for 500 constantly. I've tracked this down to this line in the myrequests.py class. If I am reading this correctly, if the response status is always one of 429, 500, 502, 503, 504, 509, you will be stuck in this retry loop.

I suggest that after line 62 we break out of the loop if the number of retires is greater than 5.

Is it possible to get only one URL of one domain from a TLD?

Hello,

I'm trying to look at some specifics only at certain URLs of domains.

Currently, I'm doing:

    url = '*.co.uk/'
    cdx = cdx_toolkit.CDXFetcher(source='cc')
    objs = list(cdx.iter(url, from_ts='202006', to='202006',
                     limit=1000, filter='=status:200'))

and it finds and the 1000 URLs and iterates over them (might be few dozen domains).
Is there a filter I can employ, if I am only interested in the front page for instance?

Thanks,
Chris

"ValueError('invalid hostname in url '+url) from None" when accessing internet archive CaptureObject.content

It seems to happens only with ia as a source and not cc. It also quite seldom, i'd say once every 5000-8000 CaptureObject's content attribute access

my code that triggers:

for obj in cdx.iter(url=url_pattern, 
            from_ts=self.date_range[0], 
            to=self.date_range[1], 
            filter=self.filter):
    with open(f"{html_dir}/{obj.data['digest']}.html", mode="wb") as f_w:
        f_w.write(obj.content)

here is the full traceback:

Traceback (most recent call last):
  File "/home/user/anaconda3/envs/SEER-env/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/user/anaconda3/envs/SEER-env/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/user/SEER/utils/cdx_retriever.py", line 103, in _retrieve_content
    f_w.write(obj.content)
  File "/home/user/anaconda3/envs/SEER-env/lib/python3.7/site-packages/cdx_toolkit/__init__.py", line 122, in content
    self._content = self.fetch_warc_record().content_stream().read()
  File "/home/user/anaconda3/envs/SEER-env/lib/python3.7/site-packages/cdx_toolkit/__init__.py", line 107, in fetch_warc_record
    self.warc_record = fetch_wb_warc(self.data, wb=self.wb)
  File "/home/user/anaconda3/envs/SEER-env/lib/python3.7/site-packages/cdx_toolkit/warc.py", line 118, in fetch_wb_warc
    resp = myrequests_get(wb_url, **kwargs)
  File "/home/user/anaconda3/envs/SEER-env/lib/python3.7/site-packages/cdx_toolkit/myrequests.py", line 63, in myrequests_get
    raise ValueError('invalid hostname in url '+url) from None
ValueError: invalid hostname in url https://web.archive.org/web/20191120071924id_/https%3A//www.placeholder.com/news/articles/2018-04-26/article-news-content

new Common Crawl year indices

The cdx_toolkit code gets Common Crawl index dates by parsing the index name, CC-MAIN-2020-50

The new (old) indices don't have a week number:

"id": "CC-MAIN-2012",
"id": "CC-MAIN-2009-2010",
"id": "CC-MAIN-2008-2009",

which leads to the bug that 2012 is ignored, and I think 2009 and 2008 are treated as if they have a week number of 20.

Filters and url of crawled page in python client

Hi!

I am trying to figure out how to pass on filters, in the same way as possible in the CLI, I am looking to filter by language, date and status code.

The second piece I am missing is when iterating over the results - how does one grab the URL of the crawl?

import cdx_toolkit

cdx = cdx_toolkit.CDXFetcher(source='cc')
url = 'dailymail.co.uk/*'

print(url, 'size estimate', cdx.get_size_estimate(url))

for obj in cdx.iter(url, limit=100):
    print(obj.text) # obj.url would be nice :)

EDIT:

Also the example on the readme does not print out the content of obj, but rather:
<cdx_toolkit.CaptureObject object at 0x1125f84d0>

WET-files

Thank you for your cdx_toolkit! This is great toolkit!
I have one question.
Can I get wet-files from commoncrawl.org using the cdx_toolkit? Or convert warc-file to wet-file?

Thank's

CommonCrawl index date range code is broken

cdxt --cc --from 2021 --to 2020 -v -v --limit 1 iter https://www.pbm.com/
INFO:cdx_toolkit.cli:set loglevel to DEBUG
DEBUG:cdx_toolkit.myrequests:getting https://index.commoncrawl.org/collinfo.json None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): index.commoncrawl.org:443
DEBUG:urllib3.connectionpool:https://index.commoncrawl.org:443 "GET /collinfo.json HTTP/1.1" 200 1157
INFO:cdx_toolkit.commoncrawl:Found 87 endpoints in the Common Crawl index
INFO:cdx_toolkit:making a custom cc index list
INFO:cdx_toolkit.commoncrawl:using cc index range from https://index.commoncrawl.org/CC-MAIN-2021-04-index to https://index.commoncrawl.org/CC-MAIN-2020-50-index
INFO:cdx_toolkit:get_more: fetching cdx from https://index.commoncrawl.org/CC-MAIN-2021-04-index

The above date range should be empty.

[Question]

how can I query just domain names? (without /path/to/files ) and unspecified all tld (no *.com *.io..). I would like to get a just unique domains dataset

next(CDXFetcherIter) hangs when cc request returns 400/404

OS: Windows 10
Python 3.8.5
cdx_toolkit 0.9.29
requests 2.24.0

So recently iterations on CDXFetcherIter hang on some urls, below is an example:

url = 'http://www.illinoishomepage.net/news/local-news/man-who-fell-through-ice-dies'
cdx = cdx_toolkit.CDXFetcher(source='cc')
next(cdx.iter(url, from_ts='201301',to='201709'))

And it turns out that on my machine, this request always returns 404:

cc_url = 'https://index.commoncrawl.org/CC-MAIN-2017-34-index'
params = {'to': '20170930235959', 'url': 'http://www.illinoishomepage.net/news/local-news/man-who-fell-through-ice-dies', 'output': 'json', 'page': 0, 'from': '20130101000000'}
headers = {'User-Agent': 'pypi_cdx_toolkit/installed-from-git'}
resp = requests.get(cc_url, params=params, headers=headers,
                                timeout=(30., 30.), allow_redirects=False)

And because of the way 404 is handled, CDXFetcherIter.get_more() will run this request over and over again with increasing page but self.captures remains [], causing the loop to hang.

I don't know if it's a temporary common crawl server error or it's related to my local connection. but same error last for 3 days now. So I think adding an option to limit number of pages or number of retries of get_more() would be nice. Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.