Hello, I'm trying to look at some specifics only at certain URLs of

Is it possible to get only one URL of one domain from a TLD? about cdx_toolkit HOT 4 CLOSED

cocrawler commented on June 2, 2024

Is it possible to get only one URL of one domain from a TLD?

from cdx_toolkit.

Comments (4)

wumpus commented on June 2, 2024

You didn't show what the value of url is, that's an easy way to get just the front page: don't use url = "commoncrawl.org/*", leave off the final "*".

from cdx_toolkit.

Chris8080 commented on June 2, 2024

True .. I've forgot that one.
I'd like to get all all front pages from one specific TLD.

It seems as if:
url = '.co.uk/' and url = '.co.uk/' and url = '.co.uk' are just doing the same and get all URLs from one domain in the index.

Conceptually speaking:
cdx = cdx_toolkit.CDXFetcher(source='cc')
is retreiving the index
and the iter method will still send one request per URL to Amazon?
In this case, I could iterate through my URLs without using the index and just retrieve the HTML from the crawl files and it would cause the same amount of requests?
(I was looking to reduce the amount of requests, if that's anyhow possible)

from cdx_toolkit.

wumpus commented on June 2, 2024

Unfortunately that's a hard-to-explain aspect of the cdx index. "*.co.uk" is actually a query for the surt "^co.uk.". You could add an additional regex filter on the url, but then you're descending into the swamp of details of exactly how surts are computed... does "pbm.com/" turn into "com.pbm)" or "com.pbm)/"? Only experimentation will tell you.

from cdx_toolkit.

Chris8080 commented on June 2, 2024

Ok, I see. I'll try around there and check the results.
Thank you.

from cdx_toolkit.

Recommend Projects

Is it possible to get only one URL of one domain from a TLD? about cdx_toolkit HOT 4 CLOSED

Comments (4)

Related Issues (17)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent