Comments (2)
Hm, thinking the error message should be improved to log the invalid URLs, but not sure if should continue?
Seems like its probably an error and user would want to fix the invalid URLs before proceeding.
from browsertrix-crawler.
Yes, a bit more information in log files is probably enough.
However, if a set of 100 seeds contains 5 errors, operator has to start the crawl 5 times to fix one URL at a time.
If Browsertrix continues with valid seeds, while logging possible errors, main crawl needs just one run. Operator can check errors later and make a new run with 5 fixed URLs.
(Heritrix continues a crawl and logs seed errors in reports/seeds-report.txt for later checks.)
from browsertrix-crawler.
Related Issues (20)
- Inconsistent Tweet archiving HOT 4
- Cloudflare interstitial wait isn't working HOT 3
- Any way to save seed urls into separate collections? HOT 2
- make browsertrix-crawler runnable in serverless environments HOT 3
- how configurable is the Automated Profile Creation feature
- Add request initiator to WARC? HOT 6
- [Bug]: no warc-info header in any warc file included in a wacz
- SOCKS proxy username and password parameters missing
- Crawl JS and CSS HOT 3
- RCE Vulnerability in puppeter-core HOT 1
- Generate 'pageinfo' resource records with summary of all page resources. HOT 1
- Unable to run multiple crawls in a single bash session HOT 1
- Add option to write pages to queue in Redis
- Brave Default Setting Improvements HOT 1
- Change path in seedFile example in readme.md HOT 3
- Handle seed redirects
- Failure uploading large files (handling slowDown) HOT 10
- Use js-wacz to create WACZ files HOT 1
- Make screenshot after custom behaviors HOT 4
- WARC Validation Error appears from time to time HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from browsertrix-crawler.