jancurn / actor-find-broken-links Goto Github PK

A source code of an Apify actor that finds and reports broken links on a website. Unlike other SEO analysis tools, it also reports broken URL #fragments.

Home Page: https://apify.com/jancurn/find-broken-links

License: Apache License 2.0

Dockerfile 5.99% JavaScript 94.01%

seo broken-links

actor-find-broken-links's People

Contributors

Stargazers

Watchers

Forkers

acebytes lhotanok natashalekh siyalab hamzaalwan quicklifesolutions

actor-find-broken-links's Issues

readme update/correct

better explain how to know the URL is broken, is this the broken link?

{
"url": "https://tvo.org/current-affairs",
"normalizedUrl": "https://tvo.org/current-affairs",
"httpStatus": null,
"errorMessage": null,
"fragment": "",
"fragmentValid": false,
"crawled": false
},

the example dataset is an old example with only a KV store

now there is anything in the KV store as described in the readme, there is only output dataset:
https://my.apify.com/tasks/hVQYNQVysl3nX4bP2#/runs/oH3FaE4wkumgG6reG

do you need proxies to run the actor? what is the consumption?

Failed request with .pdf

Here is the kind of errors, I can see:

ERROR: BasicCrawler: handleRequestFunction failed, reclaiming failed request back to the list or queue {"url":"https://www.humancoders.com/formations/securite-des-applications-web.pdf","retryCount":1}
2020-07-03T12:04:43.666Z   Error: net::ERR_ABORTED at https://www.humancoders.com/formations/securite-des-applications-web.pdf
2020-07-03T12:04:43.666Z     at navigate (/home/node/node_modules/puppeteer/lib/FrameManager.js:103:37)
2020-07-03T12:04:43.667Z     at <anonymous>
2020-07-03T12:04:43.667Z     at process._tickCallback (internal/process/next_tick.js:189:7)
2020-07-03T12:04:43.668Z   -- ASYNC --
2020-07-03T12:04:43.668Z     at Frame.<anonymous> (/home/node/node_modules/puppeteer/lib/helper.js:144:27)
2020-07-03T12:04:43.669Z     at Page.goto (/home/node/node_modules/puppeteer/lib/Page.js:587:49)
2020-07-03T12:04:43.669Z     at Page.<anonymous> (/home/node/node_modules/puppeteer/lib/helper.js:145:23)
2020-07-03T12:04:43.669Z     at PuppeteerCrawler.gotoFunction (/home/node/node_modules/apify/build/puppeteer_crawler.js:30:53)
2020-07-03T12:04:43.670Z     at PuppeteerCrawler._handleRequestFunction (/home/node/node_modules/apify/build/puppeteer_crawler.js:308:48)
2020-07-03T12:04:43.670Z     at <anonymous>
2020-07-03T12:04:43.671Z     at process._tickCallback (internal/process/next_tick.js:189:7)

Better handling of URL query parameters

For example, when crawling blog.apify.com, some pages are visited multiple-times since they have different URL params:

https://blog.apify.com/contact-information-scraper-7104cb0df25e?source=post_recirc---------1------------------
https://blog.apify.com/contact-information-scraper-7104cb0df25e?source=collection_home---4------8-----------------------

We could add an input option to list URL parameters that should be ignored when deciding if URL is unique.

Add option to report 301 redirects

Sometimes you also want to find links that still work, but have changed over time, so that you can fix those.

Add option to only show errors in report

Or maybe generate two reports?

Add option to also craw sub-domains

So that if you enter apify.com, it will also crawl blog.apify.com etc.

Skip mailto: and other special links

Subject says it all

Add option to enter email for notifications

So that people can easily schedule the actor to send them a report every day, or week.

Deal with the `www.` prefix in internal page links

When inspecting a website, if you insert its URL with www. in the beginning, like https://www.apify.com/, but the internal links don't have the www. prefix, like https://apify.com/about, they do not get enqueued for recursive inspection, even though they are technically nested pages of the main page. We should consider adding some logic that deals with this and maybe ignores the www. prefix at the start of the URL when considering whether to recursively check a link, or not.

jancurn / actor-find-broken-links Goto Github PK

actor-find-broken-links's People

Contributors

Stargazers

Watchers

Forkers

actor-find-broken-links's Issues

Errors on first run

readme update/correct

Failed request with .pdf

Better handling of URL query parameters

Add option to report 301 redirects

Add option to only show errors in report

Add option to also craw sub-domains

Skip mailto: and other special links

Add option to enter email for notifications

Deal with the `www.` prefix in internal page links

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent