w3c / link-checker Goto Github PK

Check links and anchors in Web pages or full Web sites.

Home Page: https://validator.w3.org/checklink

Perl 100.00%

link-checker's Introduction

W3C-LinkChecker

This distribution contains the W3C Link Checker.

The link checker can be run as a CGI script in a web server as well as on the command line. The CGI version provides a HTML interface as seen at http://validator.w3.org/checklink.

Install

To install the distribution for command line use:

git clone https://github.com/w3c/link-checker.git
cd link-checker
#if you have cpanminus installed
cpanm --installdeps .
perl Makefile.PL
make
make test
make install # as root unless you are using local::lib

To install the CGI version, in addition to the above, copy the bin/checklink script into a location in your web server from where execution of CGI scripts is allowed, and make sure that the web server user has execute permissions to the script. The CGI directory is typically named "cgi-bin" somewhere under your web server root directory.

For more information, please consult the POD documentation in the checklink.pod file, typically (in the directory where you unpacked the source):

perldoc ./bin/checklink.pod

...as well as the HTML documentation in docs/checklink.html.

As a Docker container

You may want to use @stupchiy's Dockerfile, which is based on Ubuntu Linux, and follow his instructions:

$ docker build -t link-checker .                                                                        # Build an image
$ docker run -it --rm link-checker                                                                      # Run a container
$ docker run -it --rm link-checker checklink https://foo.bar                                            # Run script directly
$ docker run -it --rm -v "$PWD":/home/checklink link-checker checklink -H https://foo.bar > report.html # Write to HTML file

Useful links

Archived bug list (Decommissioned Bugzilla)
www-validator MailingList used to discuss the W3C Markup Validation Service, Link Checker and Log Validator.

Copyright and License

Written by the following people for the W3C:

Hugo Haas [email protected]
Ville Skyttä [email protected]
W3C QA-dev Team [email protected]

Copyright (C) 1994-2023 World Wide Web Consortium Inc. All Rights Reserved. This work is distributed under the W3C Software License in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

link-checker's People

Contributors

Stargazers

Watchers

link-checker's Issues

is not an issue HOW to check only if URls under domain.com are not broken ? only the main urls e.g. products

i need to scan my
www.domain.com/ and all subdirectoies
which contains urls

but need to scan only the urls not all the links insed related....
thanks

robots.txt

Validator's own robots.txt filters cause errors for link checker itself.
Here's the link: https://validator.w3.org/robots.txt

For example, a webpage with a link like: https://validator.w3.org/checklink?uri=xyz.com

Produces this error: Status: (N/A) Forbidden by robots.txt. The link was not checked due to robots exclusion rules. Check the link manually.

And, to solve this issue, the link checker gives you an advice there: https://validator.w3.org/checklink/docs/checklink.html#bot

Error: 403 Checking non-public IP address disallowed by link checker configuration

I want to check localhost links. How can I do that?

Links with %2A in them are incorrectly reported as 301 redirects

Whenever a link is checked that contains %2A, the checker appears to convert that to a * then reports a 301 redirect even if the URL with %2A in it actually results in a 200 response.

Example: https://validator.w3.org/checklink?uri=https%3A%2F%2Fvideochums.com%2Freview%2Fgal-gun-double-peace

In the above tested page, the link in the source code is https://www.esrb.org/ratings/34456/Gal%2AGun%3A+Double+Peace/ which warrants a 200 response. However, the tool thinks that the link is https://www.esrb.org/ratings/34456/Gal*Gun%3A+Double+Peace/ which 301 redirects.

Update message for code 308

The code 308 has been recognized in popular documents such as MDN & Google Crawling. So should it a right time to update the Link checker message for 308 code? Giving the question mark like below really doesn't do any helpful.

Needs SSLEAY under Strawberry Perl, enh. to check HTML and CSS for each page

I've been using the LinkChecker as installed by CPAN and have two points:

Bug Report:

Using Strawberry Perl on a Win7 x64 machine, executing checklink.bat results in a complaint about taint mode:

    C:\util\html>\Strawberry\perl\site\bin\checklink.bat www.google.com
    "-T" is on the #! line, it must also be used on the command line.

This can be worked around by invoking the script directly from perl, such as:

C:\util\html>perl -wT \Strawberry\perl\site\bin\checklink www.google.com
`

This starts to run, and then a dialog box appears:

The program can't start because SSLEAY32_.dll is missing from your computer. ...

This seems to be because google.com has an https:// link:

    W3C Link Checker version 4.81 (c) 1999-2011 W3C
    GET http://www.google.com/  fetched in 1.44 seconds

    Processing      http://www.google.com/


    Settings used:
    - Accept: text/html, application/xhtml+xml;q=0.9, application/vnd.wap.xhtml+xml;q=0.6, */*;q=0.5
    - Accept-Language: (not sent)
    - Referer: sending
    - Cookies: not used
    - Sleeping 1 second between requests to each server
    Parsing...
     done (6 lines in 0.06 seconds).
    Checking anchors...
     done.

    Checking link http://www.google.com/intl/en/ads/
    HEAD http://www.google.com/intl/en/ads/
    -> HEAD https://www.google.com/intl/en/ads/  fetched in 59.67 seconds

    Checking link https://mail.google.com/mail/?tab=wm
    HEAD https://mail.google.com/mail/?tab=wm  fetched in 1.00 seconds

I have looked through the code and would suggest fixing this by changing line 29 of checklink.pl from:

$ENV{PATH} = undef;
to:

$ENV{PATH} = join($Config{path_sep}, $Config{installbin}, $Config{sitebin});
I have tested that change on Strawberry Perl on Win7 x64 and on Ubuntu 16.04 perl and it works OK. On the PC, a path is often needed for libraries like SSL.

Enhancement:
I am in the process of setting up an internal server cron job to check websites periodically and send an email if they have broken links or bad HTML/CSS. Currently, checklink.pl doesn't do HTML validation, it just points to a separate validation service in its report. I could use a list of links in a script or batch file to call a separate validation program for each link, or I could modify checklink.pl to do that internally. I have tried to modify checklink.pl to do both:

To output list of pages checked, at line 542, insert:


            print "\nPages processed:\n";

            foreach my $k (keys %processed) {

                print "$k : $processed{$k}\n";

            }

To call the nu validator for each page checked, at line 1354, insert:

		# XXXX RB add page validation:
		{
			use Capture::Tiny ':all';
			my $out = capture_merged {
                # system ('/usr/bin/java', '-Xss1024k','-jar','./vnu.jar','--format','text','--asciiquotes','--skip-non-html',$response->{absolute_uri});
                system ('C:\ProgramData\Oracle\Java\javapath\java.exe', '-Xss1024k','-jar','vnu.jar','--format','text','--asciiquotes','--skip-non-html',$response->{absolute_uri});
			};
			if ($? >> 8) {
				print "\nPage Validation: $response->{absolute_uri}\n";
				print "$out \n";
			}
			else {
				print "\nPage Validation: $response->{absolute_uri} - OK\n" unless $Opts{Summary_Only};
			}
		}

Note that I have a line commented out for the system call on linux/unix. Because this program runs under taint mode, a system("whole command line string") call won't work. Since $ENV{PATH} = undef, on Windows, there is no path to search for the java runtime. (The suggested bugfix above will make useful paths for Perl programs, but not Java) On Linux, you could try /usr/local/bin:/usr/bin and hope it works most of the time. An alternative is to pass the path to java in from a command line argument or an entry in the config file. The command line opens security issues that taint mode tries to prevent, as a user could specify any malicious program as "java". Though, taint mode for a non-setuid command line program seems severe.

Putting this in the config file would open up the opportunity for the user to specify a config file with W3C_CHECKLINK_CFG that similarly contains a malicious path. However, this does not seem of much use unless checklink.pl is running with elevated privileges, which it doesn't need.

I would be interested in your thoughts on the bugfix and enhancement ideas.

Include instructions for installing with cpanminus and local::lib

While it might not be the use case for w3c website, if a user (me in this case) wants to install locally link-checker, a mention to installing through cpanm to install dependencies and local::lib would be great.

i've created a simple (https://github.com/w3c/link-checker/pull/8)[pull request] to the readme with this change.

URLs in CSS

I just found this in archived logs. If checklink has been updated since January 2021, the issue may no longer exist.

Request, as part of link checking for a page:

128.30.52.138 - - [16/Jan/2021:13:22:09 -0800] "HEAD /ebooks/second/images/backline.png%22),%20no-repeat%20center%20bottom%20/%206em%202px%20url(%22images/backline.png HTTP/1.1" 200 342 "http://example.com/ebooks/second/" "W3C-checklink/5.0.0"

See all that extraneous stuff with %22 (quotation mark) and %20 (space)? I traced it to this line in CSS:

p.booksellers {padding-top: .5em; padding-bottom: .5em; background: no-repeat center top / 6em 2px url("images/backline.png"), no-repeat center bottom / 6em 2px url("images/backline.png");}

It looks as if checklink doesn't know about multiple background layers. So instead of asking for each url separately, it treated everything between the first quotation mark and the last one as a single vast url.

In this particular case it did no harm, as both layers involve the same image, and the server's PathInfo settings mean the request netted a 200. But the relevant code should probably be tweaked anyway.

Option to report issue once per run as opposed to per page

If every page has the same issue with the same URL, please provide an option to only report that URL as a problem on the first occurrence.

For instance, if both page 1 and page 2 have a link to https://www.example.com/problempage3.htm, which returns 405 Method Not Allowed, only report https://www.example.com/problempage3.htm as an error on page 1. However, if https://www.example.com/problempage4.htm occurred on page 2 and not on page 1, https://www.example.com/problempage4.htm would be reported because it is not a duplicate.

This would help keep the report size smaller and avoid duplicate effort when plowing through the report, especially for links which are duplicated on every page of a large site.

w3c LINK CHECKER didnt work on my website but other Link Checkers run

30May2020: w3c LINK CHECKER didnt work on my website but other Link Checkers run

ERROR MESSAGE
Error: 500 Can't connect to (MYWEBNAME).com:443 (SSL connect attempt failed error:141A318A:SSL routines:tls_process_ske_dhe:dh key too small)

Hacked by Users! Filed Police Cyber Crime&IC3 FEDS CHARGES!

I've ASKED YOU PPL TOO GET TF OFF MY PERSONAL CELLPHONE BUT FOR SOME REASON THESE GROSS ASS INDIAN, ETC SCUMBAGS HAVE STALKED ME GOTTEN INTO MY BANK , STOLLEN MONEY CHANGED MY PHONE ARE 100% STALKING ME! NONE OF YOU HAVE MY PERMISSION TOO BE ON MY PHONE! IVE FILED ILLEGAL HACKING ,STALKING CHARGES, STATE POLICE CYBER CRIMES UNIT ,FBI IC3, CYBERCRIME UNIT! ANYONE FROM GITHUB ON MY PHONE? YOURE BEING TRACKED TRACED AND YES IM GOING TOO SUE TF OUT OF YOU SCUMBAGS! GET THE FUCK OFF MY PHONE ! All THE SHIT YOU DISGUSTING PERVERTS INSTALLED ON MY PHONE? GET IT THE FUCK OFF IMMEDIATELY! IDC WHAT COUNTRIES YOU'RE IN, IM GOING TOO MAKE SURE YOUR GROSS STALKING, MONEY EMBEZZLING THEIVES ASSES GO TOO PRISON FOR A VERY LING FUCKING TIME! LAST TIME GITHUB ! GET UR FUCKING SCUMBAG STALKING PERVERTED LOSERS OFF MY FUCKING CELLPHONE BECAUSE "YOU "- " GITHUB"! WILL BE SUED AS WELL! [email protected]/ [email protected]/ [email protected] STAY TF OFF MY PHONES COMPUTERS YOU DO NOT HAVE MY PERMISSION OR CONSENT UNDERSTAND!!! NOW FUCKOFF!!!!!!!
**** JUNE 9 2024 SUNDAY 1:24 P.M. ****

Supported protocols

In the version 4.81 (of 2011) the offline usage was possible out of the box. In the current (February 26, 2017 zip) version a message is given about a non supported file protocol. Looking on github it looks like this is due to the change on June 23, 2015 (hash: bdf0e3d) where one of the changes, on line 85 of checklink, is:

 -   $self->protocols_forbidden([qw(mailto javascript)]);
 +   $self->protocols_allowed([qw(http https ftp)]);

the reasoning for this is: s/forbidden/allowed/ whitelisting easier than blacklisting

Is there a reason for disallowing the file by default? There are, in my opinion, a few solution possibilities:

patch the checklink by adding file
add an option to checklink that adds other protocols to the protocols_allowed

Time for a release?

The last release (4.81) was in 2011, and there has been considerable development since then. Please could we have a new release? Of course it's possible to simply install from GitHub, but it would be great if a new release made it into distros.

Thanks for link-checker!

Integrate test suite

https://dev.w3.org/cvsweb/2008/link-testsuite/ see also #14

Incorrect "Some of the links to this resource point to broken URI fragments"

I run https://validator.w3.org/checklink?uri=https%3A%2F%2Fmapsaregreat.com%2Flaser%2F&hide_type=all&depth=&check=Check test and while many links are right now indeed broken, the

Line: 76 https://github.com/matkoniecz/lunar_assembler
Status: 200 OK

Some of the links to this resource point to broken URI fragments (such as index.html#fragment).
Broken fragments:
   https://github.com/matkoniecz/lunar_assembler#lunar-assembler (line 76)

seems invalid

https://github.com/matkoniecz/lunar_assembler#lunar-assembler seems to be 100% fine

Check more CSP / SRI cases

#7 already needs some content-security-policy integration; but it would be good if the link checker could also verify other rules imposed by CSP and/or sub-resource integrity.

Enhancement: implement an option to check canonical URLs

A link can result in a 200 response yet it has a different canonical URL.

It would be awesome if there was an option to check each linked page's http headers as well as the rel="canonical" element in its html head to ensure that the link URL is the same as the specified canonical URL and if it isn't, issue a warning.

How to use results programatically

Is there a preferred way to script checklink to test for the presence or absence of errors? E.g. a CLI param to return 0 if no errors found or 2 if checklink completed successfully but found errors on the site.

I'm currently thinking to test for the Error: string, but that feels ... fragile.

What is the latest version?

According to the tag listing in this repo, the latest version is 4.81, but the Download link from the W3C website[1] ends up on metacpan[2] with a version of 5.0.0:

This appears to indicate that this repo is not the latest version of this tool.

I also note that the tag "checklink-4_81" has a date of 16 October 2011, but the latest commit on this repo is 9 months ago.

Documentation for link-checker web server missing

It looks like a documentation for the web server at https://validator.w3.org/checklink is missing.

What does this tool exactly do?

Does it check links from www.mypage/... to all internal pages www.mypage/... and all outgoing links? Or does it just take depth parameter?

And what does the 'line' number in the result mean? Just the line of the html of the original document? What about the linked pages like www.mypage/some?

Does it crawl all the subpages of www.mypage? Or not?

Easy-to-parse output in CGI mode

When run in CGI mode, the output seems to only be available as HTML. But if I want to process the output, HTML can be a hassle to parse.

So I'd like an option for CGI mode to output a csv file or json which would be easy to parse. Or the default output format might be OK, but I don't know what it looks like.

This would make the w3c service more useful, not requiring users to download and install the tool to get non-HTML output.

Providing an example in the documentation of the non-HTML output would also helpful.

Don't check <link rel="preconnect"> and <link rel="dns-prefetch">

The two <link> types "preconnect" and "dns-prefetch" do not link to an actual resource but tell the browser to set up a connection with that site. They usually specify just the scheme and host (without a path) in the href.

Actual behaviour

Link checker tries to access the root of the domain and returns an error if there is nothing there (as there often won't be for sites that just host resources, the main target of preconnect).

Expected behaviour

Ignore <link> elements with rel="preconnect" or rel="dns-prefetch" since they are specifying a site rather than a specific resource.

Example

The following site demonstrates this problem (unfortunately it's not an ideal testcase as it has so many links to check):

https://www.smashingmagazine.com/

Automatic test workflow is failing

due to resolving some of the CPAN dependencies

Set up CI with Travis

eg, to run tests automatically for PRs.

See docs.travis-ci.com/user/languages/perl6.

incorrect handling of picture source sets

Consider the following html5 picture tag with source set:

When this is checked with the link checker, it will look for the non-existing files "200w", "400w" and "800w", which is incorrect behaviour. These values describe image width, not image source. The link checker also checks for the files "picture-small.jpg", "picture-medium.jpg" and "picture-large.jpg", which is correct.

This example is available online at: https://www.geekabit.nl/test/picture-srcset.html

mod-security doesn't like libwww-perl

Hi there,
I found problems here and there because more and more servers don't like calls from libwww-perl

Classic log line:
[client 128.30.52.138] ModSecurity: Access denied with connection close (phase 1). Pattern match "libwww-perl" at REQUEST_HEADERS:User-Agent. [file "remote server"] [line "-1"] [id "913111"] [msg "Malware.Expert - Found User-Agent associated with scripting/generic HTTP client"]

The solution could be:
A proper UA, e.g. something like "link-checker/1.0"
or
an text input in the GUI to let the user set a UA
or
a select with the most common UAs

References within <code> elements

I am unsure if this is a bug or a feature. My personal understanding is, that contents of a code-element (<code>...</code>) or a block of code (<pre><code>...</code></pre>) should not be checked.

The following example from one of my talks (https://www.jonas.me/talks/introduction-to-the-web-content-accessibility-guidelines/#/4/13 (~8 MB!!!)) throws errors.

Excerpt:
<pre> <code> <p>I like HyperText Markup Language (HTML).</p> <video src="video.mp4"> <track srclang="en" src="en.vtt"> <track srclang="de" src="de.vtt"> </video> </code> </pre>

Screenshot: https://imgur.com/qM966ro

Maybe it's a good idea to discuss this topic.

Suppress Data-URI N/A skipped messages?

The link checker at present warns about every instance of data URIs in every page, eg

info Line: 1963 data:image/gif;base64,R0lGODlhBAAEAPAAMQAAAIiIiCwAAAAABAAEAAACBkwAhqgZBQA7
Status: (N/A) Access to 'data' URIs has been disabled

Accessing links with this URI scheme has been disabled in link checker.

This adds a lot of clutter to my reports (I have, for various CSS & dark mode reasons, 6 data URIs inlined on every page on my site, and each one generates a useless warning taking up 10 vertical lines each collectively taking up a good third of the report for most pages).

I've looked at the docs & bug reports, and I can't see any discussion of why link-checker might hypothetically check data URIs, such that the user needs to be warned that it did not in fact check them. Is there some way in which data URIs can be "broken links" within the scope of link-checker? A data URI is pretty much defined as not being a hyperlink elsewhere but literally encoding, inline, a file which could've been hyperlinked elsewhere but is not. So it seems like any 'broken' data URIs would either be broken HTML (and so checked by the other HTML validation tool instead), or would simply encode a bad data payload such as invalid SVGs (which, while possibly useful, is way beyond a link checker's scope given how many things are encoded into arbitrary data URIs).

If there is no plausible use-case, perhaps the warnings can be omitted entirely? Or if there are some use-cases, maybe a toggle for reporting them (defaulting to off or on depending on how common said use-cases are). Any of those would make it easier to read the more meaningful warnings & errors.

W3C validators not getting correct referrer, possibly because of a tilde (~) in url

The W3C validators for html and css are no longer recognizing the correct referrer url from my website (they used to). Possibly, this is because there is a tilde (~) in the url. In particular, (http://users.ugent.be/~jaschmid/) shows the validation for (http://users.ugent.be/) with the recommended code <a href="http://validator.w3.org/check?uri=referer"> (html) and <a href="http://jigsaw.w3.org/css-validator/check/referer"> (css). I am able to get the correct validation with <a href="http://validator.w3.org/check?uri=users.ugent.be/~jaschmid/"> and <a href="http://jigsaw.w3.org/css-validator/validator?uri=users.ugent.be/~jaschmid/">, however.

checklink dies w/ status 255: Modification of non-creatable array value attempted

I get an error like this when I run checklink against some large HTML files:

Modification of non-creatable array value attempted, subscript -2147468066 at /usr/local/bin/checklink line 1443.

line 1443 is in the loop distributing links to avoid 1s delay between HEAD requests to the same host:

1440:    # Distribute them
1441:    my @result;
1442:    while (my @chunk = splice(@all, 0, $num)) {
1443:        @result[@indexes] = @chunk;
1444:        @indexes = map { $_ + 1 } @indexes;
1445:    }

These files are artificially created from links in a database. I am currently building them with max 50000 links ... not sure if the number of links is part of the problem and if so what max I should use.

I'm attaching a test file. Recreate the problem with this command (you may need to add file to the Allowed_Protocols):

checklink -b -e MU-01.html

testhtml.zip

Handle PDF anchor links, esp `#page=N` (currently false positives)

PDF readers support a variety of ways to link inside them similar to HTML pages: a foo.pdf#page=5 will link to page 5 of that PDF, or a defined anchor like a section (common in Arxiv PDFs), or comments etc. See "Parameters for Opening PDF Files". This dates back to sometime around or before 2005, and is supported by many (most?) PDF viewers, although it is not an official standard AFAICT. I find them extremely useful for discussing research papers (>170 unique instances on my site so far), to hyperlink people to the exact page with the relevant figure or table or section (just as linking to parts of HTML pages is useful).

Right now link-checker treats all such arguments as errors and invalid anchors to warn about. This is unfortunate and clutters up my results considerably with false positives - the pages are all correct, but it always treats them as errors.

It should handle PDF parameters more gracefully.

The full set of parameters is probably unnecessary (I have never seen many of them in the wild, and I doubt many readers beyond Adobe Acrobat support the more obscure ones like fdf for population form fields), but page, named anchors, zoom, and view are worth supporting. page could be checked simply by, on links which return a PDF MIME type, checking that the N is a whole number less than or equal to the number of pages in the PDF. (Any PDF tool/library should be able to tell you how many pages are in a PDF.) Alternately, if a link returns a PDF MIME type, one could just skip checking anchors entirely as being too much work to figure out how to analyze arbitrary PDFs to check anchor validity and out of scope.

make test fails for "you may need to install the CSS::DOM module" reason

I followed

#if you have cpanminus installed
cpanm --installdeps .
perl Makefile.PL
make
make test
make install # as root unless you are using local::lib

instructions

but make test failed.

If I read it right, some dependency is missing

PERL_DL_NONLAZY=1 "/usr/bin/perl" "-MExtUtils::Command::MM" "-MTest::Harness" "-e" "undef *Test::Harness::Switches; test_harness(0, 'blib/lib', 'blib/arch')" t/*.t
t/00compile.t .. Can't locate CSS/DOM.pm in @INC (you may need to install the CSS::DOM module) (@INC contains: /home/mateusz/Desktop/tmp/link-checker/blib/lib /home/mateusz/Desktop/tmp/link-checker/blib/arch /etc/perl /usr/local/lib/x86_64-linux-gnu/perl/5.30.0 /usr/local/share/perl/5.30.0 /usr/lib/x86_64-linux-gnu/perl5/5.30 /usr/share/perl5 /usr/lib/x86_64-linux-gnu/perl/5.30 /usr/share/perl/5.30 /usr/local/lib/site_perl /usr/lib/x86_64-linux-gnu/perl-base) at bin/checklink line 217.
BEGIN failed--compilation aborted at bin/checklink line 217.
t/00compile.t .. 1/2 
#   Failed test at t/00compile.t line 4.
# Looks like you failed 1 test of 2.
t/00compile.t .. Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/2 subtests 

Test Summary Report
-------------------
t/00compile.t (Wstat: 256 Tests: 2 Failed: 1)
  Failed test:  1
  Non-zero exit status: 1
Files=1, Tests=2,  0 wallclock secs ( 0.01 usr  0.00 sys +  0.07 cusr  0.02 csys =  0.10 CPU)
Result: FAIL
Failed 1/1 test programs. 1/2 subtests failed.
make: *** [Makefile:927: test_dynamic] Error 1

Resource Hints ?

Would you comply with https://www.w3.org/TR/2016/WD-resource-hints-20160225/, it's still a draft, but the w3c link checker marks such links as broken ?

Not recursing across entire website

I was trying to have the link-checker check an entire site, but after running checklink -r https://sos.noaa.gov it reports only having checked 11 documents. I'm sure there are more than 11 pages on the site, am I just not understanding how to use the tool to crawl a site, or misreading the output? I've also tried using the -D flag to set a very high depth, but it still only reports checking 11 documents.

If it matters, I'm actually running this with the Docker container, so what the exact command I'm using is

sudo docker run -it --rm stupchiy/checklink -r https://sos.noaa.gov

Any help would be appreciated.

Flag mixed-content in https pages

Since pretty much all browsers block http embedding in https pages, can I suggest that the W3C link checker could usefully check this condition. (Of course if the source is not via a URL, you'd have to have a check box or some such indicating the source is intended to be served over https, but in the normal case where someone is checking a remote site, you'd already know this). I'm talking about ,

Installation fails on Centos7 and macOS

I was following installation instructions on a Centos7 machine, while at make test the process gave back an error, then runinng the binary checklink gave me also errors.

Perl version is v5.16.3

Any help is appreciated. Looks like it was missing some dependencies?

[root@localhost link-checker]# cpanm --installdeps . 
--> Working on .
Configuring W3C-LinkChecker-4.81 ... OK
==> Found dependencies: Net::IP, CGI, Locale::Country, Config::General, CGI::Carp, CGI::Cookie, Locale::Language

make test result:

[root@localhost link-checker]# make test
PERL_DL_NONLAZY=1 /usr/bin/perl "-MExtUtils::Command::MM" "-e" "test_harness(0, 'blib/lib', 'blib/arch')" t/*.t
t/00compile.t .. syntax error at bin/checklink line 1278, near "$lines{"
Global symbol "@mixedcontent" requires explicit package name at bin/checklink line 1280.
Global symbol "$line_num" requires explicit package name at bin/checklink line 1281.
Global symbol "@mixedcontent" requires explicit package name at bin/checklink line 1286.
Global symbol "$line_num" requires explicit package name at bin/checklink line 1287.
Global symbol "$line_num" requires explicit package name at bin/checklink line 1292.
Global symbol "@mixedcontent" requires explicit package name at bin/checklink line 1292.
Global symbol "$line_num" requires explicit package name at bin/checklink line 1298.
Global symbol "@mixedcontent" requires explicit package name at bin/checklink line 1298.
Global symbol "%hostlinks" requires explicit package name at bin/checklink line 1305.
Global symbol "%hostlinks" requires explicit package name at bin/checklink line 1306.
Global symbol "%links" requires explicit package name at bin/checklink line 1319.
Global symbol "$result_anchor" requires explicit package name at bin/checklink line 1331.
Global symbol "$check_mixed_content" requires explicit package name at bin/checklink line 1339.
syntax error at bin/checklink line 1339, near "$ulinks{"
bin/checklink has too many errors.
t/00compile.t .. 1/2 
#   Failed test at t/00compile.t line 4.
# Looks like you failed 1 test of 2.
t/00compile.t .. Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/2 subtests 

Test Summary Report
-------------------
t/00compile.t (Wstat: 256 Tests: 2 Failed: 1)
  Failed test:  1
  Non-zero exit status: 1
Files=1, Tests=2,  0 wallclock secs ( 0.01 usr  0.00 sys +  0.08 cusr  0.01 csys =  0.10 CPU)
Result: FAIL
Failed 1/1 test programs. 1/2 subtests failed.
make: *** [test_dynamic] Errore 1

make install result

[root@localhost link-checker]# make install
Manifying blib/man3/W3C::LinkChecker.3pm
Installing /usr/local/share/man/man3/W3C::LinkChecker.3pm
Appending installation info to /usr/lib64/perl5/perllocal.pod

checklink output

[root@localhost link-checker]# bin/checklink 
syntax error at bin/checklink line 1278, near "$lines{"
Global symbol "@mixedcontent" requires explicit package name at bin/checklink line 1280.
Global symbol "$line_num" requires explicit package name at bin/checklink line 1281.
Global symbol "@mixedcontent" requires explicit package name at bin/checklink line 1286.
Global symbol "$line_num" requires explicit package name at bin/checklink line 1287.
Global symbol "$line_num" requires explicit package name at bin/checklink line 1292.
Global symbol "@mixedcontent" requires explicit package name at bin/checklink line 1292.
Global symbol "$line_num" requires explicit package name at bin/checklink line 1298.
Global symbol "@mixedcontent" requires explicit package name at bin/checklink line 1298.
Global symbol "%hostlinks" requires explicit package name at bin/checklink line 1305.
Global symbol "%hostlinks" requires explicit package name at bin/checklink line 1306.
Global symbol "%links" requires explicit package name at bin/checklink line 1319.
Global symbol "$result_anchor" requires explicit package name at bin/checklink line 1331.
Global symbol "$check_mixed_content" requires explicit package name at bin/checklink line 1339.
syntax error at bin/checklink line 1339, near "$ulinks{"
bin/checklink has too many errors.

`#top`/`#` anchors incorrectly flagged as broken/missing anchor errors

If I check a page on my website, such as the index, where I have a link to #top at the bottom of the page for the convenience of readers (along with the standard 'return-to-top' floating widget), the link checker throws an error:

Lines: 2552, 2567, 2581, 3164 https://www.gwern.net/index
Status: 200 OK

Some of the links to this resource point to broken URI fragments (such as index.html#fragment).
Broken fragments:
   https://www.gwern.net/index#top (line 3164)

The link works fine, and as I understand it, #top and # are guaranteed to exist and be valid anchor references defined at runtime by a standards-compliant browser according to MDN:

You can use href="#top" or the empty fragment (href="#") to link to the top of the current page, as defined in the HTML specification.

The specification linked does say

Let fragment be the document's URL's fragment. If fragment is the empty string, then the indicated part of the document is the top of the document; return.

So # is definitely required to exist & be valid by the standard; I'm unsure where top is defined but MDN and everyone else seems to think it's defined exactly the same way.

So both #top and # will always be valid anchor links, and any error by the link-checker is always a false positive. They should be whitelisted and not appear in the output.

Perl error in List of redirects

I get this error at the top of the "List of redirects" section: Use of uninitialized value $tail in join or string at /usr/share/perl/5.24/Text/Wrap.pm line 44.

ILLEGALLY HARASSED STALKED MONEY MISSING!

Address many “TODO” comments (and friends) in `checklink`

Below is a list of the various comments with the TODO, FIXME, XXX, or (sometimes) @@@ markers in checklink.

Specifics of how I got this list

I generated the list with the following command:

grep --ignore-case --line-number --extended-regexp  '#\s*(@@@|todo|fixme|xxx|@@@)'  ./bin/checklink

There are a few results that may not be intended as “todo” items, but were nevertheless marked with a matching pattern. I left these results in, just in case people who know more than I do might interpret/scrutinize them differently.

I added links to the line numbers for convenience.

List of `TODO` (and friends) comments

128: # @@@ Why not just $self->SUPER::simple_request?
146: # @@@ TODO: when an LWP internal robots.txt request gets redirected,
233: # @@@ Needs also W3C::UserAgent but can't use() it here.
470: my $ua = W3C::UserAgent->new($AGENT); # @@@ TODO: admin address
631: # @@@ Ignore invalid depth silently for now.
1794: # @@@ subtract robot delay from the "fetched in" time?
2074: # @@@@ In XHTML, <a name="foo" id="foo"> is mandatory
2118: # @@@TODO: The reason for handling <base href> ourselves is that LWP's
2148: # TODO: HTML 4 spec says applet/@codebase may only point to
2267: # FIXME: are there CSS rules with URLs that aren't mixed content blockable?
2326: # TODO: Check the rules defined in
2402: # TODO: find content types for which fragment are defined
2413: # @@@TODO: this isn't the best thing to do if a decode error occurred
2431: # TODO #top is OK in HTML documents
3343: { # XXX no easy way to check if cookie expired?

Warnings during first phase of installation

I downloaded the zip package today (February 26, 2017) and ran on Cygwin the perl Makefile.PL:

Checking if your kit is complete...
Warning: the following files are missing in your kit:
        lib/W3C/LinkChecker.pm
        README
Please inform the author.
Generating a Unix-style Makefile
Writing Makefile for W3C::LinkChecker
Writing MYMETA.yml and MYMETA.json

Looks like the product does work nonetheless.

Feedback link on website is broken

The anchor with the link text 'Feedback' on the bottom of https://github.com/w3c/link-checker/ is broken. It points to a fragment (http://validator.w3.org/docs/checklink.html#csb) which doesn't seem to exist anymore. The is no ID csb in the same document.

https://i.imgur.com/MTo5RXT.png

Wrong result `Server closed connection without sending any data back`

The status log contains the output below.

Checking link http://www.jenisch-haus.de/apple-touch-icon-120x120.png
HEAD http://www.jenisch-haus.de/apple-touch-icon-120x120.png  fetched in 0.01 seconds

The Web server logs confirm that.

But in the results page an error is listed.

Line: 23 http://www.jenisch-haus.de/apple-touch-icon-120x120.png
Status: 500 Server closed connection without sending any data back
This is a server side problem. Check the URI.