Git Product home page Git Product logo

Comments (9)

andrewvaughan avatar andrewvaughan commented on August 17, 2024 1

What an investigation @andrewvaughan :D

it seems markdown-links-check

Maybe needle has different behaviors depending of environment variables ?

Something to check that could be to expose a mock service, and log the calls within docker and out of docker to see the differences :)

Lol you should see the comment I was half-way through writing...

I have gone the depths of the dependency stack. I am weary and tired, but I bear the fruits of my labor:

Bear with me friends, because this is where my soul started tearing apart. The code was a jungle.

Which brought me to the Node.js core source-code with even LESS documentation...

Ergo, visa-vis

...that's about as far as I got

from megalinter.

andrewvaughan avatar andrewvaughan commented on August 17, 2024 1

Narrowed it down:

$ docker run --entrypoint /bin/bash -it --rm oxsecurity/megalinter-python:v7.7.0

# curl --no-alpn -I -A "Mozilla/5.0 (compatible; markdown-link-check/3.11.2; MegaLinter/7.7.0; +https://megalinter.io)" "https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378"
HTTP/1.1 200 OK

# curl -I -A "Mozilla/5.0 (compatible; markdown-link-check/3.11.2; MegaLinter/7.7.0; +https://megalinter.io)" "https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378"
HTTP/2 403

There must be something either about the HTTP/2 protocol or ALPN that is flagging the linter as an unwanted bot.

Although interesting, forcing HTTP/1.1 alone does not suffice:

# curl --http1.1 -I -A "Mozilla/5.0 (compatible; markdown-link-check/3.11.2; MegaLinter/7.7.0; +https://megalinter.io)" "http
HTTP/1.1 403 Forbidden

Now why these remote hosts are allowing non-ALPN traffic through but blocking ALPN traffic is an interesting question.

Edit: I've answered this - every response came back with a cf-mitigated: response header. This means a Cloudflare WAF is in place and is putting that "checking to see if you're human" page in place. See tcort/link-check#72 for more details.

I will move the remainder of this discovery over to an issue on the markdown-link-checker, but I would definitely consider adding a unique UserAgent to Megalinter with this Issue - it will help prevent the default Needle/X.X UserAgent that gets applied from getting over-blocked.

from megalinter.

andrewvaughan avatar andrewvaughan commented on August 17, 2024

So I still recommend this - because I was able to prove this was a blocking case using curl on my machine with and without user-agent - and confirmed that this resolved my issues if I ran markdown-link-checker directly on my machine.

But, for some reason, running this with the given configuration still fails in a megalinter run still fails, almost as if the config isn't being applied:

❌ Linted [MARKDOWN] files with [markdown-link-check]: Found 3 error(s) - (15.39s)
- Using [markdown-link-check v3.11.2] https://megalinter.io/7.7.0/descriptors/markdown_markdown_link_check
- MegaLinter key: [MARKDOWN_MARKDOWN_LINK_CHECK]
- Rules config: [/.config/linters/.markdown-link-check.json]
- Number of files analyzed: [11]
--Error detail:

  ERROR: 1 dead links found in .github/CONTRIBUTING.md !
  [✖] https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378 → Status: 403

  ERROR: 1 dead links found in .github/SUPPORT.md !
  [✖] https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378 → Status: 403

  ERROR: 1 dead links found in _TEMPLATE_CHECKLIST.md !
  [✖] https://stackoverflow.com/questions/32964920/should-i-commit-the-vscode-folder-to-source-control → Status: 403

I took megalinter out of the equation as much as possible and ran the markdown-link-check tool directly on the container to see if anything was different, and it still failed:

$ docker exec -it megalinter markdown-link-check -q -v -c /tmp/lint/.config/linters/.markdown-link-check.json /tmp/lint/.github/SUPPORT.md
  [✖] https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378 → Status: 403

  ERROR: 1 dead links found in /tmp/lint/.github/SUPPORT.md !
  [✖] https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378 → Status: 403

But running the exact same command in the project outside of the container is successful:

$ npx -y markdown-link-check -q -v -c .config/linters/.markdown-link-check.json .github/SUPPORT.md
(node:45774) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead.
(Use `node --trace-deprecation ...` to show where the warning was created)

# (No errors)

Although I still believe this belongs in megalinter, because I cannot replicate the issue in any other environment except for the megalinter Docker container.

And just for proactivity sake:

$ docker exec -it megalinter markdown-link-check --version                                         
3.11.2

$ npx markdown-link-check --version
(node:43301) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead.
(Use `node --trace-deprecation ...` to show where the warning was created)
3.11.2

Tagging @tcort if they have any thoughts.

🤔

from megalinter.

andrewvaughan avatar andrewvaughan commented on August 17, 2024

More progress - and I think this actually may need to be either moved or duplicated to https://github.com/tcort/markdown-link-check now, because of what I found.

I wanted to completely remove the idea that running in a container itself was the issue, so I followed the https://github.com/tcort/markdown-link-check directions on running markdown-link-check within Docker instead of via npx. I'll be honest, I fully expected this to work fine...

$ docker run -v ${PWD}:/tmp:ro --rm -i ghcr.io/tcort/markdown-link-check:stable -q -v -c /tmp/.config/linters/.markdown-link-check.json /tmp/.github/SUPPORT.md
  [✖] https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378 → Status: 403

  ERROR: 1 dead links found in /tmp/.github/SUPPORT.md !
  [✖] https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378 → Status: 403

...but it didn't.

It also fails in a much more simple environment:

$ docker run -it -v ${PWD}:/tmp:ro --rm node npx -y markdown-link-check -q -v -c /tmp/.config/linters/.markdown-link-check.json /tmp/.github/SUPPORT.md
(node:19) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead.
(Use `node --trace-deprecation ...` to show where the warning was created)
  [✖] https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378 → Status: 403

  ERROR: 1 dead links found in /tmp/.github/SUPPORT.md !
  [✖] https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378 → Status: 403

It seems that the issue is the user-agent (or something) is not working, but only while run within a Docker container (or something specific about how either both the Megalinter and this Docker container are built/configured).

Note - this problem happens whether I call it like I did above, or bash into the section and run markdown-link-checker from the command line - I presented it here in single-line for easy of reproduction.

Interestingly, curl also fails in a similar manner, making me think this might be an underlying Docker configuration or utility issue:

# Works fine locally...
$ curl -A "Mozilla/5.0 (compatible; markdown-link-check/3.11.2; MegaLinter/7.7.0; +https://megalinter.io)" -I "https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378"
HTTP/2 200 
date: Sun, 21 Jan 2024 21:44:16 GMT
# etc...

# Fails on the basic `node` image....
$ docker run -it --rm node curl -A "Mozilla/5.0 (compatible; markdown-link-check/3.11.2; MegaLinter/7.7.0; +https://megalinter.io)" -I "https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378"
HTTP/2 403 
date: Sun, 21 Jan 2024 21:44:50 GMT
# etc...

# And even fails on the base `alpine` image...
docker run -it --rm alpine sh -c 'apk update -q; apk add -q curl; curl -A "Mozilla/5.0 (compatible; markdown-link-check/3.11.2; MegaLinter/7.7.0; +https://megalinter.io)" -I "https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378"'

HTTP/2 403 
date: Sun, 21 Jan 2024 21:54:03 GMT
# etc....

I'm going to do some more digging to see if it's an issue with a commonality, Docker, or otherwise, but this is a deeper issue than I expected. Unfortunately, markdown-link-check has practically no debugging information outputted when configured, so getting more information on how the request was formatted (which I need to debug this) is going to be a challenge.

To be clear - don't close this issue. Adding the user-agent above is still a very, very good idea. This is indicative of a secondary problem.

from megalinter.

andrewvaughan avatar andrewvaughan commented on August 17, 2024

More updates... it works fine with wget... well, at least on the base images. Just not these configurations. I think I may have pintpointed the source of the error (at least in the shell) to BusyBox.

# Local works fine...
$ wget --server-response -U "Mozilla/5.0 (compatible; markdown-link-check/3.11.2; MegaLinter/7.7.0; +https://megalinter.io)" https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378
--2024-01-21 16:55:55--  https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378
Resolving meta.stackexchange.com (meta.stackexchange.com)... 172.64.144.30, 104.18.43.226
Connecting to meta.stackexchange.com (meta.stackexchange.com)|172.64.144.30|:443... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 200 OK

# Alpine fails out of the box
$ docker run -it --rm alpine sh -c 'wget --server-response -U "Mozilla/5.0 (compatible; markdown-link-check/3.11.2; MegaLinter/7.7.0; +https://megalinter.io)" https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378'
Connecting to meta.stackexchange.com (172.64.144.30:443)
  HTTP/1.1 403 Forbidden
wget: server returned error: HTTP/1.1 403 Forbidden

# But Alpine works fine if we reinstall wget...
$ docker run -it --rm alpine sh -c 'apk update -q; apk add -q wget; wget --server-response -U "Mozilla/5.0 (compatible; markdown-link-check/3.11.2; MegaLinter/7.7.0; +https://megalinter.io)" https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378'
--2024-01-21 22:03:56--  https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378
Resolving meta.stackexchange.com (meta.stackexchange.com)... 172.64.144.30, 104.18.43.226
Connecting to meta.stackexchange.com (meta.stackexchange.com)|172.64.144.30|:443... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 200 OK

# Ubuntu works fine...
$ docker run -it --rm --entrypoint /bin/sh ubuntu -c 'apt update -qq; apt install -y -qq wget; wget --server-response -U "Mozilla/5.0 (compatible; markdown-link-check/3.11.2; MegaLinter/7.7.0; +https://megalinter.io)" https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378'

# ...

Connecting to meta.stackexchange.com (meta.stackexchange.com)|104.18.43.226|:443... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 200 OK

# Not Megalinter (which is alpine-based)
$ docker run --entrypoint /bin/bash -it --rm oxsecurity/megalinter-python:v7.7.0 -c 'wget --server-response -U "Mozilla/5.0 (compatible; markdown-link-check/3.11.2; MegaLinter/7.7.0; +https://megalinter.io)" https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378'
Connecting to meta.stackexchange.com (104.18.43.226:443)
  HTTP/1.1 403 Forbidden
wget: server returned error: HTTP/1.1 403 Forbidden

# Nor markdown-link-check (which is also alpine-based)
$ docker run -it --rm --entrypoint /bin/sh ghcr.io/tcort/markdown-link-check:stable -c 'wget --server-response -U "Mozilla/5.0 (compatible; markdown-link-check/3.11.2; MegaLinter/7.7.0; +https://megalinter.io)" https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378'
Connecting to meta.stackexchange.com (172.64.144.30:443)
  HTTP/1.1 403 Forbidden
wget: server returned error: HTTP/1.1 403 Forbidden

So what about versions?

# Works fine
$ wget
GNU Wget 1.21.4 built on darwin22.4.0.

# Fails
$ docker run -it --rm alpine sh -c 'which wget; wget'
/usr/bin/wget
BusyBox v1.36.1 (2023-11-07 18:53:09 UTC) multi-call binary.

# Works fine
$ docker run -it --rm alpine sh -c 'apk update -q; apk add -q wget; which wget; wget --version'
/usr/bin/wget
GNU Wget 1.21.4 built on linux-musl.

# Works fine
$ docker run -it --rm --entrypoint /bin/sh ubuntu -c 'apt update -qq; apt install -y -qq wget; which wget; wget --version'
/usr/bin/wget
GNU Wget 1.21.2 built on linux-gnu.

# Fails
$ docker run -it --rm alpine sh -c 'which wget; wget'
/usr/bin/wget
BusyBox v1.36.1 (2023-11-07 18:53:09 UTC) multi-call binary.

# Fails
$ docker run --entrypoint /bin/bash -it --rm oxsecurity/megalinter-python:v7.7.0 -c 'which wget; wget'  
/usr/bin/wget
BusyBox v1.36.1 (2023-11-06 11:32:24 UTC) multi-call binary.

So that's interesting... it seems that the default https://github.com/mirror/busybox bundle is the common point of failure on these devices.

I wonder if this could be solved simply by adding a proper apk add wget to each of your Dockerfiles _(of course, presuming this is what markdown-link-check is using in the background... my next deep dive down this rabbit hole).

from megalinter.

nvuillam avatar nvuillam commented on August 17, 2024

What an investigation @andrewvaughan :D

it seems markdown-links-check

Maybe needle has different behaviors depending of environment variables ?

Something to check that could be to expose a mock service, and log the calls within docker and out of docker to see the differences :)

from megalinter.

echoix avatar echoix commented on August 17, 2024

Wow, what a pleasure to read at the whole thinking process.

Near the beginning of the thread, I was thinking on trying a Debian/debian slim/ubuntu container too. Sometimes, to make sure that I'm not hitting some particular differences of musl-based packages, it's always good to check if it should work without it. In the last couple of years of still being subscribed to notifications on the node-red docker repo, you'd be surprised about the frequency of weird behaviours that doesn't happen with a Debian based base image (as they have both).

I'd never thought of going as deep as you did, you even learned me a new word, ALPN!
Your links to the source code in node.js, in a well written debugging summary like this would be good candidates to being permalinks to be able to reread the good thing in a months time (the issues that you made that would be referencing this might take a while).

As for the user agent, I have three contradicting opinions. On one side, it is reasonable and your explanations justify correctly the need to have a user agent. On another side, shouldn't it be a user agent for the linter rather than Megalinter? While you are talking specifically talking about markdown-link-checker, the linter I struggle a bit more with lychee. That brings to the third competing opinion: some sites answer completely differently by user-agent. Wink Wink SourceForge. Even though I already found it out on myself before, it was apparent when working with a winget definition for a new software version, where the download URLs work only in specific cases. (Luckily they have an arrangement so their CI works better than locally). But these differences came back at the beginning of the introduction of lychee linter, before getting stuff smoothed out. So here, sometimes having the generic most common user agent is the only way to have a (badly) configured website to work at all.

So I can't decide yet what will weight more in the balance.

from megalinter.

andrewvaughan avatar andrewvaughan commented on August 17, 2024

Thanks for the kind words!

Per your concerns on the UA - you're 100% on point. That's why I particularly recommended the pattern of UA that I did. There's a link I put above with best-practices on generating UAs.

Most "crawlers" literally put "crawler/2.2.2" which can be problematic, if only because some lazy admins block "everything not standard," which was never intended for UAs. That's where marking a compatible comes in.

The UA format is <name>/<version> <comment> - very, very simple. Filtering is only supposed to happen on the first two, but platform standard needs did add some filtering in the comment area informally.

As such, nearly all UAs for browsers are:

BrowserName/1.1 (System Information)

With some level of standardization in what the (System Information) entails.

However! That comment can technically be anything - and there actually is a better pattern for applications that meet the "requirements" of a browser standard but make use of it in a different way; for example:

Mozilla/5.0 (compatible; technology/x.x.x; technology/x.x.x; +https://reference)

This is a great pattern, because it both informs the server as to what standard can be managed and allows for fine-tune bot management by administrators. Maybe someone wants to block all of MegaLinter - maybe just link checkers. Maybe just particular, problematic versions. It's their choice in this format with some simple string-matching.

So you end up with something like the recommendation above, or, for something more simple, the following:

Mozilla/5.0 (compatible; MegaLinter/7.8.0; +https://megalinter.io)

The reference URL at the end is also super helpful - coming from an admin, if I were to start seeing this new UA appear out of everywhere, my first reaction would be to block it. A responsible admin, however, will check the reference to see what its purpose is and determine as to whether it is nefarious or not for the purposes of the applications. Systems like Web Application Firewalls learn from this, and you might even start to see MegaLinter UAs have fewer and fewer issues with systems like Cloudflare WAF (which ended up being the primary cause of the problem, here - it turned out to not be the UA, at all... at least, not on its own).

Unfortunately, without any specification, you end up with the default for whatever the linters are - or sometimes no UA at all. For markdown-link-check, I believe it's using the default from the needle library it's using, which is needle/x.x.x.

Now, imagine how many people have probably used the needle dependency to make more ...problematic... bots for servers. How could they tell MegaLinter traffic from those bad bots? There's really no way. Easier to just block the entire batch, and you'll be much more likely to get picked up in a WAF as a nefarious bot.

So, for me - I think the question is whether the responsibility of setting an appropriate UA is for the tool or the tool container. I lean toward the argument that the UA should always represent the technology closest to the end user (in this case, MegaLinter, being the utility I chose to incorporate into my project, not necessarily the specific linter), so I would prefer my UA to represent Mozilla/5.0 (compatible; MegaLinter/7.8.0; +https://megalinter.io) - but this is a personal opinion. There's a strong argument to include the individual linter details in each UA, as well.

This is just me thinking out loud, but that's my $0.00002 on the issue!

Edit: I realized I didn't touch on a concern - there's always the default argument to just "copy/paste" a "known working" UserAgent to mimic a browser entirely... but WAFs caught on to that decades ago, and it's barely worthwhile these days. It has to do with usage patterns - raises AI eyebrows when "iOS Safari" only makes HEAD requests to human-readable endpoints and does about 30 distinct ones within 2 seconds.

That said... you can always offer a configurable override to end-users!

from megalinter.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.