Git Product home page Git Product logo

Comments (13)

unixfox avatar unixfox commented on July 29, 2024 1

New Update!

Thanks to StackOverflow: https://stackoverflow.com/questions/29977086/regex-how-can-i-match-all-numbers-greater-than-954/29977124
Here is the Regex: https://regex101.com/r/wtfXpP/1
I was able to find a regex that match Chrome 76 & more and this seems to work perfectly! More over, Samsung Browser doesn't get filtered because its user agent have Chrome 75.

{
    "name": "chrome browser",
    "filters": [
        "Header:User-Agent=(Chrome/([1-9]\\d{2,}|[8-9]\\d|[6-9]{2}))",
        "!Header:Sec-Fetch-Dest",
        "!Header:Sec-Fetch-Mode",
        "!Header:Sec-Fetch-Site",
        "!Header:Sec-Fetch-User"
    ],
    "stop": true,
    "actions": [
        {
            "name": "block",
            "params": {
                "message": "Rate limit exceeded"
            }
        }
    ]
}

I guess time to submit a PR but I'm still waiting for your feedback @dalf.

from searx-docker.

unixfox avatar unixfox commented on July 29, 2024 1

I just pushed the rules on my own instance and I can already see bots being blocked smile!

+1 just be sure there are bots.

I'm monitoring their request, and they don't follow a normal human behavior.
More over, for the moment all of them their IP is blacklisted on https://mxtoolbox.com/blacklists.aspx which is a very good sign of being a bot request.

from searx-docker.

unixfox avatar unixfox commented on July 29, 2024 1

I disabled the rules because I'm seeing more requests blocked that looks like human ones than actual bots.
I guess there are a lot of users in the searx community that are using a user agent randomizer.

I'm closing this issue because it's not worth investigating more and filtron still do a very good job with the current rules.

from searx-docker.

unixfox avatar unixfox commented on July 29, 2024

I was able to workaround the issue with Samsung Browser by stopping the validating if filtron finds SamsungBrowser in the user agent:

{
    "name": "chrome browser",
    "filters": [
        "Header:User-Agent=(Chrome)"
    ],
    "subrules": [
        {
            "name": "contains samsung",
            "stop": true,
            "filters": [
                "Header:User-Agent=(SamsungBrowser)"
            ],
            "actions": [
                {
                    "name": "shell",
                    "params": {
                        "cmd": "/bin/true"
                    }
                }
            ]
        },
        {
            "name": "doesnt contains sec-fetch-x headers",
            "stop": true,
            "filters": [
                "!Header:Sec-Fetch-Dest",
                "!Header:Sec-Fetch-Mode",
                "!Header:Sec-Fetch-Site",
                "!Header:Sec-Fetch-User"
            ],
            "actions": [
                {
                    "name": "block",
                    "params": {
                        "message": "Please update Google Chrome to a newer version."
                    }
                }
            ]
        }
    ]
}

But it prints a message because I'm not able to find a workaround to avoid logging the request.
Also, the message is different from the common message because I'm unable to filtron the Google Chrome browsers with a version below 76.

from searx-docker.

dalf avatar dalf commented on July 29, 2024

@unixfox testing

from searx-docker.

unixfox avatar unixfox commented on July 29, 2024

Oh, it also filters Chrome/66.0.4044.122. I guess my regex is not powerful enough. I'll keep trying to find the better one.
Meanwhile, if you have a better one let me know :).

EDIT: I deleted my newer comment because it doesn't match "Chrome/80". Still trying...
Here is the regex that I tried: https://regex101.com/r/wtfXpP/2

from searx-docker.

dalf avatar dalf commented on July 29, 2024

Am I missing something:

  • use
    https://gist.github.com/dalf/5a823b5aeae06e9e631b5721794ae514
  • curl 'https://a.searx.space/?q=test&category_general=on&time_range=&language=en-US' -H 'authority: a.searx.space' -H 'pragma: no-cache' -H 'cache-control: no-cache' -H 'dnt: 1' -H 'upgrade-insecuests: 1' -H 'user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36' -H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' -H 'sec-fetch-modezz: navigate' -H 'sec-fetch-user: ?1' -H 'sec-fetch-dest: document' -H 'accept-language: fr-FR,fr;q=0.9,en-US;q=0.8,en;q=0.7,de;q=0.6' -H 'cookie: categories=general; language=en-US; locale=fr; image_proxy=1; safesearch=0; results_on_new_tab=0; doi_resolver=oadoi.org; oscar-style=logicodev; disabled_plugins=; enabled_plugins=; maintab=on; enginetab=on; method=GET; autocomplete=duckduckgo; theme=oscar; disabled_engines="wikidata__general\054piratebay__videos\054bing__general"; enabled_engines="reddit__social media\054startpage__general\054duckduckgo__general\054ddg definitions__general"; tokens=' --compressed

no blocking.

from searx-docker.

unixfox avatar unixfox commented on July 29, 2024

Ok I found a simpler regex: https://regex101.com/r/EVjgjL/1
Here are the rules to test that: https://paste.ee/p/0COrc
This should not be blocked:

curl 'http://127.0.0.1:4004/' \
  -H 'Connection: keep-alive' \
  -H 'Cache-Control: max-age=0' \
  -H 'Upgrade-Insecure-Requests: 1' \
  -H 'Origin: null' \
  -H 'Content-Type: application/x-www-form-urlencoded' \
  -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36' \
  -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' \
  -H 'Sec-Fetch-Site: same-origin' \
  -H 'Sec-Fetch-Mode: navigate' \
  -H 'Sec-Fetch-User: ?1' \
  -H 'Sec-Fetch-Dest: document' \
  -H 'Accept-Language: fr-FR,fr;q=0.9,en-US;q=0.8,en;q=0.7' \
  --data 'q=ok&time_range=&language=fr-FR&category_general=on'

This should be blocked:

curl 'http://127.0.0.1:4004/' \
  -H 'Connection: keep-alive' \
  -H 'Cache-Control: max-age=0' \
  -H 'Upgrade-Insecure-Requests: 1' \
  -H 'Origin: null' \
  -H 'Content-Type: application/x-www-form-urlencoded' \
  -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36' \
  -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' \
  -H 'Accept-Language: fr-FR,fr;q=0.9,en-US;q=0.8,en;q=0.7' \
  --data 'q=ok&time_range=&language=fr-FR&category_general=on'

Feel free to try different combinations while still thinking that it should block only if Sec-Fetch-Dest, Sec-Fetch-Mode, Sec-Fetch-Site and Sec-Fetch-User headers aren't present at the same time. I don't know how to make it so it match if one of the 4 headers aren't present without replicating the rule 4 time with each header.

from searx-docker.

dalf avatar dalf commented on July 29, 2024

[EDIT]

   "filters": [
       "Header:User-Agent=Chrome/(7[6-9]|[8-9][0-9]|[1-9][0-9][0-9])",
       "!Header:Sec-Fetch-Dest",
       "!Header:Sec-Fetch-Mode",
       "!Header:Sec-Fetch-Site",
       "!Header:Sec-Fetch-User"
   ],

but as soon there is at least one the Header-Sec-* header, filtron forwards the request to searx.

from searx-docker.

unixfox avatar unixfox commented on July 29, 2024

With this more powerful rule it blocks your curl command but it's not pretty:

{
    "name": "chrome >=76 user agent",
    "filters": [
        "Header:User-Agent=(Chrome/([0-9][0-9][0-9]|[8-9][0-9]|7[6-9]))"
    ],
    "subrules": [
        {
            "name": "No Sec-Fetch-Dest header",
            "filters": [
                "!Header:Sec-Fetch-Dest"
            ],
            "limit": 0,
            "stop": true,
            "actions": [
                {
                    "name": "block",
                    "params": {
                        "message": "Rate limit exceeded"
                    }
                }
            ]
        },
        {
            "name": "No Sec-Fetch-Mode header",
            "filters": [
                "!Header:Sec-Fetch-Mode"
            ],
            "limit": 0,
            "stop": true,
            "actions": [
                {
                    "name": "block",
                    "params": {
                        "message": "Rate limit exceeded"
                    }
                }
            ]
        },
        {
            "name": "No Sec-Fetch-Site header",
            "filters": [
                "!Header:Sec-Fetch-Site"
            ],
            "limit": 0,
            "stop": true,
            "actions": [
                {
                    "name": "block",
                    "params": {
                        "message": "Rate limit exceeded"
                    }
                }
            ]
        },
        {
            "name": "No Sec-Fetch-User header",
            "filters": [
                "!Header:Sec-Fetch-User"
            ],
            "limit": 0,
            "stop": true,
            "actions": [
                {
                    "name": "block",
                    "params": {
                        "message": "Rate limit exceeded"
                    }
                }
            ]
        }
    ]
}

Complete rules.json file: https://paste.ee/p/zx1rw

EDIT: I just pushed the rules on my own instance and I can already see bots being blocked 😄!

from searx-docker.

dalf avatar dalf commented on July 29, 2024

I confirm it is working as intended.

I don't see a shorter way to write this rules without touching filtron source code.


A format improvement idea:

{
    "name": "chrome >=76 user agent",
    "filters": {
		"and": {
			"Header:User-Agent=(Chrome/([0-9][0-9][0-9]|[8-9][0-9]|7[6-9]))": true,
			"or": {
				"Header:Sec-Fetch-Dest": false,
				"Header:Sec-Fetch-Mode": false,
				"Header:Sec-Fetch-Site": false	
			}
		}
    },
    ...
}

I just pushed the rules on my own instance and I can already see bots being blocked 😄!

👍 just be sure there are bots.

from searx-docker.

dalf avatar dalf commented on July 29, 2024

Do you think if it would be interesting to add a blacklist check as a filter in filtron ?

As I understand it is one a DNS lookup per IP and per list ?

from searx-docker.

unixfox avatar unixfox commented on July 29, 2024

Ok so first observation, not every Google Chrome or forks of it send Sec-Fetch-User for some reason (which I need to investigate). I removed this rule because I feel like it's filtering real humans.
Second observation is that some users use a user agent randomizer which could block their request if they make a request from Firefox using the Google Chrome user agent. I think it would be better to customize the message in order to alert the user about that but without telling the real truth in order to not having bot owners trying to circumvent filtron 🤔


Do you think if it would be interesting to add a blacklist check as a filter in filtron ?

No. I would prefer to make a new program for that purpose because I feel like filtron is good at doing its header filtering thing, adding more features unrelated to that would defeat the initial purpose of it and probably make it slower.
Moreover, having your IP in a blacklist doesn't mean that you are a bot.
Some blacklists are really outdated and due to that I think this would do more harm than actually help to block bots because your home connection could have its IP in the blacklist due to the dynamic IP thing. Filtering by ASN is a better idea, for example blocking requests made by ASN of VPS providers.

from searx-docker.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.