omrilotan / isbot Goto Github PK

View Code? Open in Web Editor NEW

887.0 11.0 73.0 1.87 MB

🤖/👨‍🦰 Detect bots/crawlers/spiders using the user agent string

Home Page: https://isbot.js.org/

License: The Unlicense

JavaScript 26.89% Shell 8.94% TypeScript 53.21% Pug 3.08% CSS 7.88%

user-agent-analysis user-agent user-agent-parser crawlers web-crawlers

isbot's Introduction

isbot 🤖/👨‍🦰

Identify bots, crawlers, and spiders using the user agent string.

Usage

Install

npm i isbot

Straightforward usage

import { isbot } from "isbot";

// Request
isbot(request.headers.get("User-Agent"));

// Nodejs HTTP
isbot(request.getHeader("User-Agent"));

// ExpressJS
isbot(req.get("user-agent"));

// Browser
isbot(navigator.userAgent);

// User Agent string
isbot(
  "Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e Safari/8536.25 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)",
); // true

isbot(
  "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36",
); // false

Use JSDeliver CDN you can import to the browser directly

See specific versions and instructions https://www.jsdelivr.com/package/npm/isbot

ESM

<script type="module">
  import { isbot } from "https://cdn.jsdelivr.net/npm/isbot@5/+esm";
  isbot(navigator.userAgent);
</script>

UMD

<script src="https://cdn.jsdelivr.net/npm/isbot@5"></script>
<script>
  // isbot is now global
  isbot(navigator.userAgent);
</script>

All named imports

import	Type	Description
isbot	(string?): boolean	Check if the user agent is a bot
isbotNaive	(string?): boolean	Check if the user agent is a bot using a naive pattern (less accurate)
getPattern	(): RegExp	The regular expression used to identify bots
list	string[]	List of all individual pattern parts
isbotMatch	(string?): string \| null	The substring matched by the regular expression
isbotMatches	(string?): string[]	All substrings matched by the regular expression
isbotPattern	(string?): string \| null	The regular expression used to identify bot substring in the user agent
isbotPatterns	(string?): string[]	All regular expressions used to identify bot substrings in the user agent
createIsbot	(RegExp): (string?): boolean	Create a custom isbot function
createIsbotFromList	(string[]): (string?): boolean	Create a custom isbot function from a list of string representation patterns

Example usages of helper functions

Create a custom isbot that does not consider Chrome Lighthouse user agent as bots.

import { createIsbotFromList, isbotMatches, list } from "isbot";

const ChromeLighthouseUserAgentStrings: string[] = [
  "mozilla/5.0 (macintosh; intel mac os x 10_15_7) applewebkit/537.36 (khtml, like gecko) chrome/94.0.4590.2 safari/537.36 chrome-lighthouse",
  "mozilla/5.0 (linux; android 7.0; moto g (4)) applewebkit/537.36 (khtml, like gecko) chrome/94.0.4590.2 mobile safari/537.36 chrome-lighthouse",
];
const patternsToRemove = new Set<string>(
  ChromeLighthouseUserAgentStrings.map(isbotMatches).flat(),
);
const isbot: (ua: string) => boolean = createIsbotFromList(
  list.filter(
    (record: string): boolean => patternsToRemove.has(record) === false,
  ),
);

Create a custom isbot that considers another pattern as a bot, which is not included in the package originally.

import { createIsbotFromList, list } from "isbot";

const isbot = createIsbotFromList(list.concat("shmulik"));

Definitions

Bot. Autonomous program imitating or replacing some aspect of a human behaviour, performing repetitive tasks much faster than human users could.
Good bot. Automated programs who visit websites in order to collect useful information. Web crawlers, site scrapers, stress testers, preview builders and other programs are welcomed on most websites because they serve purposes of mutual benefits.
Bad bot. Programs which are designed to perform malicious actions, ultimately hurting businesses. Testing credential databases, DDoS attacks, spam bots.

Clarifications

What does "isbot" do?

This package aims to identify "Good bots". Those who voluntarily identify themselves by setting a unique, preferably descriptive, user agent, usually by setting a dedicated request header.

What doesn't "isbot" do?

It does not try to recognise malicious bots or programs disguising themselves as real users.

Why would I want to identify good bots?

Recognising good bots such as web crawlers is useful for multiple purposes. Although it is not recommended to serve different content to web crawlers like Googlebot, you can still elect to

Flag pageviews to consider with business analysis.
Prefer to serve cached content and relieve service load.
Omit third party solutions' code (tags, pixels) and reduce costs.

It is not recommended to whitelist requests for any reason based on user agent header only. Instead, other methods of identification can be added such as reverse dns lookup.

How `isbot` maintains accuracy

isbot is an asset when it can most accurately identify bots by the user agent string. It uses expansive and regularly updated lists of user agent strings to create a regular expression that matches bots and only bots.

Fallback

The pattern uses lookbehind methods which are not supported in all environments. A fallback is provided for environments that do not support lookbehind. The fallback is less accurate. The test suite includes a percentage of false positives and false negatives which is deemed acceptable for the fallback: 1% false positive and 75% bot coverage.

Data sources

We use external data sources on top of our own lists to keep up to date

Crawlers user agents

Non bot user agents

user-agents npm package
A Manual list

Missing something? Please open an issue

Major releases breaking changes (full changelog)

Version 5

Remove named export "pattern" from the interface, instead use "getPattern" method

Version 4

Remove isbot function default export in favour of a named export.

import { isbot } from "isbot";

Version 3

Remove testing for node 6 and 8

Version 2

Change return value for isbot: true instead of matched string

Version 1

No functional change

isbot's People

Contributors

Stargazers

Watchers

isbot's Issues

Typescrit types are broken

This one 17338c2 should actually fix ts definitions. But it broke it for me. As soon as i build my app it end's up in something like this:

$ NODE_ENV=production tsc
node_modules/isbot/index.d.ts:19:1 - error TS1036: Statements are not allowed in ambient contexts.

19 testUserAgent.extend = extend;

Here is a codesandbox to reproduce the issue (run the build script):

https://codesandbox.io/s/async-tree-lxpjv?expanddevtools=1&fontsize=14&hidenavigation=1&moduleview=1

If i downgrade to 2.4.0 it's finde for me

QUESTION: "bot" in list.json seems too permissive?

I noticed that "bot" is one of the keywords and it greedy matches for any user agent containing those 3 characters in a row. I feel like this is great for capturing user agents like:

Slackbot 1.0(+https://api.slack.com/robots)
Slackbot-LinkExpanding 1.0 (+https://api.slack.com/robots)

However, it feels too permissive. Is it a fair assumption that there are no legitimate user agents that could contain the characters bot?

This is more of a question than an issue. Just curious on the thinking there.

I am also assuming that out of the 3 bot user agent listings used to construct the list.json that the regex(s) chosen are intended to cover all of the user agents listed but are not one to one?

Use a map

module.exports = { "googlebot":true };

var list = require("./list");
module.exports = function(bot) {
  return list[bot] === true;
}

isBot("googlebot"); //=> true
isBot("asdf"); //=> false

No conversion and regex necessary -- much faster.

Performance Improvement

RegEx is very slow
Consider iterating through the bot list with a for loop
Here's a jsperf:
https://jsperf.com/user-agent-filter-regex-vs-indexof/1

Pattern not being removed when calling "exclude"

Steps to reproduce

const isBot = require('isbot');

const userAgent = 'Mozilla/5.0 (Linux; Android 10; SNE-LX1 Build/HUAWEISNE-L21; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/94.0.4606.71 Mobile Safari/537.36 hap/1079/huawei com.huawei.fastapp/11.4.1.310 com.bla.quickapp/4.0.17 ({"packageName":"quickSearch","type":"other","extra":"{}"})';

isBot.exclude(['search']);

console.log(isBot(userAgent));
console.log(isBot.find(userAgent));

Expected behaviour

search should have been removed from the pattern list,
and the example userAgent should have not be identified as a bot.

Actual behaviour

search is still kept in the pattern list,
and the userAgent is still identified as a bot.

Wrong detect

Not correctly identifies the user agent

Mozilla/5.0 (Linux; Android 8.1.0; Mi A1 Build / OPM1.171019.026; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version / 4.0 Chrome/68.0.3440.91 Mobile Safari/537.36 YandexSearch/7.71 YandexSearchBrowser/7.71

thinks it's a bot

Tests failed

Github Actions stopped working

Workflow output:

- Workflows can't be executed on this repository. Please check your payment method or billing status.

Workflows are limited to 2,000 minutes/month. Recent issue suggests this may be limiting open repositories.

I can move CI process to CircleCI which support pipeline as code

Periodic failed tests do not open issues

curl request fails. Tested with different credentials and replaced "Bearer" keyword with "token".
Need to create a fix

Rethink existing browsers list

Some browsers may need to be re considered as bots; downloaders, Google Talk, etc.

This may potentially allow to optimise the regex - and, as a result, runtime.

Tests failed

[3.3.1] "const" keyword in index.js

Steps to reproduce

Build an app with webpack using this library.

Expected behaviour

index.js should contain ES5 code for full browser support.

Actual behaviour

index.js contains one const keyword, and therefore webpack generated a bundle/chunk with a const keyword.

Additional details

TypeError: isbot_1.default is not a function

TS Type Definitions appear to have issues. When upgrading to the latest version, I start getting the following error:

TypeError: isbot_1.default is not a function

After upgrading to 2.4.0, the old way of importing the library in TS started throwing errors in the TS compiler:

import * as isBot from 'isbot';

Now generates the following error:

Cannot invoke an expression whose type lacks a call signature. Type 'typeof import("node_modules/isbot/index")' has no compatible call signatures.ts(2349)

To remove the TS compiler error, I had to change the import to:

import isBot from 'isbot';

But this now throws a runtime error (compilation works OK), as follows:

TypeError: isbot_1.default is not a function

I can see that the main change from the previous version I was using (v2.1.2) is the addition of the type defs (.d.ts) file in the package.

I have done some research, and this kind of error usually is linked to invalid TS definition files. Some similar issues found on the web, with packages having similar problems:

microsoft/TypeScript#7554
iamkun/dayjs#475

My current tsconfig.json file is:

  "compilerOptions": {
    "target": "es2016",
    "module": "commonjs",
    "strict": false,
    "moduleResolution": "node",
    "emitDecoratorMetadata": true,
    "experimentalDecorators": true,
    "sourceMap": true,
    "types": ["node", "aws-lambda"]
  }
}

And environment:

isbot: 2.4.0
node: 10
typescript: 3.5.2

Bug - user agent incorrectly treated as a bot

Thanks for the lib @gorangajic.

I think I found a bug, or at least it looks to me like it is a bug.

I'm making a request from a Lambda function using chrome-aws-lambda lib.
Below we can see which User-Agent it sets (UA_LAMBDA).

Once I pass this value to the isbot, it returns true, but this is not a bot obviously.

var isbot = require("isbot");

const UA_LAMBDA =
  "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/79.0.3945.0 Safari/537.36";
const UA_CHROME_MAC =
  "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36";

console.log(isbot(UA_LAMBDA)); // true
console.log(isbot(UA_CHROME_MAC)); // false

Is this a bug?

When I execute the same request from an ordinary Chrome, it all works good (UA_CHROME_MAC).

Feel free to correct me if I said anything incorrect.

Thank you!

Tests failed

How to keep accurate bot list maintained?

Perhaps use/scrape this with cheerio?

https://www.distilnetworks.com/bot-directory/category/search-engine/

Allow undefined in isbot interface

Steps to reproduce

Using Typescript and Express, using the Request.get method interface returns string or undefined. This does not match the interface of isbot (string).

Expected behaviour

Accept undefined to allow simpler usage

Actual behaviour

Currently while using isbot(req.get('user-agent')) we have compilation exception

Additional details

npm script to update agents list

using something like http://www.user-agents.org/

Detecting Social Bots?

What about detecting Facebook, Twitter and LinkedIn?

They ping your site when you post links on their platforms.
Then read your social meta tags that you provide.

isSocial()?

My workaround...

var ua = require('useragent')
ua(true)
req.session.userAgent = ua.lookup(req.headers['user-agent'])
var ua = req.session.userAgent.toString().toLowerCase()
isSocialBot: (ua.indexOf('facebook') > -1 || ua.indexOf('twitter') > -1 || ua.indexOf('linkedin') > -1) ? true : false

Needed to do this recently to hide/show social html meta data in my header tag...

{{#if isSocialBot}}
<meta property="og:site_name" content="" />
<meta property="og:title" content="" />
<meta property="og:image" content="" />
<meta property="og:url" content="" />
<meta property="og:type" content="website" />
<meta property="og:locale" content="en_US" />
<meta property="og:locale:alternate" content="en_CA" />
<meta property="og:locale:alternate" content="fr_CA" />
{{/if}}

Non-Chromium compatibility broken in 2.5.2

acb6dcf#diff-168726dbe96b3ce427e7fedce31bb0bcR70 throws a SyntaxError: invalid regexp group in non-Chromium browsers like Firefox and Edge.

Wrapping in a try/catch doesn't help here, because it's not something that the browser can even parse, thus the SyntaxError.

It looks like you can detect support for this a little better with something like:

try {
	let regex = new RegExp('(?<! cu)bot');
}catch(e){
	// SyntaxError is caught here
}

broken

The previous version:

isbot("googleBot asdf") //=> googleBot
isbot("asdfasdf") //=> false

vs current version:

isbot("googleBot asdf") //=> Bot
isbot("asdfasdf") //=> false

It's not useful like this.

Needs to detect BingPreview bot

Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534+ (KHTML, like Gecko) BingPreview/1.0b

Empty user agent strings

Nice project, thanks for sharing it!

What's your take on empty user agent strings? ('')
It seems like they are being considered as not a bot currently, what about considering them as a bot instead?

Have more stability for exclusions

Hi again, I have another suggestion regarding improving this package.

Context

I have been using extend which works great. I recently started using exclude and would like to raise a concern about the way it works. My goal was considering a known crawler, let's say Chrome Lighthouse, as a human. So I needed to exclude it. Using isBot.exclude(['Lighthouse']); wouldn't work and when double-checking the documentation, I understood my mistake and used Chrome-Lighthouse (coming from https://github.com/gorangajic/isbot/blob/master/list.json) instead of Lighthouse and it's now working as expected.

Problem

I think this solution is not really stable because I cannot know for sure that https://github.com/gorangajic/isbot/blob/master/list.json won't evolve for Chrome-Lighthouse (there are other more complex examples). If the string gets updated in https://github.com/gorangajic/isbot/blob/master/list.json but doesn't get updated on my application, then the exclusion won't work anymore.

Solution

I think the best would be to use enumerations that we can import from this package. So instead of using 'Chrome-Lighthouse' string in list.json, something like UA.LIGHTHOUSE will be used. Then in my application, I can import this enumeration and use it. This way, by not using hardcoded strings, I know they will always be in sync even when updating the package version.

Please let me know what you think. Thank you in advance!

Tests failed

Add Google PageSpeed Insights

Should this include the user agent for Google PageSpeed?

See https://stackoverflow.com/questions/16403295/what-is-the-name-of-the-google-pagespeed-user-agent

The user agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko; Google Page Speed Insights) Chrome/27.0.1453 Safari/537.36

Google PageSpeed Insights: https://developers.google.com/speed/pagespeed/insights

RegExp Lookbehind is causing erros in Safari and Firefox

SyntaxError: invalid regexp group
This error will be thrown in the packed file if isBot is used in the client. This was introduced in Version 2.5.2.
Not sure if this is a general use case but it can be very hard to find as Firefox has a related bug always specifying line 1.

Fail safe - Outages resilience

An outage in myip.ms caused one of the lists to fail downloading - which caused error in testing, and was not recoverable.
Think of creating backup files for recovery from list providers outages

Wrong detect Cubot devices

Hey,

your script isn't capable of detecting Cubot devices - it's probably a "bot" string in the name. I'm sending several user agents for testing:

Mozilla/5.0 (Linux; Android 8.0.0; CUBOT_P20) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Mobile Safari/537.36
Mozilla/5.0 (Linux; Android 8.1.0; CUBOT_POWER) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Mobile Safari/537.36
Mozilla/5.0 (Linux; U; Android 8.0.0; CUBOT_X18_Plus Build/O00623; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/71.0.3578.99 Mobile Safari/537.36 OPR/37.6.2254.134291
Mozilla/5.0 (Linux; Android 6.0; CUBOT_NOTE_S) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.80 Mobile Safari/537.36

Error while installing version 3.0.18

Steps to reproduce

Install version 3.0.18 of isbot.

Expected behaviour

Module should be installed successfully.

Actual behaviour

Getting the following error:

npm ERR! code ELIFECYCLE
npm ERR! syscall spawn
npm ERR! file sh
npm ERR! errno ENOENT
npm ERR! [email protected] postinstall: `./scripts/download-fixtures.sh`
npm ERR! spawn ENOENT
npm ERR! 
npm ERR! Failed at the [email protected] postinstall script.
npm ERR! This is probably not a problem with npm. There is likely additional logging output above.

Wrong detection

Got useragent 'Hexometer' on a webpage - it is a bot, but was not detected and has passed through the checking

It is a whole ua, without versions and any other symbols

nibbler

User Agent String

nibbler

Reproduce

I tested on my website and nibbler is not recognized as a bot.

https://nibbler.silktide.com/en/about

License

Good work. Just realised there is no license on the code so as I see it technically it is copyright and no permission is granted to use the code so we're not able to at our company :-(. Any chance of adding a license like MIT to the repo? Sorry to be a bore!

PetalBot was not recognised

User Agent String

PetalBot (Huawei's Petal search engine)
https://developer.huawei.com/consumer/en/doc/petalbot

Reproduce

Cannot upgrade stadardjs

It's time to upgrade the syntax. var is no longer allowed.
Convert to ES6 source file

MozacFetch/49.0.20200702190156 was wrongly recognised as a bot

User Agent String

MozacFetch/49.0.20200702190156

Reproduce

MozacFetch appears to be the user agent used by Mozilla's new Android browser - Fenix - https://stackoverflow.com/questions/63666913/where-is-this-user-agent-from-mozacfetch

I have no Android phone or Fenix build, so I'm unable to reproduce right now. Looking at mozilla-mobile/fenix#7961, Fenix uses MozacFetch UA under certain situations, so it's not a crawler.

doesn't detect whatsapp crawlers

example user agent: WhatsApp/2.17.107 A

Add ability to create multiple instances

Problem

We have a handful of reasons to change the behaviour of our site based on whether the user is a bot. However, they have subtly different requirements. For example:

import isBot from 'isbot';

export const shouldTrack = (ua) => !isBot(ua);
export const shouldFlushHeadEarly = (ua) => !isBot(ua); // exclude chrome-lighthouse and webpagetest
export const shouldDetectRegion = (ua) => !isBot(ua); // include foo and bar
// etc.

Possible solution

Would it make sense to expose a constructor (or factory fn) which had its own pattern state?

import { createIsBot } from 'isbot';

export const shouldTrack = createIsBot();
export const shouldFlushHeadEarly = createIsBot({
  exclude: ['chrome-lighthouse', 'PTST/4']
});
export const shouldDetectRegion = createIsBot({
  include: ['foo', 'bar']
});
// etc.

Tests failed

Tests

There are no tests. There should be some.

<BOT_NAME> was not retestcognised

User Agent String

Reproduce

Tests failed

It doesn't seem to work with IE

Tests failed

Chrome was wrongly recognised as a bot

User Agent String

Mozilla/5.0 (Linux; Android 10; SM-G973F Build/QQ3A.200805.001; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/85.0.4183.101 Mobile Safari/537.36 trill_2021806060 JsSdk/1.0 NetType/WIFI Channel/googleplay AppName/musical_ly app_version/18.6.6 ByteLocale/fr ByteFullLocale/fr Region/FR

This user agent is from Tiktok webview, on Android
It seems that the string "google" is triggering your pattern

Reproduce

Open tiktok
Open a link in user bio: you are detected as a bot

Documentation update required on exclude use-case

Problem

With the recent changes made in #72, especially removing "Google Page Speed Insights" rule, the documentation is now incorrect regarding exclude:

https://github.com/gorangajic/isbot#remove-matches-of-known-crawlers

isbot('Google Page Speed Insights') always returns true as adding it to exclude does not work anymore because its rule disappeared from list.json.

Solution

In order to stay correct, the documentation could focus on Chrome-Lighthouse only.

Needs to detect Google Preview bot

https://www.distilnetworks.com/bot-directory/bot/google-favicon/

Is this library intended to be used in a browser with webpack?

The library is called: "Node.js module that detects bots/crawlers/spiders via the user agent". Does it imply that it's suited to node.js environment only?

We were able to successfully use it the browser with Webpack. Was it intended to be used this way too? Do you know some potential pitfalls? Is it safe to use it way on production?

omrilotan / isbot Goto Github PK

isbot's Introduction

isbot 🤖/👨‍🦰

Usage

All named imports

Example usages of helper functions

Definitions

Clarifications

What does "isbot" do?

What doesn't "isbot" do?

Why would I want to identify good bots?

How isbot maintains accuracy

Fallback

Data sources

Crawlers user agents

Non bot user agents

Major releases breaking changes (full changelog)

isbot's People

Contributors

Stargazers

Watchers

Forkers

isbot's Issues

Steps to reproduce

Expected behaviour

Actual behaviour

Steps to reproduce

Expected behaviour

Actual behaviour

Additional details

Steps to reproduce

Expected behaviour

Actual behaviour

Additional details

Context

Problem

Solution

Steps to reproduce

Expected behaviour

Actual behaviour

User Agent String

Reproduce

User Agent String

Reproduce

User Agent String

Reproduce

Problem

Possible solution

User Agent String

Reproduce

User Agent String

Reproduce

Problem

Solution

Recommend Projects

Recommend Topics

Recommend Org

How `isbot` maintains accuracy