Git Product home page Git Product logo

isbot's Introduction

isbot ๐Ÿค–/๐Ÿ‘จโ€๐Ÿฆฐ

Identify bots, crawlers, and spiders using the user agent string.

Usage

Install

npm i isbot

Straightforward usage

import { isbot } from "isbot";

// Request
isbot(request.headers.get("User-Agent"));

// Nodejs HTTP
isbot(request.getHeader("User-Agent"));

// ExpressJS
isbot(req.get("user-agent"));

// Browser
isbot(navigator.userAgent);

// User Agent string
isbot(
  "Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e Safari/8536.25 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)",
); // true

isbot(
  "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36",
); // false

Use JSDeliver CDN you can import to the browser directly

See specific versions and instructions https://www.jsdelivr.com/package/npm/isbot

ESM

<script type="module">
  import { isbot } from "https://cdn.jsdelivr.net/npm/isbot@5/+esm";
  isbot(navigator.userAgent);
</script>

UMD

<script src="https://cdn.jsdelivr.net/npm/isbot@5"></script>
<script>
  // isbot is now global
  isbot(navigator.userAgent);
</script>

All named imports

import Type Description
isbot (string?): boolean Check if the user agent is a bot
isbotNaive (string?): boolean Check if the user agent is a bot using a naive pattern (less accurate)
getPattern (): RegExp The regular expression used to identify bots
list string[] List of all individual pattern parts
isbotMatch (string?): string | null The substring matched by the regular expression
isbotMatches (string?): string[] All substrings matched by the regular expression
isbotPattern (string?): string | null The regular expression used to identify bot substring in the user agent
isbotPatterns (string?): string[] All regular expressions used to identify bot substrings in the user agent
createIsbot (RegExp): (string?): boolean Create a custom isbot function
createIsbotFromList (string[]): (string?): boolean Create a custom isbot function from a list of string representation patterns

Example usages of helper functions

Create a custom isbot that does not consider Chrome Lighthouse user agent as bots.

import { createIsbotFromList, isbotMatches, list } from "isbot";

const ChromeLighthouseUserAgentStrings: string[] = [
  "mozilla/5.0 (macintosh; intel mac os x 10_15_7) applewebkit/537.36 (khtml, like gecko) chrome/94.0.4590.2 safari/537.36 chrome-lighthouse",
  "mozilla/5.0 (linux; android 7.0; moto g (4)) applewebkit/537.36 (khtml, like gecko) chrome/94.0.4590.2 mobile safari/537.36 chrome-lighthouse",
];
const patternsToRemove = new Set<string>(
  ChromeLighthouseUserAgentStrings.map(isbotMatches).flat(),
);
const isbot: (ua: string) => boolean = createIsbotFromList(
  list.filter(
    (record: string): boolean => patternsToRemove.has(record) === false,
  ),
);

Create a custom isbot that considers another pattern as a bot, which is not included in the package originally.

import { createIsbotFromList, list } from "isbot";

const isbot = createIsbotFromList(list.concat("shmulik"));

Definitions

  • Bot. Autonomous program imitating or replacing some aspect of a human behaviour, performing repetitive tasks much faster than human users could.
  • Good bot. Automated programs who visit websites in order to collect useful information. Web crawlers, site scrapers, stress testers, preview builders and other programs are welcomed on most websites because they serve purposes of mutual benefits.
  • Bad bot. Programs which are designed to perform malicious actions, ultimately hurting businesses. Testing credential databases, DDoS attacks, spam bots.

Clarifications

What does "isbot" do?

This package aims to identify "Good bots". Those who voluntarily identify themselves by setting a unique, preferably descriptive, user agent, usually by setting a dedicated request header.

What doesn't "isbot" do?

It does not try to recognise malicious bots or programs disguising themselves as real users.

Why would I want to identify good bots?

Recognising good bots such as web crawlers is useful for multiple purposes. Although it is not recommended to serve different content to web crawlers like Googlebot, you can still elect to

  • Flag pageviews to consider with business analysis.
  • Prefer to serve cached content and relieve service load.
  • Omit third party solutions' code (tags, pixels) and reduce costs.

It is not recommended to whitelist requests for any reason based on user agent header only. Instead, other methods of identification can be added such as reverse dns lookup.

How isbot maintains accuracy

isbot is an asset when it can most accurately identify bots by the user agent string. It uses expansive and regularly updated lists of user agent strings to create a regular expression that matches bots and only bots.

Fallback

The pattern uses lookbehind methods which are not supported in all environments. A fallback is provided for environments that do not support lookbehind. The fallback is less accurate. The test suite includes a percentage of false positives and false negatives which is deemed acceptable for the fallback: 1% false positive and 75% bot coverage.

Data sources

We use external data sources on top of our own lists to keep up to date

Crawlers user agents

Non bot user agents

Missing something? Please open an issue

Major releases breaking changes (full changelog)

Remove named export "pattern" from the interface, instead use "getPattern" method

Remove isbot function default export in favour of a named export.

import { isbot } from "isbot";

Remove testing for node 6 and 8

Change return value for isbot: true instead of matched string

No functional change

isbot's People

Contributors

0x7f avatar chinesedfan avatar deadlypants1973 avatar deedeeh avatar dependabot-preview[bot] avatar dependabot[bot] avatar drewbard avatar franzmoro avatar gorangajic avatar hijarian avatar interisti avatar itayganor avatar jssullivan avatar kieranstartup avatar leaverou avatar lirantal avatar maximn avatar mhluska avatar migara avatar neraks avatar niftylettuce avatar omrilotan avatar orangeflame avatar oskarrisberg avatar rynmsh avatar simonecorsi avatar timbowhite avatar viraptor avatar wcho avatar webinista avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

isbot's Issues

Typescrit types are broken

This one 17338c2 should actually fix ts definitions. But it broke it for me. As soon as i build my app it end's up in something like this:

$ NODE_ENV=production tsc
node_modules/isbot/index.d.ts:19:1 - error TS1036: Statements are not allowed in ambient contexts.

19 testUserAgent.extend = extend;

Here is a codesandbox to reproduce the issue (run the build script):

https://codesandbox.io/s/async-tree-lxpjv?expanddevtools=1&fontsize=14&hidenavigation=1&moduleview=1

If i downgrade to 2.4.0 it's finde for me

QUESTION: "bot" in list.json seems too permissive?

I noticed that "bot" is one of the keywords and it greedy matches for any user agent containing those 3 characters in a row. I feel like this is great for capturing user agents like:

Slackbot 1.0(+https://api.slack.com/robots)
Slackbot-LinkExpanding 1.0 (+https://api.slack.com/robots)

However, it feels too permissive. Is it a fair assumption that there are no legitimate user agents that could contain the characters bot?

This is more of a question than an issue. Just curious on the thinking there.

I am also assuming that out of the 3 bot user agent listings used to construct the list.json that the regex(s) chosen are intended to cover all of the user agents listed but are not one to one?

Use a map

module.exports = { "googlebot":true };
var list = require("./list");
module.exports = function(bot) {
  return list[bot] === true;
}
isBot("googlebot"); //=> true
isBot("asdf"); //=> false

No conversion and regex necessary -- much faster.

Pattern not being removed when calling "exclude"

Steps to reproduce

const isBot = require('isbot');

const userAgent = 'Mozilla/5.0 (Linux; Android 10; SNE-LX1 Build/HUAWEISNE-L21; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/94.0.4606.71 Mobile Safari/537.36 hap/1079/huawei com.huawei.fastapp/11.4.1.310 com.bla.quickapp/4.0.17 ({"packageName":"quickSearch","type":"other","extra":"{}"})';

isBot.exclude(['search']);

console.log(isBot(userAgent));
console.log(isBot.find(userAgent));

Expected behaviour

search should have been removed from the pattern list,
and the example userAgent should have not be identified as a bot.

Actual behaviour

search is still kept in the pattern list,
and the userAgent is still identified as a bot.

Wrong detect

Not correctly identifies the user agent

Mozilla/5.0 (Linux; Android 8.1.0; Mi A1 Build / OPM1.171019.026; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version / 4.0 Chrome/68.0.3440.91 Mobile Safari/537.36 YandexSearch/7.71 YandexSearchBrowser/7.71

thinks it's a bot

Github Actions stopped working

Workflow output:

- Workflows can't be executed on this repository. Please check your payment method or billing status.

Workflows are limited to 2,000 minutes/month. Recent issue suggests this may be limiting open repositories.

I can move CI process to CircleCI which support pipeline as code

Rethink existing browsers list

Some browsers may need to be re considered as bots; downloaders, Google Talk, etc.

This may potentially allow to optimise the regex - and, as a result, runtime.

[3.3.1] "const" keyword in index.js

Steps to reproduce

Build an app with webpack using this library.

Expected behaviour

index.js should contain ES5 code for full browser support.

Actual behaviour

index.js contains one const keyword, and therefore webpack generated a bundle/chunk with a const keyword.

image

Additional details

TypeError: isbot_1.default is not a function

TS Type Definitions appear to have issues. When upgrading to the latest version, I start getting the following error:

TypeError: isbot_1.default is not a function

After upgrading to 2.4.0, the old way of importing the library in TS started throwing errors in the TS compiler:

import * as isBot from 'isbot';

Now generates the following error:

Cannot invoke an expression whose type lacks a call signature. Type 'typeof import("node_modules/isbot/index")' has no compatible call signatures.ts(2349)

To remove the TS compiler error, I had to change the import to:

import isBot from 'isbot';

But this now throws a runtime error (compilation works OK), as follows:

TypeError: isbot_1.default is not a function

I can see that the main change from the previous version I was using (v2.1.2) is the addition of the type defs (.d.ts) file in the package.

I have done some research, and this kind of error usually is linked to invalid TS definition files. Some similar issues found on the web, with packages having similar problems:

microsoft/TypeScript#7554
iamkun/dayjs#475

My current tsconfig.json file is:

  "compilerOptions": {
    "target": "es2016",
    "module": "commonjs",
    "strict": false,
    "moduleResolution": "node",
    "emitDecoratorMetadata": true,
    "experimentalDecorators": true,
    "sourceMap": true,
    "types": ["node", "aws-lambda"]
  }
}

And environment:

isbot: 2.4.0
node: 10
typescript: 3.5.2

Bug - user agent incorrectly treated as a bot

Thanks for the lib @gorangajic.

I think I found a bug, or at least it looks to me like it is a bug.

I'm making a request from a Lambda function using chrome-aws-lambda lib.
Below we can see which User-Agent it sets (UA_LAMBDA).

Once I pass this value to the isbot, it returns true, but this is not a bot obviously.

var isbot = require("isbot");

const UA_LAMBDA =
  "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/79.0.3945.0 Safari/537.36";
const UA_CHROME_MAC =
  "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36";

console.log(isbot(UA_LAMBDA)); // true
console.log(isbot(UA_CHROME_MAC)); // false

Is this a bug?

When I execute the same request from an ordinary Chrome, it all works good (UA_CHROME_MAC).

Feel free to correct me if I said anything incorrect.

Thank you!

Allow undefined in isbot interface

Steps to reproduce

Using Typescript and Express, using the Request.get method interface returns string or undefined. This does not match the interface of isbot (string).

Expected behaviour

Accept undefined to allow simpler usage

Actual behaviour

Currently while using isbot(req.get('user-agent')) we have compilation exception

Additional details

image

Detecting Social Bots?

What about detecting Facebook, Twitter and LinkedIn?

They ping your site when you post links on their platforms.
Then read your social meta tags that you provide.

isSocial()?

My workaround...

var ua = require('useragent')
ua(true)
req.session.userAgent = ua.lookup(req.headers['user-agent'])
var ua = req.session.userAgent.toString().toLowerCase()
isSocialBot: (ua.indexOf('facebook') > -1 || ua.indexOf('twitter') > -1 || ua.indexOf('linkedin') > -1) ? true : false

Needed to do this recently to hide/show social html meta data in my header tag...

{{#if isSocialBot}}
<meta property="og:site_name" content="" />
<meta property="og:title" content="" />
<meta property="og:image" content="" />
<meta property="og:url" content="" />
<meta property="og:type" content="website" />
<meta property="og:locale" content="en_US" />
<meta property="og:locale:alternate" content="en_CA" />
<meta property="og:locale:alternate" content="fr_CA" />
{{/if}}

Non-Chromium compatibility broken in 2.5.2

acb6dcf#diff-168726dbe96b3ce427e7fedce31bb0bcR70 throws a SyntaxError: invalid regexp group in non-Chromium browsers like Firefox and Edge.

Wrapping in a try/catch doesn't help here, because it's not something that the browser can even parse, thus the SyntaxError.

It looks like you can detect support for this a little better with something like:

try {
	let regex = new RegExp('(?<! cu)bot');
}catch(e){
	// SyntaxError is caught here
}

broken

The previous version:

isbot("googleBot asdf") //=> googleBot
isbot("asdfasdf") //=> false

vs current version:

isbot("googleBot asdf") //=> Bot
isbot("asdfasdf") //=> false

It's not useful like this.

Empty user agent strings

Nice project, thanks for sharing it!

What's your take on empty user agent strings? ('')
It seems like they are being considered as not a bot currently, what about considering them as a bot instead?

Have more stability for exclusions

Hi again, I have another suggestion regarding improving this package.

Context

I have been using extend which works great. I recently started using exclude and would like to raise a concern about the way it works. My goal was considering a known crawler, let's say Chrome Lighthouse, as a human. So I needed to exclude it. Using isBot.exclude(['Lighthouse']); wouldn't work and when double-checking the documentation, I understood my mistake and used Chrome-Lighthouse (coming from https://github.com/gorangajic/isbot/blob/master/list.json) instead of Lighthouse and it's now working as expected.

Problem

I think this solution is not really stable because I cannot know for sure that https://github.com/gorangajic/isbot/blob/master/list.json won't evolve for Chrome-Lighthouse (there are other more complex examples). If the string gets updated in https://github.com/gorangajic/isbot/blob/master/list.json but doesn't get updated on my application, then the exclusion won't work anymore.

Solution

I think the best would be to use enumerations that we can import from this package. So instead of using 'Chrome-Lighthouse' string in list.json, something like UA.LIGHTHOUSE will be used. Then in my application, I can import this enumeration and use it. This way, by not using hardcoded strings, I know they will always be in sync even when updating the package version.

Please let me know what you think. Thank you in advance!

RegExp Lookbehind is causing erros in Safari and Firefox

SyntaxError: invalid regexp group
This error will be thrown in the packed file if isBot is used in the client. This was introduced in Version 2.5.2.
Not sure if this is a general use case but it can be very hard to find as Firefox has a related bug always specifying line 1.

Fail safe - Outages resilience

An outage in myip.ms caused one of the lists to fail downloading - which caused error in testing, and was not recoverable.
Think of creating backup files for recovery from list providers outages

Wrong detect Cubot devices

Hey,

your script isn't capable of detecting Cubot devices - it's probably a "bot" string in the name. I'm sending several user agents for testing:

Mozilla/5.0 (Linux; Android 8.0.0; CUBOT_P20) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Mobile Safari/537.36
Mozilla/5.0 (Linux; Android 8.1.0; CUBOT_POWER) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Mobile Safari/537.36
Mozilla/5.0 (Linux; U; Android 8.0.0; CUBOT_X18_Plus Build/O00623; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/71.0.3578.99 Mobile Safari/537.36 OPR/37.6.2254.134291
Mozilla/5.0 (Linux; Android 6.0; CUBOT_NOTE_S) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.80 Mobile Safari/537.36

Error while installing version 3.0.18

Steps to reproduce

Install version 3.0.18 of isbot.

Expected behaviour

Module should be installed successfully.

Actual behaviour

Getting the following error:

npm ERR! code ELIFECYCLE
npm ERR! syscall spawn
npm ERR! file sh
npm ERR! errno ENOENT
npm ERR! [email protected] postinstall: `./scripts/download-fixtures.sh`
npm ERR! spawn ENOENT
npm ERR! 
npm ERR! Failed at the [email protected] postinstall script.
npm ERR! This is probably not a problem with npm. There is likely additional logging output above.

Wrong detection

Got useragent 'Hexometer' on a webpage - it is a bot, but was not detected and has passed through the checking

It is a whole ua, without versions and any other symbols

License

Good work. Just realised there is no license on the code so as I see it technically it is copyright and no permission is granted to use the code so we're not able to at our company :-(. Any chance of adding a license like MIT to the repo? Sorry to be a bore!

Add ability to create multiple instances

Problem

We have a handful of reasons to change the behaviour of our site based on whether the user is a bot. However, they have subtly different requirements. For example:

import isBot from 'isbot';

export const shouldTrack = (ua) => !isBot(ua);
export const shouldFlushHeadEarly = (ua) => !isBot(ua); // exclude chrome-lighthouse and webpagetest
export const shouldDetectRegion = (ua) => !isBot(ua); // include foo and bar
// etc.

Possible solution

Would it make sense to expose a constructor (or factory fn) which had its own pattern state?

import { createIsBot } from 'isbot';

export const shouldTrack = createIsBot();
export const shouldFlushHeadEarly = createIsBot({
  exclude: ['chrome-lighthouse', 'PTST/4']
});
export const shouldDetectRegion = createIsBot({
  include: ['foo', 'bar']
});
// etc.

Tests

There are no tests. There should be some.

Chrome was wrongly recognised as a bot

User Agent String

Mozilla/5.0 (Linux; Android 10; SM-G973F Build/QQ3A.200805.001; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/85.0.4183.101 Mobile Safari/537.36 trill_2021806060 JsSdk/1.0 NetType/WIFI Channel/googleplay AppName/musical_ly app_version/18.6.6 ByteLocale/fr ByteFullLocale/fr Region/FR

This user agent is from Tiktok webview, on Android
It seems that the string "google" is triggering your pattern

Reproduce

Open tiktok
Open a link in user bio: you are detected as a bot

Documentation update required on exclude use-case

Problem

With the recent changes made in #72, especially removing "Google Page Speed Insights" rule, the documentation is now incorrect regarding exclude:

isbot('Google Page Speed Insights') always returns true as adding it to exclude does not work anymore because its rule disappeared from list.json.

Solution

In order to stay correct, the documentation could focus on Chrome-Lighthouse only.

Screenshot 2020-06-10 at 13 46 16

Is this library intended to be used in a browser with webpack?

The library is called: "Node.js module that detects bots/crawlers/spiders via the user agent". Does it imply that it's suited to node.js environment only?

We were able to successfully use it the browser with Webpack. Was it intended to be used this way too? Do you know some potential pitfalls? Is it safe to use it way on production?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.