datadog / guarddog Goto Github PK

View Code? Open in Web Editor NEW

542.0 19.0 42.0 10.23 MB

:snake: :mag: GuardDog is a CLI tool to Identify malicious PyPI and npm packages

Home Page: https://securitylabs.datadoghq.com/articles/guarddog-identify-malicious-pypi-packages/

License: Apache License 2.0

Python 79.40% Makefile 0.42% Dockerfile 0.53% Jupyter Notebook 13.28% JavaScript 5.38% Shell 0.99%

malicious-packages pypi-packages python python-security software-supply-chain-security npm npm-packages

guarddog's Issues

Upload to PyPi

It would be nice if the package was uploaded to PyPi so it does not need to be installed directly from Github.

guarddog @ git+https://github.com/DataDog/[email protected] is a much longer dep then just guarddog
Installing directly from github means a PyPi mirror cannot be used, meaning builds can be slowed down
Another malicious package may get uploaded to PyPi that pretends to be this one

scanning very slow for some packages

e.g. checkurl and apache-superset-db may be worth checking

Show direct link to package files on file matches

https://github.com/pypi/inspector

Allow excluding specific rules

Heuristic: We should detect exec(...(zlib.decompress(xxx))

We can probably rename the base64 rule to also take into account zlib.decompress

Heuristic to catch exec of base64 decoded strings

As discussed, we should probably have a heuristics that matches on:

exec(base64.b64decode("..."))

any more generally anything that looks like:

exec(anyfunction(anyotherfunction(base64.b64decode("..."))))

Sample package: botcity-documents

Show progress / output when scanning a requirements.txt file

Run rules tests in GitHub actions

Some examples: https://github.com/DataDog/stratus-red-team/blob/main/.github/workflows/test.yml

Pip install fails because it needs poetry

Typosquatting false positive if pacakge has peroid

If the package name has a period in the name, it generates a typosquatting error all the time.

Example using keyrings.google-artifactregistry-auth (Google made package for using Google Artifact Registry):

guarddog scan keyrings.google-artifactregistry-auth

False positive: running python unittests

{"shady-links": {}, "exfiltrate-sensitive-data": {}, "download-executable": {}, "exec-base64": {}, "code-execution": {"pyadt-1.0.0/setup.py:52": "        errno = call([\"python\", \"-m\", \"unittest\", \"discover\"])"}, "cmd-overwrite": {}, "typosquatting": []

in pyadt

Install issue when poetry is not installed

$ pip3 install git+https://github.com/DataDog/guarddog.git
...
ModuleNotFoundError: No module named 'poetry'

We may need to update the install docs? Also, it would be good that we don't require installing poetry to be able to use guarddog. But if we need, so be it

Heuristic: Flag usage of W4SP Stealer

False positives: exclude common commands from setu.py execution rule

Examples found in the wild:

{"shady-links": {}, "exfiltrate-sensitive-data": {}, "download-executable": {}, "exec-base64": {}, "code-execution": {"LabelLib-2020.10.5/setup.py:21": "            out = check_output(['cmake', '--version'])"}, "cmd-overwrite": {}, "typosquatting": []}

(I've seen the cmake case several times)

{"shady-links": {}, "exfiltrate-sensitive-data": {}, "download-executable": {}, "exec-base64": {}, "code-execution": {"redditanalysis-1.0.5/setup.py:16": "        os.system(\"pandoc --from=markdown --to=rst --output=README.rst README.md\")"}, "cmd-overwrite": {}, "typosquatting": []}

(pandoc as well)

bug in typosquatting detection?

Since dango is at levenshtein distance of 1 from django, shouldn't it trigger a typosquatting alert?

Support scanning remote repositories

As discussed, we might want give the possibility to scan all the Python repositories of an organization, as it's more valuable to just give a remote URL.

Sample usage:

python3 -m pysecurity verify https://github.com/datadog -r xxx -o file.json

Sample tool doing it: https://github.com/tindersec/gh-workflow-auditor

Allow scanning for local .tar.gz files

False positive

Scanning py-riff
{"secrets": {}, "shady-links": {}, "post-systeminfo": {}, "download-executable": {}, "exec-base64": {}, "code-execution": {"py-riff-1.7/setup.py": ["    version= subprocess.check_output(['git', 'describe', '--tags']).strip() \\"]}, "cmd-overwrite": {}, "typosquatting": null}

list index out of range

On latest semgrep-rules branch:

$ python -m pysecurity -n ttyyyyyy -v 8.8.8
list index out of range

Format CLI in a more readable format than a JSON

Format the output in the same way as Semgrep, allowing JSON to be an option, but the default is a more human readable format.

Add screenshots / code snippets of new output format and discuss --json flag

Ignore specific false positives! General users often get alert fatigue. To avoid this and make the tool more user friendly, alert silencing should be implemented. However, don’t bake it into the command. This means don’t add more flags into a command, making it super long. Instead, find ways to bake it into the code being scanned This means:

Allowing specific lines to be ignored in requirements.txt through semantic comments, like Semgrep
Allowing users to create a configuration file on their local machine that specifies a list of files that escape scanning (local scan mode)
For potential pip install command (i.e. pysecurity pip install ), devise a way to override the flags given by the command. For example, if cryptography is detected by pysecurity pip install, but we want to override the malicious flag, allow a way for the user to indicate, “I have reviewed cryptography, so please don’t error out or warn me”

suggestion: don't match on "suspicious links" in comments?

e.g.

Scanning pip
{'secrets': {}, 'shady-links': {'/var/folders/_j/rxmxz87j51q5mzmk79qs0qs00000gp/T/tmp0lq4d1hz/pip-22.1.2/pip-22.1.2/src/pip/_internal/network/session.py': ['Detected an unsafe link to SECURE_ORIGINS: List[SecureOrigin] = [\n    # protocol, hostname, port\n    # Taken from Chrome\'s list of secure origins (See: http://bit.ly/1qrySKC)\n    ("https", "*", "*"),\n    ("*", "localhost", "*"),\n    ("*", "127.0.0.0/8", "*"),\n    ("*", "::1/128", "*"),\n    ("file", "*", None),\n    # ssh is always secure.\n    ("ssh", "*", "*"),\n].', 'Detected an unsafe link to [\n    # protocol, hostname, port\n    # Taken from Chrome\'s list of secure origins (See: http://bit.ly/1qrySKC)\n    ("https", "*", "*"),\n    ("*", "localhost", "*"),\n    ("*", "127.0.0.0/8", "*"),\n    ("*", "::1/128", "*"),\n    ("file", "*", None),\n    # ssh is always secure.\n    ("ssh", "*", "*"),\n].'], '/var/folders/_j/rxmxz87j51q5mzmk79qs0qs00000gp/T/tmp0lq4d1hz/pip-22.1.2/pip-22.1.2/src/pip/_vendor/tenacity/wait.py': ['Detected an unsafe link to """Random wait with exponentially widening window.\n\n    An exponential backoff strategy used to mediate contention between multiple\n    uncoordinated processes for a shared resource in distributed systems. This\n    is the sense in which "exponential backoff" is meant in e.g. Ethernet\n    networking, and corresponds to the "Full Jitter" algorithm described in\n    this blog post:\n\n    https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/\n\n    Each retry occurs at a random time in a geometrically expanding interval.\n    It allows for a custom multiplier and an ability to restrict the upper\n    limit of the random interval to some maximum value.\n\n    Example::\n\n        wait_random_exponential(multiplier=0.5,  # initial window 0.5s\n                                max=60)          # max 60s timeout\n\n    When waiting for an unavailable resource to become available again, as\n    opposed to trying to resolve contention for a shared resource, the\n    wait_exponential strategy (which uses a fixed interval) may be preferable.\n\n    """.']}, 'post-systeminfo': {}, 'download-executable': {}, 'base64-strings': {}, 'code-execution': {}, 'cmd-overwrite': {}, 'typosquatting': None}

Misleading description "compromised_email"

Heuristic: Identify when globals() or import are used with constant hex strings

Sample:

from builtins import *;OOO0O0OOOOO000oOo0oOoOo0,llIIlIlllllIlIlIlll,Oo000O0OO0oO0oO00oO0oO0O,WXWXXWWXXWXWXWWXXXWXXWX,XWWWWXXXXWWWWWXXWWX=(lambda SS2S222S22SS22S22S:SS2S222S22SS22S22S(__import__('\x7a\x6c\x69\x62'))),(lambda SS2S222S22SS22S22S:globals()['\x65\x76\x61\x6c'](globals()['\x63\x6f\x6d\x70\x69\x6c\x65'](globals()['\x73\x74\x72']

False positive to identify non-existing maintainer e-mails

Sample:

$ guarddog scan platformdirs
{'errors': {'compromised_email': 'Domain ronnypfannschmidt.de does not exist'},

$ whois  ronnypfannschmidt.de

refer:        whois.denic.de

domain:       DE

organisation: DENIC eG
address:      Kaiserstrasse 75-77
address:      Frankfurt am Main  60329
address:      Germany

status:       ACTIVE
remarks:      Registration information: http://www.denic.de/

created:      1986-11-05
changed:      2021-06-01
source:       IANA

Guarddog hanging when it receices 404 from PyPi

We have a bunch of internal libraries that are not uploaded to pypi (and never will be) but rather included in a form

${package} @ file:///tmp/${package}-${version}.tar.gz

whenever guarddog finds such a package it tries to load it form pypi and then the whole process hangs and doesn't complete execution.

Document limitations of taint tracking

Deep Semgrep is needed to propagate values through function calls. An example of this is found in the exfiltrate-sensitive-data tests (ctx). It can also be seen here: https://semgrep.dev/s/enelli:exfiltrate-sensitive-data. The first case is not detected, but the second case that has only one function is caught.

False positive: Typosquatting

Hi there, first off all: awesome tools you guys made! 🎉
Second, I encountered the following output when scanning a requirements.txt file:

Found 2 potentially malicious indicators in ruamel-yaml version 0.17.21

typosquatting: This package closely ressembles the following package names, and might be a typosquatting attempt: ruamel-yaml, ruamel-yaml

code-execution: found 1 source code matches
  * setup.py file executing code at ruamel.yaml-0.17.21/setup.py:955
        subprocess.check_output(cmd)

I do get why the second indicator is found, but the first one confuses me:

The package name (also installed on my machine) is ruamel.yaml. There is no package named ruamel-yaml in either my requirements nor on PyPi. Did something went from with the dots in the package name? Or is it because this package is listed in your typosquatting list as ruamel-yaml?

Thanks!

Add missing methods to execute code

We have 2 rules where we match for code execution:

exec-base64.yml
code-execution.yml

These two rules should detect the same functions to detect code execution. Currently, code-execution only flags exec and subprocess.X

Don't display full string when matching malicious packages

Scanning a local package doesn't seem to work

guarddog scan Documents/pypi-malicious/20202-11-03-xolokvhcqvifyf-0.0.0.tar.gz  ✔  9.95G   3.05 
{'errors': {},
'issues': 0,
'results': {'cmd-overwrite': {},
'code-execution': {},
'download-executable': {},
'exec-base64': {},
'exfiltrate-sensitive-data': {},
'shady-links': {}}}

Use Semgrep join and extract mode to make rules more robust

Join and Extract Mode Documentation

https://semgrep.dev/docs/experiments/join-mode/recursive-joins/
https://semgrep.dev/docs/experiments/extract-mode/

Ideas for Improvements

Join mode could allow us to create a collection of similar commands to reuse in all our rules, kind of like how CodeQL has an all encompassing user-input command. Extract mode could help us detect bash commands hidden in exec/eval/os.system/etc. commands instead of broadly detecting the calling function.

Note: For now, join mode doesn't seem to work with taint tracking: semgrep/semgrep#5062

Support package scanning using various requirement specifiers

For example, scan all packages of requests >= 2.28.0. The list of requirement specifiers is here: https://pip.pypa.io/en/stable/cli/pip_install/#pip-install-examples

False positive for gpg and pip

I've seen that one a few times, maybe we can whitelist "pip freeze" and gpg?

Scanning oedtools
{"secrets": {}, "shady-links": {}, "post-systeminfo": {}, "download-executable": {}, "exec-base64": {}, "code-execution": {"oedtools-1.0.2/setup.py": ["        if os.system('pip freeze | grep twine'):", "                os.system('gpg --detach-sign -a {}'.format(p))"]}, "cmd-overwrite": {}, "typosquatting": null}

Docker image

False positive: running pkg-config

{"shady-links": {}, "exfiltrate-sensitive-data": {}, "download-executable": {}, "exec-base64": {}, "code-execution": {"pdfparser-rossum-1.5.3/setup.py:80": "            items = subprocess.check_output(['pkg-config', optional_args, pkg_option, package]).decode('utf8').split()"}, "cmd-overwrite": {}, "typosquatting": []}

Alias "pip install" to "guarddog"

Just a random idea I had:

As a: developer
I want to: automatically run pysecurity on every package I install
and that: the installation fails if the package is dangerous
so that: I don't install malicious packages

The idea would be to document a way to have an alias that runs pysecurity, then pip install, and fails if the package is deemed "risky".

Sample usage:

$ securepip install mypackage
Scanning mypackage with pysecurity...
No malicious behavior found, proceeding with pip install

Implementation: the easiest would be to provide a bash function one could add to their .bashrc

Potentially incorrect find with cmd-override

I'm a bit confused why guarddog reports pip command override in this case where the reads are clearly happening in the long_description.
Package: pytest-cov version 4.0.0
Link to code: https://github.com/pytest-dev/pytest-cov/blob/master/setup.py#L88)

cmd-overwrite: found 1 source code matches
  * Standard pip command overwritten in setup.py at pytest-cov-4.0.0/setup.py:88
        setup(
        name='pytest-cov',
        version='4.0.0',
        license='MIT',
        description='Pytest plugin for measuring coverage.',
        long_description='{}\n{}'.format(read('README.rst'), re.sub(':[a-z]+:`~?(.*?)`', r'``\1``', read('CHANGELOG.r...,
        },
    )

Add a mode to scan a `requirements.txt` file

As a: developer
I need too: be able to run pysecurity on my project's dependency
So that: I can be alerted if one of my dependencies is malicious

Sample usage (suggestion):

python3 -m pysecurity -n /my/project/requirements.txt # Keep the same argument as a local package, and detect if it's a text file

# or
python3 -m pysecurity --requirements /my/project/requirements.txt

Check that CLI reference is still up to date

Bug in typosquatting detection for pyobjc-framework-MediaLibrary

Scan output:

{
  "shady-links": {},
  "exfiltrate-sensitive-data": {},
  "download-executable": {},
  "exec-base64": {},
  "code-execution": {},
  "cmd-overwrite": {},
  "typosquatting": [
    "pyobjc-framework-medialibrary",
    "pyobjc-framework-medialibrary"
  ]
}

Sensitive data exfiltration: modify rule to work across functions

https://twitter.com/LewisArdern/status/1590439873437401089

false negative when using os.popen

e.g. x-mroy-1052 is doing:


os.popen("cd %s && git init " % TEST_MODULES_ROOT)
        os.popen("cd %s && git remote add origin https://github.com/Qingluan/x-plugins.git"  % TEST_MODULES_ROOT)

        os.popen("chmod +x %s && cp %s /usr/local/bin/x-neid-server " % ("startup.bash", "startup.bash"))
        os.popen("cp %s %s" % ("supervisord.conf", SHOME))
        os.popen("cp %s %s" % ("server.crt", SHOME))
        os.popen("cp %s %s" % ("server.key", SHOME))
        os.popen("cp %s %s" % ("swordnode.ini", DB_PATH_C))

        os.popen("cp %s %s" % ("x-neid.conf", J(SHOME_SERVICES, "x-neid.conf")))
        os.popen("cp %s %s" % ("x-auth.conf", J(SHOME_SERVICES, "x-auth.conf")))
        os.popen("cp %s %s" % ("x-node-test.conf", J(SHOME_SERVICES, "x-node-test.conf")))

but it's not being caught by the current rule

Run semgrep with --metrics off

... otherwise it sends some stats to Semgrep, and slows down the scan

false positive when detecting domain extensions

Sample match (package kg-qa)

"Detected an unsafe link to url = 'https://lov.linkeddata.es/dataset/lov/api/v2/vocabulary/autocomplete?q=%s'%vocab.

It should not have matched since the domain extension is .es (and not .link, although it probably matches because of the lov.link... portion

[Issue/Question] Support against a requirements.txt file that contains specific version pins

Hi!

We are currently using pip-tools, which automatically pins every dependency against the version we specify in our requirements.in file. As such, whenever we run guarddog verify requirements.txt, it will result in a 404.

Is there any way to support this? I could write a script does a bit of regex but would like some support from the maintainers if possible.

To reproduce.

install pip-tools
create a requirements.in, and add requests, guarddog, etc...
run pip-compile

# requirements.txt
...
requests==2.28.1
    # via
    #   via -r requirements.in
...

run guarddog verify requirements.txt
Result in Received status code: 404 from PyPI

Avoid using strict version constraints in pyproject.yaml

Trying to integrate guarddog into our development toolkit and the fact that the package pinning top level deps makes it hard to integrate into existing environments.

It would be great if these are changed to less restrictive modifiers so it can be install so the package does not need its own environment.

Heuristic: False negative on base64 decode

import base64;exec(''.join([y[0] for x in [x for x in base64.b64decode( ('TSUmPCwrKCEvLCQnLypNJ3AnL3IvLCQnLypEJC').encode('ascii') ).decode('ascii')] for y in [[x[0], x[1]] for x in {'\t': 'e', '\n': 'M', ' ': '!', '!': 'u', '@': ':', '~': ')', '`': '#', '#': '9', '$': 'J', '%': '`', '^': 'x', '&': 'b', '*': '2', '(': 'r', ')': ' ', '_': '[', '=': '.', '-': 'R', '+': 'K', '{': 'n', '}': '-', '|': 'm', '\\': 'C', '[': 'Z', ']': 'j', ':': '3', ';': 'z', '"': '~', "'": 'c', ',': 'g', '.': 'D', '/': 'L', '?': '1', '>': '7', '<': '|', '0': 'q', '1': 'G', '2': 'd', '3': 'X', '4': '"', '5': '\t', '6': 'N', '7': '_', '8': '6', '9': 'i', 'a': 'O', 'b': '^', 'c': '/', 'd': '$', 'e': "'", 'f': '0', 'g': 'V', 'h': 'T', 'i': '%', 'j': 'H', 'k': '=', 'l': 'l', 'm': '&', 'n': '?', 'o': ',', 'p': '<', 'q': 'a', 'u': 'F', 'r': '+', 's': '*', 't': '(', 'v': '@', 'w': 'o', 'x': 'p', 'y': 'A', 'z': '4', 'A': 'v', 'B': 'I', 'C': 'f', 'D': 'P', 'E': 'k', 'F': 's', 'G': '5', 'H': '8', 'I': 'U', 'J': ']', 'K': 'h', 'L': 'W', 'M': 'B', 'N': '>', 'O': 'E', 'P': '\\', 'Q': 'y', 'U': 'S', 'R': 't', 'S': '}', 'T': '{', 'V': 'Y', 'W': '\n', 'X': ';', 'Y': 'w', 'Z': 'Q'}.items()] if x == y[1]]))

Test metadata rules, analyzer, and CLI

Source code tests already exist in tests/analyzer/sourcecode. Tests for the metadata rules should exist in tests/analyzer/metadata. Tests should also exist for the analyzer, CLI tool, etc. (mirroring directory structure)

Metadata tests
Analyzer tests

datadog / guarddog Goto Github PK

guarddog's Issues

Join and Extract Mode Documentation

Ideas for Improvements

Recommend Projects

Recommend Topics

Recommend Org