Git Product home page Git Product logo

google / magika Goto Github PK

View Code? Open in Web Editor NEW
7.4K 38.0 390.0 7.62 MB

Detect file content types with deep learning

Home Page: https://google.github.io/magika/

License: Apache License 2.0

Python 45.35% Assembly 0.06% CSS 0.23% HTML 7.39% Shell 2.45% JavaScript 1.62% C 0.08% Rust 23.65% Smali 0.11% Rich Text Format 4.64% PHP 0.02% SCSS 0.16% Vue 3.48% Dockerfile 0.08% TypeScript 10.68%
deep-learning filetype keras-classification-models keras-models mime-types

magika's Introduction

Magika

OpenSSF Scorecard OpenSSF Best Practices CodeQL codecov

Magika is a novel AI powered file type detection tool that relies on the recent advance of deep learning to provide accurate detection. Under the hood, Magika employs a custom, highly optimized Keras model that only weighs about 1MB, and enables precise file identification within milliseconds, even when running on a single CPU.

In an evaluation with over 1M files and over 100 content types (covering both binary and textual file formats), Magika achieves 99%+ precision and recall. Magika is used at scale to help improve Google users’ safety by routing Gmail, Drive, and Safe Browsing files to the proper security and content policy scanners.

You can try Magika without anything by using our web demo, which runs locally in your browser!

Here is an example of what Magika command line output look like:

For more context you can read our initial announcement post on Google's OSS blog

Highlights

  • Available as a Python command line, a Python API, and an experimental TFJS version (which powers our web demo).
  • Trained on a dataset of over 25M files across more than 100 content types.
  • On our evaluation, Magika achieves 99%+ average precision and recall, outperforming existing approaches.
  • More than 100 content types (see full list).
  • After the model is loaded (this is a one-off overhead), the inference time is about 5ms per file.
  • Batching: You can pass to the command line and API multiple files at the same time, and Magika will use batching to speed up the inference time. You can invoke Magika with even thousands of files at the same time. You can also use -r for recursively scanning a directory.
  • Near-constant inference time independently from the file size; Magika only uses a limited subset of the file's bytes.
  • Magika uses a per-content-type threshold system that determines whether to "trust" the prediction for the model, or whether to return a generic label, such as "Generic text document" or "Unknown binary data".
  • Support three different prediction modes, which tweak the tolerance to errors: high-confidence, medium-confidence, and best-guess.
  • It's open source! (And more is yet to come.)

For more details, see the documentation for the python package and for the js package (dev docs).

Table of Contents

  1. Getting Started
    1. Installation
    2. Running on Docker
    3. Usage
      1. Python command line
      2. Python API
      3. Experimental TFJS model & npm package
  2. Development Setup
  3. Important Documentation
  4. Known Limitations & Contributing
  5. Frequently Asked Questions
  6. Additional Resources
  7. Citation
  8. License
  9. Disclaimer

Getting Started

Installation

Magika is available as magika on PyPI:

$ pip install magika

Running in Docker

git clone https://github.com/google/magika
cd magika/
docker build -t magika .
docker run -it --rm -v $(pwd):/magika magika -r /magika/tests_data

Usage

Python command line

Examples:

$ magika -r tests_data/
tests_data/README.md: Markdown document (text)
tests_data/basic/code.asm: Assembly (code)
tests_data/basic/code.c: C source (code)
tests_data/basic/code.css: CSS source (code)
tests_data/basic/code.js: JavaScript source (code)
tests_data/basic/code.py: Python source (code)
tests_data/basic/code.rs: Rust source (code)
...
tests_data/mitra/7-zip.7z: 7-zip archive data (archive)
tests_data/mitra/bmp.bmp: BMP image data (image)
tests_data/mitra/bzip2.bz2: bzip2 compressed data (archive)
tests_data/mitra/cab.cab: Microsoft Cabinet archive data (archive)
tests_data/mitra/elf.elf: ELF executable (executable)
tests_data/mitra/flac.flac: FLAC audio bitstream data (audio)
...
$ magika code.py --json
[
    {
        "path": "code.py",
        "dl": {
            "ct_label": "python",
            "score": 0.9940916895866394,
            "group": "code",
            "mime_type": "text/x-python",
            "magic": "Python script",
            "description": "Python source"
        },
        "output": {
            "ct_label": "python",
            "score": 0.9940916895866394,
            "group": "code",
            "mime_type": "text/x-python",
            "magic": "Python script",
            "description": "Python source"
        }
    }
]
$ cat doc.ini | magika -
-: INI configuration file (text)
$ magika -h
Usage: magika [OPTIONS] [FILE]...

  Magika - Determine type of FILEs with deep-learning.

Options:
  -r, --recursive                 When passing this option, magika scans every
                                  file within directories, instead of
                                  outputting "directory"
  --json                          Output in JSON format.
  --jsonl                         Output in JSONL format.
  -i, --mime-type                 Output the MIME type instead of a verbose
                                  content type description.
  -l, --label                     Output a simple label instead of a verbose
                                  content type description. Use --list-output-
                                  content-types for the list of supported
                                  output.
  -c, --compatibility-mode        Compatibility mode: output is as close as
                                  possible to `file` and colors are disabled.
  -s, --output-score              Output the prediction score in addition to
                                  the content type.
  -m, --prediction-mode [best-guess|medium-confidence|high-confidence]
  --batch-size INTEGER            How many files to process in one batch.
  --no-dereference                This option causes symlinks not to be
                                  followed. By default, symlinks are
                                  dereferenced.
  --colors / --no-colors          Enable/disable use of colors.
  -v, --verbose                   Enable more verbose output.
  -vv, --debug                    Enable debug logging.
  --generate-report               Generate report useful when reporting
                                  feedback.
  --version                       Print the version and exit.
  --list-output-content-types     Show a list of supported content types.
  --model-dir DIRECTORY           Use a custom model.
  -h, --help                      Show this message and exit.

  Magika version: "0.5.0"

  Default model: "standard_v1"

  Send any feedback to [email protected] or via GitHub issues.

See python documentation for detailed documentation.

Python API

Examples:

>>> from magika import Magika
>>> m = Magika()
>>> res = m.identify_bytes(b"# Example\nThis is an example of markdown!")
>>> print(res.output.ct_label)
markdown

See python documentation for detailed documentation.

Experimental TFJS model & npm package

We also provide Magika as an experimental package for people interested in using in a web app. Note that Magika JS implementation performance is significantly slower and you should expect to spend 100ms+ per file.

See js documentation for the details.

Development Setup

We use poetry for development and packaging:

$ git clone https://github.com/google/magika
$ cd magika/python
$ poetry shell && poetry install
$ magika -r ../tests_data

To run the tests:

$ cd magika/python
$ poetry shell
$ pytest tests/

Important Documentation

Known Limitations & Contributing

Magika significantly improves over the state of the art, but there's always room for improvement! More work can be done to increase detection accuracy, support for additional content types, bindings for more languages, etc.

This initial release is not targeting polyglot detection, and we're looking forward to seeing adversarial examples from the community. We would also love to hear from the community about encountered problems, misdetections, features requests, need for support for additional content types, etc.

Check our open GitHub issues to see what is on our roadmap and please report misdetections or feature requests by either opening GitHub issues (preferred) or by emailing us at [email protected].

When reporting misdetections, you may want to use $ magika --generate-report <path> to generate a report with debug information, which you can include in your github issue.

NOTE: Do NOT send reports about files that may contain PII, the report contains (a small) part of the file content!

See CONTRIBUTING.md for details.

Frequently Asked Questions

We have collected a number of FAQs here.

Additional Resources

Citation

If you use this software for your research, please cite it as:

@software{magika,
author = {Fratantonio, Yanick and Invernizzi, Luca and Zhang, Marina and Metitieri, Giancarlo and Kurt, Thomas and Galilee, Francois and Petit-Bianco, Alexandre and Farah, Loua and Albertini, Ange and Bursztein, Elie},
title = {{Magika content-type scanner}},
url = {https://github.com/google/magika}
}

Security vulnerabilities

Please contact us directly at [email protected]

License

Apache 2.0; see LICENSE for details.

Disclaimer

This project is not an official Google project. It is not supported by Google and Google specifically disclaims all warranties as to its quality, merchantability, or fitness for a particular purpose.

magika's People

Contributors

abhin2002 avatar anzerr avatar bswck avatar corkamig avatar dependabot[bot] avatar devilkadabra69 avatar ebursztein avatar eltociear avatar faribauc avatar fr0gger avatar gaby avatar giancarlo-metitieri avatar ia0 avatar invernizzi avatar jogo- avatar michaelhinrichs avatar mohit-gaur avatar parth-p-shah avatar philparzer avatar regadas avatar reyammer avatar rhevin avatar step-security-bot avatar themythologist avatar uberguidoz avatar worrycare avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

magika's Issues

Magika to detect text file encoding

Hi,
I was wondering if magika could be used in the future to detect the encoding of text files (utf-8, ascii, iso-885961, cp-1252, etc...) as this is not an easy task.
Thanks

Consider yanking the 0.1.0 release

If you have Python >= 3.12 you get the 0.1.0 release which is basically empty, please consider yanking the release, so it's not picked up.

npm packages fails to run

import { readFile } from "fs/promises";
import { Magika } from "magika";

const data = await readFile("./package.json");
const magika = new Magika();
await magika.load();
const prediction = await magika.identifyBytes(data);
console.log(prediction);
/node_modules/magika/magika.js:73
  async load({ modelURL, configURL }) {
               ^

TypeError: Cannot destructure property 'modelURL' of 'undefined' as it is undefined.
    at Magika.load (file:///home/mrcool/dev/m/node_modules/magika/magika.js:73:16)
    at file:///home/mrcool/dev/m/index.js:6:14

"magika": "^0.2.5"
Node.js v20.10.0

Incorrect JSON/NDJSON detection

These are pretty minor, but:

  1. Simple JSON example that is recognized as "Generic text document (text)":
    no_whitespace.json. If you add a whitespace after ":" it will be "JSON document (code)"
  2. Same example with multiple newline-delimited JSON-objects is recognized as JSON, which is understandable, but also incorrect, as NDJSON-document is not a valid JSON: ndjson.txt

Magika version: 0.5.0
Default model: standard_v1

Make application / bindings for Rust

When using magika as a python command line tool, the biggest overhead comes from starting the python interpreter itself and loading all the libraries.

Onnx's model loading itself is relatively fast (~30ms), the content type inference is ~5ms. But the overall python CLI takes about ~300ms.

Having a client written in Rust may make the CLI 8x/9x/10x faster.

This is not a concern for large scale automated pipelines (as they would boostrap the library and load the model only once), but it is annoying for one-off CLI use cases.

If this works out, then we could make this new rust client the one installed via pip install magika (together with the python API).

We may also want to have a proper way to make magika available to rust applications.

Note for external contributors: this may become the future de-facto magika client, so its design will need extra care. We are reaching out internally as well, so please let us know if you are starting working on this so that we can coordinate better and/or avoid duplicated work. Thank you!

Remote files and range requests

Are there any plans for making this work with remote files as well? If so, I would be curious range requests could be supported, as opposed to plainly downloading the entire file.

Improve documentation

Currents docs need improvements:

  • reorganize so that they are not scatter around in different places.
  • improve docs for the python API (e.g., document the Magika() constructor, the returned objects, etc.)
  • improve table on supported content types and plans for unsupported ones, aka "what's the status"

Need to make sure the pypi have the --generate-report

It seems the current pypi package doesn't have proper installation ?
Also make sure the package report all the information that we need to debug an false positive and a proper warning. So it should contains:

  • A disclaimer to not report private or sensitive files
  • instruction on how to report -- same as readme + URL on where to report (short link to Google form which must be public for external + email optional if we have question)
  • print information about the version / model / env
  • the raw data we need to create the test files (basically the raw bytes we look at)
  • the model verdict

The pipy version doesn't have the ouput so unsure if all of this is included.

Create FP report template

Create a bug template that allows to report an misclassification using the same format than the report form but as a github bug for easier tracking and add to it the misclassification label

Add suffix as output

Hi i just perfect timing.I just restored "some" files (40GB) with [1] But the filetype detection of photorec set some wrong file types. It would be super nice if the program would show me the "suffix" (like -i) aka ".md, .py, .rst, ..." so i can move the file to a folder.

[1] https://www.cgsecurity.org/wiki/PhotoRec

Current output (-i):

❯ magika -r /opt/SORTED/TXT -i  
/opt/SORTED/TXT/1/100464.txt: text/plain
/opt/SORTED/TXT/1/101476.txt: text/x-c
/opt/SORTED/TXT/1/101485.txt: text/x-c
/opt/SORTED/TXT/1/101565.txt: text/x-c
/opt/SORTED/TXT/1/101700.txt: text/plain
/opt/SORTED/TXT/1/101729.txt: text/plain
/opt/SORTED/TXT/1/101786.txt: text/x-asm
/opt/SORTED/TXT/1/101812.txt: text/x-asm
/opt/SORTED/TXT/1/101941.txt: text/x-asm
/opt/SORTED/TXT/1/105997.txt: text/x-makefile
/opt/SORTED/TXT/1/107439.txt: text/plain
/opt/SORTED/TXT/1/108033.txt: text/plain
/opt/SORTED/TXT/1/109413.txt: text/markdown
/opt/SORTED/TXT/1/111266.txt: application/javascript
/opt/SORTED/TXT/1/114086.txt: text/x-python

Magika ignores the -m commandline option in some instances

magika file.bin
file.bin: Unknown binary data (unknown) [Low-confidence model best-guess: ISO 9660 CD-ROM filesystem data (archive), score=83]

yeilds the exact same thing as

magika -m high-confidence file.bin
file.bin: Unknown binary data (unknown) [Low-confidence model best-guess: ISO 9660 CD-ROM filesystem data (archive), score=83]

But setting it to medium-confidence gives a slightly different output. Is it a design choice to ignore the high-confidence option when it's likely to fail? Or a bug?

Be Less Confident About "Guesses"

I'm aware that, in its present incarnation, Magika is trained on a very small subset of all possible file types, and as as result when fed types on which it has not been trained incorrect responses are to be expected - but, at present, it's throwing out some very bizarre guesses.

As a worst-case experiment, Magika was run across the root directory of a CU Amiga cover CD from 1995 - full of files definitely not in the training dataset, and most likely not in the test corpus either.

image

Here we see several "unknown" results, which are to be expected, but also several completely-incorrect guesses: Amiga INFO Icon files are misidentified as BMP images, TIFF images, and ISO 9660 ROM images(!) - despite them all being the same format.

As feeding a tool that's expecting an ISO 9660 image an Amiga icon is likely to end poorly, I'd suggest the tool needs to be less confident when encountering something outwith its training data.

Detection of error situations

If I use the cmd command "copy/b ..." to disguise the file, I can deceive the detection classification. The test result is incorrect.
image

Minify production model.json and config.json

https://google.github.io/magika/model/model.json and https://google.github.io/magika/model/config.json are dynamically loaded by the library, but aren’t minified.

File .json size .min.json size .json.gz size .min.json.gz size
config 9,519 bytes 5,649 bytes 782 bytes 719 bytes
model 71,626 bytes 48,329 bytes 3,722 bytes 3,254 bytes

Minifying the config results in an 8.06% file size decrease of the gzipped bundle, and minifying the config yields a 12.57% file size decrease of the gzipped bundle.

sqlite3 files aren't detected correctly

Hi. Interesting project!

When I run it against one of my directories, I found it has some problems with sqlite3 files.

Simple example:

sqlite3 y.db 'create table y ( int id )' ; magika y.db ; file y.db

y.db: ISO 9660 CD-ROM filesystem data (archive)
y.db: SQLite 3.x database, last written using SQLite version 3037002, file counter 1, database pages 2, cookie 0x1, schema 4, UTF-8, version-valid-for 1

In most cases, a sqlite3 file is detected as ISO 9660, but I also had one case of RedHat Package Manager archive

Hello from the CCCS! 🍁

We at the Assemblyline project perform our own file identification to ensure files are routed correctly to the corresponding file analysis modules. That is why the magika project is very interesting to us.

We have a set of files used for unit testing that we are confident* in their file type. We ran that set against the magika tool and found some discrepancies: see attached CSV.

All of the SHA256 hashes can be found on VirusTotal, and we would love to collaborate (join our Discord!) to improve magika to the point where we can integrate it into Assemblyline :)

AL_MAGIKA_COMP_revised.csv

Cheers,
🇨🇦

Add typescript type definitions for js lib

Hi,

it would be great if you could add or auto generate type definitions for the js lib so it can be used with typescript.

Since there isn't a CI job to handle automatic releases and I'm not aware of your release process and plans for the lib, I'd rather not myself create a PR for this yet.

[Feature Request] COBOL File Support

Per the project README, I'd like to open a feature request to support an expanded set of file types. Specifically I would be interested to support the COBOL language.

I see per the support content types list it is not currently supported:

| 13 | coff | application/x-coff | Intel 80386 COFF |

Would it be possible to support this language?

Thanks!

Binary file identification is basic compared to GNU file

I was curious how this compared to GNU file for detecting if a binary is statically compiled or not.
Magika correctly identifies the file as an ELF or Mach-o executable, but that's it. It does not go into deeper details like if it is statically linked or which architecture it is for, whether it has debug symbols, stripped, etc., details that file will show.

Can plans be clarified as to whether magika is to be more specific with its identification of files, or if it's intentional to limit scope to high level generic classifications (e.g. simple mach-o, elf), and not get into details like targeted platforms (amd64, arm64), linking (statically linked), runtimes (Go), etc.?

Status Quo file

file on release binary

$ file release/goss-linux-amd64
release/goss-linux-amd64: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, Go BuildID=NZ3NC0NaQ1A__VXTMP32/C4GwgAOXz8tq1OWGhmJI/jNyZlKXZG8ghsmw3IBam/kO2y_axJzsqQZmtK9R4a, stripped

file on a debug binary

$ file debug/goss-linux-amd64
debug/goss-linux-amd64: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, Go BuildID=schnmn0cqD0OYUAQLlln/nTtv8RRejTipPl0QtX2L/5rh0P1Upx9VxsKYQJ-pN/eZxsN78-M_3elAQC2Na_, with debug_info, not stripped

Magika

Magika on a release binary

$ docker run --rm -it -v "$(pwd):/src:ro" google/magika /src/release/goss-linux-amd64
/src/release/goss-linux-amd64: ELF executable (executable)

Magika on a debug binary

$ docker run --rm -it -v "$(pwd):/src:ro" google/magika /src/debug/goss-linux-amd64
/src/debug/goss-linux-amd64: ELF executable (executable)

[NEW FILE TYPE REQUEST]

$ magika -r tests_data/
tests_data/README.md: Markdown document (text)
tests_data/basic/code.asm: Assembly (code)
tests_data/basic/code.c: C source (code)
tests_data/basic/code.css: CSS source (code)
tests_data/basic/code.js: JavaScript source (code)
tests_data/basic/code.py: Python source (code)
tests_data/basic/code.rs: Rust source (code)
...
tests_data/mitra/7-zip.7z: 7-zip archive data (archive)
tests_data/mitra/bmp.bmp: BMP image data (image)
tests_data/mitra/bzip2.bz2: bzip2 compressed data (archive)
tests_data/mitra/cab.cab: Microsoft Cabinet archive data (archive)
tests_data/mitra/elf.elf: ELF executable (executable)
tests_data/mitra/flac.flac: FLAC audio bitstream data (audio)
...

[NEW FILE TYPE REQUEST]

What type of file would you like magika to detect?
Examples:

  • "Nintendo Binary Revolution RESource (.brres)"
  • "Valve Map Format file (.vmf)"
  • "Blender save file (.blend)"
  • "RPG Maker 2000/2003 Lcf DataBase (.ldb)"
  • "COLLADA file (.dae)"
  • "Unreal Engine Asset (.uasset)"

What software can create/open these files?
Examples:

Where can these files be found?
Examples:

If possible, please provide a specification for this file type.
Examples:

Additional context
Add any other context or screenshots about the feature request here.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.