Git Product home page Git Product logo

mime-detective's Introduction

Mime-Detective

Mime-Detective is a blazing-fast, low-memory file type detector for .NET. It uses Magic-Number and Magic-Word signatures to accurately identify over 14,000 different file variants by analyzing a raw stream or array of bytes. It also allows you to easily convert between file extensions and mime types.

How Does it Work?

Mime-Detective is a signature-based detection library that looks for patterns within the raw content of a file. Based on the presence (or absence) of certain bytes, most files can be accurately predicted. For example, every JPG file starts with 0xFFD8 FFE0 0010 4A46 4946 0001 0101 0047.

Limitations

Binary file work best because they often have an identifiable signature. Some file types, such as .txt files, have no identifying mark and may not be well predicted. Because of this, you may need to choose a fallback type if Mime-Detective is not able to predict the file type.

Getting Started

There are three main ways you can use Mime-Detective.

  • The Default definition pack which includes a very small set of detection rules.
  • The Condensed definition pack which includes an expanded set of detection rules.
  • The Exhaustive definition pack which includes over 14,000 detection rules.

More information on these definitions is included toward the end of this file.

Installing from Nuget

Installing the Default (Small) Definition Pack

install-package Mime-Detective

Installing the Condensed (Medium) Definition Pack

install-package Mime-Detective
install-package Mime-Detective.Definitions.Condensed

Installing the Exhaustive (Large) Definition Pack

install-package Mime-Detective
install-package Mime-Detective.Definitions.Exhaustive

Create the ContentInspector

Create the Default ContentInspector

using MimeDetective;
var Inspector = new ContentInspectorBuilder() {
    Definitions = MimeDetective.Definitions.Default.All()
}.Build();

Create the Condensed ContentInspector

using MimeDetective;
var Inspector = new ContentInspectorBuilder() {
    Definitions = new Definitions.CondensedBuilder() {
        UsageType = Definitions.Licensing.UsageType.PersonalNonCommercial
    }.Build()
}.Build();

Create the Exhaustive ContentInspector

using MimeDetective;
var Inspector = new ContentInspectorBuilder() {
    Definitions = new Definitions.ExhaustiveBuilder() {
        UsageType = Definitions.Licensing.UsageType.PersonalNonCommercial
    }.Build()
}.Build();

Inspect Content

Once you have a ContentInspector you can use it to inspect a stream, file, or array of bytes:

var Results = Inspector.Inspect(ContentByteArray);
var Results = Inspector.Inspect(ContentStream);
var Results = Inspector.Inspect(ContentFileName);

Group Results by File Extension or Mime Type

var ResultsByFileExtension = Results.ByFileExtension();
var ResultsByMimeType = Results.ByMimeType();

Definition Packs

Definition packs make it easy to expand or limit the number of definitions that the Inspector will use. You can use one of the provided definition packs, create a limited subset of a definition pack, or create entirely new definition packs from scratch.

Default Definitions

The default definitions are included with the Mime-Detective nuget package and are located in the MimeDetective.Definitions.Default static class. You can create a copy of all definitions by calling MimeDetective.Definitions.Default.All() or just a limited subset by calling something like MimeDetective.Definitions.Default.FileTypes.Documents.All().

It can be used by anyone for any purpose and requires no additional licensing.

Type Extensions
Archives 7z bz2 gz rar tar zip
Audio flac m4a mid midi mp3 ogg wav
Cryptographic aes pkr skr
Documents doc docx dwg pdf ppt pptx rtf xls xlsx
Disk Images bin dmg iso toast vcd
Email Files eml pst
Executables dll exe elf coff
Images bmp gif ico jpeg jpg png psd tiff
Text txt
Video 3gp flv mov mp4
Xml xml

Mime-Detective.Definitions.Condensed

install-package Mime-Detective.Definitions.Condensed

This is a condensed library containing the most common file signatures.

It is derived from the publicly available TrID file signatures which may be used for personal/non-commercial use (free) or with a paid commercial license (usually around 300€).

Create a copy of these definitions by using the following code:

var AllDefintions = new Definitions.CondensedBuilder() { 
    UsageType = Definitions.Licensing.UsageType.PersonalNonCommercial //Change this to be your usage type
}.Build();
Type Extensions
Audio aif cda mid midi mp3 mpa ogg wav wma wpl
Video 3g2 3gp avi flv h264 m4v mkv mov mp4 mpg mpeg rm swf vob wmv
Archives 7z arj cab deb pkg rar rpm tar.gz z zip
Disk Images bin dmg iso toast vcd
Email Files eml emlx msg oft ost pst vcf
Executables apk exe com jar msi
Fonts fnt fon otf ttf
Images ai bmp cur gif ico icns jpg jpeg png ps psd svg tif tiff
Presentations key odp pps ppt pptx
Spreadsheets ods xls xlsm xlsx
Documents doc docx odt pdf rtf tex wpd

Mime-Detective.Definitions.Exhaustive

install-package Mime-Detective.Definitions.Exhaustive

This library contains the exhaustive set of 14,000+ file signatures.

It is derived from the publicly available TrID file signatures which may be used for personal/non-commercial use (free) or with a paid commercial license (usually around 300€).

Create a copy of these definitions by using the following code:

var AllDefintions = new Definitions.ExhaustiveBuilder() { 
    UsageType = Definitions.Licensing.UsageType.PersonalNonCommercial //Change this to be your usage type
}.Build();

Custom Definitions

Here is an example showing how to create a custom definition pack. This example uses all of the predefined "MP3" formats as well as a custom ".magic" file type:

internal static class CustomContentInspector {

    public static ContentInspector Instance { get; }

    static CustomContentInspector() {

        var MyDefinitions = new List<Definition>();
                
        //Add a predefined definition
        MyDefinitions.AddRange(MimeDetective.Definitions.Default.FileTypes.Audio.MP3());

        //Add a custom definition
        MyDefinitions.Add(new() {
            File = new() {
                Categories = new[] { Category.Other }.ToImmutableHashSet(),
                Description = "Magic File Type",
                Extensions = new[] { "magic" }.ToImmutableArray(),
                MimeType = "application/octet-stream",
            },
            //All of these rules must match
            Signature = new Segment[] {
                StringSegment.Create("MAGIC"), //anywhere in the file, expect "MAGIC" (exact case)
                PrefixSegment.Create(100, "4d 41 47 49 43") //At offset 100 in the file, expect the bytes "MAGIC".
            }.ToSignature(),
        });

        Instance = new ContentInspectorBuilder() {
            Definitions = MyDefinitions,
            StringSegmentOptions = new() {
                OptimizeFor = Engine.StringSegmentResourceOptimization.HighSpeed,
            },
        }.Build();
    }

}

Optimizing/Balancing Performance and Memory

The ContentInspector is designed to be a fast, high-speed utility. In order to achieve maximum performance and lowest memory usage, there are a few things you want to do.

1. Trim the Data You Don't Need

If you are positive that a file is going to be one of a few different types, create a definition set that only contains those definitions and trim out unnecessary fields.

var AllDefintions = new Definitions.ExhaustiveBuilder() { 
    UsageType = Definitions.LicensingLicensing.UsageType.PersonalNonCommercial
}.Build();

var Extensions = new[]{
    "aif", "cda","mid", "midi","mp3", "mpa", "ogg","wav","wma", "wpl",
}.ToImmutableHashSet(StringComparer.InvariantCultureIgnoreCase);

var ScopedDefinitions = AllDefinitions
    .ScopeExtensions(Extensions) //Limit results to only the extensions provided
    .TrimMeta() //If you don't care about the meta information (definition author, creation date, etc)
    .TrimDescription() //If you don't care about the description
    .TrimMimeType() //If you don't care about the mime type
    .ToImmutableArray()
    ;

var Inspector = new ContentInspectorBuilder() {
    Definitions = ScopedDefinitions,
}.Build();

2. Slow Initialization = Fast Execution

When the ContentInspector is first built, it will perform optimizations to ensure fastest execution. This is a tax best paid only once. If you have a list of files to analyze, build the Inspector once and reuse it.
Do not create a new Inspector every time you need to detect a single file.

3. Parallel = True/False

The ContentInspectorBuilder.Parallel option controlls whether multiple threads will be used to perform detections. If you have lots of definitions or want to make optimal usage of your CPU, this should be set to true. If you have a low number of definitions or you want more balanced CPU usage, set this to false.

4. Read Definitions Once

Materializing definitions causes a new instance of each definition to be created. If you are going to use the same definitions for multiple purposes, load them once and reuse them.

var AllDefintions = new Definitions.ExhaustiveBuilder() { 
    UsageType = Definitions.Licensing.UsageType.PersonalNonCommercial
}.Build();

var Inspector = new ContentInspectorBuilder() {
    Definitions = AllDefintions,
}.Build();

var MimeTypeToFileExtensions = new MimeTypeToFileExtensionLookupBuilder() {
    Definitions = AllDefintions,
}.Build();

var FileExtensionToMimeTypes = new FileExtensionToMimeTypeLookupBuilder() {
    Definitions = AllDefintions,
}.Build();

Benchmark

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.22000
AMD Ryzen 7 2700, 1 CPU, 16 logical and 8 physical cores
.NET SDK=6.0.202
  [Host]     : .NET 6.0.4 (6.0.422.16404), X64 RyuJIT
  Job-QGXQKV : .NET 6.0.4 (6.0.422.16404), X64 RyuJIT

Platform=X64  Runtime=.NET 6.0  
Method TestFile Mean Error StdDev Median Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
Default MindM(...)x.xml [31] 2.757 ms 0.0533 ms 0.0692 ms 2.752 ms 1.00 0.00 644.5313 550.7813 515.6250 10 MB
Condensed MindM(...)x.xml [31] 106.056 ms 2.0622 ms 2.3748 ms 106.228 ms 38.41 1.19 4000.0000 2000.0000 1000.0000 37 MB
Exhaustive MindM(...)x.xml [31] 1,056.068 ms 18.2700 ms 17.0898 ms 1,058.858 ms 385.27 10.59 36000.0000 15000.0000 4000.0000 236 MB
Default MixedExe.exe 9.583 ms 0.8167 ms 2.4080 ms 10.818 ms 1.00 0.00 1109.3750 1023.4375 976.5625 11 MB
Condensed MixedExe.exe 115.825 ms 2.2480 ms 3.2951 ms 115.130 ms 12.74 4.48 4750.0000 2500.0000 1250.0000 38 MB
Exhaustive MixedExe.exe 1,048.981 ms 20.4143 ms 31.1748 ms 1,054.640 ms 114.04 39.50 36000.0000 15000.0000 4000.0000 236 MB
Default imagesBy7zip.zip 56.191 ms 0.3438 ms 0.3216 ms 56.173 ms 1.00 0.00 900.0000 800.0000 700.0000 25 MB
Condensed imagesBy7zip.zip 165.268 ms 1.9812 ms 1.7563 ms 165.071 ms 2.94 0.04 6000.0000 4000.0000 3000.0000 52 MB
Exhaustive imagesBy7zip.zip 1,133.915 ms 14.8815 ms 13.9202 ms 1,137.890 ms 20.18 0.27 35000.0000 14000.0000 4000.0000 251 MB
Default micro(...)f.pdf [23] 5.161 ms 0.1056 ms 0.3046 ms 5.161 ms 1.00 0.00 1109.3750 1023.4375 976.5625 12 MB
Condensed micro(...)f.pdf [23] 118.030 ms 1.9950 ms 1.8661 ms 118.048 ms 23.07 1.83 4800.0000 2400.0000 1200.0000 39 MB
Exhaustive micro(...)f.pdf [23] 1,054.044 ms 19.6671 ms 18.3966 ms 1,059.507 ms 205.84 13.67 36000.0000 15000.0000 4000.0000 238 MB
Default test.bmp 13.465 ms 0.3585 ms 1.0572 ms 13.742 ms 1.00 0.00 1046.8750 953.1250 921.8750 22 MB
Condensed test.bmp 119.457 ms 2.3831 ms 5.8904 ms 119.411 ms 9.12 0.96 4666.6667 2333.3333 1333.3333 49 MB
Exhaustive test.bmp 1,067.179 ms 20.8197 ms 19.4748 ms 1,069.120 ms 87.67 3.37 36000.0000 15000.0000 4000.0000 254 MB
Default wavVLC.wav 8.037 ms 0.2070 ms 0.6104 ms 8.015 ms 1.00 0.00 765.6250 687.5000 640.6250 15 MB
Condensed wavVLC.wav 123.661 ms 1.4984 ms 1.3283 ms 123.824 ms 15.39 0.84 5000.0000 3000.0000 2000.0000 45 MB
Exhaustive wavVLC.wav 1,054.265 ms 20.5831 ms 20.2153 ms 1,060.961 ms 132.62 8.27 36000.0000 15000.0000 4000.0000 243 MB

mime-detective's People

Contributors

tonyvalenti avatar simader avatar jeffward01 avatar

Stargazers

Jerry Jian avatar  avatar OneHundredBens avatar Robin Sue avatar Kasper Toft Andersen avatar Valentin Dide avatar Mauro Schaparini avatar Pankaj Nikam avatar Nasim Uddin avatar  avatar Soar avatar Korneel avatar İshak KÜLEKCİ avatar Alexander Richter avatar shark avatar Zhu Lijun avatar  avatar Simon Chester avatar  avatar  avatar Chris Carter avatar  avatar Hamed Khatami avatar Mads Breusch Klinkby avatar fred avatar 落笔 avatar pedoc avatar Mohamad Moradi avatar Ibrahim Akgul avatar Sengiv avatar Julien Jacobs avatar Wesley Borges avatar  avatar tsu avatar Spoc Web avatar PinusThunbergii avatar Michał Kowalik avatar  avatar Peter Gill avatar Tonttu avatar Benjamin Höglinger-Stelzer avatar Cezar Cretu avatar zhouyu avatar Kyle avatar  avatar Fabien Ménager avatar RaminMT avatar Simon Keen avatar  avatar  avatar Luke Kolodziej avatar Paul Russo avatar Ben Jenkinson avatar SandRock avatar  avatar JAELYS avatar Shmulik avatar elseBlock avatar  avatar  avatar Md Sadman Chowdhury avatar Stefan Steiger avatar  avatar ArcticLampyrid avatar Andy P. avatar  avatar George avatar Luciano Paciornick avatar odesyatnyk avatar dzmitry-lahoda avatar  avatar İbrahim Ekinci avatar  avatar Daniel Bichuetti avatar 扭币的大妈 avatar Oktay I. avatar Brett Graves avatar VC Koh avatar Hyun Yi avatar Costas Katsavounidis avatar sadegh javanmard avatar Alok Sharma avatar Arman Gungor avatar Anton Iliyn avatar Max Vasilyev avatar

Watchers

 avatar  avatar  avatar Andrei avatar  avatar  avatar

mime-detective's Issues

The GL Transmission Format MIME Type is incorrect

The binary GL Transmission Format with a glb extension should have model/gltf-binary as its MIME type rather than application/octet-stream. Consequently, the MIME type of the non-binary GL Transmission Format with a gltf extension should be model/gltf+json.

Add StrongNaming

Would it be possible to add a strongname to this package/dll?
We have an .net6 WPF application with strongnaming enforced.
Best,
Andreas

Upgrade path from the previous nuget owner?

I am quite surprised the owner of the nuget package has changed.

The library and license have changed.

Can you state what changes? And how to update?

If the library is not the same, why not use a different nuget name?

This change looks very suspicious.


Repo history:
Original repo was https://github.com/Muraad/Mime-Detective
Nuget 0.0.6-beta5 pointed to https://github.com/clarkis117/Mime-Detective.git (repo does not exist any more)
clarkis's repo seem to have moved to https://github.com/TonyValenti/Mime-Detective-clarkis117
Is TonyValenti a member of https://github.com/MediatedCommunications?

Nuget license information is incorrect and inconsistent with code itself regarding commercial use.

I just almost added this package, but then found out the code makes a distinction between commercial and non-commercial use, requiring a license for the former.

var Error = "Please change your usage type or visit https://mark0.net/soft-tridnet-e.html to purchase a license.";

Please adjust and fix the licensing information to inform people about this to avoid frustrating and misinforming people.
Would people using the example and not reading the UsageType in them violating another license for the TrId stuff? I don't know. Either way, it should probably be addressed in the license info up front.

JSON detection

Thank you for a nice library.

I've been testing different file types and overall things look good but I haven't been able to get Mime Detective to detect JSON files (based on the byte array). Is this something that should/could be supported?

Plain old text files?

I don't seem to be able to get MD to detect "plain old text" files. I see the various magic strings that are looked for in Default.FileTypes.Text.cs, but a plain old text file saved from, say Notepad on Windows, does not have a magic string. I do not have the file extension. What could I do here?

big PPTX file(s) identified as zip/zip/pg

Hi,

We are using Mime-Detective library version 23.10.1, with Exhaustive Definitions. We are happy with the coverage of files recognized by library, but lately we've observed some issue with pptx file(s) recognition.

example here -> [12 MB.pptx](https://github.com/MediatedCommunications/Mime-Detective/files/13298152/12.MB.pptx

Here is recognition result ->

zip /// 4027 Points /// Open Packaging Conventions container
zip /// 4000 Points /// ZIP compressed archive
pg /// 1000 Points /// PrintFox/Pagefox bitmap (640x800)

this is our code sample, how we use library ->

_inspector = new ContentInspectorBuilder()
{
    Definitions = new MimeDetective.Definitions.ExhaustiveBuilder()
    {
        UsageType = MimeDetective.Definitions.Licensing.UsageType.CommercialPaid
    }.Build()
}.Build();
)


...

var results = _inspector.Inspect(stream, true);

For smaller (other) files the library is able to recognize pptx files correctly. Can we do something on our side, apply some workaround or this is related to detector library and we need to wait for the updates?

BR,
Dawid

Yaml file identified as EML file

The file below gets marked as an EML file because the word "FROM" appears in the text. The word from obviously comes way after the starting few bytes and so not sure why it is being marked as an EML file.

To reproduce, just download the attached file. Change it's extension to .yaml and try to run the inspection on it with all the default definitions loaded in.

NOTE : You might not even need to change the extension but I had to for uploading it to GitHub

recipe.txt

"CommercialFree" is not allowed

If a UsageType "CommercialFree" is supported, this shouldn't cause a runtime-exception. Instead I would have expected a reduced set of supported definitions.

If it is not planed to support any definition with "CommercialFree" this usage type should be removed. If there was no usage type "CommercialFree" I would have known from the start, that this library is no option.

How to install this? Screenshots?

Hi,
I'm looking for a fresh tool to detect File Signatures.
This sound promising, but I have no idea what kind of output it can generate and I can't find any files to run from here.

  • So how do you install and run this? (I don't have/use nuget.)
  • Any binaries to test with somewhere?
  • What is the output?

Cheers!

Inspecting normal System.IO.Streams

I'm seeing an overload of ContentInspector.Inspect that takes an IEnumerable<byte> or a byte [], but I'm not seeing one that takes a System.IO.Stream. This seems suboptimal.

Is there a way to get a byte enumerable from a stream, or do I need to write one manually?

README issue

The two examples both have the same issue:

using MimeDetective;
var Inspector = new ContentInspectorBuilder() {
    Definitions = Definitions.Default.All()
}.Build();

Definitions.Default.All() needs to be fully qualified as MimeDetective.Definitions.Default.All()

some missing types

Hi, is it possible to add a few less common and new types to some of the lists?

.mpp
.webp
.HEIF
.HEVC

ZIP file is detected as KMZ instead of ZIP

Hello,

I have two different zip files, one is 13MB and one is 48MB. The 13MB is correctly being detected as .zip mime type but for the 48MB, it is being detected as KMZ. They basically have the same contents except the other one is just bigger.

I am using the latest Mime-Detective library.

May I request assistance on this please?

Thank you.

Cheers,
Allaine

Another README issue

Hi, where is the Data class?

var AllDefintions = new Definitions.ExhaustiveBuilder() { 
    UsageType = Data.Licensing.UsageType.PersonalNonCommercial
}.Build();

Detection of .eml file types doesn't find any matches

When using Mime-Detective with the Default pack (which supports detection of eml) files we're always seeing no matches returned when bytes from a legitimate .eml file are supplied.

Is there a way to run any sort of diagnostics etc to help determine why it is failing to match? Or is there something else we need to do when trying to match .eml files or other types of file?

All other file types we've tried match correctly, e.g. docx, tif, gif, jpg, pdf etc.

Using code similar to the following with the default pack of definitions:

var inspector = new ContentInspectorBuilder()
{
Definitions = Default.All(),
Parallel = true
};

        var contentInspector = inspector.Build();
        
        var matches = contentInspector.Inspect(documentBytesArray);

Add CSV to default Text definitions

CSV is a common mime-type for some use cases, and it's no different than a txt file. Could it be added to the default Text file types?

Should be this I think:

public static ImmutableArray<Definition> CSV_Utf8()
{
    return new List<Definition>() {
        new() {
            File = new() {
                Extensions = new[]{"csv"}.ToImmutableArray(),
                MimeType = "text/csv",
                Categories = new[]{
                    Category.Document,
                }.ToImmutableHashSet(),
            },
            Signature = new Segment[] {
                PrefixSegment.Create(0, "EF BB BF")
            }.ToSignature(),
        },
    }.ToImmutableArray();
}

Alternatively, is it possible to just add the extension and mime-type to the TXT_Utf8 definition?

Storage.Definition.File Property not accessible

Hi,

thank you so much for that library!

But I have an issue according to vb.net and the Storage.Definition class.
I want to receive the "guessed" MIME of a byte[].
The code works well with C#.

The same code translated to vb.net cannot access the property(function) File(get_file()) of the Storage.Definition class.

If I look into the object browser, I can find

  • File (Property)
  • setFile (Method)
  • getFile (Method)

in the C# Version.

In the vb.net version I can only see the "File"-property in the object browser as a public Property returning a Storage.FileType, but in code (vb.net AND C#) I cannot access the property.
Only the Method getFile() works in C#, but not in vb.net.

In C# there is a friendly message that I should use the accessor-getFile() instead of File-property.
ok - works, but not for vb.net!

Can you help me here?

Thanks,
Sebastian

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.