Git Product home page Git Product logo

mcrit's Introduction

MinHash-based Code Relationship & Investigation Toolkit (MCRIT)

Test

MCRIT is a framework created to simplify the application of the MinHash algorithm in the context of code similarity. It can be used to rapidly implement "shinglers", i.e. methods which encode properties of disassembled functions, to then be used for similarity estimation via the MinHash algorithm. It is tailored to work with disassembly reports emitted by SMDA.

Usage

Dockerized Usage

We highly recommend to use the fully packaged docker-mcrit for trivial deployment and usage.
First and foremost, this will ensure that you have fully compatible versions across all components, including a database for persistence and a web frontend for convenient interaction.

Standalone Usage

Installing MCRIT on its own will require some more steps.
For the following, we assume Ubuntu as host operating system.

The Python installation requirements are listed in requirements.txt and can be installed using:

# install python and MCRIT dependencies
$ sudo apt install python3 python3-pip
$ pip install -r requirements.txt 

By default, MongoDB 5.0 is used as backend, which is also the recommended mode of operation as it provides a persistent data storage. The following commands outline an example installation on Ubuntu:

# fetch mongodb signing key
$ sudo apt-get install gnupg
$ wget -qO - https://www.mongodb.org/static/pgp/server-5.0.asc | sudo apt-key add -
# add package repository (Ubuntu 22.04)
$ echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu jammy/mongodb-org/5.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-5.0.list
# OR add package repository (Ubuntu 20.04)
$ echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu focal/mongodb-org/5.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-5.0.list
# OR add package repository (Ubuntu 18.04)
$ echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu bionic/mongodb-org/5.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-5.0.list
# install mongodb
$ sudo apt-get update
$ sudo apt-get install -y mongodb-org
# start mongodb as a service
$ sudo systemctl start mongod
# optionally configure to start the service with system startup
$ sudo systemctl enable mongod

When doing the standalone installation, you possibly want to install the MCRIT module based on the cloned repository, like so:

$ pip install -e .

After this initial installation and if desired, MCRIT can be used without an internet connection.

Operation

The MCRIT backend is generally divided into two components, a server providing an API interface to work with and one or more workers processing queued jobs. They can be started in seperate shells using:

$ python -m mcrit server

and

$ python -m mcrit worker

By default, the REST API server will be listening on http://127.0.0.1:8000/.

Interaction

Regardless of your choice for installation, once running you can interact with the MCRIT backend.

MCRIT Client

We have created a Python client module that is capable of working with all available endpoints of the server.
Documentation for this client module is currently in development.

MCRIT CLI

There is also a CLI which is based on this client package, examples:

# query some stats of the data stored in the backend 
$ python -m mcrit client status
{'status': {'db_state': 187, 'storage_type': 'mongodb', 'num_bands': 20, 'num_samples': 137, 'num_families': 14, 'num_functions': 129110, 'num_pichashes': 25385}}
# submit a malware sample with filename sample_unpacked, using family name "some_family"
$ python -m mcrit client submit sample_unpacked -f some_family
 1.039s -> (architecture: intel.32bit, base_addr: 0x10000000): 634 functions

A more extensive documentation of the MCRIT CLI is available here

MCRIT IDA Plugin

An IDA plugin is also currently under development. To use it, first create your own config.py and make required changes depending on the deployment of your MCRIT instance:

cp ./plugins/ida/template.config.py ./plugins/ida/config.py
nano ./plugins/ida/config.py

Then simply run the script found at

./plugins/ida/ida_mcrit.py

in IDA.

Reference Data

In July 2023, we started populating a Github repository which contains ready-to-use reference data for common compilers and libraries.

Version History

  • 2024-04-17 v1.3.15: Worker type spawningworker will now terminate children after QueueConfig.QUEUE_SPAWNINGWORKER_CHILDREN_TIMEOUT seconds.
  • 2024-04-02 v1.3.14: Experimental: Introduction of new worker type spawningworker - this variant will consume jobs from the queue as usual but defer the actual job execution into a separate (sub)process, which should reduce issues with locked memory allocations.
  • 2024-04-02 v1.3.13: When cleaning up the queue, now also delete all failed jobs @yankovs - THX!!
  • 2024-03-06 v1.3.12: Fixed a bug where protection of recent samples from queue cleanup would lead to key errors as reported by @yankovs - THX!!
  • 2024-02-21 v1.3.10: Bump SMDA to 1.13.16, which covers another 200 instructions in a better escaped category (affects MinHashes).
  • 2024-02-16 v1.3.9: Finished and integrated automated queue cleanup feature (disabled by default) proposed by @yankovs - THX!!
  • 2024-02-15 v1.3.8: Bump SMDA to address issues with version recognition in SmdaFunction, fixed exception prints in IDA plugin's McritInterface (THX to @malwarefrank!!).
  • 2024-02-12 v1.3.5: Recalculating minhashes will now show correct percentages (THX to @malwarefrank!!).
  • 2024-02-02 v1.3.4: Mini fix in the IDA plugin to avoid referencing a potentially uninitialized object (THX to @r0ny123!!).
  • 2024-02-01 v1.3.2: FIX: Non-parallelized matching now outputs the same data format (THX to @dannyquist!!).
  • 2024-01-30 v1.3.1: The connection to MongoDB is now fully configurable (THX to @dannyquist!!).
  • 2024-01-24 v1.3.0: BREAKING: Milestone release with indexing improvements for PicHash and MinHash. To ensure full backward compatibility, recalculation of all hashes is recommended. Check this migration guide.
  • 2024-01-23 v1.2.26: Pinning lief to 0.13.2 in order to ensure that the pinned SMDA remains compatible.
  • 2024-01-09 v1.2.25: Ensure that we can deliver system status regardless of whether there is a db_state and db_timestamp or not.
  • 2024-01-05 v1.2.24: Now supporting "query" argument in CLI, as well as compact MatchingResults (without function match info) to reduce file footprint.
  • 2024-01-03 v1.2.23: Limit maximum export size to protect the system against OOM crashes.
  • 2024-01-02 v1.2.22: Introduced data class for UniqueBlocksResult with convenience functionality.
  • 2023-12-28 v1.2.21: McritClient now doing passthrough for binary query matching.
  • 2023-12-28 v1.2.20: Status now provides timestamp of last DB update.
  • 2023-12-13 v1.2.18: Bounds check versus sample_ids passed to getUniqueBlocks.
  • 2023-12-05 v1.2.15: Added convenience functionality to Job objects, version number aligned with mcritweb.
  • 2023-11-24 v1.2.11: SMDA pinned to version 1.12.7 before we upgrade SMDA and introduce a database migration to recalculate pic + picblock hashes with the improved generalization.
  • 2023-11-17 v1.2.10: Added ability to set an authorization token for the server via header field: apitoken; added ability to filter by job groups; added ability to fail orphaned jobs.
  • 2023-10-17 v1.2.8: Minor fix in job groups.
  • 2023-10-16 v1.2.6: Summarized queue statistics, refined Job classification.
  • 2023-10-13 v1.2.4: Exposed Queue/Job Deletion to REST interface, improved query speed for various queue lookups via indexing and parameterized mongodb queries.
  • 2023-10-13 v1.2.3: Workers will now de-register from in-progress jobs in case they crash (THX to @yankovs for the code template).
  • 2023-10-03 v1.2.2: MatchingResult filtering for min/max num samples (incl. fix).
  • 2023-10-02 v1.2.0: Milestone release for Virus Bulletin 2023.
  • 2023-09-18 v1.1.7: Bugfix: Tasking matching with 0 bands now deactivates minhash matching as it was supposed to be before. Also matching job progress percentage fixed.
  • 2023-09-15 v1.1.6: Bugfix in BlockMatching, convenience functionality for interacting with Job objects.
  • 2023-09-14 v1.1.5: Deactivated gunicorn as default WSGI handler for the time being due to issues with non-returning calls when handling compute-heavy calls.
  • 2023-09-14 v1.1.4: BUGFIX: Added requirements.txt to data_files in setup.py to ensure it's available for the package.
  • 2023-09-13 v1.1.3: Extracted some performance critical constants into parameters configurable in MinHashConfig and StorageConfig, fixed progress reporting for batched matching, BUGFIX: usage of GunicornConfig to proper dataclass.
  • 2023-09-13 v1.1.1: Streamlined requirements / setup, excluded gunicorn for Windows (THX to @yankovs!!).
  • 2023-09-12 v1.1.0: For Linux deployments, MCRIT now uses gunicorn instead of waitress as WSGI server because of much better performance. As gunicorn needs its own config, this required bumping the minor versions (THX to @yankovs!!).
  • 2023-09-08 v1.0.21: All methods of McritClient now forward apitokens/usernames to the backend.
  • 2023-09-05 v1.0.20: Use two-complement to represent addresses in SampleEntry, FunctionEntry when storing in MongoDB to address BSON limitations (THX to @yankovs).
  • 2023-09-05 v1.0.19: Statistics are now using the internal counters that had been created a while ago (THX to @yankovs).
  • 2023-08-30 v1.0.18: Refined LinkHunt scoring and clustering of results via ICFG relationship.
  • 2023-08-24 v1.0.15: Integrated first attempt at link hunting capability in MatchingResult.
  • 2023-08-24 v1.0.13: Rebuilding the minhash bands will no longer explode RAM usage. Removed redundant path checks (THX to @yankovs).
  • 2023-08-23 v1.0.12: Added the ability to rebuild the minhash bands used for indexing.
  • 2023-08-22 v1.0.11: Fixed a bug where when importing bulk data, the function_name was not also added as a function_label.
  • 2023-08-11 v1.0.10: Fixed a bug where when importing bulk data, the function_id would not be adjusted prior to adding MinHashes to bands, possibly leading to non-existing function_ids.
  • 2023-08-02 v1.0.9: IDA plugin can now filter by block size and minhash score, optimized layout and user experience (THX for the feedback to @r0ny123!!)
  • 2023-07-28 v1.0.8: IDA plugin can now display colored graphs for remote functions and do queries for PicBlockHashes (for basic blocks) for the currently viewed function.
  • 2023-06-06 v1.0.7: Extended filtering capabilities on MatchingResult.
  • 2023-06-02 v1.0.6: IDA plugin can now task matching jobs, show their results and batch import labels. Harmonization of MatchingResult.
  • 2023-05-22 v1.0.3: More robustness for path verification when using MCRIT CLI on Malpedia repo folder.
  • 2023-05-12 v1.0.1: Some progress on label import for the IDA plugin. Reflected API extension of MCRITweb in McritClient.
  • 2023-04-10 v1.0.0: Milestone release for Botconf 2023.
  • 2023-04-10 v0.25.0: IDA plugin can now do function queries for the currently viewed function.
  • 2023-03-24 v0.24.2: McritClient can forward username/apitoken, addJsonReport is now forwardable.
  • 2023-03-21 v0.24.0: FunctionEntries now can store additional FunctionLabelEntries, along submitting user/date.
  • 2023-03-17 v0.23.0: It is now possible to query matches for single SmdaFunctions (synchronously).
  • 2023-03-15 v0.22.0: McritClient now supports apitokens and raw responses for a subset of functionality.
  • 2023-03-14 v0.21.0: Backend support for more fine grained filtering.
  • 2023-03-13 v0.20.6: Backend support for filtering family/sample by score in MatchResult.
  • 2023-02-22 v0.20.4: Bugfix for calculating unique scores and accessing these results.
  • 2023-02-21 v0.20.3: Supporting frontend capabilities with result presentation.
  • 2023-02-17 v0.20.2: Extended match report object to support frontend improvements.
  • 2023-02-14 v0.20.0: Overhauled console client to simplify shell-based interactions with the backend.
  • 2023-01-12 v0.19.4: Additional filtering capabilities for MatchingResults.
  • 2022-12-13 v0.19.1: It is now possible to require specific (higher) amounts of band matches for candidates (i.e. reduce fuzziness of matching).
  • 2022-12-13 v0.18.x: Enable matching of arbitrary function IDs.
  • 2022-11-25 v0.18.9: Accelerated Query matching.
  • 2022-11-18 v0.18.8: Harmonized handling of deletion and modifications, minor fixes.
  • 2022-11-13 v0.18.7: Drastically accelerated sample deletion.
  • 2022-11-13 v0.18.6: Added functionality to modify existing sample and family information.
  • 2022-11-11 v0.18.2: Upgrading matching procedure, should now be able to handle larger binaries more robustly and efficiently.
  • 2022-11-03 v0.18.1: Minor fixes.
  • 2022-11-03 v0.18.0: Unique block isolation now also generates a proposal for a YARA rule, restructured result output.
  • 2022-10-24 v0.17.4: Harmonized setup.py with requirements, improved memory efficiency for processing cross jobs.
  • 2022-10-18 v0.17.3: Added a convenience script to recursively produce SMDA reports from a semi-structured folder.
  • 2022-10-13 v0.17.2: Fixed potential OOM issues during MinHash calculation by processing functions to be hashed in smaller batches.
  • 2022-10-12 v0.17.1: Added a function to schedule a job that will ensure minhashes have been calculated for all samples/functions.
  • 2022-10-11 v0.17.0: Search for unique blocks is now an asychronous job through the Worker.
  • 2022-10-11 v0.16.0: Samples from MatchQuery jobs will now be stored with their Sample/FunctionEntries to allow better post processing.
  • 2022-10-04 v0.15.4: Server can now display its version.
  • 2022-09-28 v0.15.3: Addressing performance issues for bigger instances, generating escaped instruction sequence for unique blocks.
  • 2022-09-26 v0.15.0: CrossJobs now in backend, started to provide functionality to identify unique basic blocks in samples.
  • 2022-08-29 v0.14.2: Minor fixes for deployment.
  • 2022-08-22 v0.14.0: Jobs can now depend on other jobs (preparation for moving crossjobs to backend), QoL improvements to job handling.
  • 2022-08-17 v0.13.1: Added commandline option for profiling (requires cProfile).
  • 2022-08-09 v0.13.0: Can now do efficient direct queries for PicHash and PicBlockHash matches.
  • 2022-08-09 v0.12.3: Bugfix for FamilyEntry
  • 2022-08-08 v0.12.2: Bugfix for delivery of XCFG data, added missing dependency.
  • 2022-08-08 v0.12.0: Integrated Advanced Search syntax.
  • 2022-08-03 v0.11.0: (BREAKING) Families are now represented with a FamilyEntry.
  • 2022-08-03 v0.10.3: Now leaving function xcfg data by default in DB, exposed access to it via REST API and McritClient.
  • 2022-07-29 v0.10.2: Added ability to delete families - now also keeping XCFG info for all functions by default.
  • 2022-07-12 v0.10.1: Improved performance.
  • 2022-07-12 v0.10.0: (BREAKING) Job handling simplified.
  • 2022-05-13 v0.9.4: Bug fix for receiving submitted files.
  • 2022-05-13 v0.9.3: Further updates to MatchingResults.
  • 2022-05-13 v0.9.2: Added another field and more convenience functions in MatchingResult for better access - those are breaking changes for previously created MatchingResults.
  • 2022-05-05 v0.9.1: Processing of binary submissions, minor fixes for minhash queuing - INITIAL RELEASE.
  • 2022-02-09 v0.9.0: Added PicBlocks to MCRIT.
  • 2022-01-19 v0.8.0: Migrated the client and the examples into the primary MCRIT repository.
  • 2021-12-16 v0.7.0: Initial private release.

Credits & Notes

Thanks to Steffen Enders and Paul Hordiienko for their contributions to the internal research prototype of this project! Thanks to Manuel Blatt for his extensive contributions to and refactorings of this project as well as for the client module!

Pull requests welcome! :)

License

    MinHash-based Code Relationship & Investigation Toolkit (MCRIT)
    Copyright (C) 2022  Daniel Plohmann, Manuel Blatt

    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with this program.  If not, see <http://www.gnu.org/licenses/>.
    
    Some plug-ins and libraries may have different licenses. 
    If so, a license file is provided in the plug-in's folder.

mcrit's People

Contributors

blattm avatar danielplohmann avatar dannyquist avatar yankovs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

mcrit's Issues

Mongo: BSON document might be bigger than 16mb

MongoDbStorage's insert_many method should probably check for the total size of the documents or if one of the documents is too big itself. In some (pretty rare) cases, the size could exceed the 16mb BSON limit and result in an exception:

mcrit-server                  | 2023-09-12 17:57:10 [FALCON] [ERROR] POST /samples => Traceback (most recent call last):
mcrit-server                  |   File "/opt/mcrit/mcrit/storage/MongoDbStorage.py", line 229, in _dbInsertMany
mcrit-server                  |     insert_result = self._database[collection].insert_many([self._toBinary(document) for document in data])
mcrit-server                  |   File "/usr/local/lib/python3.8/dist-packages/pymongo/_csot.py", line 108, in csot_wrapper
mcrit-server                  |     return func(self, *args, **kwargs)
mcrit-server                  |   File "/usr/local/lib/python3.8/dist-packages/pymongo/collection.py", line 757, in insert_many
mcrit-server                  |     blk.execute(write_concern, session=session)
mcrit-server                  |   File "/usr/local/lib/python3.8/dist-packages/pymongo/bulk.py", line 580, in execute
mcrit-server                  |     return self.execute_command(generator, write_concern, session)
mcrit-server                  |   File "/usr/local/lib/python3.8/dist-packages/pymongo/bulk.py", line 447, in execute_command
mcrit-server                  |     client._retry_with_session(self.is_retryable, retryable_bulk, s, self)
mcrit-server                  |   File "/usr/local/lib/python3.8/dist-packages/pymongo/mongo_client.py", line 1413, in _retry_with_session
mcrit-server                  |     return self._retry_internal(retryable, func, session, bulk)
mcrit-server                  |   File "/usr/local/lib/python3.8/dist-packages/pymongo/_csot.py", line 108, in csot_wrapper
mcrit-server                  |     return func(self, *args, **kwargs)
mcrit-server                  |   File "/usr/local/lib/python3.8/dist-packages/pymongo/mongo_client.py", line 1460, in _retry_internal
mcrit-server                  |     return func(session, conn, retryable)
mcrit-server                  |   File "/usr/local/lib/python3.8/dist-packages/pymongo/bulk.py", line 435, in retryable_bulk
mcrit-server                  |     self._execute_command(
mcrit-server                  |   File "/usr/local/lib/python3.8/dist-packages/pymongo/bulk.py", line 381, in _execute_command
mcrit-server                  |     result, to_send = bwc.execute(cmd, ops, client)
mcrit-server                  |   File "/usr/local/lib/python3.8/dist-packages/pymongo/message.py", line 966, in execute
mcrit-server                  |     request_id, msg, to_send = self.__batch_command(cmd, docs)
mcrit-server                  |   File "/usr/local/lib/python3.8/dist-packages/pymongo/message.py", line 956, in __batch_command
mcrit-server                  |     request_id, msg, to_send = _do_batched_op_msg(
mcrit-server                  |   File "/usr/local/lib/python3.8/dist-packages/pymongo/message.py", line 1353, in _do_batched_op_msg
mcrit-server                  |     return _batched_op_msg(operation, command, docs, ack, opts, ctx)
mcrit-server                  | pymongo.errors.DocumentTooLarge: BSON document too large (60427090 bytes) - the connected server supports BSON document sizes up to 16777216 bytes.
mcrit-server                  |
mcrit-server                  | During handling of the above exception, another exception occurred:
mcrit-server                  |
mcrit-server                  | Traceback (most recent call last):
mcrit-server                  |   File "falcon/app.py", line 365, in falcon.app.App.__call__
mcrit-server                  |   File "/opt/mcrit/mcrit/server/utils.py", line 51, in wrapper
mcrit-server                  |     func(*args, **kwargs)
mcrit-server                  |   File "/opt/mcrit/mcrit/server/SampleResource.py", line 126, in on_post_collection
mcrit-server                  |     summary = self.index.addReportJson(req.media, username=username)
mcrit-server                  |   File "/opt/mcrit/mcrit/index/MinHashIndex.py", line 280, in addReportJson
mcrit-server                  |     return self.addReport(report, calculate_hashes=calculate_hashes, calculate_matches=calculate_matches, username=username)
mcrit-server                  |   File "/opt/mcrit/mcrit/index/MinHashIndex.py", line 265, in addReport
mcrit-server                  |     sample_entry = self._storage.addSmdaReport(smda_report)
mcrit-server                  |   File "/opt/mcrit/mcrit/storage/MongoDbStorage.py", line 622, in addSmdaReport
mcrit-server                  |     self._dbInsertMany("functions", function_dicts)
mcrit-server                  |   File "/opt/mcrit/mcrit/storage/MongoDbStorage.py", line 238, in _dbInsertMany
mcrit-server                  |     raise ValueError("Database insert failed.")
mcrit-server                  | ValueError: Database insert failed.

Unfortunately I didn't add any print of the samples that caused this, so I don't really have context to provide ๐Ÿ˜ญ. Overall this is pretty uncommon, happened 4 times for over 120k files.

MongoDB may throw an overflow error

Hey!

Some reoccurring mongo related error pops up in mcrit server logs from time to time when running an indexing process that submits files to mcrit.

Not sure if this is a mongo issue or an issue in mcrit, but it seems to be related to the ID generation done in mcrit. Can some field in the metadata saved to mongo be bigger than the 8 byte integer limit in BSON?

mcrit-server                 | 2023-09-03 04:57:12 [FALCON] [ERROR] POST /samples => Traceback (most recent call last):
mcrit-server                 |   File "/opt/mcrit/mcrit/storage/MongoDbStorage.py", line 188, in _dbInsert
mcrit-server                 |     insert_result = self._database[collection].insert_one(self._toBinary(data))
mcrit-server                 |   File "/usr/local/lib/python3.8/dist-packages/pymongo/collection.py", line 671, in insert_one
mcrit-server                 |     self._insert_one(
mcrit-server                 |   File "/usr/local/lib/python3.8/dist-packages/pymongo/collection.py", line 611, in _insert_one
mcrit-server                 |     self.__database.client._retryable_write(acknowledged, _insert_command, session)
mcrit-server                 |   File "/usr/local/lib/python3.8/dist-packages/pymongo/mongo_client.py", line 1568, in _retryable_write
mcrit-server                 |     return self._retry_with_session(retryable, func, s, None)
mcrit-server                 |   File "/usr/local/lib/python3.8/dist-packages/pymongo/mongo_client.py", line 1413, in _retry_with_session
mcrit-server                 |     return self._retry_internal(retryable, func, session, bulk)
mcrit-server                 |   File "/usr/local/lib/python3.8/dist-packages/pymongo/_csot.py", line 108, in csot_wrapper
mcrit-server                 |     return func(self, *args, **kwargs)
mcrit-server                 |   File "/usr/local/lib/python3.8/dist-packages/pymongo/mongo_client.py", line 1460, in _retry_internal
mcrit-server                 |     return func(session, conn, retryable)
mcrit-server                 |   File "/usr/local/lib/python3.8/dist-packages/pymongo/collection.py", line 599, in _insert_command
mcrit-server                 |     result = conn.command(
mcrit-server                 |   File "/usr/local/lib/python3.8/dist-packages/pymongo/helpers.py", line 315, in inner
mcrit-server                 |     return func(*args, **kwargs)
mcrit-server                 |   File "/usr/local/lib/python3.8/dist-packages/pymongo/pool.py", line 960, in command
mcrit-server                 |     self._raise_connection_failure(error)
mcrit-server                 |   File "/usr/local/lib/python3.8/dist-packages/pymongo/pool.py", line 932, in command
mcrit-server                 |     return command(
mcrit-server                 |   File "/usr/local/lib/python3.8/dist-packages/pymongo/network.py", line 150, in command
mcrit-server                 |     request_id, msg, size, max_doc_size = message._op_msg(
mcrit-server                 |   File "/usr/local/lib/python3.8/dist-packages/pymongo/message.py", line 765, in _op_msg
mcrit-server                 |     return _op_msg_uncompressed(flags, command, identifier, docs, opts)
mcrit-server                 | OverflowError: MongoDB can only handle up to 8-byte ints
mcrit-server                 |
mcrit-server                 |
mcrit-server                 | During handling of the above exception, another exception occurred:
mcrit-server                 |
mcrit-server                 |
mcrit-server                 | Traceback (most recent call last):
mcrit-server                 |   File "falcon/app.py", line 365, in falcon.app.App.__call__
mcrit-server                 |   File "/opt/mcrit/mcrit/server/utils.py", line 51, in wrapper
mcrit-server                 |     func(*args, **kwargs)
mcrit-server                 |   File "/opt/mcrit/mcrit/server/SampleResource.py", line 126, in on_post_collection
mcrit-server                 |     summary = self.index.addReportJson(req.media, username=username)
mcrit-server                 |   File "/opt/mcrit/mcrit/index/MinHashIndex.py", line 280, in addReportJson
mcrit-server                 |     return self.addReport(report, calculate_hashes=calculate_hashes, calculate_matches=calculate_matches, username=username)
mcrit-server                 |   File "/opt/mcrit/mcrit/index/MinHashIndex.py", line 265, in addReport
mcrit-server                 |     sample_entry = self._storage.addSmdaReport(smda_report)
mcrit-server                 |   File "/opt/mcrit/mcrit/storage/MongoDbStorage.py", line 585, in addSmdaReport
mcrit-server                 |     self._dbInsert("samples", sample_entry.toDict())
mcrit-server                 |   File "/opt/mcrit/mcrit/storage/MongoDbStorage.py", line 197, in _dbInsert
mcrit-server                 |     raise ValueError("Database insert failed.")
mcrit-server                 | ValueError: Database insert failed.

Considerations for DbCleanup

  • think of orphan query samples and functions that don't have a job connected to them
  • think of doing a DB compact afterwards

recalculateMinHashes progress measure is inaccurate

I am running the recalculateMinHashes via the web interface and the Progress: is showing 6187.50% and growing. I think, but am not certain, that the Worker.updateMinHashes function needs to set the total for progress instead of allowing the calculateMinHashes function to do it, since updateMinhashes currently batches multiple calls to calculateMinHashes.

Reassign unstarted jobs on crashed worker

Hey! ๐Ÿ˜„
Regarding:

def release_all_jobs(self):
# release all jobs associated with our consumer id if they are started, locked, but not finished.
self._getCollection().update_many(
filter={"locked_by": self.consumer_id, "started_at": {"$ne": None}, "finished_at": {"$eq": None}},
update={"$set": {"locked_by": None, "locked_at": None}, "$inc": {"attempts_left": -1}}
)

I'm not 100% sure how dispatching of jobs in MCRIT works so this question might be irrelevant. Is it possible for jobs to be assigned to a worker while it is still processing some other job? Meaning they are locked, but started_at == None. If so, then it might also make sense to release those jobs so they can be either:

  • taken by other workers to be processed
  • taken by the same worker in case some worker recovery/restart mechanism is implemented

Register Worker IDs to avoid zombie jobs

When workers fetch items from the queue, they register on the job with a dynamically generated worker ID.
If for whatever reason a job terminates/crashes unexpectedly, this job will remain marked in progress with the original worker's ID.
Now, should the worker be restarted, it will have a new ID, leading to a situation where the previous ID is no longer among the live workers and the job remaining forever "in progress" while not being processed any more, making it a zombie job.

To address this issue, the following should be done:

  • workers could be started with an additional parameter specifying an ID to be used by them instead of the dynamically generated ID.
  • workers could/should register centrally with their ID in a dedicated database collection, also providing additional information that allows to deconflict their ID after restarts, and possibly providing a heartbeat whenever they have last processed a job. This would allow to clean up zombie jobs.
  • Improve resilience of workers to avoid crashes, so that when handling issues gracefully, they have a chance to de-register from the collection.

Result filtering based on benign functions

Hey!

Awesome project! I've been reading the code and wondering, is there currently any way to filter benign functions using mcrit?

Let's say I keep a repository of functions extracted from compiler boilerplate and the such (much like mcrit-data). Is there any way, when I index some malware sample, to basically remove these functions from the malware sample? Ideally, I'd like the mcrit DB to index only "interesting" functions from the binary and make a minhash based on that.

Maybe mcrit works in a different manner than what I'm thinking about so my question is not relevant, but it would be nice to know either way

Configurable UniqueBlocks queries

Instead of doing the UniqueBlocks analysis based on our best practice settings from YARA-Signator, make them configurable with at least the following parameters:

  • min/max instructions per selected block
  • min/max bytes per selected block
  • required blocks as coverage per sample (currently 10 hardcoded)
  • of them as rule condition

KeyError on import

I am receiving the following error when attempting to import the MSVC/x86/mcrit/2003_Express_x86.mcrit file from the mcrit-data repo:

INFO:mcrit.index.MinHashIndex:Family remapping created: 1 families, 1 samples.
2024-02-09 23:46:06 [FALCON] [ERROR] POST /import => Traceback (most recent call last):
  File "falcon/app.py", line 365, in falcon.app.App.__call__
  File "/opt/mcrit/mcrit/server/utils.py", line 51, in wrapper
    func(*args, **kwargs)
  File "/opt/mcrit/mcrit/server/StatusResource.py", line 73, in on_post_import
    import_report = self.index.addImportData(import_data)
  File "/opt/mcrit/mcrit/index/MinHashIndex.py", line 220, in addImportData
    sample_entry.family_id = family_id_remapping[sample_entry.family_id]
KeyError: 0

I am running docker-mcrit that is using mcrit v1.3.4. I am initiating the import using the command-line client.

Index-out-of-range may occur in job's `sample_id` property

Hey! :)

I've recently updated to current MCRIT/web and noticed I get a crash on basically every family I try to press on in the families view. The traceback is as follows:

[2023-12-12 16:39:05,607] ERROR in app: Exception on /explore/families/5 [GET]
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 2190, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1486, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1484, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1469, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "/opt/mcritweb/mcritweb/views/utility.py", line 39, in wrapped_view
    return view(**kwargs)
  File "/opt/mcritweb/mcritweb/views/authentication.py", line 204, in wrapped_view
    return view(**kwargs)
  File "/opt/mcritweb/mcritweb/views/explore.py", line 213, in family_by_id
    job_collection.filterToSampleIds([s.sample_id for s in samples])
  File "/usr/local/lib/python3.8/dist-packages/mcrit/queue/JobCollection.py", line 36, in filterToSampleIds
    if job.sample_id in sample_ids:
  File "/usr/local/lib/python3.8/dist-packages/mcrit/queue/LocalQueue.py", line 137, in sample_id
    return int(self.arguments[0][0])
IndexError: list index out of **range**

After a bit of code reading it seems like the issue is that when trying to get the samples in a family for some reason it tries to get the sample_id via getUniqueBlocks even though this is not the case.

Edit: the same behavior was observed in two other spots. (1) When pressing Explore -> samples, (2) Data -> Jobs/Results -> Blocks

Limit export

On bigger instances, trying to export the entire database will likely lead to an out-of-memory situation.

To avoid this, the maximum export possible should be capped or in another way limited to avoid the server crashing.

Question: PicHash and MinHash recalculation results

Hey!

I'd ask this in private, but I assume this question applies to more people so it can help others.
We recently did the recalculation actions needed for the new upgraded SMDA, and these are the results:

image
image (1)

In particular, we noticed only ~1/2 of updatable functions were updated in PIcHash recalculation, and similarly for PicBlockHashes only a small fraction was actually updated.

Is this behavior normal? How can we assure that those were not updated because of an error?

Thank you :)

Question: database migrations

Hey!

I've been wondering. If at some point in the future, MCRIT's mongo schema changes (e.g.: how function matches are indexed, or core changes to the LSH implementation which will result in schema changes), will re-indexing of the whole DB be required?

Sample Deletion is incomplete

When using the client's functionality to delete samples by function_id, the respective entries are not removed from the band_* collections. This means when candidates are generated, there will be dangling entries among them, which will lead to errors as these can not be resolved by their id.

Generally, if we have a broken state, we can fix it like this:

for entry in database["functions"].find():
    if entry["function_id"] - previous_id > 1:
        print("we have a gap here!")
        print(previous_id, entry["function_id"])
        for fid in range(previous_id + 1, entry["function_id"]):
            all_gap_function_ids.append(fid)
        break
    previous_id = entry["function_id"]

for band_number in range(0, 20):
    database[f"band_{band_number}"].update_many({},{"$pull": {"function_ids": {"$in": all_gap_function_ids }}})

As a result, we probably want to use the lower part to repair our method in MongoDbStorage to remove function_ids from band_* collections.

Workers consume a lot of ram on query

While doing a query on a sample, often the memory usage of a single worker jumps to 10s of GBs - sometime even more than 60GB. It seems like there's no limit at all and the workers are greedy with memory usage, if some file will require 200GB of ram the worker will try to get it and will probably crash due to lack memory. As a result in a setup with multiple workers it happens quite a lot that some worker will hoard all the memory and starve other workers so that they crash.

Here is a list of hashes for samples that consistently consume a lot of ram on a worker (on our MCRIT instance with 20 million functions):
cea60afdae38004df948f1d2c6cb11d2d0a9ab52950c97142d0a417d5de2ff87
d92f6dd996a2f52e86f86d870ef30d8c80840fe36769cb825f3e30109078e339
bab77145165ebe5ab733487915841c23b29be7efec9a4f407a111c6aa79b00ce
97f1ea0a143f371ecf377912cbe4565d1f5c6d60ed63742ffa0b35b51a83afa2
94433566d1cb5a9962de6279c212c3ab6aa5f18dbff59fe489ec76806b09b15f
a5b38fa9a0031e8913e19ef95ac2bd21cb07052e0ef64abb8f5ef03cf11cb4d5
085b68fa717510f527f74025025b6a91de83c229dc1080c58f0f7b13e8a39904
043aac85af1bda77c259b56cd76e4750c8c10c382d7b6ec29be48ee6e40faa00
84ad84a1f730659ac2e227b71528daec5d59b361ace00554824e0fddb4b453cf
1c4bdd70338655f16cd6cf1eb596cd82a1caaf51722d0015726ec95e719f7a27
29bd1ffe07d8820c4d34a7869dbd96c8a4733c496b225b1caf31be2a7d4ff6df
f72bb91a4569fb9ba2aa40db2499f39bb7aba4d20a5cb5f6dd1e2a9a4ce9af98
9119213b617e203fbc44348eb91150a4db009d78a4123a5cbce6dc6421982a91
a614ed116edc46301a4b3995067d5028af14c8949f406165d702496630cb02ce
0c9edded5ff2ac86b06c1b9929117eab3be54ee45d44fcdb0b416664c7183cbf

I am not sure what is the correct way to handle this, but I think there at least should be a way to limit each worker to some amount of memory.

Improve performance for displaying MatchingResult pages

When having large data sets in MCRIT, the results will often drastically increase in size, yielding JSON files up to several hundred MB in size.
Oftentimes, the aggregated result is of primary interest to an analyst (e.g. for family identification), while the detailed function matches are only relevant for deeper analysis/inspection.

To improve performance, a "thin" result could be delivered to the front-end that only contains all samples matches and its aggregated results, which would massively reduce the footprint of the result file.
Also, investigate if specialized marshalling libraries (#44) can improve performance.

Consider adding actors field to families

Hey :)!

It would be cool if the family summary would include the actors associated with it. Somewhat similar to the way Malpedia has this info present:
image

It does change the DB's overall schema I guess, but since it's only addition of data on top of existing JSONs I think it's ok. What do you think?

Add TTL to query_* documents

In an automated system, the insert to the query_* collections during a query grows very large very quick. After a couple of weeks or months since query have passed the MCRIT db has probably changed so this sample will probably require a re-query anyway. So saving old results is not that useful.

It would be nice it there was an option to turn on TTL and just remove such query-related data after some user-defined period

Improve performance of job page

The jobs page is currently very slow when having a large queue.
One reason for this is certainly that the queue is barely using any indices on fields but also that rendering the page carries out multiple full collection scans and data retrievals as JobResource is doing the start/limit and filtering instead of having the database/MongoDB do this efficiently.

In order to improve performance for this,

  • the jobs page could be split up into multiple pages per Job type (Matching, Query, Blocks, Other)
  • start/limits could be performed efficiently on the DB
  • cleanup functionality for the queue should be exposed to the front-end (related to #38)

QoL: Add last index timestamp to server statistics

It would be nice if part of the statistics was a timestamp of the last time a file was indexed (successfully) in the system.

This is useful when the match reports of MCRIT are stored elsewhere other than MCRIT itself. So in my case, a short summary of the report is stored on a different system. In such cases, since the data already exists in this other platform, this sort of information can help with the question of whether to re-query a file for matches - if the DB didn't change at all, there's no need to even fetch the cached result.

Consider use of JSON marshalling accelerators

By transforming additional DTOs into full python dataclasses, it would likely become possible to use an acceleration library like mashumaro for the (un)marshalling of MatchingResult, which is very expensive for large reports right now.

Queue cleaning is unused

Regarding:

time_threshold = datetime.now() - timedelta(seconds=self.cache_time)

Is this intentional? since cache_time is 10 ** 9, it will actually cache for over 30 years and this clean function is essentially a dead path of code.

Performance: query on large collections

Hey!

We see MCRIT as a great tool for malware similarity purposes and want to see if it can be integrated into our malware pipeline, with emphasis on the API it provides. We have a DB with a lot of samples, some families with tens of thousands of files associated with them. Simple testing with a moderate number of files shows MCRIT indeed works great. However, when it grows to 100k+ files, it begins to significantly slow down and can take more than 10 minutes for a query about some file. Because of the amount of samples we already have and the daily amount of stuff we get from different sources, it is a matter of time before we reach 100k even if we start small and make a curated set of samples for each family.

Of course, this isn't a trivial thing and it requires further inspection of each step in the process. However, I think that it raises some questions worth discussing:

  • Why was MongoDB chosen for the project? Is it the right fit, if we keep scale in mind?
  • Is the database design optimal, or is there any place to improve in regard to the indexes chosen and the queries performed?
  • How does MCRIT deal with files that contain many functions (we have ones with over 80k!๐Ÿ˜…)? Is it there any other way to deal with them?
  • Should MCRIT support managed solutions like amazon's DocumentDB? Those kinds of solutions handle things like sharding the DB for horizontal scaling and are easy to deploy. However, DocumentDB in particular isn't quite 100% MongoDB compatible.
  • Where are places where a bottleneck can occur during a query?

I hope this doesn't come across as a complaint, because we think MCRIT is great and would really love for it to use it in production :)

Add universal tagging

For various purposes, it might be worthwhile to introduce and support universal tagging on the level of families, samples, functions, possibly also matching reports.

Exception on POST /query/function

I am unsure if this indicates an error with the IDA script, the mcrit service, or smda. I am using the ida_mcrit.py in IDA via File->Script File... while I have a binary open in IDA. Within the mcrit window, I click on the fingerprint icon (Convert this IDB to a SMDA...), then click on the Upload icon (Reparse and upload the SMDA report...). I see a Python exception within the IDA console Output window and I see an exception in the mcrit logs (using docker-mcrit and docker compose).

The following is the exception I see in the mcrit-server log:

2024-02-14 21:51:31 [FALCON] [ERROR] POST /query/function => Traceback (most recent call last):
  File "falcon/app.py", line 365, in falcon.app.App.__call__
  File "/opt/mcrit/mcrit/server/utils.py", line 51, in wrapper
    func(*args, **kwargs)
  File "/opt/mcrit/mcrit/server/QueryResource.py", line 85, in on_post_query_smda_function
    summary = self.index.getMatchesForSmdaFunction(smda_report, **parameters)
  File "/opt/mcrit/mcrit/index/MinHashIndex.py", line 340, in getMatchesForSmdaFunction
    match_report = matcher.getMatchesForSmdaFunction(smda_report)
  File "/opt/mcrit/mcrit/matchers/MatcherInterface.py", line 53, in wrapper
    result = func(*args, **kwargs)
  File "/opt/mcrit/mcrit/matchers/MatcherQueryFUnction.py", line 35, in getMatchesForSmdaFunction
    function_entry = FunctionEntry(self._sample_entry, smda_function, -1, minhash)
  File "/opt/mcrit/mcrit/storage/FunctionEntry.py", line 59, in __init__
    self.xcfg = smda_function.toDict()
  File "/usr/local/lib/python3.8/dist-packages/smda/common/SmdaFunction.py", line 257, in toDict
    "nesting_depth": self.nesting_depth,
AttributeError: 'SmdaFunction' object has no attribute 'nesting_depth'

I took a quick look at smda/common/SmdaFunction.py and it looks like the fromDict() function will only assign nesting_depth if a version is given. I do not know the code well enough to be sure if this is the cause of the issue, but if so then you might just need to move the else statement on line 235 out one nesting layer to be the else for the if version and re.match... statement.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.