Git Product home page Git Product logo

filechampion4j's Introduction

About FileChampion4j

Build, Test, and Bench  codecov  License  Maven Central

FileChampion4j is a powerful and flexible Java library for validating and processing files. The library can be used to check files for a variety of properties, including mime type, magic bytes, header signatures, footer signatures, maximum size, and more. The library can also execute extension plugins that are defined for the file type.

See FileChampion4j Wiki for detailed instructions on configurations and usage.

See FileChampion4j Docs for comprehensive documentations and design diagrams (generated with help of Doxygen, Graphviz, and PlantUML).

Features

  • Easy to understand and configure for developers, operations, and security engineers.
  • JSON-based configuration, supporting the ability to separate configurations from code.
  • Flexible to support various integrations, including client-defined controls.
  • Support for in-memory and on-disk validation.
  • Validate files for a variety of properties, including mime type, magic bytes, header signatures, footer signatures, maximum size, filename cleanup/encoding, and owner/permissions of file.
  • Custom plugins execution support for extended usability.
  • Comprehensive error handling and reporting.

Benefits

  • Protect your system from malicious files.
  • Ensure that files are of the correct type and size.
  • Save time, effort, and risk of developing custom file validation code.
  • Allow security engineers to define required file controls without code modification.
  • Support easy auditing of controls by compliance officers and auditors.

Releases

Working release versions, including slim/fat JARs, can be found on the release-* branches. A Maven Central package will be added for distribution soon.

Compatibility

FileChampion4j is intended to support Windows/Linux platforms, running any active LTS Java runtime versions. Builds are tested and packaged for supported environments. Merges and new releases must pass security, functional, and performance tests for supported environments.

If you have any questions about FileChampion4j, please feel free to contact the project team. The project team is available to answer questions and provide support.

If you found any issues or ideas, please open a relevant issue in this project.

Contributing

If you would like to contribute to FileChampion4j, please feel free to fork the project on GitHub. The project team welcomes contributions of all kinds, including bug fixes, new features, and documentation improvements.

License

FileChampion4j is licensed under the Apache License, Version 2.0. For more information about the license, please see the LICENSE file in the project repository.

filechampion4j's People

Contributors

dependabot[bot] avatar povimd9 avatar

Stargazers

 avatar

Watchers

 avatar

filechampion4j's Issues

Support validation by file path

Story

As an implementer of the library, I want to be able passing a file path instead of only file bytes, So i can validate files without processing it.

Details

  • Add optional 'File Path' to 'doValition' method for file
  • 'doValition' will check if bytes/path
  • if path, read file bytes from path for validation

Nice to have

  • If path, don't use 'temp dirs/files' for related validations.
  • Add option allowing caller to define 'Quarantine' path to which failed validation files will be moved.

Acceptance Criteria

  • Using path should not degrade validation by more than 5%
  • Update usage documentation
  • Update technical doc stack

Support defining checksum hashing algorithm

Story

As an implementer, I want to define the checksum hash algo, So i can adjust the output in accordance with integration needs.

Details

Describe the solution you'd like
Allow defining md5, sha1, sha2, and sha5, for the checksum output.
Support defining algo in configurations + during validation response call.

Nice to have

Describe nice-to-have features

  • Support multiple algorithms, returning multiple checksum values.
  • Consider adding a 'general' section in the json, so some definitions can be defined for all categories.

Acceptance Criteria

  • Configurations support defining the algo
  • Caller can set/get any supported algo
  • Docs updated with options
  • Performance and security impacts updated
  • Test coverage >85%

Add support for user defined CLI validators

As an implementer, I want to define custom CLI executions, So i can integrate additional custom checks for untrusted files.

Acceptance Criteria

  • Test coverage of at least 80%
  • JSON configuration supported for implementers to define custom CLI executions
  • Configurations support defining - executable path, executable arguments, placeholders for filename / filebytes / file checksum, response pass/fail body, execution timeout + behaviour (fail/pass on timeout).
  • Comprehensive logging of requests/responses

Remove Requirement # of Validations

Summary

Remove library requirement for multiple validation configurations.

Rationale

Make usage more flexible, such as only performing AV scan/other plugins.

Details

Current version requires configuration of at least few parameters of validation to init the class.
Original purpose was to force some minimal validations, at cost of flexibility and performance.
Proposal is to only required that at least one control is configured, no matter what it is.
This should enable more flexibility in usage of the library.

Current State

Current version requires configuration of at least few parameters of validation to init the class.

Proposed Solution

Proposal is to only required that at least one control is configured, no matter what it is.

Benefits

This should enable more flexibility in usage of the library.

Improve Extensions Configurations Loading

Summary

Improve logic of extensions configurations loading to improve performance.

Rationale

It serves no purpose and impacts performance.

Details

Current extension configs are loaded on every validation, it should be loaded to memory once at class init.
It serves no purpose and impacts performance.
fix by reformat Extension class to 'Extensions' class, which loads all configs to hashmaps at filevalidator init, and change doValidation to only access the required objects.

Current State

Current extension configs are loaded on every validation, it should be loaded to memory once at class init.

Proposed Solution

Reformat Extension class to 'Extensions' class, which loads all configs to hashmaps at filevalidator init, and change doValidation to only access the required objects.

Benefits

Performance improvement

Define benchmarking procedure for contributors

Story

As a contributor to the project, I want to be able to benchmark methods after code changes, so i can comply with defined degradation limits.

Details

Develop and document a benchmarking procedure with the following workflow upon completion of changes:

  • Checkout master
  • Run JMH benchmarks against master, save results locally
  • Run JMH benchmarks against new/target branch, compare results to master results
  • If diff of tests are within defined performance threshold, PR can be opened for changes
  • PR should trigger an action that runs benchmarking flow above on github runner
  • Merge is blocked upon failure
  • Following merge, CI action should push new benchmarks to 'benchmarks' directory for future base comparison and tracking

Nice to have:

  • Auto publish per validation type, relative performance benchmark in formatted markdown, in dedicated WIKI page.

Acceptance Criteria

  • Flow is defined and working as described in details
  • Tests are configured for every validation type
  • Usage and logic is documented and easy to follow
  • Thresholds are clearly defined and documented

Add support for user defined api requests

As an implementer, I want to define custom API requests, So i can integrate additional custom checks for untrusted files.

Acceptance Criteria

  • Test coverage of at least 80%
  • JSON configuration supported for implementers to define custom API calls
  • Configurations support defining - URI endpoint, request headers, credentials place holders, request body, placeholders for filename / filebytes / file checksum, response pass/fail headers / body, request timeout + behaviour (fail/pass on timeout).
  • Secure credential loading from dedicated file, with best practices for use.
  • Comprehensive logging of requests/responses

Support accepting MIME value from caller

Story

As a developer accepting files over web, I want to add the MIME declared in the http request to validation call, So validation method can identify http level injections and improve performance of MIME validation.

Details

  • Add optional 'mimeValue' argument to doValidation
  • If mimeValue is defined by caller, skip FileValidator mime analysis, and only compare MIME value against defined config

Performance impact when processing large files

Describe the bug
When processing large files, performance drops by *10.

To Reproduce
Execute test with large files, analyse benchmarks of large files.

Expected behavior
Support defining logic of 'fail fast' and 'ignore checksum'.

Additional context
Currently validations are performed sequentially regardless of validation result.
In addition, files checksum which is very costly performance wise, is generated in single thread instance, and done on many returns.

Suggested resolutions
a. support defining 'fast-fail' so first failed validation is returned to caller. (add docs)
b. support defining 'generate_sha' to false for skipping checksum generation. (add docs)
c. refactor checksum method to support chunking, and poolthread processing, for checksum generation.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.