Git Product home page Git Product logo

fastcdc's Introduction

FastCDC

This is a Go implementation of the FastCDC algorithm for content-defined chunking. CDC is a technique used in data deduplication and data storage systems to break data into variable-sized chunks based on its content rather than fixed block sizes. This approach aims to improve the efficiency of deduplication.

Usage

go get -u codeberg.org/mhofmann/fastcdc

Evaluation

The implementation can be used with the same parameters as in the FastCDC paper as well as user-provided values for minimum, average and maximum sizes of chunks. For comparison, the following table shows statistics about the number and size of chunks generated by chunking sets of test files with different parameters. The numbers in the chunker names refer to the parameters used. For example "2k-8k-64k" is a chunker with 2KB minSize 8KB avgSize and 64k maxSize. The test corpus had a total uncompressed size of 8182081670 bytes (~7.6GB) and consisted of technical manuals and drawings in PDF format and tarballs containing the source code of 5 different versions of the Linux kernel.

Results

Chunker Num. of Chunks Avg. chunk size Deduplicated size Deduplication ratio
reference 480942 9831 4727992736 1.73
2k-16k-64k 271140 19136 5188457451 1.58
2k-32k-64k 150041 37254 5589600419 1.46
2k-64k-128k 80946 73123 5919028334 1.38
4k-8k-64k 471195 10107 4762223233 1.72
4k-16k-64k 266596 19503 5199438463 1.57
4k-32k-64k 148487 37669 5593355619 1.46
4k-64k-128k 80332 73701 5920577574 1.38

In terms of pure deduplication performance, the reference parameters (2k-8k-64k) yielded the best result on the test dataset. For storage systems where chunks are stored compressed, El-Shimi et al. suggest that using larger chunk sizes for CDC can improve the performance of the compression algorithm and thereby reduce the effective storage size. If and how far this applies to the FastCDC algorithm remains to be tested in the future.

License

BSD-2-Clause. See LICENSE for details.

fastcdc's People

Contributors

mversiotech avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.