Git Product home page Git Product logo

lokad-flatfiles's Introduction

C#/.NET stand-alone library. Compresses a flat data file (TSV, CSV) using Perfect Hash Function. The library is internally used by Lokad to deliver super-fast parsing of large flat files. There is also a Java port.

Purpose

The vast majority of real-life flat data files are very redundant, as the same identifiers, dates or numeric values tend to appear on multiple lines.

However, the purpose of this software is not storage compression (if you need to keep flat files in a compressed format, GZip is a better choice).

Rather, the objective of perfect hashing is to improve the speed of parsing operations. Consider a file which contains 1000 lines, with the date 2015-01-01 appearing in 100 cells in column 3.

In a "traditional" CSV-parsing application, the sequence of steps would be as follows:

  1. A tokenizer reads the file and allocates a string matrix representing the file (some libraries will instead allow to enumerate string arrays representing lines).
  2. A date parser reads the values in column 3.

Parsing date strings is a costly operation. It is faster to parse that date once than to parse it 100 times, even taking into account the overhead involved in caching the value and recognizing that it has already been parsed. This leads to the "optimized" architecture:

  1. As before, a tokenizer produces strings representing cells.
  2. A memoization layer compares the string in column 3 against a hash table of all strings it has seen so far.
  3. If the string is not yet in the hash table, the date is parsed and the result is added to the hash table.

This is significantly better in terms of performance, but still wasteful. Lokad.FlatFiles moves the cache from step 2 to step 1:

  • The tokenizer produces only unique strings (avoiding 99 unnecessary memory allocations and text encoding operations for 2015-01-01 alone).
  • Subsequent layers (date-parsing, memoization, etc.) use integer identifiers rather than strings: comparison is faster and hash tables become simple arrays.

Behaviour

The original flat file is split into two datasets:

Content

The Content is a list of all distinct byte sequences found in the file cells.

  • The cell separator (\t or ,) and the line separator (any combination of \n and \r) are not included in the cell. Note that the parser attempts to auto-detect the separator character use for the entire file.
  • Cell contents are trimmed (initial and final spaces are not kept).
  • Cells may be quoted (start with ", end with ", any interior quotes are escaped as ""), in which case the unquoted contents are used.
  • If the file is encoded as UTF-16, cells are re-encoded a UTF-8. Otherwise, the original encoding is kept.

Cells

The Cells is a matrix that contains, for each cell in the file, the index of that cell's content in the Content list.

This is a line-based matrix, meaning that the cell at line L, column C is found at index L * Columns + C.

Thus, to retrieve the content of the original file:

Content[Cells[L * Columns + C]]

Layout

  • Program.cs is the entry point.
  • RawFlatFile.cs represents the splitting of a data file into Cells and Content.
  • RawFlatFileSerialization.cs allows serialization of the RawFlatFile.
  • InputBuffer.cs is used internally by RawFlatFile.cs for buffered reading.
  • Trie.cs implements the perfect hash function using a trie with an optimized memory allocation strategy.
  • Example/ contains example files in source (.tsv) and parsed (.rff) formats.

lokad-flatfiles's People

Contributors

vermorel avatar victornicollet avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.