Git Product home page Git Product logo

tsfile's Introduction

English | 中文

TsFile Document

___________    ___________.__.__          
\__    ___/____\_   _____/|__|  |   ____  
  |    | /  ___/|    __)  |  |  | _/ __ \ 
  |    | \___ \ |     \   |  |  |_\  ___/ 
  |____|/____  >\___  /   |__|____/\___  >  version 1.0.0
             \/     \/                 \/  

codecov Maven Version

Introduction

TsFile is a columnar storage file format designed for time series data, which supports efficient compression, high throughput of read and write, and compatibility with various frameworks, such as Spark and Flink. It is easy to integrate TsFile into IoT big data processing frameworks.

Time series data is becoming increasingly important in a wide range of applications, including IoT, intelligent control, finance, log analysis, and monitoring systems.

TsFile is the first existing standard file format for time series data. Despite the widespread presence and significance of temporal data, there has been a longstanding absence of standardized file formats for its management. The advent of TsFile introduces a unified file format to facilitate users in managing temporal data.

Click for More Information

TsFile Features

TsFile offers several distinctive features and benefits:

  • Multi Language Independent Use: Multiple language SDK can be used to directly read and write TsFile, making it possible for some lightweight data reading and writing scenarios.

  • Efficient Writing and Compression: A column storage format tailored for time series, organizing data by device and ensuring continuous storage of data for each sequence, minimizing storage space. Compared to CSV, the compression ratio can be increased by more than 90%.

  • High Query Performance: By indexing devices, measurement, and time dimensions, TsFile implements fast filtering and querying of temporal data based on specific time ranges. Compared to general file formats, query throughput can be increased by 2-10 times.

  • Open Integration: TsFile is the underlying storage file format of the temporal database IoTDB, which can form a pluggable storage computing separation architecture with IoTDB. TsFile supports compatibility with Spark Flink and other big data software establish seamless ecosystem integration to ensure compatibility and interoperability across different data processing environments, and achieve deep analysis of temporal data across ecosystems.

TsFile Basic Concepts

TsFile can manage the time series data of multiple devices. Each device can have different measurement.

Each measurement of each device corresponds to a time series.

The TsFile Scheme defines a set of measurement for all devices, as shown in the table below (m1~m5)

Time deviceId m1 m2 m3 m4 m5
1 device1 1 2 3
2 device1 1 2 3
3 device2 1 3 4 5
4 device2 1 3 4 5
5 device3 1 2 3 4 5

Among them, Time and deviceId are built-in fields that do not need to be defined and can be written directly.

TsFile Design

File Structure

TsFile adopts a columnar storage design, similar to other file formats, primarily to optimize time-series data's storage efficiency and query performance. This design aligns with the nature of time series data, which often involves large volumes of similar data types recorded over time. However, TsFile was developed particularly with a structure of page, chunk, chunk group, and index:

  • Page: The basic unit for storing time series data, sorted by time in ascending order with separate columns for timestamps and values.

  • Chunk: Comprising metadata headers and several pages, each chunk belongs to one time series, with variable sizes allowing for different compression and encoding methods.

  • Chunk Group: Multiple chunks within a chunk group belong to one or multiple series of a device written in the same period, facilitating efficient query processing.

  • Index: The file metadata at the end of TsFile contains a chunk-level index and file-level statistics for efficient data access.

TsFile Architecture

Encoding and Compression

TsFile employs advanced encoding and compression techniques to optimize storage and access for time series data. It uses methods like run-length encoding (RLE), bit-packing, and Snappy for efficient compression, allowing separate encoding of timestamp and value columns for better data processing. Its unique encoding algorithms are designed specifically for the characteristics of time series data in IoT scenarios, focusing on regular time intervals and the correlation among series.

Its uniqueness lies in the encoding algorithm designed specifically for time series data characteristics, focusing on the correlation between time attributes and data.

The table below compares 3 file formats in different dimensions.

TsFile, CSV and Parquet in Comparison

Dimension TsFile CSV Parquet
Data Model IoT Plain Nested
Write Mode Tablet, Line Line Line
Compression Yes No Yes
Read Mode Query, Scan Scan Query
Index on Series Yes No No
Index on Time Yes No No

Its development facilitates efficient data encoding, compression, and access, reflecting a deep understanding of industry needs, pioneering a path toward efficient, scalable, and flexible data analytics platforms.

Data Type Recommended Encoding Recommended Compression
INT32 TS_2DIFF LZ4
INT64 TS_2DIFF LZ4
FLOAT GORILLA LZ4
DOUBLE GORILLA LZ4
BOOLEAN RLE LZ4
TEXT DICTIONARY LZ4

more see Docs

Build and Use TsFile

Java

C++

Python

tsfile's People

Contributors

761417898 avatar alima777 avatar asdf2014 avatar beyyes avatar choubenson avatar chrisdutz avatar cpaulyz avatar critaswang avatar dependabot[bot] avatar ericpai avatar fanhualta avatar genius-pig avatar heimingz avatar hthou avatar jackietien97 avatar jixuan1989 avatar jt2594838 avatar julianfeinauer avatar lancelly avatar leirui avatar little-emotion avatar liuminghui233 avatar liutaohua avatar plutooooooo avatar qiaojialin avatar samperson1997 avatar shuwenwei avatar silvernarcissus avatar steveyurongsu avatar thumarklau avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tsfile's Issues

[CPP]. Memory Usage Analysis about TsFile CPP

Recently I analyzed the memory usage when using CPP to insert data with a Tablet. Based on the current implementation, there is still some work to be done on the memory management of CPP. For specific loads and memory requirements, we need to provide a suitable and accurate calculation method.

Publish tsfile Python package on PyPI

The tsfile Python code is already available in the repository (https://github.com/apache/tsfile/tree/develop/python). Publishing this package to PyPI would significantly enhance the user experience for Python developers. By making the tsfile package easily accessible through PyPI, it would simplify the installation process, reduce setup time, and increase the overall adoption of tsfile among the Python community. This step would undoubtedly contribute to broader usage and community engagement with tsfile.

CPP. multi-flush error when data is minimal.

When the amount of data written is minimal, multiple flush operations may lead to exceptions during the writing process to the file index.

pass:

TEST_F(TsFileWriterTest, MultiFlush) {
    std::string device_path = "device1";
    std::string measurement_name = "temperature";
    common::TSDataType data_type = common::TSDataType::INT32;
    common::TSEncoding encoding = common::TSEncoding::PLAIN;
    common::CompressionType compression_type =
        common::CompressionType::UNCOMPRESSED;
    ASSERT_EQ(tsfile_writer_->register_timeseries(device_path, measurement_name,
                                                  data_type, encoding,
                                                  compression_type),
              E_OK);
    for (int i = 1; i < 2; i++) {
        TsRecord record(i, device_path);
        DataPoint point(measurement_name, i);
        record.append_data_point(point);
        ASSERT_EQ(tsfile_writer_->write_record(record), E_OK);
    }
    ASSERT_EQ(tsfile_writer_->flush(), E_OK);

    for (int i = 2; i < 10; i++) {
        TsRecord record(i, device_path);
        DataPoint point(measurement_name, i);
        record.append_data_point(point);
        ASSERT_EQ(tsfile_writer_->write_record(record), E_OK);
    }
    ASSERT_EQ(tsfile_writer_->flush(), E_OK);
    ASSERT_EQ(tsfile_writer_->close(), E_OK);
}

FAILED

TEST_F(TsFileWriterTest, MultiFlush) {
    std::string device_path = "device1";
    std::string measurement_name = "temperature";
    common::TSDataType data_type = common::TSDataType::INT32;
    common::TSEncoding encoding = common::TSEncoding::PLAIN;
    common::CompressionType compression_type =
        common::CompressionType::UNCOMPRESSED;
    ASSERT_EQ(tsfile_writer_->register_timeseries(device_path, measurement_name,
                                                  data_type, encoding,
                                                  compression_type),
              E_OK);
    for (int i = 1; i < 2; i++) {
        TsRecord record(i, device_path);
        DataPoint point(measurement_name, i);
        record.append_data_point(point);
        ASSERT_EQ(tsfile_writer_->write_record(record), E_OK);
    }
    ASSERT_EQ(tsfile_writer_->flush(), E_OK);

    for (int i = 2; i < 3; i++) {
        TsRecord record(i, device_path);
        DataPoint point(measurement_name, i);
        record.append_data_point(point);
        ASSERT_EQ(tsfile_writer_->write_record(record), E_OK);
    }
    ASSERT_EQ(tsfile_writer_->flush(), E_OK);
    ASSERT_EQ(tsfile_writer_->close(), E_OK);
}

[CPP] Support new datatypes

We have introduced new datatypes such as DATE, TIMESTAMP, BLOB, and STRING in TsFile-Java, which should also be supported in TsFile-CPP.
Also it seems TEXT is not yet supported, which is an important type.

[CPP] Add unit test

Does the TsFile_CPP project require unit tests? Can we use third-party libraries such as Gtest or CppUnit ?

[CPP] Fix TS_2Diff

It is reported that the implementation of the TS_2Diff decoder in TsFile-CPP will mistakenly skip one byte, resulting in a wrong decoding result.

Determining time encoding and compression in a TsFile

The time compression and encoding can be specified via config. However, this information is not included in a TsFile, and one cannot know the time compression and encoding if he only has the TsFile.

For TsFile V4, there is a property map at the end of the file, so we can and WE SHOULD record the time compression and encoding in the map.

For TsFile V3 and older, unfortunately, there seems to be no good solution, but maybe we can provide some protection mechanism: if the decoding or uncompressing of a time page fails, the reader should try another decoding/compression automatically until success or all methods have failed.

[CPP] Bitmap clear method cannot execute correctly.

TEST_F(BitMapTest, Clear) {
common::BitMap bitmap;
bitmap.init(100);
bitmap.set(10);
bitmap.set(20);
bitmap.clear(10);
EXPECT_FALSE(bitmap.test(10)); // This line cannot execute correctly
EXPECT_TRUE(bitmap.test(20));
}

To correctly clear a bit, the line *start_addr = (*start_addr) ^ (~bit_mask);
should be corrected to *start_addr = (*start_addr) & ~bit_mask;

Bugfix: #114

[CPP] Inconsistent encoding implementation

I noticed that the DictionaryEncoder uses a BitPackingEncoder as the inner encoder in the CPP edition. However, in the Jave edition, the inner encoder is a RleEncoder (with bit-packing), which may cause incompatibility.

All encodings and compression in the CPP edition should be checked to ensure compatibility.

[CPP] Fix SimpleListNode::remove

The "remove" method misses updating the value of "size_", and incorrectly returns "common::E_NOT_EXIST" after finding the element.

[CPP] The serialization and deserialization logic of TsFileMeta do not match

The serialization and deserialization logic of TsFileMeta do not match, the handling of the bloom filter is missing during serialization.

    int serialize_to(common::ByteStream &out) {
        int ret = common::E_OK;
        if (RET_FAIL(index_node_->serialize_to(out))) {
        } else if (RET_FAIL(common::SerializationUtil::write_i64(meta_offset_,
                                                                 out))) {
        }
        return ret;
    }

    int deserialize_from(common::ByteStream &in) {
        int ret = common::E_OK;
        void *index_node_buf = page_arena_->alloc(sizeof(MetaIndexNode));
        void *bloom_filter_buf = page_arena_->alloc(sizeof(BloomFilter));
        if (IS_NULL(index_node_buf) || IS_NULL(bloom_filter_buf)) {
            return common::E_OOM;
        }
        index_node_ = new (index_node_buf) MetaIndexNode(page_arena_);
        bloom_filter_ = new (bloom_filter_buf) BloomFilter();

        if (RET_FAIL(index_node_->deserialize_from(in))) {
        } else if (RET_FAIL(
                       common::SerializationUtil::read_i64(meta_offset_, in))) {
        } else if (RET_FAIL(bloom_filter_->deserialize_from(in))) {
        }
        return ret;
    }

[CPP] Fix value.h

Additional checks need to be added at delete and assert to prevent the program from crashing unexpectedly.
Here is the MR: #126

[CPP] Fix BitPackDecoder::~BitPackDecoder()

BitPackDecoder calls free for the variables current_buffer_, packer_, and tmp_buf in the destructor, but there is no corresponding malloc in the constructor. This will cause the program to crash.

~BitPackDecoder() { destroy(); }

void destroy() { /* do nothing for BitpackEncoder */
        delete (packer_);
        delete[] current_buffer_;
        common::mem_free(tmp_buf);
}

[CPP] Inconsistent output files under win and unix

The contents of the generated files of the programs in example and benchmark are inconsistent between unix and win. In Windows, if we open a file without O_BINARY flags, the system will replace '\n' with '\r\n' when the file is written

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.