apache / orc Goto Github PK

View Code? Open in Web Editor NEW

670.0 48.0 477.0 53.63 MB

Apache ORC - the smallest, fastest columnar storage for Hadoop workloads

Home Page: https://orc.apache.org/

License: Apache License 2.0

CMake 1.42% C++ 40.82% Shell 0.10% C 0.11% Java 55.82% HTML 0.23% Ruby 0.01% Dockerfile 0.18% SCSS 0.56% Python 0.75%

orc java big-data apache cpp

orc's Issues

ORC-1257: Publish multi-architecture ORC-dev docker images

BEFORE

AFTER

Where can I find sample code for using ORC column encryption when writing ORC files with spark dataSource

Enable CyrusSASL in GitHub Action linux jobs

It seems that CyrusSASL is not configured correctly in GitHub Action so far.

-- Could NOT find CyrusSASL (missing: CYRUS_SASL_SHARED_LIB CYRUS_SASL_INCLUDE_DIR)

ORC-1345: Use makeBom and skip snapshot check in GitHub Action publish_snapshot job

This is resolved via #1355

ORC-1121: Predicate pushdown does not work

Hi, I have a problem, my test data is tpc-ds 1g, spark 3.2, orc version 1.6.11,
test sql select count(1) from call_center_orc where cc_call_center_sk > 100;
cc_call_center_sk is the first column in call_center_orc, and predicate pushdown is effectual

but when i test select count(1) from call_center_orc where cc_company > 100;
cc_company is not the first column, predicate pushdown does not work

And I debug the code, I found the problem is SchemaEvolution.ppdSafeConversion,

in my case, result.size is 2, but in pickRowGroups

the columnIx is the column index in orc meta, which is 19 for cc_company, this causes orc will not evaluate pushdown filters with row group stats, and can not skip the row group.

Release Apache ORC 1.7.4

branch-1.7 is healthy in GitHub Action
- https://github.com/apache/orc/commits/branch-1.7
- Ubuntu 20.04, MacOS 10.15.7, MacOS 11.6.5, Windows are tested.
- Java 8, Java 11, Java 17, Java 18 are tested.
- Clang 11.0.0 and g++
Docker tests (CentOS 7, Debian 9, Debian 10, Debian 11, Ubuntu 16, Ubuntu 18, Ubuntu 20) passed.
Apache Spark master integration test passed.
- williamhyun/spark#3
Apache Iceberg master integration test passed.
- williamhyun/iceberg#1
Tag is created.
- https://github.com/apache/orc/releases/tag/v1.7.4-rc0
Upload source artifact
- https://dist.apache.org/repos/dist/dev/orc/v1.7.4-rc0
Publish java artifact
- https://repository.apache.org/content/repositories/orgapacheorc-1056
Vote started.
- https://lists.apache.org/thread/c22pbld5p960r7ls1kv6p11xj5lc1nqd

Please vote on releasing the following candidate as Apache ORC version
1.7.4.

[ ] +1 Release this package as Apache ORC 1.7.4
[ ] -1 Do not release this package because ...

TAG:
https://github.com/apache/orc/releases/tag/v1.7.4-rc0

RELEASE FILES:
https://dist.apache.org/repos/dist/dev/orc/v1.7.4-rc0

STAGING REPOSITORY:
https://repository.apache.org/content/repositories/orgapacheorc-1056

LIST OF ISSUES:
https://issues.apache.org/jira/projects/ORC/versions/12351349
https://github.com/apache/orc/milestone/7?closed=1

This vote will be open for 72 hours.

Regards,
William

Release Apache ORC 1.6.14

branch-1.6 is healthy in GitHub Action
- https://github.com/apache/orc/commits/branch-1.6
- Ubuntu 20.04, MacOS 10.15.7, MacOS 11.6.5, Windows are tested.
- Java 8, Java 11, Java 15 are tested.
- AppleClang 12.0.0.12000032 and g++
Docker tests (CentOS 7, Debian 9, Debian 10, Debian 11, Ubuntu 16, Ubuntu 18, Ubuntu 20) passed except one known issue, testLzoLong.

Failed tests
[  FAILED  ] TestDecompression.testLzoLong (0 ms)
[  FAILED  ] 1 test, listed below:
[  FAILED  ] TestDecompression.testLzoLong
 1 FAILED TEST
FAILED debian11

Apache Spark 3.2.x integration test passed.
- dongjoon-hyun/spark#1
- dongjoon-hyun/spark#2
Apache Iceberg 0.12.x integration test passed.
- dongjoon-hyun/iceberg#15

Release Apache ORC 1.8.1

GitHub Action CI Check (Java 8/11/17) and Ubuntu/MacOS/Windows.
- https://github.com/apache/orc/actions/workflows/build_and_test.yml?query=branch%3Abranch-1.8
Docker Tests

$ ./run-all.sh apache branch-1.8

$ tail -n2 logs/*log
==> logs/centos7-test.log <==
Built target test-out
Finished centos7 at Mon Nov 28 22:32:44 PST 2022

==> logs/debian10-test.log <==
Built target test-out
Finished debian10 at Mon Nov 28 22:38:16 PST 2022

==> logs/debian10_jdk=11-test.log <==
Built target test-out
Finished debian10_jdk=11 at Mon Nov 28 22:41:52 PST 2022

==> logs/debian11-test.log <==
Built target test-out
Finished debian11 at Mon Nov 28 22:39:43 PST 2022

==> logs/fedora37-test.log <==
Built target test-out
Finished fedora37 at Mon Nov 28 22:39:08 PST 2022

==> logs/ubuntu18-test.log <==
Built target test-out
Finished ubuntu18 at Mon Nov 28 22:37:51 PST 2022

==> logs/ubuntu20-test.log <==
Built target test-out
Finished ubuntu20 at Mon Nov 28 22:42:01 PST 2022

==> logs/ubuntu20_jdk=11-test.log <==
Built target test-out
Finished ubuntu20_jdk=11 at Mon Nov 28 22:43:10 PST 2022

==> logs/ubuntu20_jdk=11_cc=clang-test.log <==
Built target test-out
Finished ubuntu20_jdk=11_cc=clang at Mon Nov 28 22:39:06 PST 2022

==> logs/ubuntu22-test.log <==
Built target test-out
Finished ubuntu22 at Mon Nov 28 22:40:22 PST 2022

Apache Spark 3.4.0-SNAPSHOT Integration Test with 1.8.1 SNAPSHOT
Apache Iceberg 1.1.0-SNAPSHOT Integration Test with 1.8.1 SNAPSHOT
Tag Creation: v1.8.1-rc0
Upload RC0 Vote Artifacts: https://dist.apache.org/repos/dist/dev/orc/v1.8.1-rc0/
Send RC0 Vote Email: https://lists.apache.org/thread/qhcxb27hcx75yfr9doph7yjzkwfnl9ow
Apache Spark 3.4.0-SNAPSHOT Integration Test with RC0
Apache Iceberg 1.1.0-SNAPSHOT Integration Test with RC0

AFTER VOTE

Send vote result email: https://lists.apache.org/thread/b2zv8ns2lxqkv51j7k4m4p0kz6hyn73l
Apache Distribution: https://dist.apache.org/repos/dist/release/orc/orc-1.8.1/
Apache Download: https://downloads.apache.org/orc/orc-1.8.1/
Apache Maven: https://repository.apache.org/content/repositories/releases/org/apache/orc/orc-core/1.8.1/
Maven Central: https://repo1.maven.org/maven2/org/apache/orc/orc-core/1.8.1/
GitHub Release: https://github.com/apache/orc/releases/tag/v1.8.1
Update Apache ORC 1.8.1 Javadoc: https://orc.apache.org/api/orc-core/
Send announce emails:
- https://lists.apache.org/thread/xr4qw64r81xs7ogs826c6ml7qlor4yn5 (dev mailing list)
- https://lists.apache.org/thread/vz5nbbbz0jxfm1nv829mjtnq69cdgzrx (user mailing list)

ORC-1172: Add row count limit config for one stripe

for query engine like presto，stripe is the base unit for query concurrency, one stripe can only be processed by one split.
In current implement of orc writer, the only config which can control row count in stripe is the "orc.stripe.size".
But for different kind of table, the row count is difficult to use.

for table with much columns( eg. 100 columns), 64MB may contain 5000 rows.
for table with less columns(eg. 5 columns), 64MB may contain 100000 rows.

for presto, normal olap query only read a subset of table columns, the row count is the key factor of query performance. If one stripe contain much rows, the query performance may become too low.

So, besides the config "orc.stripe.size", we need another config like "orc.stripe.row.count" to control the row count of one stripe.
The similar config has been introduced to cudf ( a GPU DataFrame library base on apache arrow): rapidsai/cudf#9261

Broken Avatar images in Apache ORC News

It seems that we have broken images at the following pages.

ORC community should publish test jars to maven central.

ORC-1332: Avoid NegativeArraySizeException when using searchArgument

This is resolved via #1340

ORC-1247: Improve Apache ORC website and docs

This is done via the followings.

Setup DockHub repo `apache/orc-dev`

Create apache/orc-dev via https://issues.apache.org/jira/browse/INFRA-23534
- https://hub.docker.com/r/apache/orc-dev/tags
Publish docker test images to apache/orc-dev.
Publish site build image to apache/orc-dev.
Update Apache ORC website.
- #1211
Update docker script
- #1212

Release Apache ORC 1.7.5

branch-1.7 is healthy in GitHub Action
- https://github.com/apache/orc/commits/branch-1.7
- Ubuntu 20.04, MacOS 10.15.7, MacOS 11.6.5, Windows are tested.
- Java 8, Java 11, Java 17, Java 18 are tested.
- Clang 11.0.0 and g++
Docker tests (CentOS 7, Debian 10, Debian 11, Ubuntu 18, Ubuntu 20, Ubuntu 22) passed.
Apache Spark master integration test passed.
- dongjoon-hyun/spark#7
Apache Iceberg master integration test passed.
- williamhyun/iceberg#2
Tag is created.
- https://github.com/apache/orc/releases/tag/v1.7.5-rc0
Upload source artifact
- https://dist.apache.org/repos/dist/dev/orc/v1.7.5-rc0
Publish java artifact
- https://repository.apache.org/content/repositories/orgapacheorc-1058
Vote started.

ORC-1344: Skip SBOM generation during CMake

This is resolved via #1354

Tests failure on 1.7.3

While building for Arch Linux, I’ve encountered some tests failures (after backporting ffbd341):

[ RUN      ] TestPredicateLeaf.testIntNullSafeEqualsBloomFilter
/build/apache-orc/src/orc-1.7.3/c++/test/TestPredicateLeaf.cc:635: Failure
      Expected: TruthValue::YES_NO
      Which is: 4-byte object <05-00 00-00>
To be equal to: evaluate(pred, createIntStats(10, 100), &bf)
      Which is: 4-byte object <01-00 00-00>
[  FAILED  ] TestPredicateLeaf.testIntNullSafeEqualsBloomFilter (1 ms)
[ RUN      ] TestPredicateLeaf.testIntEqualsBloomFilter
/build/apache-orc/src/orc-1.7.3/c++/test/TestPredicateLeaf.cc:652: Failure
      Expected: TruthValue::YES_NO_NULL
      Which is: 4-byte object <06-00 00-00>
To be equal to: evaluate(pred, createIntStats(10, 100, true), &bf)
      Which is: 4-byte object <04-00 00-00>
[  FAILED  ] TestPredicateLeaf.testIntEqualsBloomFilter (0 ms)
[ RUN      ] TestPredicateLeaf.testIntInBloomFilter
/build/apache-orc/src/orc-1.7.3/c++/test/TestPredicateLeaf.cc:667: Failure
      Expected: TruthValue::YES_NO_NULL
      Which is: 4-byte object <06-00 00-00>
To be equal to: evaluate(pred, createIntStats(10, 100, true), &bf)
      Which is: 4-byte object <04-00 00-00>
/build/apache-orc/src/orc-1.7.3/c++/test/TestPredicateLeaf.cc:670: Failure
      Expected: TruthValue::YES_NO_NULL
      Which is: 4-byte object <06-00 00-00>
To be equal to: evaluate(pred, createIntStats(10, 100, true), &bf)
      Which is: 4-byte object <04-00 00-00>
[  FAILED  ] TestPredicateLeaf.testIntInBloomFilter (0 ms)

[ RUN      ] TestPredicateLeaf.testDateNullSafeEqualsBloomFilter
/build/apache-orc/src/orc-1.7.3/c++/test/TestPredicateLeaf.cc:778: Failure
      Expected: TruthValue::YES_NO
      Which is: 4-byte object <05-00 00-00>
To be equal to: evaluate(pred, createDateStats(10.0, 100.0), &bf)
      Which is: 4-byte object <01-00 00-00>
[  FAILED  ] TestPredicateLeaf.testDateNullSafeEqualsBloomFilter (0 ms)
[ RUN      ] TestPredicateLeaf.testDateEqualsBloomFilter
/build/apache-orc/src/orc-1.7.3/c++/test/TestPredicateLeaf.cc:795: Failure
      Expected: TruthValue::YES_NO_NULL
      Which is: 4-byte object <06-00 00-00>
To be equal to: evaluate(pred, createDateStats(10.0, 100.0, true), &bf)
      Which is: 4-byte object <04-00 00-00>
[  FAILED  ] TestPredicateLeaf.testDateEqualsBloomFilter (0 ms)
[ RUN      ] TestPredicateLeaf.testDateInBloomFilter
/build/apache-orc/src/orc-1.7.3/c++/test/TestPredicateLeaf.cc:812: Failure
      Expected: TruthValue::YES_NO_NULL
      Which is: 4-byte object <06-00 00-00>
To be equal to: evaluate(pred, createDateStats(10.0, 100.0, true), &bf)
      Which is: 4-byte object <04-00 00-00>
/build/apache-orc/src/orc-1.7.3/c++/test/TestPredicateLeaf.cc:815: Failure
      Expected: TruthValue::YES_NO_NULL
      Which is: 4-byte object <06-00 00-00>
To be equal to: evaluate(pred, createDateStats(10.0, 100.0, true), &bf)
      Which is: 4-byte object <04-00 00-00>
[  FAILED  ] TestPredicateLeaf.testDateInBloomFilter (0 ms)

[ RUN      ] TestBloomFilter.testBloomFilterBasicOperations
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:134: Failure
Value of: bloomFilter.mBitSet->get(288)
  Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:134: Failure
Value of: bloomFilter.mBitSet->get(246)
  Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:134: Failure
Value of: bloomFilter.mBitSet->get(306)
  Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:134: Failure
Value of: bloomFilter.mBitSet->get(228)
  Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:138: Failure
Value of: bloomFilter.mBitSet->get(458)
  Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:138: Failure
Value of: bloomFilter.mBitSet->get(545)
  Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:138: Failure
Value of: bloomFilter.mBitSet->get(717)
  Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:140: Failure
Value of: bloomFilter.mBitSet->get(526)
  Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:140: Failure
Value of: bloomFilter.mBitSet->get(40)
  Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:140: Failure
Value of: bloomFilter.mBitSet->get(480)
  Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:140: Failure
Value of: bloomFilter.mBitSet->get(86)
  Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:144: Failure
Value of: bloomFilter.mBitSet->get(308)
  Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:144: Failure
Value of: bloomFilter.mBitSet->get(335)
  Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:144: Failure
Value of: bloomFilter.mBitSet->get(108)
  Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:144: Failure
Value of: bloomFilter.mBitSet->get(535)
  Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:148: Failure
Value of: bloomFilter.mBitSet->get(279)
  Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:148: Failure
Value of: bloomFilter.mBitSet->get(15)
  Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:148: Failure
Value of: bloomFilter.mBitSet->get(54)
  Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:150: Failure
Value of: bloomFilter.mBitSet->get(680)
  Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:150: Failure
Value of: bloomFilter.mBitSet->get(818)
  Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:150: Failure
Value of: bloomFilter.mBitSet->get(434)
  Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:150: Failure
Value of: bloomFilter.mBitSet->get(232)
  Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:154: Failure
Value of: bloomFilter.testLong(111)
  Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:157: Failure
Value of: bloomFilter.testLong(-1)
  Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:159: Failure
Value of: bloomFilter.testLong(-111)
  Actual: false
Expected: true
[  FAILED  ] TestBloomFilter.testBloomFilterBasicOperations (0 ms)
[ RUN      ] TestBloomFilter.testBloomFilterSerialization
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:262: Failure
Value of: dstBloomFilter->testLong(11)
  Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:263: Failure
Value of: dstBloomFilter->testLong(111)
  Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:267: Failure
Value of: dstBloomFilter->testLong(-11)
  Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:268: Failure
Value of: dstBloomFilter->testLong(-111)
  Actual: false
Expected: true
[  FAILED  ] TestBloomFilter.testBloomFilterSerialization (0 ms)

And in the second suite:

[ RUN      ] TestFileScan.testErrorHandling
/build/apache-orc/src/orc-1.7.3/tools/test/TestFileScan.cc:209: Failure
Expected: (std::string::npos) != (error.find(error_msg)), actual: 18446744073709551615 vs 18446744073709551615
[  FAILED  ] TestFileScan.testErrorHandling (27 ms)

We build with those CMake flags:

    -DCMAKE_CXX_FLAGS="${CXXFLAGS} -fPIC -ffat-lto-objects" \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_INSTALL_PREFIX="/usr" \
    -DLZ4_HOME="/usr" \
    -DPROTOBUF_HOME="/usr" \
    -DSNAPPY_HOME="/usr" \
    -DZLIB_HOME="/usr" \
    -DZSTD_HOME="/usr" \
    -DORC_PREFER_STATIC_ZLIB=OFF \
    -DBUILD_LIBHDFSPP=OFF \
    -DBUILD_JAVA=OFF \
    -DINSTALL_VENDORED_LIBS=OFF

Please tell me what I can provide you with to assist in the resolution of these failures.

How to set list's offsets correctly

Greetings,
I'm learning to work with ORC in C++, and I think I'm stuck and don't quite understand how to set array's offsets. Precisely, the following code, when executed, produces the following exception: "Caught exception in test-file.orc: bad read in nextBuffer"
"):

void write_orc()
{
    using namespace orc;

    ORC_UNIQUE_PTR<OutputStream> outStream = writeLocalFile("test-file.orc");
    ORC_UNIQUE_PTR<Type> schema(
        Type::buildTypeFromString("struct<id:int,list1:array<string>>"));
    WriterOptions options;
    ORC_UNIQUE_PTR<Writer> writer = createWriter(*schema, outStream.get(), options);

    std::unique_ptr<Writer> writer = createWriter(*type, stream.get(), options);

    uint64_t batch_size = 1024, row_count = 2048;

    std::unique_ptr<ColumnVectorBatch> batch =
        writer->createRowBatch(row_count);
    StructVectorBatch &root_batch =
        dynamic_cast<StructVectorBatch &>(*batch.get());
    LongVectorBatch &id_batch =
        dynamic_cast<LongVectorBatch &>(*struct_batch.fields[0]);
    ListVectorBatch &list_batch =
        dynamic_cast<ListVectorBatch &>(*struct_batch.fields[1]);
    StringVectorBatch &str_batch =
        dynamic_cast<StringVectorBatch &>(*list_batch.elements.get());
    
    std::vector<std::string> vs{"str1", "str2"};

    char **data         = str_batch.data.data();
    int64_t *offsets    = list_batch.offsets.data();
    uint64_t offset     = 0, rows = 0;
    for (size_t i = 0; i < row_count; ++i) {
        offsets[rows] = static_cast<int64_t>(offset);

        id_batch.data[rows] = articles[i]->get_id();

        for (auto &s : vs)
        {
            data[offset] = &s[0];
            str_batch.length[offset++] = s.size();
        }

        rows++;
        if (rows == batch_size) 
        {
            root_batch.numElements = rows;
            id_batch.numElements   = rows;
            list_batch.numElements = rows;

            writer->add(*batch);
            rows = 0;
            offset = 0;
        }
    }

    if (rows != 0) 
    {
        root_batch.numElements = rows;
        id_batch.numElements   = rows;
        list_batch.numElements = rows;

        writer->add(*batch);
        rows = 0;
        offset = 0;
    }

    writer->close();
}

My question is: what exactly am I doing wrong when setting list's offsets?

E

How to read ORC files through Rust

Release Apache ORC 1.7.6

branch-1.7 is healthy in GitHub Action
- https://github.com/apache/orc/commits/branch-1.7
- Ubuntu 20.04, Ubuntu 22.04, MacOS 11.6, MacOS 12.5, Windows are tested.
- Java 8, Java 11, Java 17, Java 18 are tested.
- Clang 11.0.0 and g++
Apache ORC DockerHub repository request
Docker tests (CentOS 7, Debian 10, Debian 11, Ubuntu 18, Ubuntu 20, Ubuntu 22) passed.
Apache Spark master integration test passed.
- https://github.com/williamhyun/spark/pull/4
Apache Iceberg master integration test passed.
- williamhyun/iceberg#3
Tag is created.
Upload source artifact
Publish java artifact
Vote started.

Release Apache ORC 1.7.7

GitHub Action Check including Mac11/12 and Windows
Docker Test: CentOS7, Debian10/11, Ubuntu18/20/22.

==> logs/centos7-test.log <==
    Start 7: java-bench-spark-test
7/8 Test #7: java-bench-spark-test ............   Passed  116.46 sec
    Start 8: tool-test
8/8 Test #8: tool-test ........................   Passed   30.60 sec

100% tests passed, 0 tests failed out of 8

Total Test time (real) = 903.70 sec
Built target test-out
Finished centos7 at Sun Nov 13 20:16:56 PST 2022

==> logs/debian10-test.log <==
    Start 7: java-bench-spark-test
7/8 Test #7: java-bench-spark-test ............   Passed   94.99 sec
    Start 8: tool-test
8/8 Test #8: tool-test ........................   Passed   19.61 sec

100% tests passed, 0 tests failed out of 8

Total Test time (real) = 788.09 sec
Built target test-out
Finished debian10 at Sun Nov 13 20:19:40 PST 2022

==> logs/debian10_jdk=11-test.log <==
    Start 7: java-bench-spark-test
7/8 Test #7: java-bench-spark-test ............   Passed   90.23 sec
    Start 8: tool-test
8/8 Test #8: tool-test ........................   Passed   14.42 sec

100% tests passed, 0 tests failed out of 8

Total Test time (real) = 793.46 sec
Built target test-out
Finished debian10_jdk=11 at Sun Nov 13 20:20:59 PST 2022

==> logs/debian11-test.log <==
    Start 7: java-bench-spark-test
7/8 Test #7: java-bench-spark-test ............   Passed  115.32 sec
    Start 8: tool-test
8/8 Test #8: tool-test ........................   Passed   26.52 sec

100% tests passed, 0 tests failed out of 8

Total Test time (real) = 858.74 sec
Built target test-out
Finished debian11 at Sun Nov 13 20:16:44 PST 2022

==> logs/ubuntu18-test.log <==
    Start 7: java-bench-spark-test
7/8 Test #7: java-bench-spark-test ............   Passed  114.11 sec
    Start 8: tool-test
8/8 Test #8: tool-test ........................   Passed   27.41 sec

100% tests passed, 0 tests failed out of 8

Total Test time (real) = 881.91 sec
Built target test-out
Finished ubuntu18 at Sun Nov 13 20:16:48 PST 2022

==> logs/ubuntu20-test.log <==
    Start 7: java-bench-spark-test
7/8 Test #7: java-bench-spark-test ............   Passed   85.41 sec
    Start 8: tool-test
8/8 Test #8: tool-test ........................   Passed   12.85 sec

100% tests passed, 0 tests failed out of 8

Total Test time (real) = 680.50 sec
Built target test-out
Finished ubuntu20 at Sun Nov 13 20:21:22 PST 2022

==> logs/ubuntu20_jdk=11-test.log <==
    Start 7: java-bench-spark-test
7/8 Test #7: java-bench-spark-test ............   Passed   82.77 sec
    Start 8: tool-test
8/8 Test #8: tool-test ........................   Passed   11.02 sec

100% tests passed, 0 tests failed out of 8

Total Test time (real) = 613.05 sec
Built target test-out
Finished ubuntu20_jdk=11 at Sun Nov 13 20:22:28 PST 2022

==> logs/ubuntu20_jdk=11_cc=clang-test.log <==
    Start 7: java-bench-spark-test
7/8 Test #7: java-bench-spark-test ............   Passed  114.86 sec
    Start 8: tool-test
8/8 Test #8: tool-test ........................   Passed   26.20 sec

100% tests passed, 0 tests failed out of 8

Total Test time (real) = 925.71 sec
Built target test-out
Finished ubuntu20_jdk=11_cc=clang at Sun Nov 13 20:17:08 PST 2022

==> logs/ubuntu22-test.log <==
    Start 7: java-bench-spark-test
7/8 Test #7: java-bench-spark-test ............   Passed   96.34 sec
    Start 8: tool-test
8/8 Test #8: tool-test ........................   Passed   20.23 sec

100% tests passed, 0 tests failed out of 8

Total Test time (real) = 797.71 sec
Built target test-out
Finished ubuntu22 at Sun Nov 13 20:19:38 PST 2022

Apache Spark 3.3.2 SNAPSHOT Integration Test
Apache Iceberg 1.0.x SNAPSHOT Integration Test
Tag Creation: v1.7.7-rc0
Upload RC0 Vote Artifacts: https://dist.apache.org/repos/dist/dev/orc/v1.7.7-rc0/
Send RC0 Vote Email: https://lists.apache.org/thread/cj5dvdl7qjtff09t5fg08wo0mj07px2n

AFTER VOTE

Send Vote Result Email: https://lists.apache.org/thread/l3oq66b2045rohd0pmd1cx3b0s5z489j
Move vote directory to release directory: https://dist.apache.org/repos/dist/release/orc/orc-1.7.7/
Download the release artifacts and verify checksum and signature once more.
Update JIRA release milestone: https://issues.apache.org/jira/projects/ORC/versions/12352219
Update GitHub milestone: https://github.com/apache/orc/milestone/12?closed=1
Tags Creation
- https://github.com/apache/orc/releases/tag/v1.7.7
- https://github.com/apache/orc/releases/tag/rel/release-1.7.7
Release staging maven artifacts to the Apache Release.
- https://repository.apache.org/content/repositories/releases/org/apache/orc/orc-core/1.7.7/
Update website
Send Announcement Email: https://lists.apache.org/thread/9dj5ty7m56v2nkl3qqqc61hoqk7wbggk

Padding length and Padding ratio What does it stand for?If there's a number on it, does it affect the program from reading this file,

File length: 548692248 bytes
Padding length: 2381182 bytes
Padding ratio: 0.43%

When spark read the orC zlib file, it was sometimes stuck for more than half an hour, but we didn't know whether the file was not written

Got orc::ParseError "bad read in nextBuffer" when using SearchArgument with nested struct

Below is the code to reproduce the issue, it works when removing the empty struct column "col2" or writing small number of rows or changing the value to "rand() % 100"

Am I doing anything wrong?

on version 1.7.2

Code

  WriterOptions options;
  auto stream = writeLocalFile("orc_file_test");
  MemoryPool* pool = getDefaultPool();
  std::unique_ptr<Type> type(Type::buildTypeFromString(
      "struct<col0:struct<col1:int>,col2:struct<col3:int>>"));

  size_t num = 50000;
  std::unique_ptr<Writer> writer = createWriter(*type, stream.get(), options);

  std::unique_ptr<ColumnVectorBatch> batch = writer->createRowBatch(num);
  StructVectorBatch* structBatch =
      dynamic_cast<StructVectorBatch*>(batch.get());
  StructVectorBatch* structBatch2 =
      dynamic_cast<StructVectorBatch*>(structBatch->fields[0]);
  LongVectorBatch* intBatch =
      dynamic_cast<LongVectorBatch*>(structBatch2->fields[0]);

  StructVectorBatch* structBatch3 =
      dynamic_cast<StructVectorBatch*>(structBatch->fields[1]);
  LongVectorBatch* intBatch2 =
      dynamic_cast<LongVectorBatch*>(structBatch3->fields[0]);

  structBatch->numElements = num;
  structBatch2->numElements = num;

  structBatch3->numElements = num;
  structBatch3->hasNulls = true;

  for (int64_t i = 0; i < num; ++i) {
    intBatch->data.data()[i] = rand() % 150000;
    intBatch->notNull[i] = 1;

    intBatch2->notNull[i] = 0;
    intBatch2->hasNulls = true;

    structBatch3->notNull[i] = 0;
  }
  intBatch->hasNulls = false;

  writer->add(*batch);
  writer->close();

  ReaderOptions readOptions;
  readOptions.setMemoryPool(*getDefaultPool());
  auto reader = createReader(readLocalFile("orc_file_test"), readOptions);
  orc::RowReaderOptions rowOptions;
  rowOptions.searchArgument(
      SearchArgumentFactory::newBuilder()
          ->startAnd()
          .equals(2, PredicateDataType::LONG, Literal((int64_t)5))
          .end()
          .build());
  std::unique_ptr<RowReader> rowReader = reader->createRowReader(rowOptions);

  batch = rowReader->createRowBatch(num);
  structBatch = dynamic_cast<StructVectorBatch*>(batch.get());
  structBatch2 = dynamic_cast<StructVectorBatch*>(structBatch->fields[0]);
  intBatch = dynamic_cast<LongVectorBatch*>(structBatch2->fields[0]);

  structBatch3 = dynamic_cast<StructVectorBatch*>(structBatch->fields[1]);

  while (rowReader->next(*batch)) {
    for (size_t i = 0; i < batch->numElements; i++) {
      
    }
  }

Stack trace:

terminate called after throwing an instance of 'orc::ParseError'
  what():  bad read in nextBuffer
*** Aborted at 1666816640 (Unix time, try 'date -d @1666816640') ***
*** Signal 6 (SIGABRT) (0x2035c0002b7ad) received by PID 178093 (pthread TID 0x7ffb12545a80) (linux TID 178093) (maybe from PID 178093, UID 131932) (code: -6), stack trace: ***
    @ 0000000000000000 (unknown)
    @ 000000000009c9d3 __GI___pthread_kill
    @ 00000000000444ec __GI_raise
    @ 000000000002c432 __GI_abort
    @ 00000000000a3fd4 __gnu_cxx::__verbose_terminate_handler()
    @ 00000000000a1b39 __cxxabiv1::__terminate(void (*)())
    @ 00000000000a1ba4 std::terminate()
    @ 00000000000a1e6f __cxa_throw
    @ 0000000001efcd55 __cxa_throw
    @ 00000000075b676c orc::BooleanRleDecoderImpl::seek(orc::PositionProvider&)
                       /home/engshare/third-party2/apache-orc/1.7.2/src/orc/c++/src/ByteRLE.cc:526
    @ 00000000075af711 orc::IntegerColumnReader::seekToRowGroup(std::unordered_map<unsigned long, orc::PositionProvider, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long const, orc::PositionProvider> > >&)
                       /home/engshare/third-party2/apache-orc/1.7.2/src/orc/c++/src/ColumnReader.cc:120
    @ 00000000075af67f orc::StructColumnReader::seekToRowGroup(std::unordered_map<unsigned long, orc::PositionProvider, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long const, orc::PositionProvider> > >&)
                       /home/engshare/third-party2/apache-orc/1.7.2/src/orc/c++/src/ColumnReader.cc:965
    @ 00000000075af67f orc::StructColumnReader::seekToRowGroup(std::unordered_map<unsigned long, orc::PositionProvider, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long const, orc::PositionProvider> > >&)
                       /home/engshare/third-party2/apache-orc/1.7.2/src/orc/c++/src/ColumnReader.cc:965
    @ 0000000007598179 orc::RowReaderImpl::seekToRowGroup(unsigned int)
                       /home/engshare/third-party2/apache-orc/1.7.2/src/orc/c++/src/Reader.cc:440
    @ 000000000759d700 orc::RowReaderImpl::startNextStripe()
                       /home/engshare/third-party2/apache-orc/1.7.2/src/orc/c++/src/Reader.cc:1037
    @ 000000000759daf4 orc::RowReaderImpl::next(orc::ColumnVectorBatch&)
                       /home/engshare/third-party2/apache-orc/1.7.2/src/orc/c++/src/Reader.cc:1055
    @ 0000000002fba9bc main
    @ 000000000002c656 __libc_start_call_main
    @ 000000000002c717 __libc_start_main_alias_2
    @ 0000000002fb2780 _start

The result is strange when casting `string` to `date` in ORC reading via spark (Schema Evolution)

I created an ORC file by the code as follows.

val data = Seq(
    ("", "2022-01-32"),  // pay attention to this, null
    ("", "9808-02-30"),  // pay attention to this, 9808-02-29
    ("", "2022-06-31"),  // pay attention to this, 2022-06-30

)
val cols = Seq("str", "date_str")
val df=spark.createDataFrame(data).toDF(cols:_*).repartition(1)
df.printSchema()
df.show(100)
df.write.mode("overwrite").orc("/tmp/orc/data.orc")

Please note that these three cases are invalid date.
And I read it via:

scala> var df = spark.read.schema("date_str date").orc("/tmp/orc/data.orc"); df.show()
+----------+
|  date_str|
+----------+
|      null|
|9808-02-29|
|2022-06-30|
+----------+

Why is 2022-01-32 converted to null, while 9808-02-30 is converted to 9808-02-29?

Intuitively, they are invalid date, we should return 3 nulls.

[C++] Unable to filter DECIMAL column from ORC file

This question is similar to THIS one I asked before on StackOverflow, which after some more trials it works.

Previously there was some issue with the column Id but now I am trying to filter a column of DECIMAL data type but always results give me all the data instead of the filtered one.

Data which ORC file has in the required columns:

And this is how I am trying to filter out the DECIMAL column using orc::SearchArgument:

orc::RowReaderOptions m_RowReaderOpts;
orc::ReaderOptions m_ReaderOpts;

std::unique_ptr<orc::Reader> m_Reader;
std::unique_ptr<orc::RowReader> m_RowReader;

auto builder = orc::SearchArgumentFactory::newBuilder();
const int snapshot_time_col_id = 22;

orc::Literal ss_begin_time{34080000000000, 14, 9};
orc::Literal ss_end_time{34380000000000, 14, 9};

// I HAVE ALSO TRIED, but didn't work.
// orc::Literal ss_begin_time{34080, 5, 0};
// orc::Literal ss_end_time{34380, 5, 0};

builder->between(snapshot_time_col_id, orc::PredicateDataType::DECIMAL, ss_begin_time, ss_end_time);

m_RowReaderOpts.searchArgument(builder->build());
reader = orc::createReader(orc::readFile(a_FilePath.c_str()), m_ReaderOpts);
row_reader = reader->createRowReader(m_RowReaderOpts);

Please give some suggestions on how to filter data of type DECIMAL?

ORC 1.7.4-SNAPSHOT fails with Iceberg Nan count tests

https://github.com/williamhyun/iceberg/runs/5802512738?check_suite_focus=true

org.apache.iceberg.data.TestMetricsRowGroupFilter > testIsNaN[format = orc] FAILED
    java.lang.AssertionError: Should read: NaN counts are not tracked in Parquet metrics
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.assertTrue(Assert.java:42)
        at org.apache.iceberg.data.TestMetricsRowGroupFilter.testIsNaN(TestMetricsRowGroupFilter.java:308)

org.apache.iceberg.data.TestMetricsRowGroupFilter > testNotNaN[format = orc] FAILED
    java.lang.AssertionError: Should read: NaN counts are not tracked in Parquet metrics
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.assertTrue(Assert.java:42)
        at org.apache.iceberg.data.TestMetricsRowGroupFilter.testNotNaN(TestMetricsRowGroupFilter.java:320)

#1055 looks relevant to this issue.

ORC-1188: Fix `ORC_PREFER_STATIC_ZLIB`

orc/cmake_modules/ThirdpartyToolchain.cmake

Lines 147 to 151 in 1e29620

 if (ORC_PREFER_STATIC_ZLIB AND ${ZLIB_STATIC_LIB}) 

 target_link_libraries (orc_zlib INTERFACE ${ZLIB_LIBRARY}) 

 else () 

 target_link_libraries (orc_zlib INTERFACE ${ZLIB_STATIC_LIB}) 

 endif ()

If ORC_PREFER_STATIC_ZLIB is false then the static system zlib is used.
If ORC_PREFER_STATIC_ZLIB is true and ZLIB_STATIC_LIB is set then the shared system zlib is used.
This usage contradicts the description for ORC_PREFER_STATIC_ZLIB 'Prefer static zlib library, if available'.

Apache ORC website revamp

Increase page width and improve resizing
Create RELEASES menu
Menu reordering
Adjust Footer

Is it time to enable c++17 to the C++ library?

At the moment, the C++ library supports gcc, clang and msvc compilers at the same time with C++11 enabled. To keep up with the pace of modern c++ standards, we can enable c++17 or even c++20 by default. We should be careful with public headers as they are dependent by various downstream projects. Elsewhere we can enjoy new language features internally in the library.

However, this requires us to lift the minimum versions supported for the compilers listed below:

gcc: at least gcc5 is required (https://gcc.gnu.org/projects/cxx-status.html#cxx17)
clang: at least clang5 is required (https://clang.llvm.org/cxx_status.html#cxx17)
msvc: at least Visual Studio 2017 version 15.3 (https://learn.microsoft.com/en-us/cpp/build/reference/std-specify-language-standard-version?view=msvc-170#c-standards-support)

Thoughts? @dongjoon-hyun @williamhyun @guiyanakuang @stiga-huang @coderex2522

Release Apache ORC 1.8.2

GitHub Action CI Check (Java 8/11/17) and Ubuntu/MacOS/Windows.
- https://github.com/apache/orc/actions/workflows/build_and_test.yml?query=branch%3Abranch-1.8
Docker Tests

$ ./run-all.sh apache branch-1.8
...
Test start: Sun Jan  8 19:00:36 PST 2023
End: Sun Jan 8 20:26:53 PST 2023

$ tail -n2 logs/*log
==> logs/centos7-test.log <==
Built target test-out
Finished centos7 at Sun Jan  8 20:17:09 PST 2023

==> logs/debian10-test.log <==
Built target test-out
Finished debian10 at Sun Jan  8 20:20:51 PST 2023

==> logs/debian10_jdk=11-test.log <==
Built target test-out
Finished debian10_jdk=11 at Sun Jan  8 20:22:35 PST 2023

==> logs/debian11-test.log <==
Built target test-out
Finished debian11 at Sun Jan  8 20:20:58 PST 2023

==> logs/fedora37-test.log <==
Built target test-out
Finished fedora37 at Sun Jan  8 20:22:26 PST 2023

==> logs/ubuntu18-test.log <==
Built target test-out
Finished ubuntu18 at Sun Jan  8 20:16:50 PST 2023

==> logs/ubuntu20-test.log <==
Built target test-out
Finished ubuntu20 at Sun Jan  8 20:21:25 PST 2023

==> logs/ubuntu20_jdk=11-test.log <==
Built target test-out
Finished ubuntu20_jdk=11 at Sun Jan  8 20:26:53 PST 2023

==> logs/ubuntu20_jdk=11_cc=clang-test.log <==
Built target test-out
Finished ubuntu20_jdk=11_cc=clang at Sun Jan  8 20:18:33 PST 2023

==> logs/ubuntu22-test.log <==
Built target test-out
Finished ubuntu22 at Sun Jan  8 20:23:15 PST 2023

ORC-1342: Publish SBOM artifacts

This is resolved via #1353 .

Byte to integer conversions fail on platforms with unsigned char type

When building the C++ library on a platform where char is by default unsigned, byte-to-integer expansion is incorrect in orc::expandBytesToIntegers as well as in a few unit tests.

This can be reproduced on any CPU architecture when building with gcc by compiling with -funsigned-char.

Suppor merge file in C++

I notice there is merge file function in Java. Is there plan to support it in C++?

RecordReaderImpl.getValueRange() may cause incorrect results

orc version: 1.6.11, sql: select xxx from xxx where str is not null

Recently i found some orc files wrote by trino didn't have complete statistics in files meta(maybe a presto bug), this causes OrcProto.ColumnStatistics can't be deserialized to any specific ColumnStatisticsImpl such as StringStatisticsImpl, then RecordReaderImpl.getValueRange() returns ValueRange with null lower and RecordReaderImpl.pickRowGroups() skips this row group, which should not be skipped. In normal conditions except above, everything is ok. And i found orc-1.5.x can handle above case according to RecordReaderImpl.UNKNOWN_VALUE, which has removed in 1.6.x. Maybe we could add it back for better compatibility. @dongjoon-hyun @omalley

Readme still says cpp only writes version 0.11?

The readme says that c++ version only writes version 0.11 but from

orc/c++/src/Writer.cc

Line 47 in cf720d7

fileVersion(FileVersion::v_0_12()) { // default to Hive_0_12

it seems that the default has already changed to 0.12

[JAVA] mvn package fails if test compiling was skipped

I would like to run mvn -Dmaven.test.skip=true clean package, but maven-dependency-plugin complains Unused declared dependencies for some libraries used by the test code, which breaks the compilation. Please check the attached logs.

I'm not a java expert, but I'm guessing the cause of the problem is the misuse of analyze-only in the package phase. According to the documentation, the analyze-only goal is meant to be used during the test-compile phase. In our case, the test class was not compiled, so the dependency analyzer treated some libraries as unused. I don't know the right way to fix it though.

The setting of maven-dependency-plugin:

orc/java/pom.xml

Lines 373 to 388 in 8cf1047

 <plugin> 

 <groupId>org.apache.maven.plugins</groupId> 

 <artifactId>maven-dependency-plugin</artifactId> 

 <version>3.1.2</version> 

 <executions> 

 <execution> 

 <phase>package</phase> 

 <goals> 

 <goal>analyze-only</goal> 

 </goals> 

 </execution> 

 </executions> 

 <configuration> 

 <failOnWarning>true</failOnWarning> 

 </configuration> 

 </plugin>

Logs:

$ cd orc/java

$ mvn -Dmaven.test.skip=true clean package
[INFO] Scanning for projects...
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Build Order:
[INFO]
[INFO] Apache ORC                                                         [pom]
[INFO] ORC Shims                                                          [jar]
[INFO] ORC Core                                                           [jar]
[INFO] ORC MapReduce                                                      [jar]
[INFO] ORC Tools                                                          [jar]
[INFO] ORC Examples                                                       [jar]
[INFO]
[INFO] -------------------------< org.apache.orc:orc >-------------------------
[INFO] Building Apache ORC 1.9.0-SNAPSHOT                                 [1/6]
[INFO] --------------------------------[ pom ]---------------------------------
[INFO]
[INFO] --- maven-clean-plugin:3.2.0:clean (default-clean) @ orc ---
[INFO] Deleting /Users/x/Documents/playground/orc/java/target
[INFO]
[INFO] --- maven-enforcer-plugin:3.1.0:enforce (enforce-maven-version) @ orc ---
[INFO]
[INFO] --- maven-enforcer-plugin:3.1.0:enforce (enforce-java-version) @ orc ---
[INFO]
[INFO] --- maven-enforcer-plugin:3.1.0:enforce (enforce-maven) @ orc ---
[INFO]
[INFO] --- maven-remote-resources-plugin:1.7.0:process (process-resource-bundles) @ orc ---
[INFO] Preparing remote bundle org.apache:apache-jar-resource-bundle:1.4
[INFO] Copying 3 resources from 1 bundle.
[INFO]
[INFO] --- maven-antrun-plugin:3.1.0:run (setup-test-dirs) @ orc ---
[INFO] Executing tasks
[INFO]     [mkdir] Created dir: /Users/x/Documents/playground/orc/java/target/testing-tmp
[INFO] Executed tasks
[INFO]
[INFO] --- maven-site-plugin:3.12.0:attach-descriptor (attach-descriptor) @ orc ---
[INFO] No site descriptor found: nothing to attach.
[INFO]
[INFO] --- maven-source-plugin:3.2.1:jar-no-fork (create-source-jar) @ orc ---
[INFO]
[INFO] --- maven-source-plugin:3.2.1:test-jar-no-fork (create-source-jar) @ orc ---
[INFO]
[INFO] --- reproducible-build-maven-plugin:0.15:strip-jar (default) @ orc ---
[INFO]
[INFO] ----------------------< org.apache.orc:orc-shims >----------------------
[INFO] Building ORC Shims 1.9.0-SNAPSHOT                                  [2/6]
[INFO] --------------------------------[ jar ]---------------------------------
[INFO]
[INFO] --- maven-clean-plugin:3.2.0:clean (default-clean) @ orc-shims ---
[INFO] Deleting /Users/x/Documents/playground/orc/java/shims/target
[INFO]
[INFO] --- maven-enforcer-plugin:3.1.0:enforce (enforce-maven-version) @ orc-shims ---
[INFO]
[INFO] --- maven-enforcer-plugin:3.1.0:enforce (enforce-java-version) @ orc-shims ---
[INFO]
[INFO] --- maven-enforcer-plugin:3.1.0:enforce (enforce-maven) @ orc-shims ---
[INFO]
[INFO] --- build-helper-maven-plugin:3.3.0:add-source (add-source) @ orc-shims ---
[INFO] Source directory: /Users/x/Documents/playground/orc/java/shims/target/generated-sources added.
[INFO]
[INFO] --- maven-remote-resources-plugin:1.7.0:process (process-resource-bundles) @ orc-shims ---
[INFO] Preparing remote bundle org.apache:apache-jar-resource-bundle:1.4
[INFO] Copying 3 resources from 1 bundle.
[INFO]
[INFO] --- maven-resources-plugin:3.2.0:resources (default-resources) @ orc-shims ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Using 'UTF-8' encoding to copy filtered properties files.
[INFO] skip non existing resourceDirectory /Users/x/Documents/playground/orc/java/shims/src/main/resources
[INFO] Copying 3 resources
[INFO]
[INFO] --- maven-compiler-plugin:3.10.1:compile (default-compile) @ orc-shims ---
[INFO] Compiling 13 source files to /Users/x/Documents/playground/orc/java/shims/target/classes
[INFO]
[INFO] --- maven-resources-plugin:3.2.0:testResources (default-testResources) @ orc-shims ---
[INFO] Not copying test resources
[INFO]
[INFO] --- maven-antrun-plugin:3.1.0:run (setup-test-dirs) @ orc-shims ---
[INFO] Executing tasks
[INFO]     [mkdir] Created dir: /Users/x/Documents/playground/orc/java/shims/target/testing-tmp
[INFO] Executed tasks
[INFO]
[INFO] --- maven-compiler-plugin:3.10.1:testCompile (default-testCompile) @ orc-shims ---
[INFO] Not compiling test sources
[INFO]
[INFO] --- maven-surefire-plugin:3.0.0-M5:test (default-test) @ orc-shims ---
[INFO] Tests are skipped.
[INFO]
[INFO] --- maven-jar-plugin:3.3.0:jar (default-jar) @ orc-shims ---
[INFO] Building jar: /Users/x/Documents/playground/orc/java/shims/target/orc-shims-1.9.0-SNAPSHOT.jar
[INFO]
[INFO] --- maven-site-plugin:3.12.0:attach-descriptor (attach-descriptor) @ orc-shims ---
[INFO] Skipping because packaging 'jar' is not pom.
[INFO]
[INFO] --- maven-source-plugin:3.2.1:jar-no-fork (create-source-jar) @ orc-shims ---
[INFO] Building jar: /Users/x/Documents/playground/orc/java/shims/target/orc-shims-1.9.0-SNAPSHOT-sources.jar
[INFO]
[INFO] --- maven-source-plugin:3.2.1:test-jar-no-fork (create-source-jar) @ orc-shims ---
[INFO] Building jar: /Users/x/Documents/playground/orc/java/shims/target/orc-shims-1.9.0-SNAPSHOT-test-sources.jar
[INFO]
[INFO] --- reproducible-build-maven-plugin:0.15:strip-jar (default) @ orc-shims ---
[INFO] Stripping /Users/x/Documents/playground/orc/java/shims/target/orc-shims-1.9.0-SNAPSHOT-test-sources.jar
[INFO] Stripping /Users/x/Documents/playground/orc/java/shims/target/orc-shims-1.9.0-SNAPSHOT-sources.jar
[INFO] Stripping /Users/x/Documents/playground/orc/java/shims/target/orc-shims-1.9.0-SNAPSHOT.jar
[INFO]
[INFO] --- maven-dependency-plugin:3.1.2:analyze-only (default) @ orc-shims ---
[WARNING] Unused declared dependencies found:
[WARNING]    org.junit.jupiter:junit-jupiter-api:jar:5.9.0:test
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Apache ORC 1.9.0-SNAPSHOT:
[INFO]
[INFO] Apache ORC ......................................... SUCCESS [  4.108 s]
[INFO] ORC Shims .......................................... FAILURE [  4.817 s]
[INFO] ORC Core ........................................... SKIPPED
[INFO] ORC MapReduce ...................................... SKIPPED
[INFO] ORC Tools .......................................... SKIPPED
[INFO] ORC Examples ....................................... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  9.189 s
[INFO] Finished at: 2022-10-13T15:07:06+08:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-dependency-plugin:3.1.2:analyze-only (default) on project orc-shims: Dependency problems found -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <args> -rf :orc-shims

Release Apache ORC 1.7.3

RC0 VOTE LINK
[VOTE] Release Apache ORC 1.7.3 (RC0)

TAG
https://github.com/apache/orc/releases/tag/v1.7.3-rc0

RELEASE FILES
https://dist.apache.org/repos/dist/dev/orc/v1.7.3-rc0

STAGING REPOSITORY
https://repository.apache.org/content/repositories/orgapacheorc-1055

LIST OF ISSUES
https://github.com/apache/orc/milestone/4?closed=1
https://issues.apache.org/jira/projects/ORC/versions/12351162

ORC-1158: Add notification settings to `.asf.yaml`

This is a testing issue to verify ORC-1158 (#1097)

Improve ORC Spec example (Decoding RLE v2 direct)

This is created based on the following dev@orc email.

ORC Format Parsing – Out-of-Bounds Access in Protobuf Messages

what happens if type.fieldnames_size() is less than the type.subtypes_size(). The type.fieldnames(i) will be invalid access.

File name : TypeImpl.cc

case proto::Type_Kind_STRUCT: {
TypeImpl* result = new TypeImplSTRUCT);
uint64_t size = static_cast<uint64_t>(type.subtypes_size());
std::vector<Type*> typeList(size);
std::vector<std::string> fieldList(size);
for(int i=0; i < type.subtypes_size(); ++i)
{ result->addStructField(type.fieldnames(i), convertType(footer.types(static_cast<int> (type.subtypes(i))),footer)); }
return std::unique_ptr<Type>(result);

I reproduced the scenario by modifying c++/test/TestType.cc (refer screenshot) and it crashed.(refer output screenshot

)

Huge memory taken for each field when exporting

Hello,
Using arrow adapter, I became aware that the memory (RAM) footprint of the export (exporting an orc file) was very huge for each field. For instance, exporting a table with 10000 fields can take up to 30Go, even if there is only 10 records.
Even for 100 fields, that could take 100Mo+.
The "issue" seems to be coming from here :

orc/c++/src/ColumnWriter.cc

Line 59 in 432a7aa

1 * 1024 * 1024,

When we create a writer with the "createWriter" (

orc/c++/src/Writer.cc

Lines 681 to 684 in 432a7aa

 std::unique_ptr<Writer> createWriter( 

 const Type& type, 

 OutputStream* stream, 

 const WriterOptions& options) {

), a stream (compressor) is created for each field. As we allocate a Buffer of 1 * 1024 *1024 we get as a minimum 1Mo additionnal size taken in memory for each field.

Is there a reason the BufferedOutputStream initial capacity is that high ? I circumvented my problem by lowering it to 1Ko (it didn't change much the performance according to my testing, but it may depend on usecases). Could it be envisaged to put a global variable (or static one) to parametrize this to allow changing this hard coded parameter ?
Thanks

ORC-1323: Make docker/reinit.sh support target OS arguments

This is resolved via #1330

ORC-1216: Pin org.jetbrains.annotations dependency to 17.0.0

During preparing 1.8.0, I realized that we had better pin annotations library to minimize downstream projects' migration burden.

https://github.com/dongjoon-hyun/spark/pull/4/files#

      <dependency>
        <groupId>org.jetbrains</groupId>
        <artifactId>annotations</artifactId>
        <version>23.0.0</version>
      </dependency>

Dependency Audit (Especially, CVEs)
branch-1.8 is healthy in GitHub Action
- https://github.com/apache/orc/commits/branch-1.8
Docker tests (CentOS 7, Debian 10, Debian 11, Fedora37, Ubuntu 18, Ubuntu 20, Ubuntu 22) passed.
Apache Spark master integration test passed.
- https://github.com/williamhyun/spark/pull/6
Apache Iceberg master integration test passed.
- williamhyun/iceberg#14
Tag is created.
Upload source artifact
Publish java artifact
Vote started.

	if (ORC_PREFER_STATIC_ZLIB AND ${ZLIB_STATIC_LIB})
	target_link_libraries (orc_zlib INTERFACE ${ZLIB_LIBRARY})
	else ()
	target_link_libraries (orc_zlib INTERFACE ${ZLIB_STATIC_LIB})
	endif ()

	<plugin>
	<groupId>org.apache.maven.plugins</groupId>
	<artifactId>maven-dependency-plugin</artifactId>
	<version>3.1.2</version>
	<executions>
	<execution>
	<phase>package</phase>
	<goals>
	<goal>analyze-only</goal>
	</goals>
	</execution>
	</executions>
	<configuration>
	<failOnWarning>true</failOnWarning>
	</configuration>
	</plugin>

	std::unique_ptr<Writer> createWriter(
	const Type& type,
	OutputStream* stream,
	const WriterOptions& options) {

apache / orc Goto Github PK

orc's Issues

Recommend Projects

Recommend Topics

Recommend Org