apache / orc Goto Github PK
View Code? Open in Web Editor NEWApache ORC - the smallest, fastest columnar storage for Hadoop workloads
Home Page: https://orc.apache.org/
License: Apache License 2.0
Apache ORC - the smallest, fastest columnar storage for Hadoop workloads
Home Page: https://orc.apache.org/
License: Apache License 2.0
Where can I find sample code for using ORC column encryption when writing ORC files with spark dataSource
It seems that CyrusSASL
is not configured correctly in GitHub Action so far.
-- Could NOT find CyrusSASL (missing: CYRUS_SASL_SHARED_LIB CYRUS_SASL_INCLUDE_DIR)
This is resolved via #1355
Hi, I have a problem, my test data is tpc-ds 1g, spark 3.2, orc version 1.6.11,
test sql select count(1) from call_center_orc where cc_call_center_sk > 100;
cc_call_center_sk
is the first column in call_center_orc
, and predicate pushdown is effectual
but when i test select count(1) from call_center_orc where cc_company > 100;
cc_company
is not the first column, predicate pushdown does not work
And I debug the code, I found the problem is SchemaEvolution.ppdSafeConversion
,
in my case, result.size
is 2, but in pickRowGroups
the columnIx
is the column index in orc meta, which is 19 for cc_company
, this causes orc will not evaluate pushdown filters with row group stats, and can not skip the row group.
branch-1.7
is healthy in GitHub Action
Ubuntu 20.04
, MacOS 10.15.7
, MacOS 11.6.5
, Windows
are tested.Java 8
, Java 11
, Java 17
, Java 18
are tested.Clang 11.0.0
and g++
CentOS 7
, Debian 9
, Debian 10
, Debian 11
, Ubuntu 16
, Ubuntu 18
, Ubuntu 20
) passed.Please vote on releasing the following candidate as Apache ORC version
1.7.4.
[ ] +1 Release this package as Apache ORC 1.7.4
[ ] -1 Do not release this package because ...
TAG:
https://github.com/apache/orc/releases/tag/v1.7.4-rc0
RELEASE FILES:
https://dist.apache.org/repos/dist/dev/orc/v1.7.4-rc0
STAGING REPOSITORY:
https://repository.apache.org/content/repositories/orgapacheorc-1056
LIST OF ISSUES:
https://issues.apache.org/jira/projects/ORC/versions/12351349
https://github.com/apache/orc/milestone/7?closed=1
This vote will be open for 72 hours.
Regards,
William
branch-1.6
is healthy in GitHub Action
Ubuntu 20.04
, MacOS 10.15.7
, MacOS 11.6.5
, Windows
are tested.Java 8
, Java 11
, Java 15
are tested.AppleClang 12.0.0.12000032
and g++
CentOS 7
, Debian 9
, Debian 10
, Debian 11
, Ubuntu 16
, Ubuntu 18
, Ubuntu 20
) passed except one known issue, testLzoLong
.Failed tests
[ FAILED ] TestDecompression.testLzoLong (0 ms)
[ FAILED ] 1 test, listed below:
[ FAILED ] TestDecompression.testLzoLong
1 FAILED TEST
FAILED debian11
Ubuntu
/MacOS
/Windows
.
$ ./run-all.sh apache branch-1.8
$ tail -n2 logs/*log
==> logs/centos7-test.log <==
Built target test-out
Finished centos7 at Mon Nov 28 22:32:44 PST 2022
==> logs/debian10-test.log <==
Built target test-out
Finished debian10 at Mon Nov 28 22:38:16 PST 2022
==> logs/debian10_jdk=11-test.log <==
Built target test-out
Finished debian10_jdk=11 at Mon Nov 28 22:41:52 PST 2022
==> logs/debian11-test.log <==
Built target test-out
Finished debian11 at Mon Nov 28 22:39:43 PST 2022
==> logs/fedora37-test.log <==
Built target test-out
Finished fedora37 at Mon Nov 28 22:39:08 PST 2022
==> logs/ubuntu18-test.log <==
Built target test-out
Finished ubuntu18 at Mon Nov 28 22:37:51 PST 2022
==> logs/ubuntu20-test.log <==
Built target test-out
Finished ubuntu20 at Mon Nov 28 22:42:01 PST 2022
==> logs/ubuntu20_jdk=11-test.log <==
Built target test-out
Finished ubuntu20_jdk=11 at Mon Nov 28 22:43:10 PST 2022
==> logs/ubuntu20_jdk=11_cc=clang-test.log <==
Built target test-out
Finished ubuntu20_jdk=11_cc=clang at Mon Nov 28 22:39:06 PST 2022
==> logs/ubuntu22-test.log <==
Built target test-out
Finished ubuntu22 at Mon Nov 28 22:40:22 PST 2022
AFTER VOTE
for query engine like presto,stripe is the base unit for query concurrency, one stripe can only be processed by one split.
In current implement of orc writer, the only config which can control row count in stripe is the "orc.stripe.size".
But for different kind of table, the row count is difficult to use.
for presto, normal olap query only read a subset of table columns, the row count is the key factor of query performance. If one stripe contain much rows, the query performance may become too low.
So, besides the config "orc.stripe.size", we need another config like "orc.stripe.row.count" to control the row count of one stripe.
The similar config has been introduced to cudf ( a GPU DataFrame library base on apache arrow): rapidsai/cudf#9261
It seems that we have broken images at the following pages.
This is resolved via #1340
apache/orc-dev
via https://issues.apache.org/jira/browse/INFRA-23534
docker
test images to apache/orc-dev
.site
build image to apache/orc-dev
.docker
script
branch-1.7
is healthy in GitHub Action
Ubuntu 20.04
, MacOS 10.15.7
, MacOS 11.6.5
, Windows
are tested.Java 8
, Java 11
, Java 17
, Java 18
are tested.Clang 11.0.0
and g++
CentOS 7
, Debian 10
, Debian 11
, Ubuntu 18
, Ubuntu 20
, Ubuntu 22
) passed.This is resolved via #1354
While building for Arch Linux, I’ve encountered some tests failures (after backporting ffbd341):
[ RUN ] TestPredicateLeaf.testIntNullSafeEqualsBloomFilter
/build/apache-orc/src/orc-1.7.3/c++/test/TestPredicateLeaf.cc:635: Failure
Expected: TruthValue::YES_NO
Which is: 4-byte object <05-00 00-00>
To be equal to: evaluate(pred, createIntStats(10, 100), &bf)
Which is: 4-byte object <01-00 00-00>
[ FAILED ] TestPredicateLeaf.testIntNullSafeEqualsBloomFilter (1 ms)
[ RUN ] TestPredicateLeaf.testIntEqualsBloomFilter
/build/apache-orc/src/orc-1.7.3/c++/test/TestPredicateLeaf.cc:652: Failure
Expected: TruthValue::YES_NO_NULL
Which is: 4-byte object <06-00 00-00>
To be equal to: evaluate(pred, createIntStats(10, 100, true), &bf)
Which is: 4-byte object <04-00 00-00>
[ FAILED ] TestPredicateLeaf.testIntEqualsBloomFilter (0 ms)
[ RUN ] TestPredicateLeaf.testIntInBloomFilter
/build/apache-orc/src/orc-1.7.3/c++/test/TestPredicateLeaf.cc:667: Failure
Expected: TruthValue::YES_NO_NULL
Which is: 4-byte object <06-00 00-00>
To be equal to: evaluate(pred, createIntStats(10, 100, true), &bf)
Which is: 4-byte object <04-00 00-00>
/build/apache-orc/src/orc-1.7.3/c++/test/TestPredicateLeaf.cc:670: Failure
Expected: TruthValue::YES_NO_NULL
Which is: 4-byte object <06-00 00-00>
To be equal to: evaluate(pred, createIntStats(10, 100, true), &bf)
Which is: 4-byte object <04-00 00-00>
[ FAILED ] TestPredicateLeaf.testIntInBloomFilter (0 ms)
[ RUN ] TestPredicateLeaf.testDateNullSafeEqualsBloomFilter
/build/apache-orc/src/orc-1.7.3/c++/test/TestPredicateLeaf.cc:778: Failure
Expected: TruthValue::YES_NO
Which is: 4-byte object <05-00 00-00>
To be equal to: evaluate(pred, createDateStats(10.0, 100.0), &bf)
Which is: 4-byte object <01-00 00-00>
[ FAILED ] TestPredicateLeaf.testDateNullSafeEqualsBloomFilter (0 ms)
[ RUN ] TestPredicateLeaf.testDateEqualsBloomFilter
/build/apache-orc/src/orc-1.7.3/c++/test/TestPredicateLeaf.cc:795: Failure
Expected: TruthValue::YES_NO_NULL
Which is: 4-byte object <06-00 00-00>
To be equal to: evaluate(pred, createDateStats(10.0, 100.0, true), &bf)
Which is: 4-byte object <04-00 00-00>
[ FAILED ] TestPredicateLeaf.testDateEqualsBloomFilter (0 ms)
[ RUN ] TestPredicateLeaf.testDateInBloomFilter
/build/apache-orc/src/orc-1.7.3/c++/test/TestPredicateLeaf.cc:812: Failure
Expected: TruthValue::YES_NO_NULL
Which is: 4-byte object <06-00 00-00>
To be equal to: evaluate(pred, createDateStats(10.0, 100.0, true), &bf)
Which is: 4-byte object <04-00 00-00>
/build/apache-orc/src/orc-1.7.3/c++/test/TestPredicateLeaf.cc:815: Failure
Expected: TruthValue::YES_NO_NULL
Which is: 4-byte object <06-00 00-00>
To be equal to: evaluate(pred, createDateStats(10.0, 100.0, true), &bf)
Which is: 4-byte object <04-00 00-00>
[ FAILED ] TestPredicateLeaf.testDateInBloomFilter (0 ms)
[ RUN ] TestBloomFilter.testBloomFilterBasicOperations
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:134: Failure
Value of: bloomFilter.mBitSet->get(288)
Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:134: Failure
Value of: bloomFilter.mBitSet->get(246)
Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:134: Failure
Value of: bloomFilter.mBitSet->get(306)
Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:134: Failure
Value of: bloomFilter.mBitSet->get(228)
Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:138: Failure
Value of: bloomFilter.mBitSet->get(458)
Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:138: Failure
Value of: bloomFilter.mBitSet->get(545)
Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:138: Failure
Value of: bloomFilter.mBitSet->get(717)
Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:140: Failure
Value of: bloomFilter.mBitSet->get(526)
Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:140: Failure
Value of: bloomFilter.mBitSet->get(40)
Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:140: Failure
Value of: bloomFilter.mBitSet->get(480)
Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:140: Failure
Value of: bloomFilter.mBitSet->get(86)
Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:144: Failure
Value of: bloomFilter.mBitSet->get(308)
Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:144: Failure
Value of: bloomFilter.mBitSet->get(335)
Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:144: Failure
Value of: bloomFilter.mBitSet->get(108)
Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:144: Failure
Value of: bloomFilter.mBitSet->get(535)
Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:148: Failure
Value of: bloomFilter.mBitSet->get(279)
Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:148: Failure
Value of: bloomFilter.mBitSet->get(15)
Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:148: Failure
Value of: bloomFilter.mBitSet->get(54)
Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:150: Failure
Value of: bloomFilter.mBitSet->get(680)
Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:150: Failure
Value of: bloomFilter.mBitSet->get(818)
Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:150: Failure
Value of: bloomFilter.mBitSet->get(434)
Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:150: Failure
Value of: bloomFilter.mBitSet->get(232)
Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:154: Failure
Value of: bloomFilter.testLong(111)
Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:157: Failure
Value of: bloomFilter.testLong(-1)
Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:159: Failure
Value of: bloomFilter.testLong(-111)
Actual: false
Expected: true
[ FAILED ] TestBloomFilter.testBloomFilterBasicOperations (0 ms)
[ RUN ] TestBloomFilter.testBloomFilterSerialization
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:262: Failure
Value of: dstBloomFilter->testLong(11)
Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:263: Failure
Value of: dstBloomFilter->testLong(111)
Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:267: Failure
Value of: dstBloomFilter->testLong(-11)
Actual: false
Expected: true
/build/apache-orc/src/orc-1.7.3/c++/test/TestBloomFilter.cc:268: Failure
Value of: dstBloomFilter->testLong(-111)
Actual: false
Expected: true
[ FAILED ] TestBloomFilter.testBloomFilterSerialization (0 ms)
And in the second suite:
[ RUN ] TestFileScan.testErrorHandling
/build/apache-orc/src/orc-1.7.3/tools/test/TestFileScan.cc:209: Failure
Expected: (std::string::npos) != (error.find(error_msg)), actual: 18446744073709551615 vs 18446744073709551615
[ FAILED ] TestFileScan.testErrorHandling (27 ms)
We build with those CMake flags:
-DCMAKE_CXX_FLAGS="${CXXFLAGS} -fPIC -ffat-lto-objects" \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_INSTALL_PREFIX="/usr" \
-DLZ4_HOME="/usr" \
-DPROTOBUF_HOME="/usr" \
-DSNAPPY_HOME="/usr" \
-DZLIB_HOME="/usr" \
-DZSTD_HOME="/usr" \
-DORC_PREFER_STATIC_ZLIB=OFF \
-DBUILD_LIBHDFSPP=OFF \
-DBUILD_JAVA=OFF \
-DINSTALL_VENDORED_LIBS=OFF
Please tell me what I can provide you with to assist in the resolution of these failures.
Greetings,
I'm learning to work with ORC in C++, and I think I'm stuck and don't quite understand how to set array's offsets. Precisely, the following code, when executed, produces the following exception: "Caught exception in test-file.orc: bad read in nextBuffer"
"):
void write_orc()
{
using namespace orc;
ORC_UNIQUE_PTR<OutputStream> outStream = writeLocalFile("test-file.orc");
ORC_UNIQUE_PTR<Type> schema(
Type::buildTypeFromString("struct<id:int,list1:array<string>>"));
WriterOptions options;
ORC_UNIQUE_PTR<Writer> writer = createWriter(*schema, outStream.get(), options);
std::unique_ptr<Writer> writer = createWriter(*type, stream.get(), options);
uint64_t batch_size = 1024, row_count = 2048;
std::unique_ptr<ColumnVectorBatch> batch =
writer->createRowBatch(row_count);
StructVectorBatch &root_batch =
dynamic_cast<StructVectorBatch &>(*batch.get());
LongVectorBatch &id_batch =
dynamic_cast<LongVectorBatch &>(*struct_batch.fields[0]);
ListVectorBatch &list_batch =
dynamic_cast<ListVectorBatch &>(*struct_batch.fields[1]);
StringVectorBatch &str_batch =
dynamic_cast<StringVectorBatch &>(*list_batch.elements.get());
std::vector<std::string> vs{"str1", "str2"};
char **data = str_batch.data.data();
int64_t *offsets = list_batch.offsets.data();
uint64_t offset = 0, rows = 0;
for (size_t i = 0; i < row_count; ++i) {
offsets[rows] = static_cast<int64_t>(offset);
id_batch.data[rows] = articles[i]->get_id();
for (auto &s : vs)
{
data[offset] = &s[0];
str_batch.length[offset++] = s.size();
}
rows++;
if (rows == batch_size)
{
root_batch.numElements = rows;
id_batch.numElements = rows;
list_batch.numElements = rows;
writer->add(*batch);
rows = 0;
offset = 0;
}
}
if (rows != 0)
{
root_batch.numElements = rows;
id_batch.numElements = rows;
list_batch.numElements = rows;
writer->add(*batch);
rows = 0;
offset = 0;
}
writer->close();
}
My question is: what exactly am I doing wrong when setting list's offsets?
branch-1.7
is healthy in GitHub Action
Ubuntu 20.04
, Ubuntu 22.04
, MacOS 11.6
, MacOS 12.5
, Windows
are tested.Java 8
, Java 11
, Java 17
, Java 18
are tested.Clang 11.0.0
and g++
CentOS 7
, Debian 10
, Debian 11
, Ubuntu 18
, Ubuntu 20
, Ubuntu 22
) passed.Mac11/12
and Windows
CentOS7
, Debian10/11
, Ubuntu18/20/22
.==> logs/centos7-test.log <==
Start 7: java-bench-spark-test
7/8 Test #7: java-bench-spark-test ............ Passed 116.46 sec
Start 8: tool-test
8/8 Test #8: tool-test ........................ Passed 30.60 sec
100% tests passed, 0 tests failed out of 8
Total Test time (real) = 903.70 sec
Built target test-out
Finished centos7 at Sun Nov 13 20:16:56 PST 2022
==> logs/debian10-test.log <==
Start 7: java-bench-spark-test
7/8 Test #7: java-bench-spark-test ............ Passed 94.99 sec
Start 8: tool-test
8/8 Test #8: tool-test ........................ Passed 19.61 sec
100% tests passed, 0 tests failed out of 8
Total Test time (real) = 788.09 sec
Built target test-out
Finished debian10 at Sun Nov 13 20:19:40 PST 2022
==> logs/debian10_jdk=11-test.log <==
Start 7: java-bench-spark-test
7/8 Test #7: java-bench-spark-test ............ Passed 90.23 sec
Start 8: tool-test
8/8 Test #8: tool-test ........................ Passed 14.42 sec
100% tests passed, 0 tests failed out of 8
Total Test time (real) = 793.46 sec
Built target test-out
Finished debian10_jdk=11 at Sun Nov 13 20:20:59 PST 2022
==> logs/debian11-test.log <==
Start 7: java-bench-spark-test
7/8 Test #7: java-bench-spark-test ............ Passed 115.32 sec
Start 8: tool-test
8/8 Test #8: tool-test ........................ Passed 26.52 sec
100% tests passed, 0 tests failed out of 8
Total Test time (real) = 858.74 sec
Built target test-out
Finished debian11 at Sun Nov 13 20:16:44 PST 2022
==> logs/ubuntu18-test.log <==
Start 7: java-bench-spark-test
7/8 Test #7: java-bench-spark-test ............ Passed 114.11 sec
Start 8: tool-test
8/8 Test #8: tool-test ........................ Passed 27.41 sec
100% tests passed, 0 tests failed out of 8
Total Test time (real) = 881.91 sec
Built target test-out
Finished ubuntu18 at Sun Nov 13 20:16:48 PST 2022
==> logs/ubuntu20-test.log <==
Start 7: java-bench-spark-test
7/8 Test #7: java-bench-spark-test ............ Passed 85.41 sec
Start 8: tool-test
8/8 Test #8: tool-test ........................ Passed 12.85 sec
100% tests passed, 0 tests failed out of 8
Total Test time (real) = 680.50 sec
Built target test-out
Finished ubuntu20 at Sun Nov 13 20:21:22 PST 2022
==> logs/ubuntu20_jdk=11-test.log <==
Start 7: java-bench-spark-test
7/8 Test #7: java-bench-spark-test ............ Passed 82.77 sec
Start 8: tool-test
8/8 Test #8: tool-test ........................ Passed 11.02 sec
100% tests passed, 0 tests failed out of 8
Total Test time (real) = 613.05 sec
Built target test-out
Finished ubuntu20_jdk=11 at Sun Nov 13 20:22:28 PST 2022
==> logs/ubuntu20_jdk=11_cc=clang-test.log <==
Start 7: java-bench-spark-test
7/8 Test #7: java-bench-spark-test ............ Passed 114.86 sec
Start 8: tool-test
8/8 Test #8: tool-test ........................ Passed 26.20 sec
100% tests passed, 0 tests failed out of 8
Total Test time (real) = 925.71 sec
Built target test-out
Finished ubuntu20_jdk=11_cc=clang at Sun Nov 13 20:17:08 PST 2022
==> logs/ubuntu22-test.log <==
Start 7: java-bench-spark-test
7/8 Test #7: java-bench-spark-test ............ Passed 96.34 sec
Start 8: tool-test
8/8 Test #8: tool-test ........................ Passed 20.23 sec
100% tests passed, 0 tests failed out of 8
Total Test time (real) = 797.71 sec
Built target test-out
Finished ubuntu22 at Sun Nov 13 20:19:38 PST 2022
AFTER VOTE
Padding length and Padding ratio What does it stand for?If there's a number on it, does it affect the program from reading this file,
File length: 548692248 bytes
Padding length: 2381182 bytes
Padding ratio: 0.43%
When spark read the orC zlib file, it was sometimes stuck for more than half an hour, but we didn't know whether the file was not written
Below is the code to reproduce the issue, it works when removing the empty struct column "col2" or writing small number of rows or changing the value to "rand() % 100"
Am I doing anything wrong?
on version 1.7.2
Code
WriterOptions options;
auto stream = writeLocalFile("orc_file_test");
MemoryPool* pool = getDefaultPool();
std::unique_ptr<Type> type(Type::buildTypeFromString(
"struct<col0:struct<col1:int>,col2:struct<col3:int>>"));
size_t num = 50000;
std::unique_ptr<Writer> writer = createWriter(*type, stream.get(), options);
std::unique_ptr<ColumnVectorBatch> batch = writer->createRowBatch(num);
StructVectorBatch* structBatch =
dynamic_cast<StructVectorBatch*>(batch.get());
StructVectorBatch* structBatch2 =
dynamic_cast<StructVectorBatch*>(structBatch->fields[0]);
LongVectorBatch* intBatch =
dynamic_cast<LongVectorBatch*>(structBatch2->fields[0]);
StructVectorBatch* structBatch3 =
dynamic_cast<StructVectorBatch*>(structBatch->fields[1]);
LongVectorBatch* intBatch2 =
dynamic_cast<LongVectorBatch*>(structBatch3->fields[0]);
structBatch->numElements = num;
structBatch2->numElements = num;
structBatch3->numElements = num;
structBatch3->hasNulls = true;
for (int64_t i = 0; i < num; ++i) {
intBatch->data.data()[i] = rand() % 150000;
intBatch->notNull[i] = 1;
intBatch2->notNull[i] = 0;
intBatch2->hasNulls = true;
structBatch3->notNull[i] = 0;
}
intBatch->hasNulls = false;
writer->add(*batch);
writer->close();
ReaderOptions readOptions;
readOptions.setMemoryPool(*getDefaultPool());
auto reader = createReader(readLocalFile("orc_file_test"), readOptions);
orc::RowReaderOptions rowOptions;
rowOptions.searchArgument(
SearchArgumentFactory::newBuilder()
->startAnd()
.equals(2, PredicateDataType::LONG, Literal((int64_t)5))
.end()
.build());
std::unique_ptr<RowReader> rowReader = reader->createRowReader(rowOptions);
batch = rowReader->createRowBatch(num);
structBatch = dynamic_cast<StructVectorBatch*>(batch.get());
structBatch2 = dynamic_cast<StructVectorBatch*>(structBatch->fields[0]);
intBatch = dynamic_cast<LongVectorBatch*>(structBatch2->fields[0]);
structBatch3 = dynamic_cast<StructVectorBatch*>(structBatch->fields[1]);
while (rowReader->next(*batch)) {
for (size_t i = 0; i < batch->numElements; i++) {
}
}
Stack trace:
terminate called after throwing an instance of 'orc::ParseError'
what(): bad read in nextBuffer
*** Aborted at 1666816640 (Unix time, try 'date -d @1666816640') ***
*** Signal 6 (SIGABRT) (0x2035c0002b7ad) received by PID 178093 (pthread TID 0x7ffb12545a80) (linux TID 178093) (maybe from PID 178093, UID 131932) (code: -6), stack trace: ***
@ 0000000000000000 (unknown)
@ 000000000009c9d3 __GI___pthread_kill
@ 00000000000444ec __GI_raise
@ 000000000002c432 __GI_abort
@ 00000000000a3fd4 __gnu_cxx::__verbose_terminate_handler()
@ 00000000000a1b39 __cxxabiv1::__terminate(void (*)())
@ 00000000000a1ba4 std::terminate()
@ 00000000000a1e6f __cxa_throw
@ 0000000001efcd55 __cxa_throw
@ 00000000075b676c orc::BooleanRleDecoderImpl::seek(orc::PositionProvider&)
/home/engshare/third-party2/apache-orc/1.7.2/src/orc/c++/src/ByteRLE.cc:526
@ 00000000075af711 orc::IntegerColumnReader::seekToRowGroup(std::unordered_map<unsigned long, orc::PositionProvider, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long const, orc::PositionProvider> > >&)
/home/engshare/third-party2/apache-orc/1.7.2/src/orc/c++/src/ColumnReader.cc:120
@ 00000000075af67f orc::StructColumnReader::seekToRowGroup(std::unordered_map<unsigned long, orc::PositionProvider, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long const, orc::PositionProvider> > >&)
/home/engshare/third-party2/apache-orc/1.7.2/src/orc/c++/src/ColumnReader.cc:965
@ 00000000075af67f orc::StructColumnReader::seekToRowGroup(std::unordered_map<unsigned long, orc::PositionProvider, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long const, orc::PositionProvider> > >&)
/home/engshare/third-party2/apache-orc/1.7.2/src/orc/c++/src/ColumnReader.cc:965
@ 0000000007598179 orc::RowReaderImpl::seekToRowGroup(unsigned int)
/home/engshare/third-party2/apache-orc/1.7.2/src/orc/c++/src/Reader.cc:440
@ 000000000759d700 orc::RowReaderImpl::startNextStripe()
/home/engshare/third-party2/apache-orc/1.7.2/src/orc/c++/src/Reader.cc:1037
@ 000000000759daf4 orc::RowReaderImpl::next(orc::ColumnVectorBatch&)
/home/engshare/third-party2/apache-orc/1.7.2/src/orc/c++/src/Reader.cc:1055
@ 0000000002fba9bc main
@ 000000000002c656 __libc_start_call_main
@ 000000000002c717 __libc_start_main_alias_2
@ 0000000002fb2780 _start
I created an ORC file by the code as follows.
val data = Seq(
("", "2022-01-32"), // pay attention to this, null
("", "9808-02-30"), // pay attention to this, 9808-02-29
("", "2022-06-31"), // pay attention to this, 2022-06-30
)
val cols = Seq("str", "date_str")
val df=spark.createDataFrame(data).toDF(cols:_*).repartition(1)
df.printSchema()
df.show(100)
df.write.mode("overwrite").orc("/tmp/orc/data.orc")
Please note that these three cases are invalid date.
And I read it via:
scala> var df = spark.read.schema("date_str date").orc("/tmp/orc/data.orc"); df.show()
+----------+
| date_str|
+----------+
| null|
|9808-02-29|
|2022-06-30|
+----------+
Why is 2022-01-32
converted to null
, while 9808-02-30
is converted to 9808-02-29
?
Intuitively, they are invalid date, we should return 3 nulls.
This question is similar to THIS one I asked before on StackOverflow, which after some more trials it works.
Previously there was some issue with the column Id but now I am trying to filter a column of DECIMAL
data type but always results give me all the data instead of the filtered one.
Data which ORC file has in the required columns:
And this is how I am trying to filter out the DECIMAL
column using orc::SearchArgument
:
orc::RowReaderOptions m_RowReaderOpts;
orc::ReaderOptions m_ReaderOpts;
std::unique_ptr<orc::Reader> m_Reader;
std::unique_ptr<orc::RowReader> m_RowReader;
auto builder = orc::SearchArgumentFactory::newBuilder();
const int snapshot_time_col_id = 22;
orc::Literal ss_begin_time{34080000000000, 14, 9};
orc::Literal ss_end_time{34380000000000, 14, 9};
// I HAVE ALSO TRIED, but didn't work.
// orc::Literal ss_begin_time{34080, 5, 0};
// orc::Literal ss_end_time{34380, 5, 0};
builder->between(snapshot_time_col_id, orc::PredicateDataType::DECIMAL, ss_begin_time, ss_end_time);
m_RowReaderOpts.searchArgument(builder->build());
reader = orc::createReader(orc::readFile(a_FilePath.c_str()), m_ReaderOpts);
row_reader = reader->createRowReader(m_RowReaderOpts);
org.apache.iceberg.data.TestMetricsRowGroupFilter > testIsNaN[format = orc] FAILED
java.lang.AssertionError: Should read: NaN counts are not tracked in Parquet metrics
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.assertTrue(Assert.java:42)
at org.apache.iceberg.data.TestMetricsRowGroupFilter.testIsNaN(TestMetricsRowGroupFilter.java:308)
org.apache.iceberg.data.TestMetricsRowGroupFilter > testNotNaN[format = orc] FAILED
java.lang.AssertionError: Should read: NaN counts are not tracked in Parquet metrics
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.assertTrue(Assert.java:42)
at org.apache.iceberg.data.TestMetricsRowGroupFilter.testNotNaN(TestMetricsRowGroupFilter.java:320)
#1055 looks relevant to this issue.
orc/cmake_modules/ThirdpartyToolchain.cmake
Lines 147 to 151 in 1e29620
ORC_PREFER_STATIC_ZLIB
is false then the static system zlib is used.ORC_PREFER_STATIC_ZLIB
is true and ZLIB_STATIC_LIB
is set then the shared system zlib is used.ORC_PREFER_STATIC_ZLIB
'Prefer static zlib library, if available'.RELEASES
menuAt the moment, the C++ library supports gcc, clang and msvc compilers at the same time with C++11 enabled. To keep up with the pace of modern c++ standards, we can enable c++17 or even c++20 by default. We should be careful with public headers as they are dependent by various downstream projects. Elsewhere we can enjoy new language features internally in the library.
However, this requires us to lift the minimum versions supported for the compilers listed below:
Thoughts? @dongjoon-hyun @williamhyun @guiyanakuang @stiga-huang @coderex2522
Ubuntu
/MacOS
/Windows
.
$ ./run-all.sh apache branch-1.8
...
Test start: Sun Jan 8 19:00:36 PST 2023
End: Sun Jan 8 20:26:53 PST 2023
$ tail -n2 logs/*log
==> logs/centos7-test.log <==
Built target test-out
Finished centos7 at Sun Jan 8 20:17:09 PST 2023
==> logs/debian10-test.log <==
Built target test-out
Finished debian10 at Sun Jan 8 20:20:51 PST 2023
==> logs/debian10_jdk=11-test.log <==
Built target test-out
Finished debian10_jdk=11 at Sun Jan 8 20:22:35 PST 2023
==> logs/debian11-test.log <==
Built target test-out
Finished debian11 at Sun Jan 8 20:20:58 PST 2023
==> logs/fedora37-test.log <==
Built target test-out
Finished fedora37 at Sun Jan 8 20:22:26 PST 2023
==> logs/ubuntu18-test.log <==
Built target test-out
Finished ubuntu18 at Sun Jan 8 20:16:50 PST 2023
==> logs/ubuntu20-test.log <==
Built target test-out
Finished ubuntu20 at Sun Jan 8 20:21:25 PST 2023
==> logs/ubuntu20_jdk=11-test.log <==
Built target test-out
Finished ubuntu20_jdk=11 at Sun Jan 8 20:26:53 PST 2023
==> logs/ubuntu20_jdk=11_cc=clang-test.log <==
Built target test-out
Finished ubuntu20_jdk=11_cc=clang at Sun Jan 8 20:18:33 PST 2023
==> logs/ubuntu22-test.log <==
Built target test-out
Finished ubuntu22 at Sun Jan 8 20:23:15 PST 2023
This is resolved via #1353 .
When building the C++ library on a platform where char
is by default unsigned, byte-to-integer expansion is incorrect in orc::expandBytesToIntegers
as well as in a few unit tests.
This can be reproduced on any CPU architecture when building with gcc by compiling with -funsigned-char
.
I notice there is merge file function in Java. Is there plan to support it in C++?
orc version: 1.6.11, sql: select xxx from xxx where str is not null
Recently i found some orc files wrote by trino didn't have complete statistics in files meta(maybe a presto bug), this causes OrcProto.ColumnStatistics
can't be deserialized to any specific ColumnStatisticsImpl
such as StringStatisticsImpl
, then RecordReaderImpl.getValueRange()
returns ValueRange
with null lower
and RecordReaderImpl.pickRowGroups()
skips this row group, which should not be skipped. In normal conditions except above, everything is ok. And i found orc-1.5.x can handle above case according to RecordReaderImpl.UNKNOWN_VALUE
, which has removed in 1.6.x. Maybe we could add it back for better compatibility. @dongjoon-hyun @omalley
The readme says that c++ version only writes version 0.11 but from
Line 47 in cf720d7
it seems that the default has already changed to 0.12
I would like to run mvn -Dmaven.test.skip=true clean package
, but maven-dependency-plugin
complains Unused declared dependencies
for some libraries used by the test code, which breaks the compilation. Please check the attached logs.
I'm not a java expert, but I'm guessing the cause of the problem is the misuse of analyze-only
in the package phase. According to the documentation, the analyze-only
goal is meant to be used during the test-compile
phase. In our case, the test class was not compiled, so the dependency analyzer treated some libraries as unused. I don't know the right way to fix it though.
The setting of maven-dependency-plugin:
Lines 373 to 388 in 8cf1047
Logs:
$ cd orc/java
$ mvn -Dmaven.test.skip=true clean package
[INFO] Scanning for projects...
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Build Order:
[INFO]
[INFO] Apache ORC [pom]
[INFO] ORC Shims [jar]
[INFO] ORC Core [jar]
[INFO] ORC MapReduce [jar]
[INFO] ORC Tools [jar]
[INFO] ORC Examples [jar]
[INFO]
[INFO] -------------------------< org.apache.orc:orc >-------------------------
[INFO] Building Apache ORC 1.9.0-SNAPSHOT [1/6]
[INFO] --------------------------------[ pom ]---------------------------------
[INFO]
[INFO] --- maven-clean-plugin:3.2.0:clean (default-clean) @ orc ---
[INFO] Deleting /Users/x/Documents/playground/orc/java/target
[INFO]
[INFO] --- maven-enforcer-plugin:3.1.0:enforce (enforce-maven-version) @ orc ---
[INFO]
[INFO] --- maven-enforcer-plugin:3.1.0:enforce (enforce-java-version) @ orc ---
[INFO]
[INFO] --- maven-enforcer-plugin:3.1.0:enforce (enforce-maven) @ orc ---
[INFO]
[INFO] --- maven-remote-resources-plugin:1.7.0:process (process-resource-bundles) @ orc ---
[INFO] Preparing remote bundle org.apache:apache-jar-resource-bundle:1.4
[INFO] Copying 3 resources from 1 bundle.
[INFO]
[INFO] --- maven-antrun-plugin:3.1.0:run (setup-test-dirs) @ orc ---
[INFO] Executing tasks
[INFO] [mkdir] Created dir: /Users/x/Documents/playground/orc/java/target/testing-tmp
[INFO] Executed tasks
[INFO]
[INFO] --- maven-site-plugin:3.12.0:attach-descriptor (attach-descriptor) @ orc ---
[INFO] No site descriptor found: nothing to attach.
[INFO]
[INFO] --- maven-source-plugin:3.2.1:jar-no-fork (create-source-jar) @ orc ---
[INFO]
[INFO] --- maven-source-plugin:3.2.1:test-jar-no-fork (create-source-jar) @ orc ---
[INFO]
[INFO] --- reproducible-build-maven-plugin:0.15:strip-jar (default) @ orc ---
[INFO]
[INFO] ----------------------< org.apache.orc:orc-shims >----------------------
[INFO] Building ORC Shims 1.9.0-SNAPSHOT [2/6]
[INFO] --------------------------------[ jar ]---------------------------------
[INFO]
[INFO] --- maven-clean-plugin:3.2.0:clean (default-clean) @ orc-shims ---
[INFO] Deleting /Users/x/Documents/playground/orc/java/shims/target
[INFO]
[INFO] --- maven-enforcer-plugin:3.1.0:enforce (enforce-maven-version) @ orc-shims ---
[INFO]
[INFO] --- maven-enforcer-plugin:3.1.0:enforce (enforce-java-version) @ orc-shims ---
[INFO]
[INFO] --- maven-enforcer-plugin:3.1.0:enforce (enforce-maven) @ orc-shims ---
[INFO]
[INFO] --- build-helper-maven-plugin:3.3.0:add-source (add-source) @ orc-shims ---
[INFO] Source directory: /Users/x/Documents/playground/orc/java/shims/target/generated-sources added.
[INFO]
[INFO] --- maven-remote-resources-plugin:1.7.0:process (process-resource-bundles) @ orc-shims ---
[INFO] Preparing remote bundle org.apache:apache-jar-resource-bundle:1.4
[INFO] Copying 3 resources from 1 bundle.
[INFO]
[INFO] --- maven-resources-plugin:3.2.0:resources (default-resources) @ orc-shims ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Using 'UTF-8' encoding to copy filtered properties files.
[INFO] skip non existing resourceDirectory /Users/x/Documents/playground/orc/java/shims/src/main/resources
[INFO] Copying 3 resources
[INFO]
[INFO] --- maven-compiler-plugin:3.10.1:compile (default-compile) @ orc-shims ---
[INFO] Compiling 13 source files to /Users/x/Documents/playground/orc/java/shims/target/classes
[INFO]
[INFO] --- maven-resources-plugin:3.2.0:testResources (default-testResources) @ orc-shims ---
[INFO] Not copying test resources
[INFO]
[INFO] --- maven-antrun-plugin:3.1.0:run (setup-test-dirs) @ orc-shims ---
[INFO] Executing tasks
[INFO] [mkdir] Created dir: /Users/x/Documents/playground/orc/java/shims/target/testing-tmp
[INFO] Executed tasks
[INFO]
[INFO] --- maven-compiler-plugin:3.10.1:testCompile (default-testCompile) @ orc-shims ---
[INFO] Not compiling test sources
[INFO]
[INFO] --- maven-surefire-plugin:3.0.0-M5:test (default-test) @ orc-shims ---
[INFO] Tests are skipped.
[INFO]
[INFO] --- maven-jar-plugin:3.3.0:jar (default-jar) @ orc-shims ---
[INFO] Building jar: /Users/x/Documents/playground/orc/java/shims/target/orc-shims-1.9.0-SNAPSHOT.jar
[INFO]
[INFO] --- maven-site-plugin:3.12.0:attach-descriptor (attach-descriptor) @ orc-shims ---
[INFO] Skipping because packaging 'jar' is not pom.
[INFO]
[INFO] --- maven-source-plugin:3.2.1:jar-no-fork (create-source-jar) @ orc-shims ---
[INFO] Building jar: /Users/x/Documents/playground/orc/java/shims/target/orc-shims-1.9.0-SNAPSHOT-sources.jar
[INFO]
[INFO] --- maven-source-plugin:3.2.1:test-jar-no-fork (create-source-jar) @ orc-shims ---
[INFO] Building jar: /Users/x/Documents/playground/orc/java/shims/target/orc-shims-1.9.0-SNAPSHOT-test-sources.jar
[INFO]
[INFO] --- reproducible-build-maven-plugin:0.15:strip-jar (default) @ orc-shims ---
[INFO] Stripping /Users/x/Documents/playground/orc/java/shims/target/orc-shims-1.9.0-SNAPSHOT-test-sources.jar
[INFO] Stripping /Users/x/Documents/playground/orc/java/shims/target/orc-shims-1.9.0-SNAPSHOT-sources.jar
[INFO] Stripping /Users/x/Documents/playground/orc/java/shims/target/orc-shims-1.9.0-SNAPSHOT.jar
[INFO]
[INFO] --- maven-dependency-plugin:3.1.2:analyze-only (default) @ orc-shims ---
[WARNING] Unused declared dependencies found:
[WARNING] org.junit.jupiter:junit-jupiter-api:jar:5.9.0:test
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Apache ORC 1.9.0-SNAPSHOT:
[INFO]
[INFO] Apache ORC ......................................... SUCCESS [ 4.108 s]
[INFO] ORC Shims .......................................... FAILURE [ 4.817 s]
[INFO] ORC Core ........................................... SKIPPED
[INFO] ORC MapReduce ...................................... SKIPPED
[INFO] ORC Tools .......................................... SKIPPED
[INFO] ORC Examples ....................................... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 9.189 s
[INFO] Finished at: 2022-10-13T15:07:06+08:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-dependency-plugin:3.1.2:analyze-only (default) on project orc-shims: Dependency problems found -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR] mvn <args> -rf :orc-shims
RC0 VOTE LINK
[VOTE] Release Apache ORC 1.7.3 (RC0)
TAG
https://github.com/apache/orc/releases/tag/v1.7.3-rc0
RELEASE FILES
https://dist.apache.org/repos/dist/dev/orc/v1.7.3-rc0
STAGING REPOSITORY
https://repository.apache.org/content/repositories/orgapacheorc-1055
LIST OF ISSUES
https://github.com/apache/orc/milestone/4?closed=1
https://issues.apache.org/jira/projects/ORC/versions/12351162
This is created based on the following dev@orc email.
what happens if type.fieldnames_size() is less than the type.subtypes_size(). The type.fieldnames(i) will be invalid access.
File name : TypeImpl.cc
case proto::Type_Kind_STRUCT: {
TypeImpl* result = new TypeImplSTRUCT);
uint64_t size = static_cast<uint64_t>(type.subtypes_size());
std::vector<Type*> typeList(size);
std::vector<std::string> fieldList(size);
for(int i=0; i < type.subtypes_size(); ++i)
{ result->addStructField(type.fieldnames(i), convertType(footer.types(static_cast<int> (type.subtypes(i))),footer)); }
return std::unique_ptr<Type>(result);
I reproduced the scenario by modifying c++/test/TestType.cc (refer screenshot) and it crashed.(refer output screenshot
)
Hello,
Using arrow adapter, I became aware that the memory (RAM) footprint of the export (exporting an orc file) was very huge for each field. For instance, exporting a table with 10000 fields can take up to 30Go, even if there is only 10 records.
Even for 100 fields, that could take 100Mo+.
The "issue" seems to be coming from here :
Line 59 in 432a7aa
When we create a writer with the "createWriter" (
Lines 681 to 684 in 432a7aa
Is there a reason the BufferedOutputStream initial capacity is that high ? I circumvented my problem by lowering it to 1Ko (it didn't change much the performance according to my testing, but it may depend on usecases). Could it be envisaged to put a global variable (or static one) to parametrize this to allow changing this hard coded parameter ?
Thanks
This is resolved via #1330
During preparing 1.8.0, I realized that we had better pin annotations
library to minimize downstream projects' migration burden.
https://github.com/dongjoon-hyun/spark/pull/4/files#
<dependency>
<groupId>org.jetbrains</groupId>
<artifactId>annotations</artifactId>
<version>23.0.0</version>
</dependency>
Apache ORC homepage provides the following documents. It would be great if we can add Using with Python
page with the example of PyArrow
.
This is created based on the following dev@orc mail on June 11st.
https://lists.apache.org/thread/pkp6ffh9pqok7v618zxtox708mv26sz0
branch-1.8
is healthy in GitHub Action
CentOS 7
, Debian 10
, Debian 11
, Fedora37
, Ubuntu 18
, Ubuntu 20
, Ubuntu 22
) passed.Hello, I am trying to read orc data from gcs bucket in java. Is there any way or example to read orc data from GCP directly? Many Thanks!
We're seeing some conflicts in the proto files when updating Iceberg in Trino: trinodb/trino#15079 (comment)
I think we should exclude the proto files from the nohive
jar.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.