Comments (7)
This is not a reproducible example. A reproducible example with code is required to debug bugs.
from parquet-wasm.
With columns:
> function meminfo() {
... global.gc()
... const memoryUsage = process.memoryUsage();
... console.log('RSS:', memoryUsage.rss / (1024 * 1024), 'MB');
... console.log('Heap Total:', memoryUsage.heapTotal / (1024 * 1024), 'MB');
... console.log('Heap Used:', memoryUsage.heapUsed / (1024 * 1024), 'MB');
... }
undefined
>
undefined
> const {ParquetFile, readParquet, readParquetStream, wasmMemory} = require("parquet-wasm")
undefined
> const {parseTable, parseRecordBatch} = require("arrow-js-ffi")
undefined
> var WASM_MEMORY = wasmMemory();
undefined
>
> meminfo()
RSS: 70.890625 MB
Heap Total: 17.921875 MB
Heap Used: 7.547981262207031 MB
undefined
>
> console.time('pq')
undefined
> var table = await ParquetFile.fromUrl('http://localhost:8000/lineitem.parquet')
undefined
> var numRows = table.metadata().fileMetadata().numRows()
undefined
> var option = {
... columns : ['l_quantity', 'l_extendedprice', 'l_discount', 'l_tax', 'l_returnflag', 'l_linestatus', 'l_shipdate'],
... batchSize : 122_880,
... }
undefined
> meminfo()
RSS: 95.42578125 MB
Heap Total: 14.0703125 MB
Heap Used: 10.886177062988281 MB
undefined
>
> var ffiTable = (await table.read(option)).intoFFI();
undefined
> meminfo()
RSS: 641.65234375 MB
Heap Total: 14.0703125 MB
Heap Used: 10.615234375 MB
undefined
>
> var arrowTable = parseTable(
... WASM_MEMORY.buffer,
... ffiTable.arrayAddrs(),
... ffiTable.schemaAddr()
... )
undefined
> console.timeEnd('pq')
pq: 2.231s
undefined
> meminfo()
RSS: 1097.15234375 MB
Heap Total: 17.3203125 MB
Heap Used: 11.910774230957031 MB
undefined
>
> console.log(numRows, table.metadata().fileMetadata().createdBy());
6001215 DuckDB
undefined
> arrowTable.batches[0].data.children[4]
Data {
type: Decimal { typeId: 7, scale: 2, precision: 15, bitWidth: 128 },
children: [],
dictionary: undefined,
offset: 0,
length: 122880,
_nullCount: 0,
stride: 4,
values: Uint32Array(491520) [
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23,
24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59,
60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71,
72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83,
84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95,
96, 97, 98, 99,
... 491420 more items
],
nullBitmap: Uint8Array(0) []
}
Without Columns:
> console.time('pq')
undefined
> var table = await ParquetFile.fromUrl('http://localhost:8000/lineitem.parquet')
undefined
> var numRows = table.metadata().fileMetadata().numRows()
undefined
> var option = {
... // columns : ['l_quantity', 'l_extendedprice', 'l_discount', 'l_tax', 'l_returnflag', 'l_linestatus', 'l_shipdate'],
... batchSize : 122_880,
... }
undefined
> meminfo()
RSS: 1099.27734375 MB
Heap Total: 14.3203125 MB
Heap Used: 11.8231201171875 MB
undefined
>
> var ffiTable = (await table.read(option)).intoFFI();
undefined
> meminfo()
RSS: 2150.1640625 MB
Heap Total: 15.0703125 MB
Heap Used: 11.835853576660156 MB
undefined
>
> var arrowTable = parseTable(
... WASM_MEMORY.buffer,
... ffiTable.arrayAddrs(),
... ffiTable.schemaAddr()
... )
undefined
> console.timeEnd('pq')
pq: 5.992s
undefined
> meminfo()
RSS: 3131.0390625 MB
Heap Total: 17.8203125 MB
Heap Used: 11.851348876953125 MB
undefined
>
> console.log(numRows, table.metadata().fileMetadata().createdBy());
6001215 DuckDB
undefined
> arrowTable.batches[0].data.children[9]
Data {
type: Utf8 { typeId: 5 },
children: [],
dictionary: undefined,
offset: 0,
length: 122880,
_nullCount: 0,
stride: 1,
valueOffsets: Int32Array(122881) [
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23,
24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59,
60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71,
72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83,
84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95,
96, 97, 98, 99,
... 122781 more items
],
values: Uint8Array(122880) [
79, 79, 79, 79, 79, 79, 79, 70, 70, 70, 70, 70,
70, 79, 70, 70, 70, 70, 79, 79, 79, 79, 79, 79,
79, 79, 79, 79, 79, 79, 79, 70, 70, 70, 70, 79,
79, 79, 79, 79, 79, 79, 79, 79, 79, 70, 70, 70,
79, 79, 79, 79, 79, 79, 79, 70, 70, 79, 79, 70,
70, 79, 79, 79, 79, 79, 79, 79, 79, 79, 79, 79,
79, 79, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70,
70, 70, 79, 79, 79, 79, 79, 79, 70, 70, 70, 70,
70, 70, 70, 70,
... 122780 more items
],
nullBitmap: Uint8Array(0) []
}
from parquet-wasm.
the bug might be:
options
specified columns
, but when in get step, the field schema still use the one without columns
the default column postion 4 is:
Field {
name: 'l_quantity',
type: Decimal { typeId: 7, scale: 2, precision: 15, bitWidth: 128 },
nullable: true,
metadata: Map(0) {}
}
with specified columns
at position 4, but the real position of whole arrowTable.schema.fields
at postion 9:
Field {
name: 'l_linestatus',
type: Utf8 { typeId: 5 },
nullable: true,
metadata: Map(0) {}
}
from parquet-wasm.
I'm sorry I still don't understand what your issue is. Are you able to provide a reproducible data file along with your code? Can you remove all the memory reporting that is irrelevant to this issue, and just include the minimum amount of code to show your issue? See:
https://stackoverflow.com/help/minimal-reproducible-example
from parquet-wasm.
Also note that I believe the ordering of the names in columns
is not taken into effect. That only determines which columns to select, but the original order in the dataset is used.
from parquet-wasm.
Also note that I believe the ordering of the names in
columns
is not taken into effect. That only determines which columns to select, but the original order in the dataset is used.
yes,default postions are 0~15,columns
option pick 4~10 with same order, not work
and the tpch lienitem parquet, you can generate from duckdb
from parquet-wasm.
yes,default postions are 0~15,
columns
option pick 4~10 with same order, not work
I still don't know what you're saying doesn't work. It looks like you're able to access the data correctly. You need to check the schema to find the positions of the columns in the output data.
and the tpch lienitem parquet, you can generate from duckdb
I am not going to go out of my way to generate data without a reproducible example. Please supply a minimal, reproducible example or I'll close this issue.
from parquet-wasm.
Related Issues (20)
- Example doesnt work HOT 18
- Cannot read properties of undefined (reading 'fromIPCStream') HOT 8
- reading wbindgen issue with webpack HOT 5
- issue using esm or bundled versions of parquet-wasm using esbuild HOT 2
- No functioning example HOT 11
- Write a valid geoparquet? HOT 6
- 0.6 roadmap
- Support read options HOT 1
- Fully empty file does not load HOT 5
- Remove www/ directory HOT 1
- Add wasm to package.json exports
- Use `import.meta.resolve` instead of `import.meta.url`
- Improve documentation around calling `.free` HOT 4
- `free()` not work for `ParquetFile` and so on HOT 1
- Cannot import parquet-wasm/bundler HOT 1
- Write data streaming to a parquet file HOT 3
- How to use with vite? HOT 4
- initialization of GeoArrowPolygonLayer({id: 'geoarrow-polygons'}): geometryColumn not Polygon or MultiPolygon HOT 3
- Buffers are intermittently converted into BigInt data, rather than strings HOT 9
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from parquet-wasm.