This works: <div class="snippet-clipboard-content notranslate position-relative ov

The following code <div class="snippet-clipboard-content notranslate position-rela

For example on my ubuntu <div class="snippet-clipboard-content notranslate positio

Cannot write string dataset w/ many points about hdf5.node HOT 20 CLOSED

hdf-ni commented on August 11, 2024

Cannot write string dataset w/ many points

from hdf5.node.

Comments (20)

rimmartin commented on August 11, 2024

I'll experiment. Have to see if it works with the slash on 'dataset'

from hdf5.node.

rimmartin commented on August 11, 2024

Should work without the slash; just the name

from hdf5.node.

jacoscaz commented on August 11, 2024

Slash or no slash, I keep getting the same error when I increase the array's length from 10000 to 100000. I'll try to bisect until I find the exact length that triggers this behaviour.

from hdf5.node.

jacoscaz commented on August 11, 2024

It breaks when going from a length of 73901 to 73902.

Also, when I examine the file with h5dump -d /datasetName, I'm getting the JSON representation of the whole array as the first point in the dataset.

EDIT: I was wrong, apologies. It looks like JSON but it's not JSON.

This is the header for a Uint16 dataset within the same file:

DATASET "/station_id" {
   DATATYPE  H5T_STD_U16LE
   DATASPACE  SIMPLE { ( 100 ) / ( 100 ) }
   ATTRIBUTE "type" {
      DATATYPE  H5T_STD_U32LE
      DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
   }
}

This is the header for the string dataset:

DATASET "/station_name" {
   DATATYPE  H5T_ARRAY { [100] H5T_STRING {
      STRSIZE H5T_VARIABLE;
      STRPAD H5T_STR_NULLTERM;
      CSET H5T_CSET_ASCII;
      CTYPE H5T_C_S1;
   } }
   DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
}

My code is based on the tutorial for variable length strings here: http://hdf-ni.github.io/hdf5.node/tut/dataset-tutorial.html .

from hdf5.node.

rimmartin commented on August 11, 2024

What are you filling in the Array entries with?
I suppose for a test a random string generator could b used. Or find a text document wth over 80,000 lines...
Testing

from hdf5.node.

jacoscaz commented on August 11, 2024

The following code

var fs = require('fs');
var hdf5 = require('../common/hdf5').hdf5;
var h5lt = require('../common/hdf5').h5lt;
var h5gl = require('../common/hdf5').h5gl;
var path = require('path');
var shortid = require('shortid');
var filePath = path.join(__dirname, 'test-hdf5.h5');
var file = new hdf5.File(filePath, h5gl.Access.ACC_TRUNC);
var length = 10;
var dataset = new Array(length);
for (var i = 0; i < length; i++) {
  dataset[i] = shortid.generate();
}
h5lt.makeDataset(file.id, 'test', dataset);
file.close();

produces a file that when examined through h5dump -d /test --stride 1 --start 0 --count 1 products/test-hdf5.h5 shows the following:

HDF5 "products/test-hdf5.h5" {
DATASET "/test" {
   DATATYPE  H5T_ARRAY { [10] H5T_STRING {
      STRSIZE H5T_VARIABLE;
      STRPAD H5T_STR_NULLTERM;
      CSET H5T_CSET_ASCII;
      CTYPE H5T_C_S1;
   } }
   DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
   SUBSET {
      START ( 0 );
      STRIDE ( 1 );
      COUNT ( 1 );
      BLOCK ( 1 );
      DATA {
      (0): [ "r1_oscv0", "SkeOjocDR", "Sy-doo5DC", "SJfujjcwR", "Bkmuio9PA", "BJNuoo9D0", "rkBdsjqPA", "ry8OssqvR", "ryv_ii5wR", "ryd_ojqw0" ]
      }
   }
}
}

This is what I was referring to before - it looks like the entire array of strings is being stored as the first point the dataset rather than each string being treated as a separate point.

from hdf5.node.

rimmartin commented on August 11, 2024

I got a test case setup by reading in a pdb of the rat liver molecule from https://pdb101.rcsb.org/motm/114
It's close to a million lines and cuts out between 70000 and 80000.

So able to repeat and test

from hdf5.node.

rimmartin commented on August 11, 2024

It might have to do with some handle limit on linux

from hdf5.node.

rimmartin commented on August 11, 2024

For example on my ubuntu

cat /proc/sys/fs/file-max
808097

from hdf5.node.

jacoscaz commented on August 11, 2024

I guess there are two sides to this - the cut out and the array of strings vs strings dataset.
Happy to contribute in any way I can. Feel free to send tests my way. I'll check the fs limit as soon as I get back home.
On 9 Oct 2016 6:28 p.m., rimmartin [email protected] wrote:I got a test case setup by reading in a pdb of the rat liver molecule from https://pdb101.rcsb.org/motm/114
It's close to a million lines and cuts out between 70000 and 80000.

So able to repeat and test

—You are receiving this because you authored the thread.Reply to this email directly, view it on GitHub, or mute the thread.

from hdf5.node.

rimmartin commented on August 11, 2024

filename = '/home/jacopo/data-backend/products/gistemp/gistemp.h5', file descriptor = 12, errno = 14, error message = 'Bad address', buf = 0x55c61fcac378, total write size = 422496, bytes this sub-write = 422496, bytes actually written = 18446744073709551615, offset = 1179648

filename = './roothaan.h5', file descriptor = 9, errno = 14, error message = 'Bad address', buf = 0x487f858, total write size = 98400, bytes this sub-write = 98400, bytes actually written = 18446744073709551615, offset = 1183744

The actually written is crazy in both your test and mine; but there is a bad address as the error message

from hdf5.node.

jacoscaz commented on August 11, 2024

With my test as-is, i.e. using shortid.generate(), I can go up to a length of 73862. A length of 73863 breaks one every two runs (more or less) and 73864 always breaks.

However, switching to the following filler loop only got me up to 73820, breaking on all runs from 73821 going upward.

for (var i = 0; i < length; i++) {
  dataset[i] = 'hello ' + i;
}

Lengthening the string to 'helloworldhelloworld ' + i still got me up to 73820. Curiously enough, inverting the order to i + ' hello' got me to a different number, 73746.

There must be a pattern but I can't see it ATM. Perhaps we're hitting some kind of limit on how big an array of strings can be within an array of strings-typed dataset (even though we shouldn't be getting an array of strings-typed dataset in the first place).

PS: My file-max is 200676.
PS: Can I store fixed-length strings using node.hdf5?

from hdf5.node.

rimmartin commented on August 11, 2024

Yea I was testing with

   dataset[i] = 'hello ' + '\0';

It feels like some limit is being hit; a heap or stack. Something. I may put the question to the hdfgroup after I search their email.

Yes, fixed was done for table columns. Let me test some; to make it clean I may add an option:

h5lt.makeDataset(file.id, '/dataset', dataset, {fixed-width=7});

for example

Will continue to look at large sizes of everything to look for breaks in the system

from hdf5.node.

jacoscaz commented on August 11, 2024

That'd be lovely. Happy to test any solution you come up with.

from hdf5.node.

rimmartin commented on August 11, 2024

fixed width is coming. Need to test and work on the reading back to javascript.

For writing there is no need to fixed the length of the strings; just know the maximum length of them all. If this is too short for one string entry in the Array an exception will be thrown from the native side to insure data doesn't get messed up

h5lt.makeDataset(group.id, "Rat Liver", lines, {fixed_width: maxLength});

should commit this evening

from hdf5.node.

jacoscaz commented on August 11, 2024

Wonderful, wonderful, wonderful.

from hdf5.node.

rimmartin commented on August 11, 2024

Hi, sorry for delay.

    h5lt.makeDataset(group.id, "Rat Liver", lines, {fixed_width: maxLength});

now saves nearly 1 million lines from a text file for the rat liver pdb chemistry model. The fixed width is 80 in this case.

Need to test reading back to javascript yet

from hdf5.node.

rimmartin commented on August 11, 2024

I'm building their c examples and extending them to work with large data. Otherwise I've mirrored these examples in this project. Their docs don't say chunking is necessary but may need to

from hdf5.node.

rimmartin commented on August 11, 2024

fixed width is now working. Tested on about a million entries and ~74 Mb h5 file

    h5lt.makeDataset(group.id, "Rat Liver", lines, {fixed_width: maxLength});
    var readArray=h5lt.readDataset(group.id, "Rat Liver");

where the array is filled from a text file read and split on"\n"

    const lineArr          =  ratLiver.trim().split("\n");
    var lines=new Array(lineArr.length);
    var index=0;
    var maxLength=0;
    /* Loop over every line. */
    lineArr.forEach(function (line) {
        if(index<lines.length){
        lines[index]=line;
        if(maxLength<line.length)maxLength=line.length;
        }
        index++;
    });

Relooking at variable length

from hdf5.node.

rimmartin commented on August 11, 2024

variable length io is now working

from hdf5.node.

Cannot write string dataset w/ many points about hdf5.node HOT 20 CLOSED

Comments (20)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent