jimbraun / xcdf Goto Github PK

View Code? Open in Web Editor NEW

14.0 14.0 8.0 429 KB

XCDF: eXplicitly Compacted Data Format. See documentation at Read the Docs:

Home Page: https://xcdf.readthedocs.io/en/latest/

License: Other

C 0.85% C++ 92.10% Python 1.04% CMake 5.69% Makefile 0.32%

xcdf's People

Contributors

Stargazers

Watchers

Forkers

lilcupcake fwerner lukasnellen colasri cbrisboi hayalaso youngkwonjo healthypear

xcdf's Issues

Should be able to alias fields

Field aliases should be accessible vi GetXField() methods, and should be created via a CreateAlias() method. Aliases should not be iterated over when applying the visitor idiom, but should be viewable with an e.g. GetAliases() method.

Dynamic histogram range

Often it is useful to create histograms simply by specifying a number of bins and letting the analysis software figure out the binning. XCDF should support this:

We could probably do this dynamically. We know
the max/min in each data block. So we histogram
that and, if the next block has a larger range, rebin
appropriately. Since the bins would always become
larger, I don't think this would break down (except
we'd probably need to use more than the specified
number of bins, then rebin at the end).

This is not a trivial addition, so I'll add it to the
GitHub XCDF issue tracker and implement it when
I can.

Jim

On 12/4/14, 12:44 PM, Segev BenZvi wrote:

OK, then keeping things simple might be the best approach.

Can the function work if we don't know min/max? I.e., suppose we just say that we want the data binned into 50 bins, and then have the function figure out the min/max of the data. Would that break down quickly when looking at more than one XCD file?

Segev

XCDF Python bindings won't run on OS X with Python >2.7.8

When building XCDF against a homebrewed Python (2.7.9) in Yosemite 10.10.3, the program compiles perfectly but the Python bindings crash when calling "import xcdf" in a Python session.

It seems to be a CMake issue in which FIND_PACKAGE(PythonLibs) grabs the system Python rather than the homebrewed version. Unfortunately the same error also occurs if building against the same version of Python installed with ape.

The CMake issue with FIND_PACKAGE(PythonLibs) is a known problem (Homebrew/legacy-homebrew#25118). The issue with ape is new. There doesn't appear to be an issue with the XCDF code but the build system needs to be fixed.

Memory error in python field iteration; general memory leaks in pybindings

Hi Jim,

Attached is an XCDF output file (simulated Ne20 from the data challenge) and a script that extracts three fields from the data in a loop. To run it, just download the XCDF trunk, compile it in a build directory, copy these files into the build directory, and run

python nhit.py

This should work without errors, but there may be something weird happening to heap memory.

The problem I was running into was as follows. I created a python iterator called "fields" that can be used to loop over a subset of the fields in an XCDF file from within python. For example, if you look into the script I attached, you'll see

for r in f.fields("phony.Nhit, phony.Energy, phony.pclType"):
...

The python bindings in src/pybindings/PyXCDF.cc take the string "phony.Nhit, phony.Energy, phony.pclType" and pass it to a field visitor (see XCDFFieldsByNameSelector.h). Inside the field visitor class, the string is parsed and used to push the requested field data into a python tuple return value.

The problem appears in the object I use to pass the string from python to the C++ field visitor. The object is an XCDFFieldIterator, defined on line 174 of PyXCDF.cc. If you look at the object, you'll see that the last member is a char buffer called fieldNames_. The XCDFFieldIterator struct is what I use to maintain the state of the "fields" iterator in python, and fieldNames_ stores the data we want to pass to the C++ field visitor.

As long as I use a static char buffer to store fieldNames_, the program works fine, but if I change it to a std::string or a char* allocated with new all hell breaks loose.

What seems to be happening is that some part of the program is attempting an invalid access to heap memory when I use std::string or a char*. I'm not sure if this is the python or in the core library. I am trying to figure this out, but if you're willing to help me with the profiling I'd really appreciate it.

Fix design in Expression/NumericalExpression

A copy of the XCDFFile reference is kept in NumericalExpression just so we can disable the copy constructor. This should be handled more cleanly.

Merge version3 branch

List of issues landing on this branch will be included in this meta-issue

select-fields should support regex, or at least "*"

Can possibly use regex.h (C++ tools are only in C++11). Should do a field name search first, then a regex match. The thinking here is that "select-fields 'rec.*'" could be useful.

XCDFFrameType NONE

In XCDFDefs.h : this enum causes a conflict with an enum in GEANT4 that has the same name..
Is it possible to move the enum into a namespace or make it more explicit XCDF_NONE?

XCDF-Utility remove comment does not parse output file

The option for removing comments from an XCDF file does not allow the output option.

Help shows:
remove-comments {-o outfile} {infiles} Remove all comments from an XCDF file

When it is used:
xcdf-utility remove-comments -o gammas.xcd gammas_test.xcd
Only one input file is allowed for remove-comments. Quitting

add-comments fails for files written to stream

[jbraun@dyn-9-98:~/data 2335]$ ~/git/XCDF/build/xcdf-utility add-comment "test2" test3.xcd > test4.xcd
XCDF FATAL ERROR: /Users/jbraun/git/xcdf/src/XCDFFile.cc,ReadFrame:375]: Read failed. Byte offset: 4641
libc++abi.dylib: terminating with uncaught exception of type XCDFException
Abort trap: 6

XCDFFile::GetFileName is required

Write data for each field into its own sub-frame

This way we can completely ignore fields that are not used when reading back the file. Skip both reading the data into memory and attempting to parse field data from the block.

Test that fields are active by checking XCDFDataManager::GetReferenceCount() != 3

XCDFException is not exposed in pybindings

error: version control conflict marker in FieldNodeDefs.h

https://github.com/jimbraun/XCDF/blob/master/include/xcdf/utility/FieldNodeDefs.h#L52-L55

    // Knowing size limits is up to the user
    T operator[](unsigned index) const {
      const T& datum = field_[index];
      return datum;
<<<<<<< HEAD
=======
      //return field_[index];
>>>>>>> 8d9d6fcde1000c8369a52eed108c8461af0de450
    }
    unsigned GetSize() const {return field_.GetSize();}

    const std::string& GetName() const {return field_.GetName();}

So the 3.00.01 release will not build.

Stash --> 2 containers

Maybe we can combine Stash() with the data holding container into a vector. This std::vector would be used to back both Scalar and Vector data, and could eliminate the SSVector class, as we can use std::vector::const_iterator in both places (note: the need for T* as an iterator was the reason for the SSVector class).

Field Groups

In many cases, the data will be such that the user does not want to fill every field for every event. Logically, there would be groups of fields that are filled for any given time. It would make sense to support this concept in XCDF. There would be a default unnamed group, and users could add groups by name. To add fields to those groups, the AddXXXXField methods would take an optional group name field.

This would lay the foundation to supporting more structured data, possibly using an object-oriented storage format built on top of xcdf, while the files themselves remain valid XCDF and are flat, allowing one to use all the generic XCDF tools and analysis support.

Track global max and min

This is for histograms. On block write, check the block max/min for each field and track the global max/min. Write these into the trailer.

pybinding memory leaks and XCDFFieldsByNameSelector slowness

A quick examination of the code reveals mostly proper INCREF and DECREF, but a closer inspection is required. For example, with PyArg_ParseTuple(args, format, &filename, &filemode), are we given these references or are they just loaned? Note that XCDFFieldsByNameSelector and XCDFTupleSetter appear as though they leak, as they have a PyObject as a member, but actually do not because the reference is subsequently given away, not copied, in GetTuple()

XCDFFieldsByNameSelector parses the field specification string for each entry. This is slow, much slower than just reading all the fields. This behavior must be improved. The best option is probably to put an XCDFFieldsByNameSelector* in the XCDFFieldIterator object. We can't create a new PyTuple in operator() (since this is called many times per event), so create a new PyTuple on construction and when GetTuple() is called. Possibly wrap this in another object. XCDFFieldsByNameSelector would need a destructor and have copy/assignment disallowed.

Histogram/select for vector-vector data is problematic

This is relevant now that we have 2D vectors. The vector comparison logic in xcdf/utility/Histogram.h and xcdf/utility/Node.h is fuzzy and should be fixed. We currently support scalar-scalar, scalar-vector, and vector-vector (provided the two vectors are the same length) relations.

The following vector field relations should be supported:

Relating any field to a scalar field. No check needed here. Draw each item from the vector and take the scalar value for each draw.
Relating a 1D vector field to a 1D vector field: OK if fields have the same parent. We currently support this by checking that the two vectors are the same length. Draw once from either field for each entry. This is generally valid for comparing two fields that have the same parent.
Relating a mD vector field to an nD vector field, where m>n. Only works if the nD field is a parent of the mD field. If this is the case, draw each item from the mD field and take the corresponding value from the nD field.

So basically:

Check for a scalar. Output length will be length of the vector field.
Check for a shared first-level parent. Output length will be the parent's value.
Check if one field is a parent. Output length will be the length of the non-parent field.

Relations are not transitive, so when e.g. making a 2D histogram with a weight expression, both axes need to be compared against the weight expression, and the axes must be compared against each other.

Do a callgrind/cachegrind analysis with HAWC rec data

Read performance is lower than expected with this data.

XCDF-Utility recovers no events

xcdf-utility recover was not recovering any events this issue appear somewhere between 2.07 and 2.09.

Here is in an example of the difference between the two versions:

./xcdf-utility version
XCDF version 2.9.0
xcdf-utility recover -o test.xcd $HAWCROOT/data/hawc/data/2014/12/run002186/trig_run002186_00002.xcd
XCDF FATAL ERROR: /tmp/xcdf/xcdf-2.09.00/src/XCDFFile.cc,ReadFrame:375]: Read failed. Byte offset: 86745736
Corrupt file: Recovered 0 events.

./xcdf-utility version
XCDF version 2.7.0
./xcdf-utility recover -o test2.xcd $HAWCROOT/data/hawc/data/2014/12/run002186/trig_run002186_00002.xcd
XCDF FATAL ERROR: /data/disk01/home/iwisher/XCDF_old/XCDF-2.07.00/src/XCDFFile.cc,ReadFrame:370]: Read failed. Byte offset: 86745736
Corrupt file: Recovered 142490 events.

Bug with expression +/-

9+9 will not evaluate as an expression, since we try to evaluate the longest expression possible (keeping "+" and "-") to catch e.g. +3.14159e+0.

XCDF Force Loading Comments Broke AERIE Online

See #14. Fix by allowing the user to choose whether to force-load the comments. By default, this should probably be true, but set default to false to avoid needing changes in AERIE as well.

Need actual unit tests

Need pass/fail instead of needing to interpret the results as with the current tests.

Unable to grab fields by name at a given record

getRecord() calls Rewind() after the data is returned, making it impossible to extract data for a given field at a given record without iterating through the entire file.

Configure errors with cmake 2.8.12.2

CentOS 6.8 provides cmake 2.8.12.2 which fails to configure the current XCDF master:

CMake Error at CMakeLists.txt:31 (CMAKE_POLICY):
  Policy "CMP0026" is not known to this version of CMake.

CMake Error at CMakeLists.txt:32 (CMAKE_POLICY):
  Policy "CMP0042" is not known to this version of CMake.

This is due to the following check in CMakeLists.txt which, I guess, wass meant to identify cmake 3:

  IF ("${CMAKE_VERSION}" VERSION_GREATER 2.8.12)
    CMAKE_POLICY(SET CMP0026 OLD)  # Read LOCATION properties from build targets
    CMAKE_POLICY(SET CMP0042 OLD)  # Disable @rpath in target install name
  ENDIF()

Simple patch:

--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -27,7 +27,7 @@ ENDIF ("${CMAKE_SOURCE_DIR}" STREQUAL "${CMAKE_BINARY_DIR}")
 IF (COMMAND cmake_policy)
   CMAKE_POLICY(SET CMP0009 NEW)  # Do not follow symlinks w/ FILE_GLOB_RECURSE

-  IF ("${CMAKE_VERSION}" VERSION_GREATER 2.8.12)
+  IF ("${CMAKE_VERSION}" VERSION_GREATER 2.8.12.2)
     CMAKE_POLICY(SET CMP0026 OLD)  # Read LOCATION properties from build targets
     CMAKE_POLICY(SET CMP0042 OLD)  # Disable @rpath in target install name
   ENDIF()

But most likely only the major version of cmake should be checked instead of the full version string.

support selection by string in python iteration over records

Optional gzip of block data

The HAWC REC data still compresses by ~40% when gzipped. Using the delta field should improve this, but probably not 100%, because rare values well outside the typical range affect the compression. Internal gzip should be an option. Gzip status should be an entry in the block header (needs to be in block header, not file header) to support concatenation, then gzip the block contents if requested. This is a trivial addition.

Add histogramming capability

From Trac:

Segev:

The XCDF library currently has a Histogram python module that can be used to create 1D histograms without loading XCDF data into memory (in contrast with numpy.histogram). People are starting to use it because our data are no longer being written in ROOT format. However, the python Histogram is slow and does not take advantage of the fast variable selection made possible using the field visitor pattern in XCDF.
So, a couple of questions for users:
Should we add a C++ histogram class to XCDF (+ python bindings) with selection routines similar to those in xcdf-utility?
Should this class be added directly to the XCDF library or as a project in AERIE that builds off XCDF?
The advantage of adding the class directly to XCDF is that we can do analysis on REC data without actually needing to load the AERIE environment. The disadvantage is that XCDF could become bloated over time if we start packing it with analysis features. Any thoughts?

Lukas:

A 3rd possibility could be a new xcdf-utility collection, that builds on xcdf, but won't require the whole aerie framework. This way, we avoid xcdf bloat for uses that only need an efficient storage library. This could also help to avoid contaminating the base xcdf code with non-core functions. The price to pay is yet another package, but ape could provide an xcdf-anslysis package to help installing xcdf+histogram only.

Jim:

I intended to implement this functionality in XCDF. I'll move this ticket to the XCDF repository:
https://github.com/jimbraun/XCDF/issues

GetComments doesn't work for streaming files

For files written to a stream, we have to read through the file to find the comment block at the end. CommentsBegin()/CommentsEnd() should load the trailer if it isn't yet loaded, but currently they are not.

Additionally, "xcdf-utility comments" would be useful.

Fix definitions in headers

Particularly in Symbol.h and Node.h. This prevents using these files into libraries where they are included in multiple files.

Expression parsing code is awful

I can't believe I wrote this. It is a nightmare and should be updated to something more maintainable.

All fields should be of "Delta" type

The deltas logically will always span a smaller range than the actual values. We should always compress using deltas. Issues:

No problem for integers, but for floating-point data, taking the actual data deltas will accumulate floating-point error. We need to take the deltas of the "integerized" values instead.
For vectors, the jump between the last bin of the previous event and the first bin of the next will commonly be much larger than the jump between bins in the same event (e.g. waveforms with differing baselines). This should be handled properly so that all vector bins don't need the extra bits.

This replaces issue #9, which would be a much worse solution.

xcdf-utility should also be simply 'xcdf'

From memory, this looked to be hard, but it needs to happen.

Use c++11 for memory management

Replace custom memory management with std::shared_ptr everywhere. Using compile definitions "-std=c++0x" seems to work well-enough on RHEL6 and OSX.

"co"

[jbraun@i3-dhcp-172-16-223-209:/splitTest 802]$ xcdf-utility select "eventC < 10000" tt.xcd > tt2.xcd
[jbraun@i3-dhcp-172-16-223-209:/splitTest 803]$ xcdf-utility count tt2.xcd
146306

[jbraun@i3-dhcp-172-16-223-209:~/splitTest 809]$ xcdf-utility select "xx < 10000" tt.xcd > tt2.xcd
XCDF FATAL ERROR: /Users/jbraun/software/ApeInstalled/build/xcdf/xcdf-2.04.00/include/xcdf/utility/Expression.h,ParseValue:154]: Unable to parse symbol "xx"

Why isn't an error thrown when "eventC" is used? It is not a field in the file.

Allocating fields before Open() when writing a file leads to an error

This should be reasonable behavior. The problem is that the code checks whether we have an open ostream rather than directly passing whether we need to write the header.

XCDFException not passed to python

bbaugh [4:41 PM]
Looks like it isn't passed as a python Exception

  xcf = xcdf.XCDFFile('/Users/bbaugh/Dropbox/plots/hit-dropping/frac_flag_sim.txt')
except Exception as inst:
  print type(inst)

Yields:

libc++abi.dylib: terminating with uncaught exception of type XCDFException
Abort trap: 6

bbaugh [4:42 PM]
If it were being passed to python then it would print out the type of the exception.

dlennarz [4:47 PM]
that's unfortunate

bbaugh [5:01 PM]
It seems like it would be a good feature. if the python is supplied by the XCDF project I would put in a feature request on github.

Generic lists in expressions

"(field1, field2)==(1,2)" should be shorthand for "field1==1&&field2==2". "field1==(1,2)" should be shorthand for "field1==1 || field1==2". If we require the RHS to be constants, this is easily implemented within the existing "in" functionality, e.g. "in((field1, field2), (1, 2))" or "in((field1, field2), ((1, 2),(3,4)))" as shorthand for "(field1==1&&field2==2)||(field1==3&&field2==4)".

It is unclear whether it would be useful to also allow variables on the RHS. Probably useful, but uncommon.

Should have a DELTA type field

Only track the difference between one value and the next. This holds for arrays too. Only track the difference between one bin value and the next.

How do we deal with the starting value? Would need another header field: can't repurpose the field min. We'll need that.
How do we efficiently compress arrays? The jump between the starting value of one array and the end value of the previous may be significant relative to inter-array jumps.

decimal numbers are assigned as hex when parsed

Need to check for presence of "x" character before attempting hex parse

Unable to create new XCDF files from Python

This means exposing the XCDFFile::Write() method. Most likely, we'll also need to create an XCDFField class (or 3 of them) and expose XCDFField::Add and XCDFField::Get().

Support true 2D arrays

This feature is needed, since 2D data shows up everywhere, but true 3D data is much much less common.

How to handle the field entry count checking?

How to handle fields stored as deltas?

XCDFUtility paste: different delimiters parse only first value with no Warnings/Errors

If a CSV with a space delimiter, or a tab delimiter with the XCDFUtility are parsed into an xcdf file but with only the first field. This isn't a huge issue but it seems like it could issue a warning or an error.

Example:
xcdf-utility paste -o test.xcd test_space.csv

With test_space.csv:
test.unsign/U/1 test.sign/I/1 test.flt/F/0.0001
12412 1123 1.632512
122213 12314 6.231241
2123123 -1858684 10.46312

Don't load fields that have no references: save read time!

Reading an XCDF file with many fields is a bottleneck right now for analysis. Often only one or two of the fields are actually used. We could speed analysis by a factor 5x or more if we check that the field had external references before we decode the data. This probably means counting the number of base references (probably 1, in XCDFDataManager) and skipping the read if the number of references is equal to the number of base references.

Should track #bytes written into each field, write it in the trailer

It would be nice from a user perspective to know in which fields the bytes in the file are allocated. This could be printed out in e.g. xcdf-utility info after the number of entries is written. Express as bytes/entry.

Eliminate concatenation

We never used it, and it needlessly complicates the code.

Derived fields

It may be useful for a file to contain a field that is an expression derived from other fields in the file, e.g. "R" --> "sqrt(X_X + Y_Y)".

From Andy:

Sometimes I think that it would also be cool to have calculated
fields. For example: r = sqrt(x_x + y_y) where x,y are stored and
r is computed on the fly.

When I think about this more, I sometimes convince myself that this
is a misuse as simple equations can be coded into a text line.

But, sometimes there are calculations that are too complex to code
easily into a text line. An extreme example of this is time,RA,dec,
and theta,phi. Only 3 of these 5 are required, the other 2 are
computed. There is a CPU/storage tradeoff here and the optimal
solution may be to just store the 5 variables, but in other case,
it may be easier to compute variables on the fly. This would require
that something like a function be carried along as an arbitrary
field type, which may be way beyond the scope of XCDF.

This is all just thinking out loud. I don’t know the right answer.

Should be able to compare against a parent/siblings

Imagine an XCDF file with the following field structure

A-->B-->C
-->D
E

where root field A is the parent of B and D and B is the parent of 2D field C, and E is an unrelated root field. We can compare/histogram all field combinations except (CB) and (CD). Generally (CB) would be unusual since B only contains a number of entries, but (CD) would not be uncommon, since node D may contain a weight or something else to cut/draw against. In such a case, it would be proper to only draw items in C that correlate with the proper entry in B/D. Since we can get at the parent field in XCDFField/Node, it should be possible to implement this e.g.:

unsigned GetParentIndex(unsigned index)

that uses std::find on the parent iterator. Or better yet, a routine that does this, since this interface should be invisible to the user.

This is related to #69.

xcdf/utility/NumericalExpression.h include issue

Teensy issue with xcdf/utility/NumericalExpression.h

This code fails to compile with complaints about missing definition of XCDFPtr.
#include <xcdf/utility/NumericalExpression.h> int main(){}

Adding

#include <xcdf/XCDFPtr.h>
to the top of NumericalExpression.h is all that's needed to fix it.

It's a small thing, I just assume that we want an include file to itself include everything needed for it.