jimbraun / xcdf Goto Github PK
View Code? Open in Web Editor NEWXCDF: eXplicitly Compacted Data Format. See documentation at Read the Docs:
Home Page: https://xcdf.readthedocs.io/en/latest/
License: Other
XCDF: eXplicitly Compacted Data Format. See documentation at Read the Docs:
Home Page: https://xcdf.readthedocs.io/en/latest/
License: Other
Field aliases should be accessible vi GetXField() methods, and should be created via a CreateAlias() method. Aliases should not be iterated over when applying the visitor idiom, but should be viewable with an e.g. GetAliases() method.
Often it is useful to create histograms simply by specifying a number of bins and letting the analysis software figure out the binning. XCDF should support this:
We could probably do this dynamically. We know
the max/min in each data block. So we histogram
that and, if the next block has a larger range, rebin
appropriately. Since the bins would always become
larger, I don't think this would break down (except
we'd probably need to use more than the specified
number of bins, then rebin at the end).This is not a trivial addition, so I'll add it to the
GitHub XCDF issue tracker and implement it when
I can.
Jim
On 12/4/14, 12:44 PM, Segev BenZvi wrote:
OK, then keeping things simple might be the best approach.
Can the function work if we don't know min/max? I.e., suppose we just say that we want the data binned into 50 bins, and then have the function figure out the min/max of the data. Would that break down quickly when looking at more than one XCD file?
Segev
When building XCDF against a homebrewed Python (2.7.9) in Yosemite 10.10.3, the program compiles perfectly but the Python bindings crash when calling "import xcdf" in a Python session.
It seems to be a CMake issue in which FIND_PACKAGE(PythonLibs) grabs the system Python rather than the homebrewed version. Unfortunately the same error also occurs if building against the same version of Python installed with ape.
The CMake issue with FIND_PACKAGE(PythonLibs) is a known problem (Homebrew/legacy-homebrew#25118). The issue with ape is new. There doesn't appear to be an issue with the XCDF code but the build system needs to be fixed.
Hi Jim,
Attached is an XCDF output file (simulated Ne20 from the data challenge) and a script that extracts three fields from the data in a loop. To run it, just download the XCDF trunk, compile it in a build directory, copy these files into the build directory, and run
python nhit.py
This should work without errors, but there may be something weird happening to heap memory.
The problem I was running into was as follows. I created a python iterator called "fields" that can be used to loop over a subset of the fields in an XCDF file from within python. For example, if you look into the script I attached, you'll see
for r in f.fields("phony.Nhit, phony.Energy, phony.pclType"):
...
The python bindings in src/pybindings/PyXCDF.cc take the string "phony.Nhit, phony.Energy, phony.pclType" and pass it to a field visitor (see XCDFFieldsByNameSelector.h). Inside the field visitor class, the string is parsed and used to push the requested field data into a python tuple return value.
The problem appears in the object I use to pass the string from python to the C++ field visitor. The object is an XCDFFieldIterator, defined on line 174 of PyXCDF.cc. If you look at the object, you'll see that the last member is a char buffer called fieldNames_. The XCDFFieldIterator struct is what I use to maintain the state of the "fields" iterator in python, and fieldNames_ stores the data we want to pass to the C++ field visitor.
As long as I use a static char buffer to store fieldNames_, the program works fine, but if I change it to a std::string or a char* allocated with new all hell breaks loose.
What seems to be happening is that some part of the program is attempting an invalid access to heap memory when I use std::string or a char*. I'm not sure if this is the python or in the core library. I am trying to figure this out, but if you're willing to help me with the profiling I'd really appreciate it.
A copy of the XCDFFile reference is kept in NumericalExpression just so we can disable the copy constructor. This should be handled more cleanly.
List of issues landing on this branch will be included in this meta-issue
Can possibly use regex.h (C++ tools are only in C++11). Should do a field name search first, then a regex match. The thinking here is that "select-fields 'rec.*'" could be useful.
In XCDFDefs.h : this enum causes a conflict with an enum in GEANT4 that has the same name..
Is it possible to move the enum into a namespace or make it more explicit XCDF_NONE?
The option for removing comments from an XCDF file does not allow the output option.
Help shows:
remove-comments {-o outfile} {infiles} Remove all comments from an XCDF file
When it is used:
xcdf-utility remove-comments -o gammas.xcd gammas_test.xcd
Only one input file is allowed for remove-comments. Quitting
[jbraun@dyn-9-98:~/data 2335]$ ~/git/XCDF/build/xcdf-utility add-comment "test2" test3.xcd > test4.xcd
XCDF FATAL ERROR: /Users/jbraun/git/xcdf/src/XCDFFile.cc,ReadFrame:375]: Read failed. Byte offset: 4641
libc++abi.dylib: terminating with uncaught exception of type XCDFException
Abort trap: 6
This way we can completely ignore fields that are not used when reading back the file. Skip both reading the data into memory and attempting to parse field data from the block.
Test that fields are active by checking XCDFDataManager::GetReferenceCount() != 3
https://github.com/jimbraun/XCDF/blob/master/include/xcdf/utility/FieldNodeDefs.h#L52-L55
// Knowing size limits is up to the user
T operator[](unsigned index) const {
const T& datum = field_[index];
return datum;
<<<<<<< HEAD
=======
//return field_[index];
>>>>>>> 8d9d6fcde1000c8369a52eed108c8461af0de450
}
unsigned GetSize() const {return field_.GetSize();}
const std::string& GetName() const {return field_.GetName();}
So the 3.00.01 release will not build.
Maybe we can combine Stash() with the data holding container into a vector. This std::vector would be used to back both Scalar and Vector data, and could eliminate the SSVector class, as we can use std::vector::const_iterator in both places (note: the need for T* as an iterator was the reason for the SSVector class).
In many cases, the data will be such that the user does not want to fill every field for every event. Logically, there would be groups of fields that are filled for any given time. It would make sense to support this concept in XCDF. There would be a default unnamed group, and users could add groups by name. To add fields to those groups, the AddXXXXField methods would take an optional group name field.
This would lay the foundation to supporting more structured data, possibly using an object-oriented storage format built on top of xcdf, while the files themselves remain valid XCDF and are flat, allowing one to use all the generic XCDF tools and analysis support.
This is for histograms. On block write, check the block max/min for each field and track the global max/min. Write these into the trailer.
A quick examination of the code reveals mostly proper INCREF and DECREF, but a closer inspection is required. For example, with PyArg_ParseTuple(args, format, &filename, &filemode), are we given these references or are they just loaned? Note that XCDFFieldsByNameSelector and XCDFTupleSetter appear as though they leak, as they have a PyObject as a member, but actually do not because the reference is subsequently given away, not copied, in GetTuple()
XCDFFieldsByNameSelector parses the field specification string for each entry. This is slow, much slower than just reading all the fields. This behavior must be improved. The best option is probably to put an XCDFFieldsByNameSelector* in the XCDFFieldIterator object. We can't create a new PyTuple in operator() (since this is called many times per event), so create a new PyTuple on construction and when GetTuple() is called. Possibly wrap this in another object. XCDFFieldsByNameSelector would need a destructor and have copy/assignment disallowed.
This is relevant now that we have 2D vectors. The vector comparison logic in xcdf/utility/Histogram.h and xcdf/utility/Node.h is fuzzy and should be fixed. We currently support scalar-scalar, scalar-vector, and vector-vector (provided the two vectors are the same length) relations.
The following vector field relations should be supported:
So basically:
Relations are not transitive, so when e.g. making a 2D histogram with a weight expression, both axes need to be compared against the weight expression, and the axes must be compared against each other.
Read performance is lower than expected with this data.
xcdf-utility recover was not recovering any events this issue appear somewhere between 2.07 and 2.09.
Here is in an example of the difference between the two versions:
./xcdf-utility version
XCDF version 2.9.0
xcdf-utility recover -o test.xcd $HAWCROOT/data/hawc/data/2014/12/run002186/trig_run002186_00002.xcd
XCDF FATAL ERROR: /tmp/xcdf/xcdf-2.09.00/src/XCDFFile.cc,ReadFrame:375]: Read failed. Byte offset: 86745736
Corrupt file: Recovered 0 events.
./xcdf-utility version
XCDF version 2.7.0
./xcdf-utility recover -o test2.xcd $HAWCROOT/data/hawc/data/2014/12/run002186/trig_run002186_00002.xcd
XCDF FATAL ERROR: /data/disk01/home/iwisher/XCDF_old/XCDF-2.07.00/src/XCDFFile.cc,ReadFrame:370]: Read failed. Byte offset: 86745736
Corrupt file: Recovered 142490 events.
9+9 will not evaluate as an expression, since we try to evaluate the longest expression possible (keeping "+" and "-") to catch e.g. +3.14159e+0.
See #14. Fix by allowing the user to choose whether to force-load the comments. By default, this should probably be true, but set default to false to avoid needing changes in AERIE as well.
Need pass/fail instead of needing to interpret the results as with the current tests.
getRecord() calls Rewind() after the data is returned, making it impossible to extract data for a given field at a given record without iterating through the entire file.
CentOS 6.8 provides cmake 2.8.12.2 which fails to configure the current XCDF master:
CMake Error at CMakeLists.txt:31 (CMAKE_POLICY):
Policy "CMP0026" is not known to this version of CMake.
CMake Error at CMakeLists.txt:32 (CMAKE_POLICY):
Policy "CMP0042" is not known to this version of CMake.
This is due to the following check in CMakeLists.txt which, I guess, wass meant to identify cmake 3:
IF ("${CMAKE_VERSION}" VERSION_GREATER 2.8.12)
CMAKE_POLICY(SET CMP0026 OLD) # Read LOCATION properties from build targets
CMAKE_POLICY(SET CMP0042 OLD) # Disable @rpath in target install name
ENDIF()
Simple patch:
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -27,7 +27,7 @@ ENDIF ("${CMAKE_SOURCE_DIR}" STREQUAL "${CMAKE_BINARY_DIR}")
IF (COMMAND cmake_policy)
CMAKE_POLICY(SET CMP0009 NEW) # Do not follow symlinks w/ FILE_GLOB_RECURSE
- IF ("${CMAKE_VERSION}" VERSION_GREATER 2.8.12)
+ IF ("${CMAKE_VERSION}" VERSION_GREATER 2.8.12.2)
CMAKE_POLICY(SET CMP0026 OLD) # Read LOCATION properties from build targets
CMAKE_POLICY(SET CMP0042 OLD) # Disable @rpath in target install name
ENDIF()
But most likely only the major version of cmake should be checked instead of the full version string.
The HAWC REC data still compresses by ~40% when gzipped. Using the delta field should improve this, but probably not 100%, because rare values well outside the typical range affect the compression. Internal gzip should be an option. Gzip status should be an entry in the block header (needs to be in block header, not file header) to support concatenation, then gzip the block contents if requested. This is a trivial addition.
From Trac:
Segev:
The XCDF library currently has a Histogram python module that can be used to create 1D histograms without loading XCDF data into memory (in contrast with numpy.histogram). People are starting to use it because our data are no longer being written in ROOT format. However, the python Histogram is slow and does not take advantage of the fast variable selection made possible using the field visitor pattern in XCDF.
So, a couple of questions for users:
Should we add a C++ histogram class to XCDF (+ python bindings) with selection routines similar to those in xcdf-utility?
Should this class be added directly to the XCDF library or as a project in AERIE that builds off XCDF?
The advantage of adding the class directly to XCDF is that we can do analysis on REC data without actually needing to load the AERIE environment. The disadvantage is that XCDF could become bloated over time if we start packing it with analysis features. Any thoughts?
Lukas:
A 3rd possibility could be a new xcdf-utility collection, that builds on xcdf, but won't require the whole aerie framework. This way, we avoid xcdf bloat for uses that only need an efficient storage library. This could also help to avoid contaminating the base xcdf code with non-core functions. The price to pay is yet another package, but ape could provide an xcdf-anslysis package to help installing xcdf+histogram only.
Jim:
I intended to implement this functionality in XCDF. I'll move this ticket to the XCDF repository:
https://github.com/jimbraun/XCDF/issues
For files written to a stream, we have to read through the file to find the comment block at the end. CommentsBegin()/CommentsEnd() should load the trailer if it isn't yet loaded, but currently they are not.
Additionally, "xcdf-utility comments" would be useful.
Particularly in Symbol.h and Node.h. This prevents using these files into libraries where they are included in multiple files.
I can't believe I wrote this. It is a nightmare and should be updated to something more maintainable.
The deltas logically will always span a smaller range than the actual values. We should always compress using deltas. Issues:
This replaces issue #9, which would be a much worse solution.
From memory, this looked to be hard, but it needs to happen.
Replace custom memory management with std::shared_ptr everywhere. Using compile definitions "-std=c++0x" seems to work well-enough on RHEL6 and OSX.
[jbraun@i3-dhcp-172-16-223-209:/splitTest 802]$ xcdf-utility select "eventC < 10000" tt.xcd > tt2.xcd/splitTest 803]$ xcdf-utility count tt2.xcd
[jbraun@i3-dhcp-172-16-223-209:
146306
[jbraun@i3-dhcp-172-16-223-209:~/splitTest 809]$ xcdf-utility select "xx < 10000" tt.xcd > tt2.xcd
XCDF FATAL ERROR: /Users/jbraun/software/ApeInstalled/build/xcdf/xcdf-2.04.00/include/xcdf/utility/Expression.h,ParseValue:154]: Unable to parse symbol "xx"
Why isn't an error thrown when "eventC" is used? It is not a field in the file.
This should be reasonable behavior. The problem is that the code checks whether we have an open ostream rather than directly passing whether we need to write the header.
bbaugh [4:41 PM]
Looks like it isn't passed as a python Exception
xcf = xcdf.XCDFFile('/Users/bbaugh/Dropbox/plots/hit-dropping/frac_flag_sim.txt')
except Exception as inst:
print type(inst)
Yields:
libc++abi.dylib: terminating with uncaught exception of type XCDFException
Abort trap: 6
bbaugh [4:42 PM]
If it were being passed to python then it would print out the type of the exception.
dlennarz [4:47 PM]
that's unfortunate
bbaugh [5:01 PM]
It seems like it would be a good feature. if the python is supplied by the XCDF project I would put in a feature request on github.
"(field1, field2)==(1,2)" should be shorthand for "field1==1&&field2==2". "field1==(1,2)" should be shorthand for "field1==1 || field1==2". If we require the RHS to be constants, this is easily implemented within the existing "in" functionality, e.g. "in((field1, field2), (1, 2))" or "in((field1, field2), ((1, 2),(3,4)))" as shorthand for "(field1==1&&field2==2)||(field1==3&&field2==4)".
It is unclear whether it would be useful to also allow variables on the RHS. Probably useful, but uncommon.
Only track the difference between one value and the next. This holds for arrays too. Only track the difference between one bin value and the next.
Need to check for presence of "x" character before attempting hex parse
This means exposing the XCDFFile::Write() method. Most likely, we'll also need to create an XCDFField class (or 3 of them) and expose XCDFField::Add and XCDFField::Get().
This feature is needed, since 2D data shows up everywhere, but true 3D data is much much less common.
How to handle the field entry count checking?
How to handle fields stored as deltas?
If a CSV with a space delimiter, or a tab delimiter with the XCDFUtility are parsed into an xcdf file but with only the first field. This isn't a huge issue but it seems like it could issue a warning or an error.
Example:
xcdf-utility paste -o test.xcd test_space.csv
With test_space.csv:
test.unsign/U/1 test.sign/I/1 test.flt/F/0.0001
12412 1123 1.632512
122213 12314 6.231241
2123123 -1858684 10.46312
Reading an XCDF file with many fields is a bottleneck right now for analysis. Often only one or two of the fields are actually used. We could speed analysis by a factor 5x or more if we check that the field had external references before we decode the data. This probably means counting the number of base references (probably 1, in XCDFDataManager) and skipping the read if the number of references is equal to the number of base references.
It would be nice from a user perspective to know in which fields the bytes in the file are allocated. This could be printed out in e.g. xcdf-utility info after the number of entries is written. Express as bytes/entry.
We never used it, and it needlessly complicates the code.
It may be useful for a file to contain a field that is an expression derived from other fields in the file, e.g. "R" --> "sqrt(X_X + Y_Y)".
From Andy:
Sometimes I think that it would also be cool to have calculated
fields. For example: r = sqrt(x_x + y_y) where x,y are stored and
r is computed on the fly.
When I think about this more, I sometimes convince myself that this
is a misuse as simple equations can be coded into a text line.
But, sometimes there are calculations that are too complex to code
easily into a text line. An extreme example of this is time,RA,dec,
and theta,phi. Only 3 of these 5 are required, the other 2 are
computed. There is a CPU/storage tradeoff here and the optimal
solution may be to just store the 5 variables, but in other case,
it may be easier to compute variables on the fly. This would require
that something like a function be carried along as an arbitrary
field type, which may be way beyond the scope of XCDF.
This is all just thinking out loud. I don’t know the right answer.
Imagine an XCDF file with the following field structure
A-->B-->C
-->D
E
where root field A is the parent of B and D and B is the parent of 2D field C, and E is an unrelated root field. We can compare/histogram all field combinations except (CB) and (CD). Generally (CB) would be unusual since B only contains a number of entries, but (CD) would not be uncommon, since node D may contain a weight or something else to cut/draw against. In such a case, it would be proper to only draw items in C that correlate with the proper entry in B/D. Since we can get at the parent field in XCDFField/Node, it should be possible to implement this e.g.:
unsigned GetParentIndex(unsigned index)
that uses std::find on the parent iterator. Or better yet, a routine that does this, since this interface should be invisible to the user.
This is related to #69.
Teensy issue with xcdf/utility/NumericalExpression.h
This code fails to compile with complaints about missing definition of XCDFPtr.
#include <xcdf/utility/NumericalExpression.h> int main(){}
Adding
#include <xcdf/XCDFPtr.h>
to the top of NumericalExpression.h is all that's needed to fix it.
It's a small thing, I just assume that we want an include file to itself include everything needed for it.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.