scikit-hep / uproot5 Goto Github PK

View Code? Open in Web Editor NEW

218.0 19.0 67.0 3.85 MB

ROOT I/O in pure Python and NumPy.

Home Page: https://uproot.readthedocs.io

License: BSD 3-Clause "New" or "Revised" License

Python 100.00% HTML 0.01%

root root-cern hep hep-ex hep-py numpy big-data bigdata file-format analysis

uproot5's Introduction

Uproot is a library for reading and writing ROOT files in pure Python and NumPy.

Unlike the standard C++ ROOT implementation, Uproot is only an I/O library, primarily intended to stream data into machine learning libraries in Python. Unlike PyROOT and root_numpy, Uproot does not depend on C++ ROOT. Instead, it uses Numpy to cast blocks of data from the ROOT file as Numpy arrays.

Installation

Uproot can be installed from PyPI using pip.

pip install uproot

Uproot is also available using conda.

conda install -c conda-forge uproot

If you have already added conda-forge as a channel, the -c conda-forge is unnecessary. Adding the channel is recommended because it ensures that all of your packages use compatible versions (see conda-forge docs):

conda config --add channels conda-forge
conda update --all

Getting help

Start with the tutorials and reference documentation.

Report bugs, request features, and ask for additional documentation on GitHub Issues.
If you have a "How do I...?" question, start a GitHub Discussion with category "Q&A".
Alternatively, ask about it on StackOverflow with the [uproot] tag. Be sure to include tags for any other libraries that you use, such as Pandas or PyTorch.
To ask questions in real time, try the Gitter Scikit-HEP/uproot chat room.

Installation for developers

Uproot is an ordinary Python library; you can get a copy of the code with

git clone https://github.com/scikit-hep/uproot5.git

and install it locally by calling pip install -e . in the repository directory.

If you need to develop Awkward Array as well, see its installation for developers.

Dependencies

Uproot's only strict dependencies are NumPy and packaging. Strict dependencies are automatically installed by pip (or conda).

Awkward Array is highly recommended and is automatically installed by pip (or conda), though it is possible to use Uproot without it. If you need a minimal installation, pass --no-deps to pip and pass library="np" to every array-fetching function, or globally set uproot.default_library to get NumPy arrays instead of Awkward Arrays.

awkward: Uproot 5.x requires Awkward 2.x.

The following libraries are also useful in conjunction with Uproot, but are not necessary. If you call a function that needs one, you'll be prompted to install it. (Conda installs most of these automatically.)

For ROOT files, compressed different ways:

lz4 and xxhash: if reading ROOT files that have been LZ4-compressed.
zstandard: if reading ROOT files that have been ZSTD-compressed.
ZLIB and LZMA are built in (Python standard library).

For accessing remote files:

minio: if reading files with s3:// URIs.
xrootd: if reading files with root:// URIs.
HTTP/S access is built in (Python standard library).

For distributed computing with Dask:

dask: see uproot.dask.
dask-awkward: for data with irregular structure ("jagged" arrays), see dask-awkward.

For exporting TTrees to Pandas:

pandas: if library="pd".
awkward-pandas: if library="pd" and the data have irregular structure ("jagged" arrays), see awkward-pandas.

For exporting histograms:

boost-histogram: if converting histograms to boost-histogram with histogram.to_boost().
hist: if converting histograms to hist with histogram.to_hist().

Acknowledgements

Support for this work was provided by NSF cooperative agreements OAC-1836650 and PHY-2323298 (IRIS-HEP), grant OAC-1450377 (DIANA/HEP), and PHY-2121686 (US-CMS LHC Ops).

Thanks especially to the gracious help of Uproot contributors (including the original repository).

_{Jim Pivarski} 💻 📖 🚇 🚧	_{Pratyush Das} 💻 🚇	_{Chris Burr} 💻 🚇	_{Dmitri Smirnov} 💻	_{Matthew Feickert} 🚇	_{Tamas Gal} 💻	_{Luke Kreczko} 💻 ⚠️
_{Nicholas Smith} 💻	_{Noah Biederbeck} 💻	_{Oksana Shadura} 💻 🚇	_{Henry Schreiner} 💻 🚇 ⚠️	_{Mason Proffitt} 💻 ⚠️	_{Jonas Rembser} 💻	_benkrikler 💻
_{Hans Dembinski} 📖	_{Marcel R.} 💻	_{Ruggero Turra} 💻	_{Jonas Rübenach} 💻	_bfis 💻	_{Raymond Ehlers} 💻	_{Andrzej Novak} 💻
_{Josh Bendavid} 💻	_{Doug Davis} 💻	_{Chao Gu} 💻	_{Lukas Koch} 💻	_{Michele Peresano} 💻	_Edoardo 💻	_{JMSchoeffmann} 💻
_{alexander-held} 💻	_{Giordon Stark} 💻	_{Ryunosuke O'Neil} 💻	_{ChristopheRappold} 📖	_{Cosmin Deaconu} ⚠️ 💻	_{Carlos Pegueros} 📖 💡 ⚠️ ✅	_{Benjamin Tovar} 💻
_{Duncan Macleod} 🚇	_mpad 💻	_{Peter Fackeldey} 💻	_{Kush Kothari} 💻	_{Aryan Roy} 💻	_{Jerry Ling} 💻	_kakwok 💻
_{Dmitry Kalinkin} 💻 🚇	_{Nikolai Hartmann} 💻	_{Kilian Lieret} 📖	_{Daniel Cervenkov} 💻	_{Beojan Stanislaus} 💻	_{Angus Hollands} 💻 🚧	_{Luis Antonio Obis Aparicio} 💻
_renyhp 💻	_{Lindsey Gray} 💻	_ioanaif 💻	_{OTABI Tomoya} ⚠️	_{Jost Migenda} 📖	_{Gaétan Lepage} ⚠️	_{HaarigerHarald} 💻
_{Ben Greiner} ⚠️	_{Robin Sonnabend} 💻	_{Bo Johnson} 💻	_Miles 💻	_djw9497 💻

💻: code, 📖: documentation, 🚇: infrastructure, 🚧: maintainance, ⚠: tests/feedback, 🤔: foundational ideas.

uproot5's People

Contributors

Stargazers

Watchers

uproot5's Issues

Uproot4 and Double32_t

Hi,
I'm facing a weird issue when dealing with Double_32t with uproot4.
I have uploaded a small tree to reproduce the problem, in particular:

uproot4.open("test_tree.root")["Hyp3O2"]["RHyperTriton"]["m"].array() converts the floating variable "m" as expected
-uproot4.open("test_tree.root")["Hyp3O2"]["RHyperTriton"]["mppi_vert"].array() fails while converting the double32 variable "mppi_vert" with the following error: AttributeError: 'NoneType' object has no attribute 'title'

The link to the tree is:

https://drive.google.com/file/d/14kF2gEKV7-eIbhSByd-F0cnUuVwrBopp/view?usp=sharing

Please note that the conversion with old uproot works nicely.
Any idea of what is happening here?

Handle "long long" types

They are not included in uproot4/interpretation/identity.py::_parse_node.

Follow scikit-hep/uproot3#508.

Possible bug or old version incompatibility, depending on how deep it goes

In this scan, Python 2.6 somehow got off track and was killed (memory error) for these files:

issue390.root
issue465-flat.root
issue433-splitlevel2.root
issue213.root
issue452.root
issue431b.root
issue126b.root
issue126a.root
issue407.root

I don't know if the same is true for Python 2.7. It happens when deserializing the TTree metadata (not even reading arrays); the TObjArray of TBranches (second level) reads a huge number as its length and somehow doesn't stop attempting to read. Deserialization logic doesn't sound like a Python 2 vs 3 thing, which is why it's relevant (Python 2 isn't, by itself, relevant): it could be a deeper bug.

It's possible that these get off track in Python 3 as well—there are plenty of TTree/TBranch/TLeaf objects out there that don't match ROOT's standard streamers, but are correct according to their own streamers. For this, Uproot first attempts to use the standard streamer, then falls back on reading the file's streamer and using those (once). Python 2 could be failing to fall back.

But you'd think it would hit the end of its (very limited) chunk and raise the error there, instead of running wild. Maybe old NumPy (1.10.4) has different slice semantics and the chunk sizes aren't limited? If so, then it's just a question of finding when that changed and declaring that the minimum NumPy.

Reading DAOD_PHYSLITE prototype

It might make sense to discuss this here. I'm trying to read as much as possible from the current prototype for the DAOD_PHYSLITE format at ATLAS using uproot. The following uses these example root files: DAOD_PHYSLITE.art.pool.root and DAOD_PHYSLITE.art_split99.pool.root. I'm using uproot4 0.0.27. The things i currently can't read are the following:

Links into multiple other collections (branches whoose names end in "Links"). On the one hand those are vector<vector<...> and on the other hand a custom type (ElementLink<DataVector<something>>). Uproot can read these, but i get some "Unknown" objects out:

>>> import uproot4
>>> f = uproot4.open("DAOD_PHYSLITE.art.pool.root")
>>> t = f["CollectionTree"]
>>> arrays = t["AnalysisElectronsAuxDyn.trackParticleLinks"].array(library="np")
>>> arrays[8]
<STLVector [[<Unknown ElementLink<DataVector<xAOD::TrackParticle_v1_3e at 0x7fa30578b190>, ...], ...] at 0x7fa30578b2b0>

In principle the data part of these ElementLinks are just an index and a hash to identify the collection linked to. This can be seen for the branches that are just vector<ElementLink<...>> (single jagged). ROOT seems to be able to split these for the single-jagged case:

>>> t["AnalysisJetsAuxDyn.btaggingLink"]
<TBranchElement 'AnalysisJetsAuxDyn.btaggingLink' (2 subbranches) at 0x7fa304c861c0>
>>> t["AnalysisJetsAuxDyn.btaggingLink"].keys()
['AnalysisJetsAuxDyn.btaggingLink.m_persKey', 'AnalysisJetsAuxDyn.btaggingLink.m_persIndex']
>>> t["AnalysisJetsAuxDyn.btaggingLink.m_persIndex"].array()
<Array [[0, 1, 2, 3, 4, 5, ... 2, 3, 4, 5, 6]] type='100 * var * uint32'>
>>> t["AnalysisJetsAuxDyn.btaggingLink.m_persKey"].array()
<Array [[1030373024, ... 1030373024]] type='100 * var * uint32'>

Would there be a reasonable way to also read the multi-jagged links when i know that they should contain these uint32 numbers (m_persIndex and m_persKey)? So far i couldn't get ROOT to split them ...

When i set the default split level to 99 one more thing ROOT seems to split is a larger structure used for association of soft terms to the MET (branches with names *METAssoc*.*). In principle the branches are mostly vector<vector<float>, but i get the following error when trying to read them:

>>> import uproot4
>>> f = uproot4.open("DAOD_PHYSLITE.art_split99.pool.root")
>>> t = f["CollectionTree"]
>>> t["METAssoc_AnalysisMETAux.trkpx"].array()
Traceback (most recent call last):
  File "<input>", line 1, in <module>
    t["METAssoc_AnalysisMETAux.trkpx"].array()
  File "/home/nikolai/.local/lib/python3.8/site-packages/uproot4/behaviors/TBranch.py", line 1984, in array
    interpretation = self.interpretation
  File "/home/nikolai/.local/lib/python3.8/site-packages/uproot4/behaviors/TBranch.py", line 2120, in interpretation
    self._interpretation = uproot4.interpretation.identify.interpretation_of(
  File "/home/nikolai/.local/lib/python3.8/site-packages/uproot4/interpretation/identify.py", line 443, in interpretation_of
    if branch.streamer is not None:
  File "/home/nikolai/.local/lib/python3.8/site-packages/uproot4/behaviors/TBranch.py", line 2232, in streamer
    for element in streamerinfo.walk_members(self._file.streamers):
  File "/home/nikolai/.local/lib/python3.8/site-packages/uproot4/streamers.py", line 372, in walk_members
    base = streamers[element.name][element.base_version]
KeyError: -1

The error also occurs when i run .show() - both on the branch and on the whole tree.

Values read from TH2* are mixed up

It appears that something is going wrong with reading TH2* histograms compared to ROOT or uproot3. When reading a TH2, the second axis values seem to be reversed, as well as the over and underflow next to each other. I can reproduce it (see below), and it seems like it should be straightforward, but somehow I haven't yet found the source.

from pathlib import Path
filename = Path("reproducer.root")
if not filename.exists():
    import ROOT
    # Fill to create an identity, but with different number of bins to ensure that we don't accidentally get it right.
    h = ROOT.TH2D("test", "test", 10, 0, 10, 12, 0, 12)
    for i in range(0, 10):
        h.Fill(i + 0.5, i + 0.5)

    f_out = ROOT.TFile("reproducer.root", "RECREATE")
    h.Write()
    f_out.Close()

import uproot4
import uproot as uproot3
import numpy as np

hist_3 = uproot3.open("reproducer.root")[b"test"]
hist_4 = uproot4.open("reproducer.root")["test"]

# Ignore overflow for now since uproot3.values ignores overflow
np.testing.assert_allclose(hist_3.values, hist_4.values()[1:-1, 1:-1])

this gives:

AssertionError:
Not equal to tolerance rtol=1e-07, atol=0

Mismatched elements: 19 / 120 (15.8%)
Max absolute difference: 1.
Max relative difference: 1.
 x: array([[1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],...
 y: array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],...

Just printing all of the values gives some info. Expected from uproot3 (note: excluding flow bins):

>>> print(hist_3.values)
[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]]

Uproot4 (note: including flow bins and the reverse):

>>> print(hist_4.values()[:, ::-1])
[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

Any help is appreciated!

Simplify GHA conda

Miniconda is already installed in GHA. You may need https://github.com/scikit-hep/vector/blob/60d3637efc98ecf53665ea6e81198bc4d77827d8/.github/workflows/ci.yml#L57
but I don't think you need to depend on a custom action.

https://github.com/scikit-hep/uproot4/blob/4cd75f06218cf381e5906876e6b0b50b9dd9ca82/.github/workflows/build-test.yml#L21

std::vector<vector<float>> not correctly being read in .array() method

Hello! I am analyzing a TTree where one branch is of type std::vector<vector> is not being read. file_str is the ROOT file:

ttree1 = uproot4.open(file_str)["CharmAnalysis"]
ttree1["TruthParticles_Selected_daughterInfoT__eta"].array()

Here is the full stack trace:

IndexError                                Traceback (most recent call last)
<timed eval> in <module>

~/.local/cori/3.7-anaconda-2019.07/lib/python3.7/site-packages/uproot4/behaviors/TBranch.py in array(self, interpretation, entry_start, entry_stop, decompression_executor, interpretation_executor, array_cache, library)
   1284             interpretation_executor,
   1285             library,
-> 1286             arrays,
   1287         )
   1288 

~/.local/cori/3.7-anaconda-2019.07/lib/python3.7/site-packages/uproot4/behaviors/TBranch.py in _ranges_or_baskets_to_arrays(hasbranches, ranges_or_baskets, branchid_interpretation, entry_start, entry_stop, decompression_executor, interpretation_executor, library, arrays)
    493 
    494         elif isinstance(obj, tuple) and len(obj) == 3:
--> 495             uproot4.source.futures.delayed_raise(*obj)
    496 
    497         else:

~/.local/cori/3.7-anaconda-2019.07/lib/python3.7/site-packages/uproot4/source/futures.py in delayed_raise(exception_class, exception_value, traceback)
     31         exec("raise exception_class, exception_value, traceback")
     32     else:
---> 33         raise exception_value.with_traceback(traceback)
     34 
     35 

~/.local/cori/3.7-anaconda-2019.07/lib/python3.7/site-packages/uproot4/behaviors/TBranch.py in basket_to_array(basket)
    467                     branch.entry_offsets,
    468                     library,
--> 469                     branch,
    470                 )
    471         except Exception:

~/.local/cori/3.7-anaconda-2019.07/lib/python3.7/site-packages/uproot4/interpretation/objects.py in final_array(self, basket_arrays, entry_start, entry_stop, entry_offsets, library, branch)
    218                 for global_i in range(start, stop):
    219                     local_i = global_i - start
--> 220                     output[global_i - entry_start] = basket_array[local_i]
    221 
    222             start = stop

~/.local/cori/3.7-anaconda-2019.07/lib/python3.7/site-packages/uproot4/interpretation/objects.py in __getitem__(self, where)
     67         if uproot4._util.isint(where):
     68             byte_start = self._byte_offsets[where]
---> 69             byte_stop = self._byte_offsets[where + 1]
     70             data = self._byte_content[byte_start:byte_stop]
     71             chunk = uproot4.source.chunk.Chunk.wrap(self._branch.file.source, data)

IndexError: index 12344 is out of bounds for axis 0 with size 12344```

Thoughts about Uproot4 ↔ boost-histogram/hist integration

Since this comes after reading and writing, it's a few steps removed.

However, I think that the histogram objects in Uproot4 will not be usable as such, but will consist entirely of conversion methods to NumPy-style, boost-histogram, and hist. They should reuse code as much as possible, so it might be hist-by-way-of-boost-histogram and/or aghast.

Mentioning @henryiii and @LovelyBuggies.

Support for boost-histogram 0.10.0

In scikit-hep/boost-histogram#410, @HDembinski requested that I add an underscore to sum_of_weighted_deltas_squared, and this has broken Uproot's to_boost() method on TProfiles.

Now when reading a TProfile, we get the error:

AttributeError                            Traceback (most recent call last)
<ipython-input-10-d946cc02fe81> in <module>
----> 1 mplhep.histplot(rfile['hprof;1'].to_boost())

/usr/local/Caskroom/miniconda/base/envs/bh-talk/lib/python3.8/site-packages/uproot4/behaviors/TProfile.py in to_boost(self)
    178         view.sum_of_weights_squared
    179         view.value = values
--> 180         view.sum_of_weighted_deltas_squared
    181 
    182         raise NotImplementedError(repr(self))

AttributeError: 'WeightedMeanView' object has no attribute 'sum_of_weighted_deltas_squared'

Support XRootD URLs with multiple servers.

From the TXNetFile docs:

The "url" argument must be of the form

root://server1:port1[,server2:port2,...,serverN:portN]/pathfile,

Note that this means that multiple servers (>= 1) can be specified in the url. The connection will try to connect to the first server:port and if that does not succeed, it will try the second one, and so on until it finds a server that will respond.

I didn't realise this was an pre-existing feature in ROOT but I'd been asking for it for a while. Being able to do this better matches with how data is stored on the grid and makes it easier to when sites are down.

The implementation of this shouldn't only check that the server is alive, it should also fallback if errors are returned when trying to access the data (i.e. only querying the config isn't enough to know if the site is down).

Histograms should have "to_pandas" like Uproot 3's "pandas" method

Giammi reminded me that this is a useful function. Once in Pandas form, there's a lot you can do.

Here: https://stackoverflow.com/questions/63790713/uproot-processing-a-th2d-using-the-uproot-method-pandas/63794125?noredirect=1#comment112844565_63794125

HTTP protocol doesn't follow redirect

I cannot open

uproot4.open("https://github.com/CoffeaTeam/coffea/raw/master/tests/samples/nano_dy.root")

yet I can open the place it 302 redirects to:

uproot4.open("https://raw.githubusercontent.com/CoffeaTeam/coffea/master/tests/samples/nano_dy.root")

Implement std::bitset

mc10events.root doesn't have it, but some other samples do; see this for searching.

Calling uproot.open many times uses up all available threads

Not sure quite how to explain it, but it seems like the reliance on numpy's mmap is causing an inability to handle opening many files at once.

$ python
>>> import uproot
u>>> uproot.version.version
'3.12.0'
>>> a = uproot.open("/nfs/slac/atlas/fs1/d/yuzhan/collinearw_files/June2020_Production/merged_files/Wj_AB212108_v2_mc16a.root")
>>> a = uproot.open("/nfs/slac/atlas/fs1/d/yuzhan/collinearw_files/June2020_Production/merged_files/Wj_AB212108_v2_mc16a.root")
>>> a = uproot.open("/nfs/slac/atlas/fs1/d/yuzhan/collinearw_files/June2020_Production/merged_files/Wj_AB212108_v2_mc16a.root")
>>> a = uproot.open("/nfs/slac/atlas/fs1/d/yuzhan/collinearw_files/June2020_Production/merged_files/Wj_AB212108_v2_mc16a.root")
>>> a = uproot.open("/nfs/slac/atlas/fs1/d/yuzhan/collinearw_files/June2020_Production/merged_files/Wj_AB212108_v2_mc16a.root")
>>> a = uproot.open("/nfs/slac/atlas/fs1/d/yuzhan/collinearw_files/June2020_Production/merged_files/Wj_AB212108_v2_mc16a.root")
>>> a = uproot.open("/nfs/slac/atlas/fs1/d/yuzhan/collinearw_files/June2020_Production/merged_files/Wj_AB212108_v2_mc16a.root")
>>> a = uproot.open("/nfs/slac/atlas/fs1/d/yuzhan/collinearw_files/June2020_Production/merged_files/Wj_AB212108_v2_mc16a.root")
>>> a = uproot.open("/nfs/slac/atlas/fs1/d/yuzhan/collinearw_files/June2020_Production/merged_files/Wj_AB212108_v2_mc16a.root")
>>> a = uproot.open("/nfs/slac/atlas/fs1/d/yuzhan/collinearw_files/June2020_Production/merged_files/Wj_AB212108_v2_mc16a.root")
>>> a = uproot.open("/nfs/slac/atlas/fs1/d/yuzhan/collinearw_files/June2020_Production/merged_files/Wj_AB212108_v2_mc16a.root")
>>> a = uproot.open("/nfs/slac/atlas/fs1/d/yuzhan/collinearw_files/June2020_Production/merged_files/Wj_AB212108_v2_mc16a.root")
>>> a = uproot.open("/nfs/slac/atlas/fs1/d/yuzhan/collinearw_files/June2020_Production/merged_files/Wj_AB212108_v2_mc16a.root")
>>> a = uproot.open("/nfs/slac/atlas/fs1/d/yuzhan/collinearw_files/June2020_Production/merged_files/Wj_AB212108_v2_mc16a.root")
>>> a = uproot.open("/nfs/slac/atlas/fs1/d/yuzhan/collinearw_files/June2020_Production/merged_files/Wj_AB212108_v2_mc16a.root")
>>> a = uproot.open("/nfs/slac/atlas/fs1/d/yuzhan/collinearw_files/June2020_Production/merged_files/Wj_AB212108_v2_mc16a.root")
>>> a = uproot.open("/nfs/slac/atlas/fs1/d/yuzhan/collinearw_files/June2020_Production/merged_files/Wj_AB212108_v2_mc16a.root")
>>> a = uproot.open("/nfs/slac/atlas/fs1/d/yuzhan/collinearw_files/June2020_Production/merged_files/Wj_AB212108_v2_mc16a.root")
>>> a = uproot.open("/nfs/slac/atlas/fs1/d/yuzhan/collinearw_files/June2020_Production/merged_files/Wj_AB212108_v2_mc16a.root")
>>> exit()

versus

>>> import uproot4 as uproot
>>> uproot.version.version
'0.0.23'
>>> a = uproot.open("/nfs/slac/atlas/fs1/d/yuzhan/collinearw_files/June2020_Production/merged_files/Wj_AB212108_v2_mc16a.root")
>>> a = uproot.open("/nfs/slac/atlas/fs1/d/yuzhan/collinearw_files/June2020_Production/merged_files/Wj_AB212108_v2_mc16a.root")
>>> a = uproot.open("/nfs/slac/atlas/fs1/d/yuzhan/collinearw_files/June2020_Production/merged_files/Wj_AB212108_v2_mc16a.root")
>>> a = uproot.open("/nfs/slac/atlas/fs1/d/yuzhan/collinearw_files/June2020_Production/merged_files/Wj_AB212108_v2_mc16a.root")
>>> a = uproot.open("/nfs/slac/atlas/fs1/d/yuzhan/collinearw_files/June2020_Production/merged_files/Wj_AB212108_v2_mc16a.root")
>>> a = uproot.open("/nfs/slac/atlas/fs1/d/yuzhan/collinearw_files/June2020_Production/merged_files/Wj_AB212108_v2_mc16a.root")
>>> a = uproot.open("/nfs/slac/atlas/fs1/d/yuzhan/collinearw_files/June2020_Production/merged_files/Wj_AB212108_v2_mc16a.root")
>>> a = uproot.open("/nfs/slac/atlas/fs1/d/yuzhan/collinearw_files/June2020_Production/merged_files/Wj_AB212108_v2_mc16a.root")
Traceback (most recent call last):
  File "/gpfs/slac/atlas/fs1/u/gstark/collinearw/py3/lib/python3.6/site-packages/uproot4/source/file.py", line 110, in __init__
    self._file = numpy.memmap(self._file_path, dtype=self._dtype, mode="r")
  File "/gpfs/slac/atlas/fs1/u/gstark/collinearw/py3/lib/python3.6/site-packages/numpy/core/memmap.py", line 264, in __new__
    mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start)
OSError: [Errno 12] Cannot allocate memory
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/gpfs/slac/atlas/fs1/u/gstark/collinearw/py3/lib/python3.6/site-packages/uproot4/reading.py", line 142, in open
    **options  # NOTE: a comma after **options breaks Python 2
  File "/gpfs/slac/atlas/fs1/u/gstark/collinearw/py3/lib/python3.6/site-packages/uproot4/reading.py", line 537, in __init__
    file_path, **self._options  # NOTE: a comma after **options breaks Python 2
  File "/gpfs/slac/atlas/fs1/u/gstark/collinearw/py3/lib/python3.6/site-packages/uproot4/source/file.py", line 117, in __init__
    file_path, **opts  # NOTE: a comma after **opts breaks Python 2
  File "/gpfs/slac/atlas/fs1/u/gstark/collinearw/py3/lib/python3.6/site-packages/uproot4/source/file.py", line 246, in __init__
    [FileResource(file_path) for x in uproot4._util.range(num_workers)]
  File "/gpfs/slac/atlas/fs1/u/gstark/collinearw/py3/lib/python3.6/site-packages/uproot4/source/futures.py", line 351, in __init__
    worker.start()
  File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.6.5-f74f0/x86_64-centos7-gcc8-opt/lib/python3.6/threading.py", line 846, in start
    _start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread
>>> exit()

File access methods performance studies

Data provided by @chrisburr; discussion below

If I'm making a fundamental mistake in these file access methods, then I don't want to design around it. I've been thinking about how I can adequately test that. Do you have a suggestion for a publicly available test file?

I default to using these three open data files for XRootD:

$ xrdfs root://eospublic.cern.ch/ ls -l /eos/opendata/lhcb/AntimatterMatters2017/data
-r-- 2017-03-07 13:53:05   666484974 /eos/opendata/lhcb/AntimatterMatters2017/data/B2HHH_MagnetDown.root
-r-- 2017-03-07 13:53:08   444723234 /eos/opendata/lhcb/AntimatterMatters2017/data/B2HHH_MagnetUp.root
-r-- 2017-03-07 13:53:08     2272072 /eos/opendata/lhcb/AntimatterMatters2017/data/PhaseSpaceSimulation.root

I've just copied them to CERN S3 so they can be accessed using standard HTTP(S) as well:

Would a test through a household ISP mean anything, given that most of these files would be used on university or lab computers or on the GRID?

It's definitely useful as the quality of institute connections varies a lot in my experience and it's more likely to expose issues with latency, especially given you'll be accessing them from the US. I think a good implementation should only be limited by bandwidth when reading large enough files.

If you put together a test I can easily try it at a few different institutes as well as pushing a test job to every LHCb grid site.

Originally posted by @chrisburr in #1 (comment)

Handle ROOT's memberwise splitting

Consider the weird serialization in scikit-hep/uproot3#373, scikit-hep/uproot3#374, scikit-hep/uproot3#403, scikit-hep/uproot3#475, and scikit-hep/uproot3#495. It's field-at-a-time inside of each entry. I had thought it was Boost-inside-ROOT, but not for most of the above. It may be some ROOT serialization mode that I'm unaware of.

Branch-joining logic for parent branches (Awkward Arrays only?)

Write split std::map and invoke reading at the structural branch level. The branch-joining logic will need to be used at both TBranch.array and HasBranches.arrays.

Test it for the new-style Lorentz vectors, too.

flow option for values/edges

Would be convenient to have flow option for values/edges methods on THX matching with the to_numpy() method, esp when porting uproot3 code.

Trouble with modern LorentzVector classes

I have flat ntuples containing std::vector of:

ROOT::Math::PositionVector3D<ROOT::Math::Cartesian3D<double>,ROOT::Math::DefaultCoordinateSystemTag>
ROOT::Math::DisplacementVector3D<ROOT::Math::Cartesian3D<double>,ROOT::Math::DefaultCoordinateSystemTag>
ROOT::Math::LorentzVector<ROOT::Math::PtEtaPhiE4D<double> >

uproot4 seems to be choking badly on these classes. Can you please take a look and make sure that it is able to handle these? An example file can be found at /uscms_data/d2/aperloff/YOURWORKINGAREA/SUSY/slc7/CMSSW_10_2_21/src/TreeMaker/Production/test/SplitLevelTest/Summer16v3.TTJets_SingleLeptFromT_TuneCUETP8M1_13TeV-madgraphMLM-pythia8_split99Lorentz_RA2AnalysisTree.root.

I tried exploring the file with:

import uproot4
f = uproot4.open("/uscms_data/d2/aperloff/YOURWORKINGAREA/SUSY/slc7/CMSSW_10_2_21/src/TreeMaker/Production/test/SplitLevelTest/Summer16v3.TTJets_SingleLeptFromT_TuneCUETP8M1_13TeV-madgraphMLM-pythia8_split99Lorentz_RA2AnalysisTree.root")
t = f["TreeMaker2/PreSelection"]
t.show()

Most of the branches look as expected, but the ROOT::Math::LorentzVector branches have:

Electrons            | std::vector<ROOT::Ma | AsObjects(AsVector(True, Unknown_R

I have turned the split level to 99, so I would expect these to be AsJagged and not AsObjects. Also, use the Unknown_R thing going to be a problem? Even worse, the whole thing crashes on ROOT::Math::PositionVector3D<ROOT::Math::Cartesian3D<double>,ROOT::Math::DefaultCoordinateSystemTag> with two errors:

uproot4.interpretation.identify.NotNumerical
ValueError: invalid C++ type name syntax at char 68

A full log has been attached.
uproot4crash.log

Predefined classes are preventing "wrong" class/streamer versions from being read

... but these do happen in real data, often from files created by an unreleased version of ROOT. I suspect that's what happened with demo-double32.root. (I might have made it while collaborating with Brian; my ROOT had TIOFeatures but an old TBranch class version.) Uproot3 reads this TTree with no problem because it imports and uses the wrong streamer from the file; Uproot4 fails because its expected streamer (the "right" one) doesn't match the data.

Failed endcheck or read off end of chunk should trigger an attempt to read the streamer and try again (once). This would require a custom Exception for these cases and retry logic in the TKey.get. Test case: demo-double32.root. Preferably, it should use a local classes dict.

Even if a read finally fails (can't find a good streamer or even the streamer from the file doesn't work), these two error cases should be reported to the user as "failed deserialization," rather than "endcheck size didn't match" or "attempting to read beyond end of chunk." These are very common symptoms of deserialization failures and are frequently reported with confusion in the Issues.

Maybe I should also add breadcrumbs to pinpoint the location of the failure. Beyond being more self-explanatory for users, this could be an investment in developing more deserializers because I wouldn't have to spend the time searching for the point of error.

Test a URL with an HTTP port number

It seems that file_object_path_split is splitting on the colon between the server name and the file path.

Internal error while coordinating basket-reads for a branch

I think I may have found a bug.
During skimming, I store quantities as std::vector and write those into the root tree. However, when using uproot4 and doing something like
uproot4.open("Test_1.root")["Events"]["Muon_Pt"].array()
I get the following error:
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/nfs/dust/cms/user/wiens/anaconda3/envs/hepML/lib/python3.8/site-packages/uproot4/behaviors/TBranch.py", line 1548, in array _ranges_or_baskets_to_arrays( File "/nfs/dust/cms/user/wiens/anaconda3/envs/hepML/lib/python3.8/site-packages/uproot4/behaviors/TBranch.py", line 509, in _ranges_or_baskets_to_arrays uproot4.source.futures.delayed_raise(*obj) File "/nfs/dust/cms/user/wiens/anaconda3/envs/hepML/lib/python3.8/site-packages/uproot4/source/futures.py", line 33, in delayed_raise raise exception_value.with_traceback(traceback) File "/nfs/dust/cms/user/wiens/anaconda3/envs/hepML/lib/python3.8/site-packages/uproot4/behaviors/TBranch.py", line 451, in basket_to_array interpretation = branchid_interpretation[id(branch)] KeyError: 139865518582800
Doing the exact same thing but using uproot (i.e. uproot3) instead it just works.

Memory leak

It seems, at least for me, that uproot4 is using twice as memory to load a branch compared to uproot3.

A created a simple but large enough TTree to test this with this script

#include <array>

void test() {
    TFile file("test.root", "recreate");
    TTree tree("tree", "test memory leak");

    std::array<Float_t, 1000> array;
    tree.Branch("array", &array);

    for (int i = 0; i < 100000; i++) {
        for (auto &x : array)
            x = i;
        tree.Fill();
    }
    file.Write();
    file.Close();
}

And them I loaded this file using uproot3 and 4

import uproot
with uproot.open('test.root') as file:
    data = file['tree/array'].array()

import uproot4
with uproot4.open('test.root') as file:
    data = file['tree/array'].array(library='np')

Checking the memory consumption of those processes I got this

$ ps -e -o command,pid,%mem | grep python
python  21032  10.5   # uproot4
python  21256   5.2   # uproot3

The uproot4 version is using twice as memory, so it seams the data is duplicated in memory. Next I deleted the data array with del data in both processes and that release the memory used by the array but the "duplicated" data remained in uproot4.

$ ps -e -o command,pid,%mem | grep python
python  21032  5.7  # uproot4
python  21256  0.4  # uproot3

For this test I used the 0.23 version of uproot4. I'm doing something wrong or this is indeed a bug?

Add ability to break file connection in objects

And use this option in __deepcopy__ and when handing objects to boost-histogram/hist. It should have the file_path and uuid, or at least for the latter set the object_path and cache_key in a permanent way.

@henryiii

parsing a struct with 2D arrays

This is a continuation of https://stackoverflow.com/questions/62519766/parsing-a-struct-using-2d-arrays-using-uproot/62520558#62520558

Jim,

Thanks, for the fix on StackExchange. Here is a small root file along with the macro that generated it so that you can test the automated interpretation.

2DinStruct.root.gz
Test2DinStruct.C.gz

Ron

Thoughts about porting Uproot3 writing to Uproot4

This will take some thought, and it will come after the two reading projects.

For now, though, I should bring this to @reikdas's attention.

Port most of the tests from the old test_issues.py

Non-existent aliases tripping up arrays()?

I took uproot4 for a spin for the first time and it choked on a call to arrays():

---------------------------------------------------------------------------
KeyInFileError                            Traceback (most recent call last)
~/workspace/hgc_l1_trigger_autoencoder/reco/reco.py in 
----> 24 hits.arrays()

/usr/local/lib/python3.8/site-packages/uproot4/behaviors/TBranch.py in arrays(self, expressions, cut, filter_name, filter_typename, filter_branch, aliases, compute, entry_start, entry_stop, decompression_executor, interpretation_executor, array_cache, library, how)
    914                 return None
    915 
--> 916         aliases = _regularize_aliases(self, aliases)
    917         arrays, expression_context, branchid_interpretation = _regularize_expressions(
    918             self,

/usr/local/lib/python3.8/site-packages/uproot4/behaviors/TBranch.py in _regularize_aliases(hasbranches, aliases)
    100 def _regularize_aliases(hasbranches, aliases):
    101     if aliases is None:
--> 102         return hasbranches.aliases
    103     else:
    104         new_aliases = dict(hasbranches.aliases)

/usr/local/lib/python3.8/site-packages/uproot4/behaviors/TTree.py in aliases(self)
     21     @property
     22     def aliases(self):
---> 23         aliases = self.member("fAliases")
     24         if aliases is None:
     25             return {}

/usr/local/lib/python3.8/site-packages/uproot4/model.py in member(self, name, bases, recursive_bases, none_if_missing)
    288             return None
    289         else:
--> 290             raise uproot4.KeyInFileError(
    291                 name,
    292                 """{0}.{1} has only the following members:

KeyInFileError: not found: 'fAliases' because .Model_TTree_v5 has only the following members:

    '@fUniqueID', '@fBits', 'fName', 'fTitle', 'fLineColor', 'fLineStyle', 'fLineWidth', 'fFillColor', 'fFillStyle', 'fMarkerColor', 'fMarkerStyle', 'fMarkerSize', 'fEntries', 'fTotBytes', 'fZipBytes', 'fSavedBytes', 'fTimerInterval', 'fScanField', 'fUpdate', 'fMaxEntryLoop', 'fMaxVirtualSize', 'fAutoSave', 'fEstimate', 'fBranches', 'fLeaves', 'fIndexValues', 'fIndex'

in file ../dat/config1_mu+_95_to_100GeV_4Tesla_t0.root

The tree looks like:

hits.show(interpretation_width=50)
name                 | typename             | interpretation                                    
---------------------+----------------------+---------------------------------------------------
eventID              | int32_t              | AsDtype('>i4')                                    
pdgID                | int32_t              | AsDtype('>i4')                                    
totalEnergy_GeV      | double               | AsDtype('>f8')                                    
P_X_GeV              | double               | AsDtype('>f8')                                    
P_Y_GeV              | double               | AsDtype('>f8')                                    
P_Z_GeV              | double               | AsDtype('>f8')                                    
beamX_cm             | double               | AsDtype('>f8')                                    
beamY_cm             | double               | AsDtype('>f8')                                    
beamZ_cm             | double               | AsDtype('>f8')                                    
hit_ID               | std::vector<int32_t> | AsJagged(AsDtype('>i4'), header_bytes=10)         
hit_x_cm             | std::vector<double>  | AsJagged(AsDtype('>f8'), header_bytes=10)         
hit_y_cm             | std::vector<double>  | AsJagged(AsDtype('>f8'), header_bytes=10)         
hit_z_cm             | std::vector<double>  | AsJagged(AsDtype('>f8'), header_bytes=10)         
hit_Edep_keV         | std::vector<double>  | AsJagged(AsDtype('>f8'), header_bytes=10)         
hit_EdepNonIonizing_ | std::vector<double>  | AsJagged(AsDtype('>f8'), header_bytes=10)         
hit_TOA_ns           | std::vector<double>  | AsJagged(AsDtype('>f8'), header_bytes=10)         
hit_TOA_last_ns      | std::vector<double>  | AsJagged(AsDtype('>f8'), header_bytes=10)         
hit_type             | std::vector<int32_t> | AsJagged(AsDtype('>i4'), header_bytes=10)         
ESum_MeV             | double               | AsDtype('>f8')                                    
COG_Z_cm             | double               | AsDtype('>f8')                                    
NHits                | int32_t              | AsDtype('>i4')

One fishy thing is that any call of the form hist[col_name].array() is perfectly fine, but hist.arrays([col1, col2]) fails.

So, I tried to make a MWE, but unfortunately it fails to reproduce so I wonder about the root of the issue:

#%%
from ROOT import TFile, TTree
from array import array

f = TFile('foo.root','recreate')
t = TTree('foo','foo')
maxn = 10
n = array('i',[0])
d = array('f',maxn*[0.])
t.Branch('mynum',n,'mynum/I')
t.Branch('myval',d,'myval[mynum]/D')

for i in range(25):
    n[0] = min(i,maxn)
    for j in range(n[0]):
        d[j] = i*0.1+j
        t.Fill()

f.Write()
f.Close()

# %%
import uproot4 as ur

#%%
t = ur.open('./foo.root:foo')

# %%
t.show()

# %%
t.arrays(library='pd')

This example's show() yields:

name                 | typename             | interpretation                    
---------------------+----------------------+-----------------------------------
mynum                | int32_t              | AsDtype('>i4')                    
myval                | double[]             | AsJagged(AsDtype('>f8'))

I've noticed a couple of differences between the failing and passing cases:

Failing is coming from C++, while passing is all Python,
Failing uses std::vector<>, while passing uses array,
Failing prints AsJagged(AsDtype('>f8'), header_bytes=10) , while passing has just AsJagged(AsDtype('>f8')), i.e. no header_bytes=10.

Let me know if any of this is helpful and whether I can help making better sense of it.

Cursor.skip_over can only be used on an object with non-null `num_bytes`

Using uproot4 v'0.0.16', when trying to read a tree that was copied from another file (may be coincidental - just including all details), I receive the following traceback:

>>> import uproot4 as uproot
>>> f = uproot.open("AnalysisResults.root")
>>> f.keys()
['AliAnalysisTask;3']
>>> f["AliAnalysisTask"]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/alf/data/rehlers/substructure/.venv/lib/python3.7/site-packages/uproot4/reading.py", line 1371, in __getitem__
    return self.key(where).get()
  File "/alf/data/rehlers/substructure/.venv/lib/python3.7/site-packages/uproot4/reading.py", line 870, in get
    out = cls.read(chunk, cursor, context, self._file, self)
  File "/alf/data/rehlers/substructure/.venv/lib/python3.7/site-packages/uproot4/model.py", line 478, in read
    versioned_cls.read(chunk, cursor, context, file, parent, concrete=concrete),
  File "/alf/data/rehlers/substructure/.venv/lib/python3.7/site-packages/uproot4/model.py", line 123, in read
    self.read_members(chunk, cursor, context)
  File "/alf/data/rehlers/substructure/.venv/lib/python3.7/site-packages/uproot4/models/TTree.py", line 665, in read_members
    chunk, cursor, context, self._file, self._concrete
  File "/alf/data/rehlers/substructure/.venv/lib/python3.7/site-packages/uproot4/model.py", line 123, in read
    self.read_members(chunk, cursor, context)
  File "/alf/data/rehlers/substructure/.venv/lib/python3.7/site-packages/uproot4/models/TObjArray.py", line 42, in read_members
    chunk, cursor, context, self._file, self._parent
  File "/alf/data/rehlers/substructure/.venv/lib/python3.7/site-packages/uproot4/deserialization.py", line 280, in read_object_any
    obj = cls.read(chunk, cursor, context, file, parent)
  File "/alf/data/rehlers/substructure/.venv/lib/python3.7/site-packages/uproot4/model.py", line 478, in read
    versioned_cls.read(chunk, cursor, context, file, parent, concrete=concrete),
  File "/alf/data/rehlers/substructure/.venv/lib/python3.7/site-packages/uproot4/model.py", line 123, in read
    self.read_members(chunk, cursor, context)
  File "/alf/data/rehlers/substructure/.venv/lib/python3.7/site-packages/uproot4/models/TBranch.py", line 626, in read_members
    concrete=self._concrete,
  File "/alf/data/rehlers/substructure/.venv/lib/python3.7/site-packages/uproot4/model.py", line 123, in read
    self.read_members(chunk, cursor, context)
  File "/alf/data/rehlers/substructure/.venv/lib/python3.7/site-packages/uproot4/models/TBranch.py", line 296, in read_members
    cursor.skip_over(chunk, context)
  File "/alf/data/rehlers/substructure/.venv/lib/python3.7/site-packages/uproot4/source/cursor.py", line 152, in skip_over
    "Cursor.skip_over can only be used on an object with non-null "
TypeError: Cursor.skip_over can only be used on an object with non-null `num_bytes`

I can read the tree just fine in root v6.20/02 (+ some alice patches). I can send the file, but need to keep it private, so I can't post it here. Any help or suggestions are appreciated!

Reading large doubly jagged arrays iteratively/lazily?

Edit: sorry, some copy&paste error destroyed the intro 😉 now fixed

So basically what we need is to process many branches of doubly jagged arrays in files which do not fit into memory.

Here is an example with a small file, showing the structure of such a branch (with uproot3 which falls back to ObjectArray and uproot4 which gives a nice awkward1 array):

import uproot
import uproot4
from km3net_testdata import data_path
print(uproot.__version__)  # 3.12.0
print(uproot4.__version__)  # 0.0.20

filename = data_path("offline/km3net_offline.root")
f = uproot.open(filename)
f4 = uproot4.open(filename)

f["E/Evt/trks/trks.rec_stages"].array()
# <ObjectArray [[[1, 3, 5, 4], [1, 3, 5], [1, 3], ...

f4["E/Evt/trks/trks.rec_stages"].array()
# <Array [[[1, 3, 5, 4], [1, ... 1], [1], [1]]] type='10 * var * var * int64'>

I uploaded a larger file here: http://131.188.167.67:8889/doubly_jagged.root for which f["E/Evt/trks/trks.rec_stages"].array() takes extremely long.

So I tried to utilise uproot4.lazy() and played around with uproot4.iterate() but I have not figured out how to read the data iteratively. Here are my naive approaches (based on some

trks = uproot4.lazy({"doubly_jagged.root": "E/Evt/trks"})
trks[0, "trks.rec_stages"]

which yiels:

ValueError: generated array does not conform to expected form:

{
    "class": "ListOffsetArray64",
    "offsets": "i64",
    "content": {
        "class": "ListOffsetArray64",
        "offsets": "i64",
        "content": "int32"
    }
}

but generated:

{
    "class": "ListOffsetArray64",
    "offsets": "i64",
    "content": {
        "class": "ListOffsetArray64",
        "offsets": "i64",
        "content": "int64"
    }
}

(https://github.com/scikit-hep/awkward-1.0/blob/0.2.33/src/libawkward/virtual/ArrayGenerator.cpp#L46)

I am not sure why it confuses between int32 and int64, maybe due to empty arrays? No clue...

ping @zinebaly

Test all compression algorithms

Use Zmumu and HZZ. (The "sample*.root" files are too small to actually be compressed.)

uproot4 unable to open ROOT file with colons in the name that uproot3 can

Hi. I've come across an issue where if a ROOT file has a colon in the name of it uproot3 can open the file but uproot4 fails. I can't give you the file, but I can show you a minimal failing example and then a reproducible example with public files.

Minimal Failing Example

$ tree .
.
├── data-tree
│   └──  data16_13TeV:data16_13TeV.periodA.physics_Main.PhysCont.DAOD_JETM1.grp16_v01_p4061.root
├── issue.py
├── requirements.txt

1 directory, 3 files
$ cat requirements.txt 
uproot
uproot4
$ docker run --rm -it -v $PWD:/data -w /data python:3.8 /bin/bash
root@510598a7f4e8:/data# python -m pip install --upgrade pip setuptools wheel
root@510598a7f4e8:/data# python -m pip install -r requirements.txt
root@510598a7f4e8:/data# python --version
Python 3.8.5
root@510598a7f4e8:/data# python -m pip list
Package        Version
-------------- -------
awkward        0.13.0
cachetools     4.1.1
numpy          1.19.1
pip            20.2.2
setuptools     49.6.0
uproot         3.12.0
uproot-methods 0.7.4
uproot4        0.0.18
wheel          0.35.1
root@510598a7f4e8:/data# cp data-tree/data16_13TeV\:data16_13TeV.periodA.physics_Main.PhysCont.DAOD_JETM1.grp16_v01_p4061.root data-tree/renamed.root
root@510598a7f4e8:/data# python
Python 3.8.5 (default, Aug  5 2020, 08:22:02) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import uproot as uproot3
>>> import uproot4
>>> from pathlib import Path
>>> 
>>> uproot3_file = uproot3.open(
...     "data-tree/data16_13TeV:data16_13TeV.periodA.physics_Main.PhysCont.DAOD_JETM1.grp16_v01_p4061.root"
... )
>>> print(f"uproot3 opens file as {uproot3_file}")
uproot3 opens file as <ROOTDirectory b'/home/feickert/workarea/submitDir/data-tree//data16_13TeV:data16_13TeV.periodA.physics_Main.PhysCont.DAOD_JETM1.grp16_v01_p4061.root' at 0x7fce6da5b5e0>
>>> 
>>> # uproot4 fails with the ':' in the filename
>>> uproot4.open(
...     "data-tree/data16_13TeV:data16_13TeV.periodA.physics_Main.PhysCont.DAOD_JETM1.grp16_v01_p4061.root"
... )
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/uproot4/source/file.py", line 74, in __init__
    self._file = numpy.memmap(self._file_path, dtype=self._dtype, mode="r")
  File "/usr/local/lib/python3.8/site-packages/numpy/core/memmap.py", line 225, in __new__
    f_ctx = open(os_fspath(filename), ('r' if mode == 'c' else mode)+'b')
FileNotFoundError: [Errno 2] No such file or directory: 'data-tree/data16_13TeV'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/site-packages/uproot4/reading.py", line 78, in open
    file = ReadOnlyFile(
  File "/usr/local/lib/python3.8/site-packages/uproot4/reading.py", line 265, in __init__
    self._source = Source(file_path, **self._options)
  File "/usr/local/lib/python3.8/site-packages/uproot4/source/file.py", line 80, in __init__
    self._fallback = uproot4.source.file.FileSource(file_path, opts)
AttributeError: module 'uproot4.source.file' has no attribute 'FileSource'
>>> # even if that is inside a pathlib object
>>> uproot4.open(
...     Path(
...         "data-tree/data16_13TeV:data16_13TeV.periodA.physics_Main.PhysCont.DAOD_JETM1.grp16_v01_p4061.root"
...     )
... )
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/uproot4/source/file.py", line 74, in __init__
    self._file = numpy.memmap(self._file_path, dtype=self._dtype, mode="r")
  File "/usr/local/lib/python3.8/site-packages/numpy/core/memmap.py", line 225, in __new__
    f_ctx = open(os_fspath(filename), ('r' if mode == 'c' else mode)+'b')
FileNotFoundError: [Errno 2] No such file or directory: 'data-tree/data16_13TeV'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/site-packages/uproot4/reading.py", line 78, in open
    file = ReadOnlyFile(
  File "/usr/local/lib/python3.8/site-packages/uproot4/reading.py", line 265, in __init__
    self._source = Source(file_path, **self._options)
  File "/usr/local/lib/python3.8/site-packages/uproot4/source/file.py", line 80, in __init__
    self._fallback = uproot4.source.file.FileSource(file_path, opts)
AttributeError: module 'uproot4.source.file' has no attribute 'FileSource'
>>> # but the file itself is fine
>>> uproot4.open("data-tree/renamed.root")
<ReadOnlyDirectory '/' at 0x7fce725e7670>

Failing Reproducible Example

# issue.py
import uproot as uproot3
import uproot4
from pathlib import Path


def main():
    # curl -sL https://github.com/scikit-hep/scikit-hep-testdata/raw/master/src/skhep_testdata/data/uproot-HZZ-lz4.root -o uproot-HZZ-lz4.root
    uproot3.open("uproot-HZZ-lz4.root")
    uproot3.open("uproot:HZZ-lz4.root")
    uproot3.open(Path("uproot:HZZ-lz4.root"))

    uproot4.open("uproot-HZZ-lz4.root")
    uproot4.open("uproot:HZZ-lz4.root")


if __name__ == "__main__":
    main()

root@510598a7f4e8:/data# curl -sL https://github.com/scikit-hep/scikit-hep-testdata/raw/master/src/skhep_testdata/data/uproot-HZZ-lz4.root -o uproot-HZZ-lz4.root
root@510598a7f4e8:/data# cp uproot-HZZ-lz4.root uproot:HZZ-lz4.root 
root@510598a7f4e8:/data# python issue.py 
Traceback (most recent call last):
  File "issue.py", line 17, in <module>
    main()
  File "issue.py", line 9, in main
    uproot3.open("uproot:HZZ-lz4.root")
  File "/usr/local/lib/python3.8/site-packages/uproot/rootio.py", line 63, in open
    raise ValueError("URI scheme not recognized: {0}".format(path))
ValueError: URI scheme not recognized: uproot:HZZ-lz4.root

Comments

I realize that this is probably because uproot4's open's path is

https://github.com/scikit-hep/uproot4/blob/b8828069b9ae52c742fb704f07e1a4e179fe7b30/uproot4/reading.py#L43-L46

and it isn't a good idea to have a file with a colon in it in general. However, there are ATLAS files that do, and it would be nice if uproot3 behavior toward filenames could still be supported. If support for this is firmly out of scope it would be great if there could be some huge warning about this in the docs.

XRootDResource relies on querying parameters which is not supported by all storage elements

get_server_config queries some parameters for vector reading

https://github.com/scikit-hep/uproot4/blob/73d103dd5588bdf478937e475007fabd1a5803ec/uproot4/source/xrootd.py#L32-L41

not all storage elements seem to be reporting this correctly (seems those that use dCache) - they just report back the name of the parameter, e.g.

>>> import XRootD.client
>>> fs = XRootD.client.FileSystem("root://prometheus.desy.de:1094/")
>>> fs.query(XRootD.client.flags.QueryCode.CONFIG, "readv_iov_max")
(<status: 0, code: 0, errno: 0, message: '[SUCCESS] ', shellcode: 0, error: False, fatal: False, ok: True>, b'readv_iov_max\n')

Unfortunately i can't give a minimal reproducer with opening a root file since i couldn't find publicly accessible root files on any of these storages, but essentially it breaks when trying to convert this value to an int (the follwing works with access to ATLAS VO)

>>> import uproot4
>>> f = uproot4.open("root://lcg-lrz-rootd.grid.lrz.de:1094/pnfs/lrz-muenchen.de/data/atlas/dq2/atlaslocalgroupdisk/rucio/data15_13TeV/f4/ba/DAOD_PHYSLITE.21568620._000001.pool.root.1")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/nikolai/python/uproot4/uproot4/reading.py", line 80, in open
    file = ReadOnlyFile(
  File "/home/nikolai/python/uproot4/uproot4/reading.py", line 139, in __init__
    self._source = Source(file_path, **self._options)
  File "/home/nikolai/python/uproot4/uproot4/source/xrootd.py", line 182, in __init__
    self._max_num_elements, self._max_element_size = get_server_config(file_path)
  File "/home/nikolai/python/uproot4/uproot4/source/xrootd.py", line 37, in get_server_config
    readv_iov_max = int(readv_iov_max)
ValueError: invalid literal for int() with base 10: b'readv_iov_max\n'

So some kind of fallback (default value that is configurable if needed?) would be needed to support these storages for now.

The method .numpy discards -inf and inf on x and y axis of imported TH2D

I tried on TH2D graphs and upon the conversion of an imported TH2D using myTH2D.numpy(), the arrays myTH2D.index (y-axis) and myTH2D.column (x-axis) have their dimension INCREASED by 1 respect to the respective rows and columns of the matrix myTH2.values (z-axis).

ttree.iterate documentation bug

Noticed this: https://github.com/scikit-hep/uproot4/blob/21047924d1092a3bcd077774474edf1b5b4acaa8/uproot4/behaviors/TBranch.py#L1103-L1106

It appears that the documentation implies report, events = ttree.iterate() when it's actually events, report = ttree.iterate().

Implement attachment of streamers to branches

Attach streamers to branches, but only for non-trivial types. (Avoid reading any streamers, if possible.)

May be necessary for #27.

Lazyarrays in uproot4

The only references to lazy-loading (of arrays) I could find in the code so far are those using uproot4.lazy(), which requires a file and an object path.

Is there a plan to provide the old .lazyarray() interface as it was available in uproot3? Or does the current design make it hard to provide a lazyarray-interface for already opened files?

Btw. I noticed that the .array() method is very fast in many of our files which require custom interpretations (which is really nice!) but still, the memory requirement is sometimes huge (due to large dtypes) compared to what a user usually extracts from our branches and the Python GC is going crazy.

Also we usually deal with large amounts of branches and the overhead of the file opening might not be negligible.

Porting code based on uproot3 to uproot4 - interpretations with custom-dtypes

I started porting our I/O library based on uproot3 to uproot4 and although I know that the docs are not done yet, I am also hoping to solve some performance issues which are related to dtypes falling back to object-type, so I thought I give it a try.

One of the current issues is that I have not figured out yet how to pass custom interpretations to .array(). This is the uproot3 code:

import uproot

f = uproot.open("km3net_online.root")

dtype = [("dom_id", "i4"), ("dq_status", "u4"), ("hrv", "u4"), ("fifo", "u4"), ("status3", "u4"), ("status4", "u4")] + [(f"ch{c}", "u1") for c in range(31)]
path = "KM3NET_SUMMARYSLICE/KM3NET_SUMMARYSLICE/vector<KM3NETDAQ::JDAQSummaryFrame>"

data = f[path].array(uproot.asjagged(uproot.astable(uproot.asdtype(dtype)), skipbytes=10))                                                              

data.dtype  # dtype('O')

data[0]   # Out[8]: <Table [<Row 0> <Row 1> <Row 2> ... <Row 61> <Row 62> <Row 63>] at 0x7ff5eb6586d0>

data[0].dtype  # dtype([('dom_id', '<i4'), ('dq_status', '<u4'), ('hrv', '<u4'), ('fifo', '<u4'), ('status3', '<u4'), ('status4', '<u4'), ('ch0', 'u1'), ('ch1', 'u1'), ('ch2', 'u1'), ('ch3', 'u1'), ('ch4', 'u1'), ('ch5', 'u1'), ('ch6', 'u1'), ('ch7', 'u1'), ('ch8', 'u1'), ('ch9', 'u1'), ('ch10', 'u1'), ('ch11', 'u1'), ('ch12', 'u1'), ('ch13', 'u1'), ('ch14', 'u1'), ('ch15', 'u1'), ('ch16', 'u1'), ('ch17', 'u1'), ('ch18', 'u1'), ('ch19', 'u1'), ('ch20', 'u1'), ('ch21', 'u1'), ('ch22', 'u1'), ('ch23', 'u1'), ('ch24', 'u1'), ('ch25', 'u1'), ('ch26', 'u1'), ('ch27', 'u1'), ('ch28', 'u1'), ('ch29', 'u1'), ('ch30', 'u1')])

data[0].dom_id

# array([806451572, 806455814, 806465101, 806483369, 806487219, 806487226,
#       806487231, 808432835, 808435278, 808447180, 808447186, 808451904,
#       808451907, 808469129, 808472260, 808472265, 808488895, 808488990,
#    ...
#       808981864, 808982018, 808982041, 808982077, 808982547, 808984711,
#       808996773, 808997793, 809006037, 809007627, 809503416, 809521500,
#       809524432, 809526097, 809544058, 809544061], dtype=int32)

In uproot4, this is what I came up with

# similar lines as above until the branch interpretation

interp = uproot4.interpretation.jagged.AsJagged(uproot4.interpretation.numerical.AsDtype(dtype), header_bytes=10)
# AsJagged(AsDtype("[('dom_id', '<i4'), ('dq_status', '<u4'), ('hrv', '<u4'), ('fifo', '<u4'), ('status3', '<u4'), ('status4', '<u4'), ('ch0', 'u1'), ('ch1', 'u1'), ('ch2', 'u1'), ('ch3', 'u1'), ('ch4', 'u1'), ('ch5', 'u1'), ('ch6', 'u1'), ('ch7', 'u1'), ('ch8', 'u1'), ('ch9', 'u1'), ('ch10', 'u1'), ('ch11', 'u1'), ('ch12', 'u1'), ('ch13', 'u1'), ('ch14', 'u1'), ('ch15', 'u1'), ('ch16', 'u1'), ('ch17', 'u1'), ('ch18', 'u1'), ('ch19', 'u1'), ('ch20', 'u1'), ('ch21', 'u1'), ('ch22', 'u1'), ('ch23', 'u1'), ('ch24', 'u1'), ('ch25', 'u1'), ('ch26', 'u1'), ('ch27', 'u1'), ('ch28', 'u1'), ('ch29', 'u1'), ('ch30', 'u1')]"), header_bytes=10)

f[path].array(interp)  # <Array [[{dom_id: 1954091312, ... ch30: 48}]] type='3 * var * {"dom_id": int32, ...'>

The first problem is that the endianness is not picked up correctly, the dom_id for example should be little endian:

import numpy as np

np.array(f[path].array(interp)[0].dom_id).byteswap()[:4]
# array([806451572, 806455814, 806465101, 806483369], dtype=int32)

The second problem is that I am not sure if it is now a "type-safe" (high-performance) readout. In case of uproot3, the dtype='O' really had a horrible performance, for obvious reasons. Now I think that with awkward1 I should be able to correctly map the structure, or am I wrong? At least what I got here is a awkward1.highlevel.Array.

However, checking the dtype is giving me an error and it seems that it is an object type:

>>> f[path].array(interp).dtype                                                                                                    
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-52-98270c2c8c48> in <module>
----> 1 f[path].array(interp).dtype

~/.virtualenvs/km3net/lib/python3.8/site-packages/awkward1/highlevel.py in __getattr__(self, where)
   1015         """
   1016         if where in dir(type(self)):
-> 1017             return super(Array, self).__getattribute__(where)
   1018         else:
   1019             if where in self._layout.keys():

~/.virtualenvs/km3net/lib/python3.8/site-packages/awkward1/_connect/_pandas.py in dtype(self)
    180 
    181         else:
--> 182             return np.dtype(np.object)
    183 
    184     @property

AttributeError: 'NumpyMetadata' object has no attribute 'object'

Thanks in advance!

Here is the file:
km3net_online.root.zip

Superfluous print statement in iterate

#101 introduced a print statement in iterate. Affects uproot4 versions >0.0.22. I believe this was added for debugging purposes, but is not meant to be in the package.

Reproduce:

download https://github.com/scikit-hep/scikit-hep-testdata/raw/master/src/skhep_testdata/data/uproot-HZZ.root
run the script:

import uproot4

for _ in uproot4.iterate("uproot-HZZ.root"): pass

output:

uproot-HZZ.root 0

due to https://github.com/scikit-hep/uproot4/blob/1accbe3a70eda72cad954187291ddaec5f36e22d/uproot4/behaviors/TBranch.py#L213

Allow access of a an open file-like object (root file within a `.tar.gz`)

As far as I can tell, you can only open a root file via a path name?

I am working with a massive amount of files that are individually in .tar.gzs. I would like to access the root files within without explicitly extracting them first.

Right now I am doing this dance:

with tarfile.open('/path/to/DATA.tar.gz', 'r') as tf:
    names = [x for x in tf.getnames() if x.endswith(".root")]
    for name in names:
        with tempfile.NamedTemporaryFile() as tmpfile:
            tinfo = tf.getmember(name)
            rootfile = tf.extractfile(tinfo)
            tmpfile.write(rootfile.read(-1))
            tmpfile.seek(0)
            branch = uproot4.lazy(f"{tmpfile.name}:path/to/branch")
            result = process(branch)

I would love if I could do something like (probably not fully correct but you get the idea):

with tarfile.open('/path/to/DATA.tar.gz', 'r') as tf:
    names = [x for x in tf.getnames() if x.endswith(".root")]
    for name in names:
        tinfo = tf.getmember(name)
        rootfile = tf.extractfile(tinfo)
        uprootfile = uproot4.open(rootfile)
        branch = uprootfile.get("path/to/branch")
        result = process(branch)

Indices for array branches

I am working with a tree that contains branches which are vectors of floats:

>>> tree["weight_bTagSF_eigenvars_B_up"].show(interpretation_width=40)
name                 | typename             | interpretation
---------------------+----------------------+-----------------------------------------
weight_bTagSF_eigenv | std::vector<float>   | AsJagged(AsDtype('>f4'), header_bytes=10

In ROOT, I use [0] to obtain the first entry of the vector for every event:

root [8] nominal->Scan("weight_bTagSF_eigenvars_B_up[0]")
************************
*    Row   * weight_bT *
************************
*        0 * 1.0968146 *
*        1 * 0.8027607 *
*        2 * 1.0042166 *
*        3 * 1.0752149 *
*        4 * 0.9913491 *
*        5 * 0.7981501 *

To obtain the same values in uproot4 (0.0.16), I need to use [:,0]:

>>> weight = "weight_bTagSF_eigenvars_B_up[:,0]"
>>> tree.arrays(weight)[weight]
<Array [1.1, 0.803, 1, ... 0.795, 0.808, 0.791] type='14992 * float32'>

Is this change in behavior on purpose? I'd like to turn user-provided strings into weights extracted from the file. In ROOT, such a string looks like weight_bTagSF_eigenvars_B_up[0]*some_other_weight. To handle this different convention, I would either translate the strings into what uproot4 expects, or ask users to adopt this way of indexing. I was wondering whether there may exist some "ROOT-like" interpretation I can switch on?

I noticed that the above does not work with library=np:

>>> weight = "weight_bTagSF_eigenvars_B_up[:,0]"
>>> tree.arrays(weight, library="np")[weight]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "[...]/lib/python3.8/site-packages/uproot4/behaviors/TBranch.py", line 967, in arrays
    output = compute.compute_expressions(
  File "[...]/lib/python3.8/site-packages/uproot4/compute/python.py", line 347, in compute_expressions
    output[expression] = _expression_to_function(
  File "<dynamic>", line 1, in <lambda>
IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed
>>> tree.arrays("weight_bTagSF_eigenvars_B_up", library="np")["weight_bTagSF_eigenvars_B_up"]
array([array([1.0968146, 1.0773798, 1.0343741, 1.1100905, 1.0188507, 1.0471411],
      dtype=float32),
       array([0.8027608, 1.0142494, 0.9376403, 0.980236 , 0.9652771, 0.95773  ],
      dtype=float32),
       array([1.0042167 , 0.93642455, 1.0683353 , 0.9941829 , 0.9865473 ,
       1.0125985 ], dtype=float32),
       ...,
       array([0.7953308, 0.9993567, 0.9192695, 0.9617004, 0.9411704, 0.9501564],
      dtype=float32),
       array([0.80789626, 0.99212795, 0.9876958 , 1.0175581 , 1.0071241 ,
       0.99970514], dtype=float32),
       array([0.79134053, 0.99477154, 0.93990743, 0.95846075, 0.9771205 ,
       0.95926076], dtype=float32)], dtype=object)

I don't know whether this is a limitation of numpy or a bug. It is possible (but likely very inefficient) to extract the same information like this:

>>> np.asarray(tree.arrays("weight_bTagSF_eigenvars_B_up", library="np")["weight_bTagSF_eigenvars_B_up"].tolist())[:,0]
array([1.0968146 , 0.8027608 , 1.0042167 , ..., 0.7953308 , 0.80789626,
       0.79134053], dtype=float32)

KeyInFileError to show near matches?

KeyInFileError currently shows a truncating "Known keys" list. This cool regarding the general format of what to ask for instead, but since we quite commonly have files with a large number of branches one still needs to go back to some kind of [key for key in f.keys() if 'bla' in key] to find the correct name.

It would be really cool if this would instead show closest matches first. I guess this could get computationally pricey, i.e. slow, so maybe it would only get computed if the number of keys was not too crazy?

Need behaviors (and possibly Models) for boost::histogram objects in ROOT files

Compared to uproot3 I can properly deserialize a boost::histogram object that has been stored in a ROOT directory or branch.

However I end up with a totally opaque object.

I look at how you proceed for the ROOT histograms (using member) and wrapping it in a Python class.

However my objects (the boost histograms) have no member.

Could you maybe point me in the right direction so that I could get these? I would be happy to contribute the "behavior" for these then.

Incorrect units in memory size

In https://github.com/scikit-hep/uproot4/blob/7a03d3bc44ca76315c5f45de678b79ff8afc91ee/uproot4/_util.py#L242
the unit kB, MB are used for what appear to be kiB, MiB, etc.
See https://en.wikipedia.org/wiki/Binary_prefix

Deploy a wheel, not just an sdist tarball.

For these reasons.

scikit-hep/scikit-hep-testdata#33 (comment)

Streamer-based strings and STL strings.

Maybe this can be done without actually looking at the streamers. It might depend on whether the data are inside an object or at top-level: I think that objects always need the equivalent of Uproot3's TTree.attachstreamers.

Error reading single and multi index branchs with pandas

Hi folks, I'm working with a TTree with two kinds of branches, a struct of doubles and single integers. And when a I try to load its data with pandas strange things happens.

The trees is like so.

>>> tree.show()

name                 | typename             | interpretation                    
---------------------+----------------------+-----------------------------------
ionization           | struct {double x; do | AsDtype("[('x', '>f8'), ('y', '>f8
electron             | struct {double x; do | AsDtype("[('x', '>f8'), ('y', '>f8
ion                  | struct {double x; do | AsDtype("[('x', '>f8'), ('y', '>f8
estatus              | int32_t              | AsDtype('>i4')                    
istatus              | int32_t              | AsDtype('>i4')

When I load a single branch every thing works fine. I get a MultiIndex DataFrame for the struct and Single Index for the integer branch.

>>> tree.arrays('ion', library='pd')

>>> tree.arrays('istatus', library='pd')

But when I try to load those braches at the same time the index gets messed up and I end up with an empty DataFrame with a bunch of Nan subindexes.

>>> tree.arrays(['ion', 'istatus'], library='pd')

I don't know if this is a bug or I'm doing an illegal operation reading those two kinds of data at the same time. But when a do the same thing with library='np' every thing works fine.

XRootD source doesn't fall back to non-vector read

In this CMS open data example, I cannot load a branch's array due to a NotImplementedError

>>> import uproot4
>>> file = uproot4.open("root://eospublic.cern.ch//eos/root-eos/cms_opendata_2012_nanoaod/ZZTo4mu.root")
>>> file["Events/Muon_eta"].array(entry_stop=244800)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/ncsmith/src/uproot4/uproot4/behaviors/TBranch.py", line 2017, in array
    _ranges_or_baskets_to_arrays(
  File "/Users/ncsmith/src/uproot4/uproot4/behaviors/TBranch.py", line 3180, in _ranges_or_baskets_to_arrays
    hasbranches._file.source.chunks(ranges, notifications=notifications)
  File "/Users/ncsmith/src/uproot4/uproot4/source/xrootd.py", line 263, in chunks
    raise NotImplementedError(
NotImplementedError: TODO: Probably need to fall back to a non-vector read

Can this be implemented?

Calling lazy() on a filename that doesn't exist produces confusing error

This is what happens if you call lazy() on a path that doesn't exist:

>>> import uproot4
>>> uproot4.lazy('nonexistent_file.root')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/mproffit/miniconda3/lib/python3.8/site-packages/uproot4/behaviors/TBranch.py", line 585, in lazy
    raise ValueError(
ValueError: allow_missing=True and no TTrees found in

Obviously this doesn't really indicate the actual problem--that the file doesn't exist. It also strangely claims that allow_missing=True, which is not correct since the default is False. It says the same thing even if you explicitly set allow_missing=False:

>>> uproot4.lazy('nonexistent_file.root', allow_missing=False)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/mproffit/miniconda3/lib/python3.8/site-packages/uproot4/behaviors/TBranch.py", line 585, in lazy
    raise ValueError(
ValueError: allow_missing=True and no TTrees found in

Unable to parse higher-dimensional (basic) arrays in leaves

An issue in UnROOT was raised (JuliaHEP/UnROOT.jl#9) since the automatic higher dimensional data or in general is still not possible or experimental, so I did what I usually do: fire up uproot and uproo4 😉 however there seems to be a problem to parse the dimensionality of those arrays. I am quite sure this worked in other files in the past, so I am not sure what's happening here but you see that the data under structs is correctly parsed, however the same data saved as multi-dimensional arrays in as TLeafD does not parse correctly.

I have not worked much with uproot4 yet, so I am sorry to just dump the error here, but I guess you already know what's going wrong:

In [21]: import uproot4

In [22]: f = uproot4.open("/Users/tamasgal/Downloads/test_array.root")

In [23]: f["arrays"].show()
name                 | typename             | interpretation
---------------------+----------------------+-----------------------------------
nInt                 | int32_t              | AsDtype('>i4')
6dVec                | double               | AsDtype('>f8')
2x3Mat               | double               | AsDtype('>f8')

In [24]: f["structs"].show()
name                 | typename             | interpretation
---------------------+----------------------+-----------------------------------
nInt                 | int32_t              | AsDtype('>i4')
2x3mat               | double[6]            | AsDtype("('>f8', (6,))")

In [25]: f["arrays/6dVec"].array()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-25-dbf33b2ce1b8> in <module>
----> 1 f["arrays/6dVec"].array()

~/.virtualenvs/km3net/lib/python3.7/site-packages/uproot4/behaviors/TBranch.py in array(self, interpretation, entry_start, entry_stop, decompression_executor, interpretation_executor, array_cache, library)
   1620             interpretation_executor,
   1621             library,
-> 1622             arrays,
   1623         )
   1624

~/.virtualenvs/km3net/lib/python3.7/site-packages/uproot4/behaviors/TBranch.py in _ranges_or_baskets_to_arrays(hasbranches, ranges_or_baskets, branchid_interpretation, entry_start, entry_stop, decompression_executor, interpretation_executor, library, arrays)
    516
    517         elif isinstance(obj, tuple) and len(obj) == 3:
--> 518             uproot4.source.futures.delayed_raise(*obj)
    519
    520         else:

~/.virtualenvs/km3net/lib/python3.7/site-packages/uproot4/source/futures.py in delayed_raise(exception_class, exception_value, traceback)
     35         exec("raise exception_class, exception_value, traceback")
     36     else:
---> 37         raise exception_value.with_traceback(traceback)
     38
     39

~/.virtualenvs/km3net/lib/python3.7/site-packages/uproot4/behaviors/TBranch.py in basket_to_array(basket)
    482                         len(basket_arrays[basket.basket_num]),
    483                         interpretation,
--> 484                         branch.file.file_path,
    485                     )
    486                 )

ValueError: basket 0 in tree/branch /arrays;1:6dVec has the wrong number of entries (expected 1, obtained 6) when interpreted as AsDtype('>f8')
    in file /Users/tamasgal/Downloads/test_array.root