astronomy-commons / hipscat-import Goto Github PK
View Code? Open in Web Editor NEWHiPSCat import - generate HiPSCat-partitioned catalogs
Home Page: https://hipscat-import.readthedocs.io
License: BSD 3-Clause "New" or "Revised" License
HiPSCat import - generate HiPSCat-partitioned catalogs
Home Page: https://hipscat-import.readthedocs.io
License: BSD 3-Clause "New" or "Revised" License
e.g. for a catalog with total file size 10T of bz2 compressed CSV: size of intermediate files, size of final catalog
Possibly also include size estimates in the estimate_pixel_threshold notebook
Only catalog creation has command line parsing set up, and it doesn't really seem that useful, and is just extra stuff to maintain and get out of sync.
Create documentation as part of continuous integration and automatically push docs to readthedocs or github pages
Re-partition an existing catalog, using new partitioning arguments (or filters)
The tqdm progress bar stage names are inconsistent:
Planning : 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 14364.05it/s]
mapping: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:04<00:00, 1.24it/s]
Binning : 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2.19it/s]
splitting: 83%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 5/6 [00:09<00:01, 1.64s/it]2
I think this can be fixed with a line or two in /pipeline_resume_plan.py:wait_for_futures
This might look like a report of what's inside the catalog, to be verified against the catalog owner's expectation of the catalog contents.
Useful things to consider:
When importing a catalog, we store one intermediate healpix histogram per input file as an intermediate file. This uses a very naive np binary array storage, and files can grow large when importing large catalogs at high healpix order.
One alternative is the healsparse library, (though this introduces additional dependency on healpy).
ZTF data as imported has list data for individual observations. This is not efficient for reads or timeseries analysis - flatten out to a row per observation.
Configurable logging level via command line argument.
MAST folks are interested in strategies for making sure that we retain rounding/precision in our import process, particularly for floating point values.
One potential issue is that we're reading their data that has been exported from a SQL database (where they're presumably happy with the level of control on their rounding/precision) into a CSV, and converting this string representation again into a float can be lossy.
The catalog providers may have some format or units they can provide to help us in this, but they will vary by catalog. e.g. for Allwise, we see descriptions like ra %11.7f deg
(link)
😵💫
As requested by Troy Rean with IPAC/IRSA.
This is for general infrastructure support for neighbor tables, not the generation of the bounding box stuff.
Also helpful to have something that converts type information into schema, either from a dtype dict or CSV with names and types.
For exploration reasons, it would be useful to support pixel_threshold=1
. For example, we can estimate minimum distance between objects from the result hipscat.
If you have a file with some empty radec values, an error bubbles up from the healpy library.
e.g.
id,ra,dec
701,,
702,310.5,-27.5
Results in an error like:
theta = array([ nan, 2.05076187])
def check_theta_valid(theta):
"""Raises exception if theta is not within 0 and pi"""
theta = np.asarray(theta)
if not ((theta >= 0).all() and (theta <= np.pi + 1e-5).all()):
> raise ValueError("THETA is out of range [0,pi]")
E ValueError: THETA is out of range [0,pi]
../../.local/lib/python3.10/site-packages/healpy/pixelfunc.py:157: ValueError
I'm torn between some possible options here (in order of preference):
This is pretty low priority for now, as this only came up because I'd misconfigured my reader and it was parsing everything as nan =P
Leave breadcrumbs about where the survey data is coming from. Could be useful if you want to go back to the original data for e.g. looking at object images.
The code should be regularly built and tested to ensure that no new breaking bugs are introduced.
In the macauff cross-match routine, there is additional information on likelihood of non-matches. We currently don't have anywhere to put these in the catalog or association tables, and so ignore the results. These are likely to be more useful as we move to real LSST data, and we should find a nice place for them to live.
Possibly in an extension table.
I'm trying to run the AllWISE example import. After an initial completion of Planning:
Planning : 100%|█████████████████████████████████| 5/5 [00:00<00:00, 615.24it/s]
I get a never-ending string of errors like this:
2023-08-09 12:31:14,359 - distributed.nanny - WARNING - Restarting worker
Planning : 0%| | 0/5 [00:00<?, ?it/s]
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/stage/irsa-staff-shupe/micromamba/envs/hipscat/lib/python3.11/multiprocessing/spawn.py", line 120, in spawn_main
exitcode = _main(fd, parent_sentinel)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/stage/irsa-staff-shupe/micromamba/envs/hipscat/lib/python3.11/multiprocessing/spawn.py", line 129, in _main
prepare(preparation_data)
File "/stage/irsa-staff-shupe/micromamba/envs/hipscat/lib/python3.11/multiprocessing/spawn.py", line 240, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "/stage/irsa-staff-shupe/micromamba/envs/hipscat/lib/python3.11/multiprocessing/spawn.py", line 291, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
^^^^^^^^^^^^^^^^^^^^^^^^^
File "<frozen runpy>", line 291, in run_path
File "<frozen runpy>", line 98, in _run_module_code
File "<frozen runpy>", line 88, in _run_code
File "/stage/irsa-staff-shupe/hipscat/allwise/runimport.py", line 11, in <module>
args = ImportArguments(
^^^^^^^^^^^^^^^^
File "<string>", line 33, in __init__
File "/stage/irsa-staff-shupe/micromamba/envs/hipscat/lib/python3.11/site-packages/hipscat_import/catalog/arguments.py", line 78, in __post_init__
self._check_arguments()
File "/stage/irsa-staff-shupe/micromamba/envs/hipscat/lib/python3.11/site-packages/hipscat_import/catalog/arguments.py", line 123, in _check_arguments
self.resume_plan = ResumePlan(
^^^^^^^^^^^
File "<string>", line 9, in __init__
File "/stage/irsa-staff-shupe/micromamba/envs/hipscat/lib/python3.11/site-packages/hipscat_import/catalog/resume_plan.py", line 51, in __post_init__
self.gather_plan()
File "/stage/irsa-staff-shupe/micromamba/envs/hipscat/lib/python3.11/site-packages/hipscat_import/catalog/resume_plan.py", line 61, in gather_plan
raise ValueError(
ValueError: tmp_path (/stage/irsa-staff-shupe/hipscat/allwise/tmp/allwise2/intermediate) contains intermediate files. choose a different directory or use --resume flag
see the one in hipscat
From @troyraen 's experience importing AllWISE:
In general, the warnings and errors are pretty opaque. I'm guessing that some types and amounts of dask warnings are safe to ignore, but which ones do I really need to pay attention to? And when it does actually fail, how do I know where to look for the problem? I'm pretty sure that several of my runs encountered fatal errors during earlier steps like "Mapping", but they kept running and didn't actually fail until the final "Finishing" step, so the specific error was just a downstream effect of a much earlier problem. I don't know hipscat or dask well enough to know specifically, but I wonder:
- if there are some dask errors or warnings that hipscat should catch and then terminate the run; and
- if there are 1 or 2 basic checks that could be done at the end of each step that must pass, else the run terminates.
In either case, if hipscat would raise custom error messages that gave some info about where the user might might look for the problem and/or solution, that would be really helpful.
We should implement some basic check at the end of each stage to make sure they have completed successfully.
If you're resuming from a previous interrupted import run, you have to specify --resume to pick up from where you left off.
We could create an additional flag like --clean to remove any intermediate files from previous runs and start fresh.
Could (and should) hipscat-import
create the output directory? From my limited experience, I'd like it to do that
Two things here:
1 - IPAC is currently only generating their catalogs at a fixed order. It would be nice to do some apples-to-apples comparison with their formats.
2 - For very small/sparse catalogs, we might only have pixels at orders 0 and 1. Cross matching these to big ol' surveys is going to be very inefficient, and we would be better served with only a handful of very small pixels. One way to achieve this would be to force all pixels to be at order 4 or 5. In these catalogs, the file size distribution is pretty wonky anyway.
This has some more requirements work to do to figure out if we want to include these in 1.0 (or at all)
It's pretty friendly, and is allowed under LSST DM guidelines: https://developer.lsst.io/python/cli.html
Support for writing source metadata.
Support for writing association metadata.
May also include some pipeline work for generating an association table, given a primary and join table.
Currently, split across prose documentation and API documentation. Ok to repeat yourself if it makes the process easier for users.
Can distinguish the arguments that are more "runtime" and which are specific to catalog importing, and the current breakdown on /catalogs/arguments.rst is useful for the additional descriptions.
the unit test
tests/hipscat_import/association/test_association_map_reduce.py::test_map_association
is timing out and causing the CI to fail. didn't seem to start until after #107 was merged, although I can't think of a reason why the two would be related... since it was my change that triggered this failure I'll investigate o7
Based on fully written parquet files.
e.g. previous _hipscat_id values, other coordinate systems
migrating this over from astronomy-commons/lsd2#11 since we're working on importing in this repo now.
Task:
Create a cache for each pixel of objects that are nearby in neighboring pixels.
Read the cache when computing cross-match and consider the neighboring objects as potential matches.
Right now the prototype for margin caching is working on the LSD2 repo. Now the main task is working it into the hipscat-import package.
now that the negative tree generation code has been merged into the hipscat
main, we can go about using the tree in the actual margin cache generation.
steps two and three in the detailed design section:
Input catalogs could have different formats and coordinate systems for object positions:
FIrst two points are related to #46
This issue is about adding conversion input catalogs into ICRS, see astronomy-commons/hipscat#152 for the hipscat
support of different coordinate systems
Copied from design doc
Create a dedicated map-reduce pipeline within the hipscat-import pipeline structure.
Content currently looks like:
order,pixel,num_objects
3.0,0.0,792734.0
3.0,1.0,706177.0
But all those are integers. The writer for partition info should coerce inputs to integer.
due to some of the changes in the hipscat package for the new margin cache code, the API has been changed and might break the CI once hipscat has been updated. to ensure a smooth transition, i propose we do the following:
hipscat
hipscat-import
's dependency on hipscat
to version 0.1.0
hipscat==0.1.1
hipscat-import
compatibility changes + remove the pin on the hipscat
dependencyhipscat-import==0.1.1
(the last step might not be dependent on the margin cache changes, but can also be done whenever :])
numpy
is still listed as an optional dependency, despite being used for a lot of core functionality... I assume no one's noticed this because everyone installs with .[dev]
or has numpy installed locally already 😭
Hyphen is more comfortable for python folks, while C++-aphiles like underscore. Don't make people choose, since we can just support both with argparse.
What are some best practices when setting up your dask client? What's the minimum resources you should allocate, given some input size and runtime expectation? How do you spin up a dask cluster on more than one node?
The field we've been calling HOP (higher order pixel) or HOP+.
See https://docs.google.com/document/d/1v0U_5YB1WcpvnYB9Pg2TJ-0CWO2zJqAcGl8hoK1n1dc/edit
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.