astronomy-commons / hipscat-import Goto Github PK

View Code? Open in Web Editor NEW

4.0 6.0 2.0 1 MB

HiPSCat import - generate HiPSCat-partitioned catalogs

Home Page: https://hipscat-import.readthedocs.io

License: BSD 3-Clause "New" or "Revised" License

Python 99.53% Shell 0.47%

hipscat-import's Introduction

hipscat-import

HiPSCat import - Utility for ingesting large survey data into HiPSCat structure.

Check out our ReadTheDocs site for more information on partitioning, installation, and contributing.

See related projects:

HiPSCat (on GitHub) (on ReadTheDocs)
LSDB (on GitHub) (on ReadTheDocs)

Contributing

See the contribution guide for complete installation instructions and contribution best practices.

Acknowledgements

This project is supported by Schmidt Sciences.

This project is based upon work supported by the National Science Foundation under Grant No. AST-2003196.

This project acknowledges support from the DIRAC Institute in the Department of Astronomy at the University of Washington. The DIRAC Institute is supported through generous gifts from the Charles and Lisa Simonyi Fund for Arts and Sciences, and the Washington Research Foundation.

hipscat-import's People

Contributors

Stargazers

Watchers

Forkers

mjuric schwarzam

hipscat-import's Issues

Partition on non-radec positions

e.g. previous _hipscat_id values, other coordinate systems

Create docker image for import tool

We may need to do this to run the importer on the RSP
docker images (well, singularity images) are recommended for running jobs on PSC
looking into this for COIaC engagement.

partition_info.csv should contain only integers

Content currently looks like:

order,pixel,num_objects
3.0,0.0,792734.0
3.0,1.0,706177.0

But all those are integers. The writer for partition info should coerce inputs to integer.

Association Tables - basic support

Support for writing association metadata.

May also include some pipeline work for generating an association table, given a primary and join table.

Use both hyphen and underscore for command line parsing

Hyphen is more comfortable for python folks, while C++-aphiles like underscore. Don't make people choose, since we can just support both with argparse.

Validate imported data for scientific use

MAST folks are interested in strategies for making sure that we retain rounding/precision in our import process, particularly for floating point values.

One potential issue is that we're reading their data that has been exported from a SQL database (where they're presumably happy with the level of control on their rounding/precision) into a CSV, and converting this string representation again into a float can be lossy.

The catalog providers may have some format or units they can provide to help us in this, but they will vary by catalog. e.g. for Allwise, we see descriptions like ra %11.7f deg (link)

Add --clean option to import when resuming

If you're resuming from a previous interrupted import run, you have to specify --resume to pick up from where you left off.

We could create an additional flag like --clean to remove any intermediate files from previous runs and start fresh.

Add option to generate a catalog at a fixed healpix order

Two things here:

1 - IPAC is currently only generating their catalogs at a fixed order. It would be nice to do some apples-to-apples comparison with their formats.
2 - For very small/sparse catalogs, we might only have pixels at orders 0 and 1. Cross matching these to big ol' surveys is going to be very inefficient, and we would be better served with only a handful of very small pixels. One way to achieve this would be to force all pixels to be at order 4 or 5. In these catalogs, the file size distribution is pretty wonky anyway.

negative space changes

now that the negative tree generation code has been merged into the hipscat main, we can go about using the tree in the actual margin cache generation.

steps two and three in the detailed design section:

We would then create some margin cache pre-reduce directories for these empty pixels in hipscat_import.margin_cache._create_margin_directory to put the margin shards into.
Then in the hipscat_import.margin_cache._find_partition_margin_pixel_pairs method, we can pull in from the empty leaf pixel metadata above to mix in the empty pixel → margin pixel pairs into the main data structure.

Add provenance information in metadata

Leave breadcrumbs about where the survey data is coming from. Could be useful if you want to go back to the original data for e.g. looking at object images.

Generate source fits file

`test_map_association` test timing out in our CI

the unit test

tests/hipscat_import/association/test_association_map_reduce.py::test_map_association

is timing out and causing the CI to fail. didn't seem to start until after #107 was merged, although I can't think of a reason why the two would be related... since it was my change that triggered this failure I'll investigate o7

Give guidance on expected final catalog size

e.g. for a catalog with total file size 10T of bz2 compressed CSV: size of intermediate files, size of final catalog

Possibly also include size estimates in the estimate_pixel_threshold notebook

Generate _hipscat_id field

The field we've been calling HOP (higher order pixel) or HOP+.

See https://docs.google.com/document/d/1v0U_5YB1WcpvnYB9Pg2TJ-0CWO2zJqAcGl8hoK1n1dc/edit

Pipeline to create column-level indexes

This has some more requirements work to do to figure out if we want to include these in 1.0 (or at all)

User guide and examples

Enhance support of input positions

Input catalogs could have different formats and coordinate systems for object positions:

Non-ICRS frames
Galactic or ecliptic frames
#338

FIrst two points are related to #46

This issue is about adding conversion input catalogs into ICRS, see astronomy-commons/hipscat#152 for the hipscat support of different coordinate systems

Flatten ZTF data

ZTF data as imported has list data for individual observations. This is not efficient for reads or timeseries analysis - flatten out to a row per observation.

Add catalog verification pipeline

This might look like a report of what's inside the catalog, to be verified against the catalog owner's expectation of the catalog contents.

Useful things to consider:

total number of sources
column-level schema (names and types)
column-level units
values "close" to original values (identical for integers, "close enough" for floating points)
distribution of radec among points
allow for comparisons against some golden file(s)

Rerun black formatting with lsst-dm guidelines

See astronomy-commons/hipscat#43

Spruce up progress bar stage names

The tqdm progress bar stage names are inconsistent:

Planning : 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 14364.05it/s]
mapping: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:04<00:00,  1.24it/s]
Binning  : 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.19it/s]
splitting:  83%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                       | 5/6 [00:09<00:01,  1.64s/it]2

should pad the stage name to the same length for every stage
should use consistent capitalization.

I think this can be fixed with a line or two in /pipeline_resume_plan.py:wait_for_futures

Demo notebook for empty schema creation

Also helpful to have something that converts type information into schema, either from a dtype dict or CSV with names and types.

Source Tables - basic support

Support for writing source metadata.

Consider using click for command line parsing

It's pretty friendly, and is allowed under LSST DM guidelines: https://developer.lsst.io/python/cli.html

Import non-match results

In the macauff cross-match routine, there is additional information on likelihood of non-matches. We currently don't have anywhere to put these in the catalog or association tables, and so ignore the results. These are likely to be more useful as we move to real LSST data, and we should find a nice place for them to live.

Possibly in an extension table.

Support pixel_threshold=1

For exploration reasons, it would be useful to support pixel_threshold=1. For example, we can estimate minimum distance between objects from the result hipscat.

Remove command line options from catalog arguments

Only catalog creation has command line parsing set up, and it doesn't really seem that useful, and is just extra stuff to maintain and get out of sync.

Simplify documentation for catalog import arguments

Currently, split across prose documentation and API documentation. Ok to repeat yourself if it makes the process easier for users.

Can distinguish the arguments that are more "runtime" and which are specific to catalog importing, and the current breakdown on /catalogs/arguments.rst is useful for the additional descriptions.

Run import on ZTF catalog

hipscat margin cache api updates

due to some of the changes in the hipscat package for the new margin cache code, the API has been changed and might break the CI once hipscat has been updated. to ensure a smooth transition, i propose we do the following:

merge the new margin cache code into hipscat
pin hipscat-import's dependency on hipscat to version 0.1.0
release hipscat==0.1.1
merge in the hipscat-import compatibility changes + remove the pin on the hipscat dependency
release hipscat-import==0.1.1

(the last step might not be dependent on the margin cache changes, but can also be done whenever :])

Sparse representation of intermediate histograms on catalog import

When importing a catalog, we store one intermediate healpix histogram per input file as an intermediate file. This uses a very naive np binary array storage, and files can grow large when importing large catalogs at high healpix order.

One alternative is the healsparse library, (though this introduces additional dependency on healpy).

Neighbor Tables - basic support

This is for general infrastructure support for neighbor tables, not the generation of the bounding box stuff.

Continuous integration testing and code coverage calculation

The code should be regularly built and tested to ensure that no new breaking bugs are introduced.

Test import of Hubble object catalog

Catalog import stages should fail louder

From @troyraen 's experience importing AllWISE:

In general, the warnings and errors are pretty opaque. I'm guessing that some types and amounts of dask warnings are safe to ignore, but which ones do I really need to pay attention to? And when it does actually fail, how do I know where to look for the problem? I'm pretty sure that several of my runs encountered fatal errors during earlier steps like "Mapping", but they kept running and didn't actually fail until the final "Finishing" step, so the specific error was just a downstream effect of a much earlier problem. I don't know hipscat or dask well enough to know specifically, but I wonder:

if there are some dask errors or warnings that hipscat should catch and then terminate the run; and

if there are 1 or 2 basic checks that could be done at the end of each step that must pass, else the run terminates.

In either case, if hipscat would raise custom error messages that gave some info about where the user might might look for the problem and/or solution, that would be really helpful.

We should implement some basic check at the end of each stage to make sure they have completed successfully.

investigate why margin cache test is taking so long

😵‍💫

make `numpy` a required dependency

numpy is still listed as an optional dependency, despite being used for a lot of core functionality... I assume no one's noticed this because everyone installs with .[dev] or has numpy installed locally already 😭

Handle nans in radec

If you have a file with some empty radec values, an error bubbles up from the healpy library.

e.g.

id,ra,dec
701,,
702,310.5,-27.5

Results in an error like:

theta = array([       nan, 2.05076187])

    def check_theta_valid(theta):
        """Raises exception if theta is not within 0 and pi"""
        theta = np.asarray(theta)
        if not ((theta >= 0).all() and (theta <= np.pi + 1e-5).all()):
>           raise ValueError("THETA is out of range [0,pi]")
E           ValueError: THETA is out of range [0,pi]

../../.local/lib/python3.10/site-packages/healpy/pixelfunc.py:157: ValueError

I'm torn between some possible options here (in order of preference):

do nothing and let the healpy ValueError bubble up
catch the nans / out of range values first and give a more useful error message
strip the nans / out of range values out first and process the good rows

This is pretty low priority for now, as this only came up because I'd misconfigured my reader and it was parsing everything as nan =P

Neighbor/Border Margin caching for LSDB crossmatching

migrating this over from astronomy-commons/lsd2#11 since we're working on importing in this repo now.
Task:
Create a cache for each pixel of objects that are nearby in neighboring pixels.
Read the cache when computing cross-match and consider the neighboring objects as potential matches.
Right now the prototype for margin caching is working on the LSD2 repo. Now the main task is working it into the hipscat-import package.

ImportArguments fails if output folder does not exist

Could (and should) hipscat-import create the output directory? From my limited experience, I'd like it to do that

Run import on allwise catalog

As requested by Troy Rean with IPAC/IRSA.

Smoke Test Failure

see the one in hipscat

Import pipeline for Macauff association table

Pipeline description

Copied from design doc

Create a dedicated map-reduce pipeline within the hipscat-import pipeline structure.

read the XML file, and create a new parquet metadata file from it
for each input file / chunk
- fetch left catalog partition info
- fetch right catalog partition info
- map left and right to healpix
- split to match left partitions
  - include all columns, including likelihood scores
  - write to sharded parquet using generated parquet metadata
for each left partition
- aggregate sharded files
write metadata files

Milestones

Pipeline boilerplate (see similar PR)
Test data files (XML and CSV)
Pipeline map reduce

Planning : 100%|█████████████████████████████████| 5/5 [00:00<00:00, 615.24it/s]

I get a never-ending string of errors like this:

2023-08-09 12:31:14,359 - distributed.nanny - WARNING - Restarting worker
Planning :   0%|                                          | 0/5 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/stage/irsa-staff-shupe/micromamba/envs/hipscat/lib/python3.11/multiprocessing/spawn.py", line 120, in spawn_main
    exitcode = _main(fd, parent_sentinel)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/stage/irsa-staff-shupe/micromamba/envs/hipscat/lib/python3.11/multiprocessing/spawn.py", line 129, in _main
    prepare(preparation_data)
  File "/stage/irsa-staff-shupe/micromamba/envs/hipscat/lib/python3.11/multiprocessing/spawn.py", line 240, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/stage/irsa-staff-shupe/micromamba/envs/hipscat/lib/python3.11/multiprocessing/spawn.py", line 291, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen runpy>", line 291, in run_path
  File "<frozen runpy>", line 98, in _run_module_code
  File "<frozen runpy>", line 88, in _run_code
  File "/stage/irsa-staff-shupe/hipscat/allwise/runimport.py", line 11, in <module>
    args = ImportArguments(
           ^^^^^^^^^^^^^^^^
  File "<string>", line 33, in __init__
  File "/stage/irsa-staff-shupe/micromamba/envs/hipscat/lib/python3.11/site-packages/hipscat_import/catalog/arguments.py", line 78, in __post_init__
    self._check_arguments()
  File "/stage/irsa-staff-shupe/micromamba/envs/hipscat/lib/python3.11/site-packages/hipscat_import/catalog/arguments.py", line 123, in _check_arguments
    self.resume_plan = ResumePlan(
                       ^^^^^^^^^^^
  File "<string>", line 9, in __init__
  File "/stage/irsa-staff-shupe/micromamba/envs/hipscat/lib/python3.11/site-packages/hipscat_import/catalog/resume_plan.py", line 51, in __post_init__
    self.gather_plan()
  File "/stage/irsa-staff-shupe/micromamba/envs/hipscat/lib/python3.11/site-packages/hipscat_import/catalog/resume_plan.py", line 61, in gather_plan
    raise ValueError(
ValueError: tmp_path (/stage/irsa-staff-shupe/hipscat/allwise/tmp/allwise2/intermediate) contains intermediate files. choose a different directory or use --resume flag