Git Product home page Git Product logo

hipscat-import's Introduction

hipscat-import

Template

PyPI Conda

GitHub Workflow Status codecov Read the Docs

HiPSCat import - Utility for ingesting large survey data into HiPSCat structure.

Check out our ReadTheDocs site for more information on partitioning, installation, and contributing.

See related projects:

Contributing

GitHub issue custom search in repo

See the contribution guide for complete installation instructions and contribution best practices.

Acknowledgements

This project is supported by Schmidt Sciences.

This project is based upon work supported by the National Science Foundation under Grant No. AST-2003196.

This project acknowledges support from the DIRAC Institute in the Department of Astronomy at the University of Washington. The DIRAC Institute is supported through generous gifts from the Charles and Lisa Simonyi Fund for Arts and Sciences, and the Washington Research Foundation.

hipscat-import's People

Contributors

camposandro avatar delucchi-cmu avatar dependabot[bot] avatar hombit avatar jeremykubica avatar maxwest-uw avatar mi-dai avatar schwarzam avatar smcguire-cmu avatar troyraen avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Forkers

mjuric schwarzam

hipscat-import's Issues

Create docker image for import tool

  • We may need to do this to run the importer on the RSP
  • docker images (well, singularity images) are recommended for running jobs on PSC
  • looking into this for COIaC engagement.

Association Tables - basic support

Support for writing association metadata.

May also include some pipeline work for generating an association table, given a primary and join table.

Validate imported data for scientific use

MAST folks are interested in strategies for making sure that we retain rounding/precision in our import process, particularly for floating point values.

One potential issue is that we're reading their data that has been exported from a SQL database (where they're presumably happy with the level of control on their rounding/precision) into a CSV, and converting this string representation again into a float can be lossy.

The catalog providers may have some format or units they can provide to help us in this, but they will vary by catalog. e.g. for Allwise, we see descriptions like ra %11.7f deg (link)

Add --clean option to import when resuming

If you're resuming from a previous interrupted import run, you have to specify --resume to pick up from where you left off.

We could create an additional flag like --clean to remove any intermediate files from previous runs and start fresh.

Add option to generate a catalog at a fixed healpix order

Two things here:

1 - IPAC is currently only generating their catalogs at a fixed order. It would be nice to do some apples-to-apples comparison with their formats.
2 - For very small/sparse catalogs, we might only have pixels at orders 0 and 1. Cross matching these to big ol' surveys is going to be very inefficient, and we would be better served with only a handful of very small pixels. One way to achieve this would be to force all pixels to be at order 4 or 5. In these catalogs, the file size distribution is pretty wonky anyway.

negative space changes

now that the negative tree generation code has been merged into the hipscat main, we can go about using the tree in the actual margin cache generation.

steps two and three in the detailed design section:

Add provenance information in metadata

Leave breadcrumbs about where the survey data is coming from. Could be useful if you want to go back to the original data for e.g. looking at object images.

`test_map_association` test timing out in our CI

the unit test

tests/hipscat_import/association/test_association_map_reduce.py::test_map_association

is timing out and causing the CI to fail. didn't seem to start until after #107 was merged, although I can't think of a reason why the two would be related... since it was my change that triggered this failure I'll investigate o7

Give guidance on expected final catalog size

e.g. for a catalog with total file size 10T of bz2 compressed CSV: size of intermediate files, size of final catalog

Possibly also include size estimates in the estimate_pixel_threshold notebook

Enhance support of input positions

Input catalogs could have different formats and coordinate systems for object positions:

  • Non-ICRS frames
  • Galactic or ecliptic frames
  • #338

FIrst two points are related to #46

This issue is about adding conversion input catalogs into ICRS, see astronomy-commons/hipscat#152 for the hipscat support of different coordinate systems

Flatten ZTF data

ZTF data as imported has list data for individual observations. This is not efficient for reads or timeseries analysis - flatten out to a row per observation.

Add catalog verification pipeline

This might look like a report of what's inside the catalog, to be verified against the catalog owner's expectation of the catalog contents.

Useful things to consider:

  • total number of sources
  • column-level schema (names and types)
  • column-level units
  • values "close" to original values (identical for integers, "close enough" for floating points)
  • distribution of radec among points
  • allow for comparisons against some golden file(s)

Spruce up progress bar stage names

The tqdm progress bar stage names are inconsistent:

Planning : 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 14364.05it/s]
mapping: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:04<00:00,  1.24it/s]
Binning  : 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.19it/s]
splitting:  83%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                       | 5/6 [00:09<00:01,  1.64s/it]2
  • should pad the stage name to the same length for every stage
  • should use consistent capitalization.

I think this can be fixed with a line or two in /pipeline_resume_plan.py:wait_for_futures

Import non-match results

In the macauff cross-match routine, there is additional information on likelihood of non-matches. We currently don't have anywhere to put these in the catalog or association tables, and so ignore the results. These are likely to be more useful as we move to real LSST data, and we should find a nice place for them to live.

Possibly in an extension table.

Support pixel_threshold=1

For exploration reasons, it would be useful to support pixel_threshold=1. For example, we can estimate minimum distance between objects from the result hipscat.

Simplify documentation for catalog import arguments

Currently, split across prose documentation and API documentation. Ok to repeat yourself if it makes the process easier for users.

Can distinguish the arguments that are more "runtime" and which are specific to catalog importing, and the current breakdown on /catalogs/arguments.rst is useful for the additional descriptions.

hipscat margin cache api updates

due to some of the changes in the hipscat package for the new margin cache code, the API has been changed and might break the CI once hipscat has been updated. to ensure a smooth transition, i propose we do the following:

  • merge the new margin cache code into hipscat
  • pin hipscat-import's dependency on hipscat to version 0.1.0
  • release hipscat==0.1.1
  • merge in the hipscat-import compatibility changes + remove the pin on the hipscat dependency
  • release hipscat-import==0.1.1

(the last step might not be dependent on the margin cache changes, but can also be done whenever :])

Catalog import stages should fail louder

From @troyraen 's experience importing AllWISE:

In general, the warnings and errors are pretty opaque. I'm guessing that some types and amounts of dask warnings are safe to ignore, but which ones do I really need to pay attention to? And when it does actually fail, how do I know where to look for the problem? I'm pretty sure that several of my runs encountered fatal errors during earlier steps like "Mapping", but they kept running and didn't actually fail until the final "Finishing" step, so the specific error was just a downstream effect of a much earlier problem. I don't know hipscat or dask well enough to know specifically, but I wonder:

  1. if there are some dask errors or warnings that hipscat should catch and then terminate the run; and
  2. if there are 1 or 2 basic checks that could be done at the end of each step that must pass, else the run terminates.

In either case, if hipscat would raise custom error messages that gave some info about where the user might might look for the problem and/or solution, that would be really helpful.

We should implement some basic check at the end of each stage to make sure they have completed successfully.

make `numpy` a required dependency

numpy is still listed as an optional dependency, despite being used for a lot of core functionality... I assume no one's noticed this because everyone installs with .[dev] or has numpy installed locally already 😭

Handle nans in radec

If you have a file with some empty radec values, an error bubbles up from the healpy library.

e.g.

id,ra,dec
701,,
702,310.5,-27.5

Results in an error like:

theta = array([       nan, 2.05076187])

    def check_theta_valid(theta):
        """Raises exception if theta is not within 0 and pi"""
        theta = np.asarray(theta)
        if not ((theta >= 0).all() and (theta <= np.pi + 1e-5).all()):
>           raise ValueError("THETA is out of range [0,pi]")
E           ValueError: THETA is out of range [0,pi]

../../.local/lib/python3.10/site-packages/healpy/pixelfunc.py:157: ValueError

I'm torn between some possible options here (in order of preference):

  • do nothing and let the healpy ValueError bubble up
  • catch the nans / out of range values first and give a more useful error message
  • strip the nans / out of range values out first and process the good rows

This is pretty low priority for now, as this only came up because I'd misconfigured my reader and it was parsing everything as nan =P

Neighbor/Border Margin caching for LSDB crossmatching

migrating this over from astronomy-commons/lsd2#11 since we're working on importing in this repo now.
Task:
Create a cache for each pixel of objects that are nearby in neighboring pixels.
Read the cache when computing cross-match and consider the neighboring objects as potential matches.
Right now the prototype for margin caching is working on the LSD2 repo. Now the main task is working it into the hipscat-import package.

Import pipeline for Macauff association table

Pipeline description

Copied from design doc

Create a dedicated map-reduce pipeline within the hipscat-import pipeline structure.

  • read the XML file, and create a new parquet metadata file from it
  • for each input file / chunk
    • fetch left catalog partition info
    • fetch right catalog partition info
    • map left and right to healpix
    • split to match left partitions
      • include all columns, including likelihood scores
      • write to sharded parquet using generated parquet metadata
  • for each left partition
    • aggregate sharded files
  • write metadata files

Milestones

  • Pipeline boilerplate (see similar PR)
  • Test data files (XML and CSV)
  • Pipeline map reduce

Dask workers restarting results in ValueError on intermediate files

I'm trying to run the AllWISE example import. After an initial completion of Planning:

Planning : 100%|█████████████████████████████████| 5/5 [00:00<00:00, 615.24it/s]

I get a never-ending string of errors like this:

2023-08-09 12:31:14,359 - distributed.nanny - WARNING - Restarting worker
Planning :   0%|                                          | 0/5 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/stage/irsa-staff-shupe/micromamba/envs/hipscat/lib/python3.11/multiprocessing/spawn.py", line 120, in spawn_main
    exitcode = _main(fd, parent_sentinel)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/stage/irsa-staff-shupe/micromamba/envs/hipscat/lib/python3.11/multiprocessing/spawn.py", line 129, in _main
    prepare(preparation_data)
  File "/stage/irsa-staff-shupe/micromamba/envs/hipscat/lib/python3.11/multiprocessing/spawn.py", line 240, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/stage/irsa-staff-shupe/micromamba/envs/hipscat/lib/python3.11/multiprocessing/spawn.py", line 291, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen runpy>", line 291, in run_path
  File "<frozen runpy>", line 98, in _run_module_code
  File "<frozen runpy>", line 88, in _run_code
  File "/stage/irsa-staff-shupe/hipscat/allwise/runimport.py", line 11, in <module>
    args = ImportArguments(
           ^^^^^^^^^^^^^^^^
  File "<string>", line 33, in __init__
  File "/stage/irsa-staff-shupe/micromamba/envs/hipscat/lib/python3.11/site-packages/hipscat_import/catalog/arguments.py", line 78, in __post_init__
    self._check_arguments()
  File "/stage/irsa-staff-shupe/micromamba/envs/hipscat/lib/python3.11/site-packages/hipscat_import/catalog/arguments.py", line 123, in _check_arguments
    self.resume_plan = ResumePlan(
                       ^^^^^^^^^^^
  File "<string>", line 9, in __init__
  File "/stage/irsa-staff-shupe/micromamba/envs/hipscat/lib/python3.11/site-packages/hipscat_import/catalog/resume_plan.py", line 51, in __post_init__
    self.gather_plan()
  File "/stage/irsa-staff-shupe/micromamba/envs/hipscat/lib/python3.11/site-packages/hipscat_import/catalog/resume_plan.py", line 61, in gather_plan
    raise ValueError(
ValueError: tmp_path (/stage/irsa-staff-shupe/hipscat/allwise/tmp/allwise2/intermediate) contains intermediate files. choose a different directory or use --resume flag

Sphinx documentation

Create documentation as part of continuous integration and automatically push docs to readthedocs or github pages

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.