AstroCLIP

Official PyTorch implementation and pre-trained models for the paper AstroCLIP: A Cross-Modal Foundation Model for Galaxies.

AstroCLIP is a novel, cross-modal, self-supervised foundation model that creates a shared embedding space for multi-band imaging and optical spectra of galaxies. These embeddings encode meaningful physical information shared between both modalities, and can be used as the basis for competitive zero- and few-shot learning on a variety of downstream tasks, including similarity search, redshift estimation, galaxy property prediction, and morphology classification.

Web App

Check out our interactive similarity search app, enabling both in-modal and cross-modal search for galaxies: https://astroclip.streamlit.app/

Installation

The training and evaluation code requires PyTorch 2.0. Additionally, an up-to-date eventlet is required for wandb. Note that the code has only been tested with the specified versions and also expects a Linux environment. To install the AstroCLIP package and its corresponding dependencies, please follow the code below.

pip install --upgrade pip
pip install --upgrade eventlet torch lightning[extra]
pip install -e .

NOTE The package provides the three shortcuts: astroclip_trainer and spectrum_trainer, which link to astroclip/trainer.py, and image_trainer, which links to astroclip/astrodino/trainer.py, as long as it is installed. The shortcuts are defined in the project.scripts section of the pyproject.toml file.

Handling roots

The package expects to load models and data by default from

{ASTROCLIP_ROOT}

You can configure ASTROCLIP_ROOT as well as the weights and biases group in which runs are saved by creating a .env file in the root of astroclip with the following content:

ASTROCLIP_ROOT="/mnt/ceph/users/polymathic/astroclip"
WANDB_ENTITY_NAME="flatiron-scipt"

If no environment is specified, the default path at Flatiron will be assumed.

Pretrained Models

We provide the pretrained AstroCLIP model on the Huggingface model hub for easy access. Additionally, we provide the pretrained single-modal models for galaxy images and spectra as well. Model details, checkpoints, configs and logs are below.

Model Name	Pretraining	# Params.	Download
AstroCLIP	CLIP	370M	ckpt	config	logs
Image Encoder	DINOv2	302M	ckpt	config	logs
Spectrum Encoder	Masked Modeling	43M	ckpt	config	logs

Loading the Pretrained Models

The pretrained AstroCLIP model can be loaded using the following:

from astroclip.models import AstroClipModel
model = AstroClipModel.load_from_checkpoint(
    checkpoint_path = "path_to_model.ckpt",
)

High-Level Performance Overview

Below, we include a high-level performance overview of our models on a variety of downstream tasks. This is non-exhaustive, and we refer the reader to the paper for the full details.

Source	Model	Type	Redshift	Properties	Morphology
Image	AstroCLIP*	Zero-Shot	0.79	0.47	0.76
	Image Encoder*	Zero-Shot	0.63	0.37	0.78
	Stein, et al.	Zero-Shot	0.36	0.26	0.76
	ResNet18	Supervised	0.77	0.43	-
	ZooBot¹	Supervised	-	-	0.88
Spectrum	AstroCLIP*	Zero-Shot	0.99	0.63	-
	Spectrum Encoder*	Zero-Shot	0.99	0.64	-
	Conv+Att²	Supervised	0.99	0.60	-
Photometry	MLP	Supervised	0.68	0.42	-

We report R-squared metrics on redshift and galaxy property estimation (averaged across all properties) and accuracy on galaxy morphology classification (averaged across all labels). Our models are marked with an asterisk (*). [1] We use the results reported from Walmsley, et al. (2021). [2] We use the encoder from Melchior, et al. (2022).

Data Access

The AstroCLIP model is trained on the cross-matched sample containing optical spectra from the Dark Energy Spectroscopic Instrument (DESI) Early Data Release (EDR) and multi-band images (g,r,z) from the DESI Legacy Survey prepared by Stein, et al. (2022). We provide the dataset as a HuggingFace dataset, which can be accessed directly using

from datasets import load_dataset

# This downloads about 60 GB of data
dset = load_dataset('astroclip/data/dataset.py')

For reproducibility, we include the scripts and a brief description of how to generate the cross-matched dataset in astroclip/data/crossmatch.

Image Pretraining Dataset

While the AstroCLIP and Spectrum Encoder models are trained on the image-spectrum dataset, we pretrain the galaxy image model separately on full Stein, et al. (2022) image dataset, which consists of 76M galaxy images. This dataset can be accessed using this globus endpoint:

https://app.globus.org/file-manager?origin_id=9fb0fc0e-e760-11ec-9bd2-2d2219dcc1fa&origin_path=%2F

The directory is organized into south and north surveys, where each survey is split into chunks of 1,000,000 galaxies (sorted by decreasing z-band flux) and saved in hdf5 format. For more details, see here.

Pretraining

AstroCLIP is trained using a two-step process:

We pre-train a single-modal galaxy image encoder and a single-modal galaxy spectrum encoder separately.
We CLIP-align these two encoders on a paired image-spectrum dataset.

Single-Modal Pretraining

Image Pretraining - DINOv2 ViT:

AstroCLIP uses a Vision Transformer (ViT) to encode galaxy images. Pretraining is performed using the DINOv2 package, which combines self-distillation, masked-modeling, and contrastive objectives. Overall, we use largely the same training regime, however we modify some of the contrastive augmentations to suit an astrophysics context. Model training can be launched with the following command:

image_trainer -c astroclip/astrodino/config.yaml

We train the model using 20 A100 GPUs (on 5 nodes) for 250k steps which takes roughly 46 hours.

Spectrum Pretraining - Masked Modelling Transformer:

AstroCLIP uses a 1D Transformer to encode galaxy spectra. Pretraining is performed using a masked-modeling objective, whereby the 1D spectrum is split into contiguous, overlapping patches. Model training can be launched with the following command:

spectrum_trainer fit -c config/specformer.yaml

We train the model using 4 A100 GPUs (on 1 node) for 30k steps which takes roughly 12 hours.

CLIP Alignment:

Once pretrained, we align the image and spectrum encoder using cross-attention projection heads to maximize the similarity between cross-modal embeddings that correspond to the same galaxy while simultaneously minimizing the similarity between cross-modal embeddings that correspond to different galaxies. Model training can be launched with the following command:

spectrum_trainer fit -c config/astroclip.yaml

We train the model using 4 A100 GPUs (on 1 node) for 25k steps or until the validation loss does not increase for a fixed number of steps. This takes roughly 12 hours.

Downstream Tasks

We demonstrate that the AstroCLIP can be used to easily perform a variety of downstream tasks. In particular, we demonstrate their ability to do:

In-modal and cross-modal similarity search
Photometric redshift prediction
Physical property estimation from images
Physical property estimation from spectra
Morphology classification from images

The details of these downstream tasks and the results in our paper can be found in astroclip/downstream_tasks.

Acknowledgements

This reposity uses datasets and contrastive augmentations from Stein, et al. (2022). The image pretraining is built on top of the DINOv2 framework; we also thank Piotr Bojanowski for valuable conversations around image pretraining.