alecthomson / arrakis Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 1.0 33.17 MB

License: BSD 3-Clause "New" or "Revised" License

Python 96.94% Shell 3.06%

arrakis's Introduction

Hi 👋, I'm Alec Thomson

Radio astronomer. Huge fan of Stokes Q and U.

🔭 I’m currently working on the Rapid ASKAP Continuum Survey (RACS)
🌱 I’m currently learning Python asyncio, and dipping my toes into Rust 🦀
👯 I’m looking to collaborate on anything linearly polarised at radio frequencies
🤔 I’m looking for help with adapting the Arrakis pipelines to more surveys or telescopes
💬 Ask me about radio astronomy and large data pipelines
📫 How to reach me: [email protected]
⚡ Fun fact: RACS detected 3 million galaxies in 300 hours. That's nearly 3 per second!

Languages and Tools

arrakis's People

Contributors

Stargazers

Watchers

arrakis's Issues

RMExtract related issues and notes

When running the FRion correction stage of the arrakis pipeline a number of issues internal to RMextract were found. This issue aims to record them for future generations.

The compute nodes of petrichor do not have external network access. Internally, RMextract attempts to download data files from remote sites over ftp. This fails. @AlecThomson found that by supplying dummpy proxy settings through to RMextract it could be convinced to use a caches copy of these files, so long as the server url is of the form file:///path/to/ionex/files.

In this approach a secondary issue was found where the data file is copied into place and extracted using a subprocess call to gunzip if the file ends in .Z. Although this works when in a single-process environment, the concurrently running tasks would each try to download (i.e. copy from local cache) and extract the same file. This results in race-conditions where the file is constantly overwritten, and either gunzip fails outright or the extracted file is not complete. I added a collection of os.path.exists type checks and early escapes, which seems to get me by.

We should merge these fixes together and have a static fork?

ASKAP correlations are not what wsclean expects

As we know, ASKAP MeasurementSets are not 'standard'. I had previously assumed that the only difference was a factor of 2, which we applied to the images. However, its not that simple unfortunately.

WSclean assumes the following formulation for Stokes (taken straight from aocommon):

$$ \begin{bmatrix} I \\ Q \\ U \\ V \end{bmatrix} = \begin{bmatrix} 0.5 & 0 & 0 & 0.5\\ 0.5 & 0 & 0 & -0.5\\ 0 & 0.5 & 0.5 & 0\\ 0 & - 0.5 i & 0.5 i & 0 \end{bmatrix} \begin{bmatrix} XX\\ XY\\ YX\\ YY \end{bmatrix} $$

ASKAP, however, can have an arbitrary rotation of the PAF. See the ASKAP Observation Guide, Eqn. D1. I've been able to solve for the matrix that gives us the appropriate transformation from ASKAP X's and Ys into one's WSclean expects.

Emil also gave us this handy script to get the rotation:

#!/usr/bin/env python
import sys
import numpy as np
from casacore.tables import *

if len(sys.argv) != 2:
    sys.exit("Usage %s <ms_file>" %(sys.argv[0]))

ms = sys.argv[1]

tf = table("%s/FEED" %(ms), readonly=True, ack=False)
ms_feed = tf.getcol('RECEPTOR_ANGLE')
pol_axis = -(np.degrees(ms_feed)-45.0)
print(pol_axis[0][0])
tf.close()

I'll add a script similar to fix_ms_dir to fix this issue.

Benchmark imaging

We need to:

Determine best default parameters for wsclean imaging of cubes.
Benchmark the time taken per field for imaging.

Need to make LINMOS happy

Some experiments with LINMOS have revealed some hard requirements in both the header and file naming. These are required for LINMOS to perform leakage correction appropriately.

LINMOS expects a filename with like:

image.blahblah.{STOKES}.blahblah.beam{BEAMNUMBER}.blahblah.fits

Here STOKES must be lower case i.e. .i., .q., .u. or .v., and the beam number must presented in zero-padded format e.g. .beam00 (note that preceding .)

Also, in the header there must be a dummy Stokes axis in NAXIS3 - this is the 2nd (index 1) in numpy. So ordering must be (FREQ, STOKES, Y, X) in numpy or (X, Y, STOKES, FREQ in FITS). Without this ordering leakage subtraction will fail. Right now I've stacked the FREQ into NAXIS3 (it needs to be NAXIS4).

Also, interestingly, LINMOS will default to looking for

image.blahblah.i.blahblah.beam{BEAMNUMBER}.blahblah.fits

when correcting e.g. image.blahblah.q.blahblah.beam{BEAMNUMBER}.blahblah.fits. This can be overridden using the linmos. stokesinames parameter to point to any other FITS file. We may want to consider keeping the model files around for a more robust subtraction method.

Ping @tjgalvin for thoughts

Common resolutions

Putting some thoughts down here on the 'proper' approach to common resolution.

Right now, in the imager stage, we convolve an entire field to a common resolution across beams and frequency.

In principle, we actually only need to convolve each beam along the frequency axis. Then, when we go to LINMOS each cubelet together, we can make a last minute check and convolution before images are co-added. This way, each cubelet would be kept at the best possible resolution.

For top-level purposes, one could then decide to convolve all (or some subset) of the cubelets to whatever resolution and reproduce a catalogue from those convolved data. We just need to ensure the right check happens in the LINMOS stage, and that the pipeline picks up the correct PSF information when making a catalogue.

Integrate imager module

Remove RACS as submodule

The RACS 'database' is too big to keep as a submodule. Let's get rid of it and set up for users to get it themselves. Eventually this will be a proper database. When that happens we can just query it.

RMSs passed in as linmos weights

I found this when doing my adventuring on the high seas with flint, but it looks like to me that passing RMSs through as weights into linmos is not the right thing to do. We should be passing in the inverse RMS squared, or at the very least the inverse RMS. I believe the higher the weight, the more preference linmos gives to that image. So, lower RMSs (i.e. better image) are actually treated with less importance.

I will confirm this is the correct behavior next day or two. Putting here so the beer I drink later doesn't make me forget.

Migrate to Prefect v2

We should look to porting process_region.py and process_spice.py to Prefect v2 for easier managemen of large-scale processing

Logger issues when moving to prefect2

Each of the individual arrakis processing stages are currently a dask based pipeline. The workflow is constructed using dask.delayed functions, which are used to build a graph that is then traversed by dask workers.

The previous prefect version 1 pipeline has been updated to version 2. This pipeline sequentially calls each of the individual dask workflows. The intent is that some of the prefect 2 features (agents for remotely starting pipelines, RESTful API) can be leveraged when full racs processing commences.

We are creating a single Client/Cluster pair which is supplied to the main prefect workflow, and are relying on the prefect_dask.get_dask_client context manager to pass this compute resource manager through the to dask workflows started by each of the separate arrakis stages.

This setup though is losing information emitted by the arrakis logger. The supported methods to attach prefect log handlers to a module (which are defined via an export PREFECT_LOGGING_EXTRA_LOGGERS="arrakis or prefect logging configuration file) are not being properly propagated through to the code being executed with the @dask.delayed functions. Similarly, the logs that should be printed to stdout in these functions are not being printed to stdout. This makes debugging difficult. Grrr.

I have tried to hijack the arrakis logger when the delayed functions are being executed and manually attach the prefect orion logger handler. However, by this point there is not prefect context that can be obtained (e.g. the flowrun/taskrun id). Unless we want to pass this context information into the dask workflow when we enter some main, as I understand things, relying on the prefect helpers (e.g. get_run_context and get_run_logger) is not possible.

The way I have managed to get information out of the encapsulated dask workflows is to retrieve the distributed.worker logger and use that, similar to this below.

@delayed
def cutout(...):
    logger = logging.getLogger('distributed.worker')
    logger.setLevel(logging.INFO)

seems to work reliably, although it emits messages using the distributed.worker formatter.

And in typing up all this as a MWE I think this works just as well

from arrakis.logger import logger

@delayed
def cutout(...):
    logger.setLevel(logging.INFO)

The real gotcha in all of this seems to be the logging level. There needs to be a setLevel(logging.INFO) inside the dask.delayed functions - putting it at the top level of each function does not seem to work. The logger object created in arrakis.logger also has this level set. All this is to say I am not sure why this is needed. It might be related to how serialisation works? As the dask workers are scaled up and down with the Cluster.adapt method, there might be a need to reset this often?

Not sure where the problem is coming through.

Add Dask Cluster class to config file

It'd be nice to steal the feature from flint that allows a user to specify the Dask Cluster Class in the config YAML. @tjgalvin can this be ported into Arrakis?

Add back an automate imager skip stage

This is being raised as a note and perhaps a discussion when things are running more routinely.

When optimising the imager dask routine I removed checks that search for images, essentially removing the image_reuse option. Since I am now routinely cleaning up all non-MFS files this check would be fairly useless in most cases.

There is now a --skip_imager option in the configuration parser. However, I am curious on whether there would be a need to add back in this option? if so, the actual thing to do I think would be to scan for the final cubes. If all 36 (or 72) cubes and corresponding weight files do exist, then the entire imager stage could be automatically skipped, otherwise burn it all down and start again.

Dealing with many files

As a design principle, Arrakis produces many files. This is actually somewhat desirable since it allows us to process the data in parallel. However, for our full runs at processing the entire sky we'll need to make some considerations. Many HPC filesystems don't like having millions of files, and enforce inode limits on users.

There are some options for us to consider, and we'll need to do some testing and brainstorming. One painful constraint is that many of the tools used in the pipeline (e.g. linmos) are built to handle FITS files and FITS files only. Further, converting in/out of other file formats will chew up time considerably.

To be explicit, we need to consider some kind of combined file format for the cubelets.

Some options off the top of my head:

Multi-extension FITS
HDF5
Zarr
tar
SquashFS