davidt3 / daxa Goto Github PK

Democratising Archival X-ray Astronomy (DAXA) is an easy-to-use Python module for downloading multi-mission X-ray telescope data and processing it into usable archives. Users can acquire entire archives, or filter observations based on ID/positions/time. Supports XMM; partial support eROSITA, Chandra, NuSTAR, Swift, Suzaku, ASCA, ROSAT, INTEGRAL

License: BSD 3-Clause "New" or "Revised" License

Python 98.72% TeX 1.28%

astronomy astrophysics python x-ray-astronomy xmm chandra erosita xga archival-astronomy nustar

daxa's Introduction

What is Democratising Archival X-ray Astronomy (DAXA)?

DAXA is a Python module designed to make the acquisition and processing of archives of X-ray astronomy data as painless as possible. It provides a consistent interface to the downloading and cleaning processes of each telescope, allowing the user to easily create multi-mission X-ray archives, allowing for the community to make better use of archival X-ray data. This process can be as simple or as in-depth as the user requires; if the default settings are used then data can be acquired and processed into an archive in only a few lines of code.

As the missions (i.e. telescopes) that should be included in the archive are defined, the user can filter the desired observations based on a unique identifier (i.e. observation ID), on whether observations are near to a coordinate (or set of coordinates), and the time frame in which the observations were taken. As such it is possible to very quickly identify what archival data might be available for a set of objects you wish to study. It is also possible to place no filters on the desired observations, and as such process every observation available for a set of missions.

Documentation is available on ReadTheDocs, and can be found here, or accessed by clicking on the documentation build status at the top of the README. The source for the documentation can be found in the 'docs' directory in this repository.

Installing DAXA

We strongly recommend that you make use of Python virtual environments, or (even better) Conda/Mamba virtual environments when installing DAXA.

DAXA is available on the popular Python Package Index (PyPI), and can be installed like this:

pip install daxa

You can also fetch the current working version from the git repository, and install it (this method has replaced 'python setup.py install'):

git clone https://github.com/DavidT3/DAXA
cd DAXA
python -m pip install .

Alternatively you could use the 'editable' option (this has replaced running setup.py and passing 'develop') so that any changes you pull from the remote repository are reflected without having to reinstall DAXA.

git clone https://github.com/DavidT3/DAXA
cd DAXA
python -m pip install --editable .

Which missions are supported?

DAXA is still in a relatively early stage of development, and as such the support for local re-processing is limited; however, support for the acquisition and use of pre-processed data is implemented for a wide selection of telescopes:

XMM-Newton Pointed
eROSITA Commissioning
eROSITA All-Sky Survey DR1 (German Half)
[Under Development - data acquisition implemented] NuSTAR
[Under Development - data acquisition implemented] Chandra
[Under Development - RASS/pointed data acquisition implemented] ROSAT
[Under Development - XRT/BAT/UVOT data acquisition implemented] Swift
[Under Development - data acquisition implemented] Suzaku
[Under Development - data acquisition implemented] ASCA
[Under Development - data acquisition implemented] INTEGRAL

If you would like to help with any of the telescopes above, or adding another X-ray telescope, please get in contact!

Required telescope-specific software

DAXA makes significant use of existing processing software released by the telescope teams, and as such there are some specific non-Python dependencies that need to be installed if that mission is to be included in a DAXA generated archive.

An alternative to installing the dependencies yourself

[Under Development] - A docker image containing relevant telescope-specific software is being created. The built image will be released on DockerHub (or some other convenient platform), and the actual dockerfile used for building the image will also be released for anyone to use/modify. The dockerfile is heavily inspired by/based off of the HEASoft docker image.

XMM-Newton

Science Analysis System (SAS) - v14 or higher

Analysing the processed archives

Once an archive of cleaned X-ray data has been created, it can be analysed in all the standard ways, however you may also wish to consider X-ray: Generate and Analyse (XGA), a companion module to DAXA.

XGA is also completely open source, and is a generalised tool for the analysis of X-ray emission from astrophysical sources. The software operates on a 'source based' paradigm, where the user declares sources or samples of objects which are analogous to astrophysical sources in the sky, with XGA determining which data (if any) are relevant to a particular source, and providing a powerful (but easy to use) interface for the generation and analysis of data products. The module is fully documented, with tutorials and API documentation available (support for telescopes other than XMM is still under development).

Problems and Questions

If you encounter a bug, or would like to make a feature request, please use the GitHub issues page, it really helps to keep track of everything.

However, if you have further questions, or just want to make doubly sure I notice the issue, feel free to send me an email at [email protected]

daxa's People

Contributors

Stargazers

Watchers

daxa's Issues

Should design a DAXA-specific cleaning process at some point

This would be mission-agnostic, and ideally support any of the telescopes which DAXA ends up being able to reduce data for. This would be an alternative to the mission-specific methods I am implementing first (i.e. the SAS cleaning methods for XMM).

Maybe I should come up with a generalised download function

So that for each mission's download method I can wrap it in a decorator that will download the data in a multithreaded way (or not if NUM_CORES=1) and I don't have to write that code over and over again.

How do I deal with OOT event lists in cleaned_evt_lists?

Not sure whether the same filters should be applied - though I should think the same GTIs should be

Reconsider the naming scheme for cleaned event lists

Currently just observation, instrument, subexposure and 'cleaned' - probably should include some more information in the file names.

Add the basic structure to the documentation

Set up the general structure, with an installation guide, intro section, contact section etc.

Don't need to make it perfect for this issue, or write any tutorials, but sketch out the framework.

Build SAS summary file parser for DAXA

That can then be used to determine what processed files we expect ahead of time, rather than pattern matching like I do now

XMM data will currently be downloaded even if it already exists

This should not be the default behaviour, though perhaps it would be a good idea to have a clobber argument in the mission init? So people can overwrite existing data if they want to.

Setup convenience functions to easily set up Archives in particular circumstances

I.e. the simplest could be 'process all available observations from XMM', or 'process all available observations from XMM, Chandra, and eROSITA' (once support for other telescopes is added).

That would provide an archive instance which could be passed into processing functions, both the telescope-specific processing which is generally provided by a particular telescope's software suite, or the planned mission-agnostic processing that I will eventually add to DAXA (issue #17).

Other examples of convenience functions like this could be ones that would assemble an archive from multiple telescopes for observations relevant to some particular sources.

Do proper data processing tests on observations known to have multiple sub-exposures

When the eSAS processing stack is implemented, make sure to manually check that everything is working for observations with multiple exposures

Add combined sky-coverage calculation capabilities

This should both be able to assess how much of the sky is covered by a particular set of data, but also produce coverage maps which can be stored alongside the processed datasets to allow for the identification of data relevant to a particular source.

PN non-imaging mode sub-exposures

As mentioned in issue #34, DAXA cannot currently parse XMM ODF summary files. As such it is difficult to efficiently identify which exposures are in which observing mode, and other details about them.

Currently only PN imaging mode data will be processed by DAXA (though hopefully that will change at some point), but as issue #34 is not implemented, and I don't want to be reading headers for thousands of fits files if I can avoid it, epchain will currently attempt to process every sub-exposure as an imaging mode observation.

This will cause an error for other data modes (like timing for instance):

At the moment I am just going to let them fail, and then catch them further down the line, rather than identifying them a priori and never running the commands in the first place

XMM data weirdness

This isn't really a question to me, but just XMM in general.

On XSA the observation 0001730401 shows only RGS data being available - the quality report just indicates RGS as well, no EPIC data.

However when I acquire the ODFs I find unscheduled PN observations and scheduled MOS observations - so what gives??

This is being left here mostly as a reminder to myself to try and solve this mystery

Have archives create reports on the reduction/processing

Fold in all the information I'm collecting about which processes were successful, the stdout, and stderr.

BaseMission filter_on_time should allow just start or end datetimes

So that the user can select everything before or after a particular date/time

The documentation is not building on RTD

I will add more information as I explore the issue, but every build of the DAXA documentation on read the docs has failed thus far.

I think its a dependancy versioning problem.

Currently storing warnings and errors from SAS in Archives as dictionaries

Decide whether that is the right way to do it, or whether to just store the text of the error/warning

Add anomalous CCD state checking for MOS

I am basically following the eSAS guide at this point, but checking for CCDs in anomalous states is going to be a good idea.

This should enable filtering based on what states the user considers acceptable as well.

The choice of acceptable states should of course be recorded for the archive.

The setup.py doesn't have all module requirements in like requirements.txt

Not a problem when DAXA is eventually released on PyPI, but for using setup.py to install it right now it is a problem.

Implement a wrapper for the eSAS espfilter soft-proton filter function

Again following the example of the XMM eSAS manual, I will be using espfilter to find bad time intervals with high levels of soft proton flaring courtesy of the Sun.

In the currently released version of eSAS there is a script called PN-FILTER (and an equivalent MOS implementation) that called espfilter, but the upcoming version of eSAS (per the unreleased manual I found) has removed it so that eSAS adds functions rather than processing scripts to SAS.

I will be attempting to make DAXA compatible with as many versions of SAS/eSAS as possible (whilst remaining consistent) by not using PN-FILTER, and making a espfilter function for DAXA that supports PN and MOS.

Could task-based parallelism be used to maximise compute usage of data preparation tasks?

That way we wouldn't have to wait for each individual missions processing to complete before moving onto the next one (as will be the case in the way I'm writing DAXA right now).

Downloading specific instruments for XMM currently downloads then deletes irrelevant

I intended the downloading of specific instruments to minimise disk usage/bandwidth usage by not downloading data that a user considers to be irrelevant to their use-case/can't (yet) be processed by DAXA. Unfortunately for XMM, the downloading of ODF (observation data files) using the AIO URLs (and thus the AstroQuery interface) for specific instruments is currently impossible, as regardless of the specified instrument, all instrument ODFs are downloaded.

This is happening on the XSA end, and I've sent in a ticket asking if this is an intended behaviour, however whatever the answer ends up being I have to deal with it for the time being. As such the XMMPointed class download behaviour will acquire all instrument data for a given observation.

Then (assuming that this doesn't break any pre-built data processing tasks downstream) it will delete those ODF files which relate to instruments that have NOT been selected by the user.

Add previous-process awareness to XMM processing tasks

Need to ensure that things are run in the right order - for instance cif_build must be run before everything, odf_ingest must be run before basically everything else etc.

Currently just rely on the user doing that, but that won't be a permanent state of affairs - I'll make use of the process_success property of Archive to a) check if dependencies have been run, and if they were successful.

Add granularity to missions on whether individual observations have been fully processed

This would go hand-in-hand with some extra resilience to failing processes - I don't really want to halt the whole processing stack just because one observation has failed for some reason.

Just so long as we can keep track of whether an observation has been processed or not.

Ensure that a new CCF is created if a different analysis date is used

In the case where ccfs already exist but cif_build is run again with a different analysis date set make sure they are overwritten. The date information should be stored somewhere as well.

This will be integrated into the backend datebase I suspect, in some way that I have yet to figure out.

If a CCF is re-created, then presumably reduction should be re-run to be completely valid?

cleaned_evt_lists fails for 0099280101 because of emanom and calclosed

Currently no checks are performed at any stage to identify what the filter value of a particular sub-exposure of an observation is, and as such everything is blindly thrown into emanom (if the user chooses to run it). This method will fail for any CalClosed filter data, which then carries through to cleaned_evt_lists because DAXA is trying to create cleaned versions of those evt lists as well and expects there to be an emanom log file, even though CalClosed is not useful observation data for us.

Need to write unit tests for DAXA

This should hopefully be easier than XGA, because we're earlier in the development process + its just a simpler module.

The position filtering methods in BaseMission aren't really valid for survey-type missions

For instance it wouldn't really be valid for eROSITA, because (at least for eFEDS) their data is released in big sweeps across the sky. As such matching to the central coordinate of that region doesn't really have much meaning.

Should I change mission properties so that things named as 'obs_ids' for instance are the filtered values

Currently that would return all of the observations, rather than whatever subset has been selected for the mission - this doesn't feel like the right behaviour really.

Add an Archive class

Instances of the Archive class will be capable of storing and accessing multiple missions, and will probably be the most user-facing class of this module. They will contain a bunch of convenience methods, and probably the planned sky coverage generation capabilities (issue #18).

Downloading XMM data should be allowed to retry at least once

If the connection is refused, then it should be allowed to attempt at least one more time to download the data.

Implement eROSITA commissioning mission

This would include the eFEDS field and the other commissioning data taken by eROSITA.

SAS error parsing is largely duplicated between XGA and DAXA

The SAS parsing method from BaseProduct in XGA has basically been copied over, and the lists of SAS errors and warnings have also been copied over.

Probably would be best to just put it all in DAXA, and have XGA import what it needs.

Start work on the DAXA paper for JOSS

It won't need to be very long, and some liberties can be taken in terms of writing about features that don't exist yet, because this won't be going on arXiv until they do exist.

Use XGA to generate images and exposure maps for DAXA XMM archives

This will probably require changes to XGA as well, but shouldn't be too hard to achieve using the NullSource class.

SAS v21's (upcoming not released) eSAS implementation is quite different from previous versions

I am currently implementing an eSAS-based XMM processing method, and have accidentally found the eSAS v21 manual indexed on google. It indicates that many of the eSAS tools have had their inputs changed considerably to better resemble normal SAS functions (i.e. they'll take arguments to point to specific event lists etc.)

This is obviously great, more control is better, but it does mean that there will be a significant difference in behaviour. As I do not want to lock people into one specific version of SAS if it can be avoided (especially considering that version isn't even out yet) I will have to build two different approaches (though within the same Python function) for SAS v21 and any lower SAS version (though I don't think I will allow any SAS version below v14).

Hopefully won't be too difficult, considering I already identify the installed SAS version in the find_sas function, it'll just be some extra work.

Current download-checker for XMM raw data is very crude

Doesn't check for presence of particular instrument files, just that there is a directory for the particular ObsID - definitely needs to be more sophisticated

XMM scheduled and unscheduled observations

I want to ensure that any unscheduled (with U in their exposure identifier rather than S) PN observations are processed by epchain, but its not clear to me whether that is True by default.

You can set the 'schedule' flag in epchain to S or U, but it only triggers if odfaccess=odf rather than oal, which is not explained...

To be honest, exactly what an 'unscheduled' observation is isn't really explained either.

Make the readme and 'about' less rubbish

DAXA does do things now, though admittedly not everything that it will

Failed to find or open the following file: (ffopen) toto.in.mos[1]

This is happening during emchain runs, and at another point in the stderr output there is 'sh: lcurve: command not found'

I suspect they might be connected.

lcurve is a part of xronos section of HEASoft, which I may not have selected for my laptop install of HEASoft. This could help me learn which parts of HEASoft are actually required for SAS to work in its entirety.

My ICER install of HEASoft is the whole thing, so I can test running emchain on there to see if the same problem pops up.

DAXA XMMPointed download isn't successfully identifying observations without data for the requested instruments

For instance 4444440301, not only is it really weirdly named but it also seems to only have OM data on XSA - definitely don't care about that when the XMMPointed object can only process PN, M1, and M2 at the moment (that's hard coded in).

I should normalise how DAXA calls emchain and epchain as much as possible

Currently emchain will loop through all available sub-exposures, including unscheduled observations, without any extra intervention. As such the processing of an entire ObsID-MOSX set of data happens as one process.

As epchain has to have the sub-exposures manually specified, each sub-exposure of each observation is processed separately. As such it gets its own success/log/error entry in the Archive records - considerably more granular.

I think I should change emchain's behaviour in DAXA so it is more comparable to how epchain behaves. I can address separate sub-exposures by themselves in emchain (using the exposure argument) - this will also make it easier to check that a particular process for a particular sub-exposure did work when it comes to looking for anomalous CCD states in MOS observations.

NOTE ON ExceptionGroup IN JUPYTER NOTEBOOK

Very limited parts of DAXA use a new Python feature (introduced in 3.11, backported by exceptiongroup module), that allows me to raise a set of exceptions together.

Specifically this is used when Python errors occur during the parallel tasks that run command line SAS tools (and possibly other telescope specific command line tools in the future) - to be clear Python errors shouldn't happen in those parallelised tasks, but if they do an ExceptionGroup is used.

It seems that at the moment (this is true on my setup on the date this issue was created) that Jupyter notebooks do not show the tracebacks properly for ExceptionGroup. For instance in the notebook a test raised ExceptionGroup gives this traceback:

ExceptionGroup: pythony errors (3 sub-exceptions)

Whereas in a script run from terminal this is what you get (and should get):

Exception Group Traceback (most recent call last):
| File "/Users/dt237/code/test_daxa/testo.py", line 12, in
| success, errors, outs = cif_build(arch)
| ^^^^^^^^^^^^^^^
| File "/Users/dt237/code/DAXA/daxa/process/xmm/_common.py", line 209, in wrapper
| raise ExceptionGroup("pythony errors", python_errors)
| ExceptionGroup: pythony errors (3 sub-exceptions)
+-+---------------- 1 ----------------
| Traceback (most recent call last):
| File "/opt/anaconda3/envs/daxa_dev/lib/python3.11/multiprocessing/pool.py", line 125, in worker
| result = (True, func(*args, **kwds))
| ^^^^^^^^^^^^^^^^^^^
| File "/Users/dt237/code/DAXA/daxa/process/xmm/_common.py", line 89, in execute_cmd
| print(boi)
| ^^^
| NameError: name 'boi' is not defined
+---------------- 2 ----------------
| Traceback (most recent call last):
| File "/opt/anaconda3/envs/daxa_dev/lib/python3.11/multiprocessing/pool.py", line 125, in worker
| result = (True, func(*args, **kwds))
| ^^^^^^^^^^^^^^^^^^^
| File "/Users/dt237/code/DAXA/daxa/process/xmm/_common.py", line 89, in execute_cmd
| print(boi)
| ^^^
| NameError: name 'boi' is not defined
+---------------- 3 ----------------
| Traceback (most recent call last):
| File "/opt/anaconda3/envs/daxa_dev/lib/python3.11/multiprocessing/pool.py", line 125, in worker
| result = (True, func(*args, **kwds))
| ^^^^^^^^^^^^^^^^^^^
| File "/Users/dt237/code/DAXA/daxa/process/xmm/_common.py", line 89, in execute_cmd
| print(boi)
| ^^^
| NameError: name 'boi' is not defined
+------------------------------------

So just be aware of that!

Small-window mode PN processing errors

When running epchain on small-window mode without some extra configuration, it will throw errors when it finds that most of the CCD IME files are missing (small window just uses one). These errors aren't fatal to the epchain process, but they do contaminate the stderr output which DAXA parses to try to find any truly fatal errors.

As such we should identify which CCDs are available a priori and pass that list to the ccds parameter of epchain. Ideally this will eventually be done by parsing the SAS summary file (issue #34), but for now I think I can just search through files in the ODF directory.

Currently the TI files for MOS observations are copied after the emchain process

Need to figure out whether I actually need them (they get copied because they pattern match the EVLI string I look for, as recommended by the eSAS manual). I assume they contain some good time info even though we haven't run soft proton filtering yet?

Add a way of storing 'extra info' dictionaries for processes in archives

Much the same way as logs/errors are stored right now, but for any extra information that was passed out of the SAS process wrapper function (e.g. the event list paths from epchain and emchain).

The SAS espfilt docs seem to have incorrect information about a default parameter value

An important one unfortunately - it seems like (though I am only ~90% certain on this right now) the with_binning argument is defaulting to yes rather than defaulting to no like it states in the docs.

That was causing some extremely irritating behaviour where my SP filtering was not working at all.

Support the acquisition and reduction of proprietary data

What it says on the tin really. For XMM for instance you need to provide a login and password, and I'll also have to make sure that the proprietary data belonging to a particular user are marked as usable in the fetch_obs_info method, as currently all proprietary observations are marked as unusable.

Process logging storage keys

Currently the logs, errors, processed errors, and warnings are stored in archives under either an ObsID or an ObsID+instrument+sub exposure ID combo.

This is somewhat tantamount to what the docstrings in the Archive class say, as they state either an ObsID or an ObsID+Instrument key combo.

I should consider having lower level instrument and then sub-exposure dictionaries to store the results/logs in, rather than ObsID+instrument+exposure ID. I intend to implement some sort of lookup method that can grab all results for an ObsID, or a specific ObsID instrument combo, and that would probably be easier with more distinct layers of dictionaries.

Implement SAS procedures for slew observations, when the xmm slew mission is added

Currently not sure which would actually work for slew observations, so at the moment a not implemented error is thrown.

Allow final path to be a list of paths (for SAS process calls)

Particularly this applicable to the epchain wrapper, as it actually creates 'normal' uncleaned PN event lists and out of time event lists as well - so it would be good to check that both are present.