gthompson / kitchensinkgt Goto Github PK

Miscellaneous research codes, organized into projects with a common set of libraries. This is just where I actively work, before packing codes up for release when publishing papers, datasets etc.

Python 1.65% Jupyter Notebook 96.97% MATLAB 0.02% HTML 1.30% Shell 0.01% Perl 0.02% Rich Text Format 0.01% IDL 0.03%

seismology obspy astropy astrophysics volcanology personal-finance volcano-seismology volcano-infrasound

kitchensinkgt's People

Contributors

Stargazers

Watchers

kitchensinkgt's Issues

How big is MSEED compared to PICKLE format?

Spikes and dropouts

Q: Spikes and dropouts are quite commonplace due to telemetry problems / interference. My QC codes eliminate some of these in the pre-processing. Probably good to make you broadly aware of these problems, but not go into specific examples.

A: I know these problems exist in the dataset, but they have not come up with the ~600 events I have relabelled so far. So figure out how to deal with them when I find them.

My QC metrics seem to be working!

Should event-receiver distance be a feature?

Q: Events also look substantially different depending on whether they are recorded close or far away. I haven’t thought much about this, but perhaps distance of event from station, or location, should be a ‘feature’ in the model, or used in weighting somehow. The problem is we don’t have locations for 99% of events, but the Amplitude Source Location method I independently invented in 2000 could give useful constraints.

A: Our discussion centred on the fact that an event near MBGH could look like noise on MBWH and vice versa. So we need a way to filter out noise traces - perhaps with a separate ML step, or regular QC metrics, e.g. looking to see if there are significant amplitude and frequency changes within a waveform. Or we could eliminate noise traces manually (or do event labelling based on each channel, rather than each event), and build up a set of labelled noise versus signal for an earlier step of machine learning. If we do not do this, event classifications could be biased by noise, e.g. MBWH LP class might be trained on 70 LPs and 30 noise signals. Same for all channels and event classes. (I recall that Jean-Philippe had also suggested training a separate noise model).

Add frequency change metrics?

Q: Some signals such as hybrids and lp-rockfalls change frequency (high to low for hybrids, low to high for lp-rockfalls). So can we think of a good way to measure the frequency change during the signal itself (and not during the noise). Perhaps scan the signal in 3-s sliding window and record the highest and lowest number of zero crossings per second, and their times, and then these frequencies, the frequency change and the time between low f and high f (-ve time for hybrid, +ve for lp-rockfall) could be features?

A: She noted a spectrogram already includes this information, so maybe we should just use a spectrogram as a feature? The problem, as Alexis had mentioned, is that waveforms are vastly different in length, and AAA needs a fixed length feature vector. Nevertheless, I could compute what I suggested above from the spectrogram, similar to where I compute other frequency metrics from the spectrogram.

Work on metrics to replicate AEF files

Implement energy-magnitude scale

Compute the energy-magnitude scale.
To do this, compute waveform energy on each station (Z component or 3-component vector?), correct for distance and Q, to get an energy magnitude for each station. The network magnitude should then be the median of these (or a weighted mean, e.g. weight estimates by distance from median).

Next stage would be to correlate against magnitudes computed by MVO, SRC, Guadeloupe etc.

Could also compute duration magnitudes.

refactor 00 using enhance_waveform_file as a template, then re-run 00 on Hal

improve metrics/signaltonoise function?

Why not instead just take the ratio of the amplitudes of the "loudest" second and the "quietest" second?
def signaltonoise(tr):

function snr, highval, lowval = signaltonoise(tr)

# Here we just make an estimate of the signal-to-noise ratio
#
# Normally the trace should be pre-processed before passing to this routine, e.g.
# * remove ridiculously large values
# * remove any sequences of 0 from start or end
# * detrend
# * bandpass filter
#
# Processing:
#    1. ensure we have at least 1 seconds
#    2. take absolute values
#    3. compute the maximum of each 1-s of data, call this time series M
#    4. compute 95th and 5th percentile of M, call these M95 and M5
#    5. estimate signal-to-noise ratio as M95/M5
#
# Correction. The above seems like a poor algorithm. Why not instead just take the ratio of the amplitudes of 
# the "loudest" second and the "quietest" second.

Create a mapping from all WAVfiles back to Sfiles too

correct channel mappings

Analog network clipping

Q: There is also another seismic network (the “analog” network), which had much less dynamic range, but captured the only data from the first 20 months of the eruption, so these signals are an important target. The result is that many signals are “clipped” (saturated).

A: Marielle agreed it makes sense to train on the digital network first, and then just try to apply this to analog network and see how it does. And then iterate from there, coming up with strategies to train a model and/or correct saturated data.

M: Also this might be interesting from a “can our model train on some data generalize to different data” point of view. Would be interesting in a publication!

Michi Wenner's Jupyter notebook

Try Michi Wenner's (or was it Miriam's?) Jupiter Notebook on my dataset, and see what features she had

Reconcile number of events in catalog on Hal9000

There are 400,977 lines in rea_MVOE_all.csv on hal9000.

But cum_classes_by_month_on_hal.txt adds up to 235,804.

Ensure I use enough events to train a model

For every feature and new type of time series or domain you add, need more events to train a model.

What is compute_Gains.py?

Do we want to mix corrected & uncorrected metrics in event CSV and Pickle files?

Or send them to different versions of the CSV file?

Use multiple stations

Q: What are some possible strategies to take advantage of the entire seismic network? So far, I trained models on what are the three most common seismic channels (vertical components from 3 different stations). I was thinking of just implementing some sort of weighted voting system. So for each event and each channel, we get a probability for each event class. Then we do a weighted average of these probabilities across all channels. Others have just treated one event recorded on, say, 10 different channels as 10 separate events. But that ignores the information that we know they are the same event, and likely adds confusion as each seismic site has different characteristics that make waveforms for the same event significantly different.

A: I tried explaining that for each event channel, we can get a probability vector from scikit-learn, of length equal to event class. And then we could just sum or average over all event channels available for that event. And this would still work even if we had just one channel (indeed, for long periods in late 1997 after most of network destroyed, we only have MBWH-Z). But I don't think my explanation was clear, as Marielle felt the scheme would fall apart if all channels not available. I think I need to create an example to explain better.

M: I understand your thought I believe. My suggestion would have been to implement a ML algorithm instead of the weighting systems, which would have been able to learn that some events still need to be detected even if only seen on 2 or 3stations (which would probably be lost in the weighting system). But this solution raise the problem that to train such a model, all stations would need to be available at all times. Hence the weighted system might be more adapted.

Incorporate 3-component stations

For 3-component stations, we may compute polarization parameters too (linearity, planarity and ellipticity traces), and treat these as separate traces. Cannot really derive separate features from them, as those features would be missing for 1-C stations.

Do all events need same sampling rate?

Q: How important is it to ensure that all “signals” have the same sampling rate? For example, from 1996-2004 the sampling rate was 75 Hz, and from 2005-today it is 100 Hz, and there is likely a transition period where some stations were 75 Hz and others were 100 Hz. If we used the wave polarization traces I suggested in item 2 below, the sampling rate might be much lower, perhaps 1 Hz or 10 Hz.

A: Marielle seemed to think the code took care of sampling rate differences. But we could also just try it. This could impact 2 above, as polarization metrics are likely to be at much lower sampling frequency.

M: Need to check that it is ok in the code. The danger would be to introduce a biais in the dataset …

Statistical (non-ML) approach

Complete my effort from SeisanDB to CSV. See if there are additional metrics in Tanya's work.

Write code to compile existence of different NSLC combos, whether data corrected, and whether channel name correct

Average parameters over events by label, grouped by traceID. Do not average over all traceIDs

I previously averaged metrics for each label, ignoring different traceIDs. Better would be to average over events, but grouped by TraceID since each station/channel sees events differently.

extract any useful functions from seisanDBwrappers.py

Merge stationXML/inventory objects

Recrunching the data progress so far

10:55 am on Tuesday November 30th
1996-2002 complete on hal
2003-08-16 hal (4.5 months left)
2005-11-29 hal (1 month left)
2006-03-17 hal (9.5 months left)
2007-03-07 newton (10 months left)
2008-07-28 faraday (1 month left)
total of 26 months left

add hal job to run 2007-07 and beyond. so then only need newton (which is extremely slow) to do 4 more months.

Hopefully by tomorrow morning, all will be finished.

Convert Seisan CAL files from BBaptie to stationXML and try to add to my workflow

Classification check - Richard Luckett

Send reclassified events to Richard and ask him to independently verify

merge code from enhance_waveform_file into 00

Any advantage to using instrument corrected traces?

Probably not since we use a 0.5 Hz high pass (check this), and response is flat above this for all sensors. And AAA code normalized the amplitudes anyway. Note that bandpass filter in AAA is currently turned off.

Instrument corrected traces and metrics computed from them more useful for other catalog-based work, e.g. AEF, energy-magnitude scale, plots etc.

Modify 00_convert_seisandb_to_csvfiles.py to process uncorrected and corrected files simultaneously?

See MVO-Seisan-event-database-to-Pickle-CSV-files

01_compute_gains.py appears to be an attempt to generalize 00_convert_seisandb_to_csvfiles.py to create both uncorrected and corrected pickle files and traceCSV files. Consider merging elements of this 01_compute_gains.py into 00_convert_seisandb_to_csvfiles.py, to process uncorrected and corrected waveforms within the same workflow.

75.0 or 75.19 Hz? need to check continuous vs event data and merge

definitely 75.0

see experiment_75Hz notebook

Add pre-computed features to the 34x3 computed by AAA so far.

What is the best way to add pre-computed features to the 34x3 you used? As part of my automatic data QC, I compute many time series and spectral parameters, and they reside in Pickle files and also in CSV files.
The current method I use, where I have written featureFunctions to read precomputed features from files, works but is inefficient as it repeats the same numbers in every domain. I think I need to find where in the code the feature vector is sent to scikit-learn, and add external features to the vector there.
Marielle was confused, and suggested doing what I had already done.

Refactor all codes to use libMontyML

Add a new metric to compute reduced displacement

Add metrics to compute reduced displacement and reduced pressure. These could be incorporated into event catalog monthly CSV files, along with ML, ME and seismicenergy etc.

Incorporate 3-D data

Q: What are some possible strategies for considering 3-component seismic data? Seismologists visually identify different wave types (e.g. P, S, Love, Rayleigh) best by examining 3-component data. Even better, we use tools which compute the planarity, rectilinearity and ellipticity of the particle motion/wave polarization, and show those as time series plots. So my thinking was simply to treat these as additional numpy time series arrays and compute features on them.

A: OK to add these as new traces, though many features computed on them might not make any sense. Can always implement some sort of feature compression, to remove non-sensical features (e.g. I could implement this at same stage mentioned in 1 above).

M: Always the question of how local or how global you want the model to be: one model per axis, one model per station, one model per volcano. Curse of Dimensionality might be a problem, if so need to perform feature compression or selection.

Find the most important features

perform some sort of PCA to identify most important features for each class, and that way reduce the feature vector length and computation time. (Especially for real-time applications). Alexis can help with this

Add autocorrelation function as new domain

Q: My collaborator from 15-20 years ago, Horst Langer, used the autocorrelation function rather than the raw seismic event waveform. Would it make sense to add that as a new domain, in addition to time series, spectral and cepstral? Or would it be redundant?

A: Add autocorrelation as a new domain and see if it helps.

Merge processed events from MacBook2020 with much larger unprocessed catalog from hal9000

Around 17,000 S-files were processed with https://github.com/gthompson/kitchensinkGT/blob/master/PROJECTS/MontserratML/00_convert_seisandb_to_csvfiles.py on my 2020 MacBook. But there are around 400,000 on hal9000. A bug had introduced lots of incorrect entries in the reawav_MVOE_YYYYMM.csv files, so I wrote https://github.com/gthompson/kitchensinkGT/blob/master/PROJECTS/MontserratML/tool_remove_duplicates.ipynb to rectify this, and I think I fixed the original bug too. However, I now need to load the REA_WAV_MVOE_YYYY_MM.csv files on hal9000 and rectify against https://github.com/gthompson/kitchensinkGT/blob/master/PROJECTS/MontserratML/catalog_unique.csv which came from the MacBook2020. For example, I may need to read the D, R, r, e, l, h, t columns and definitely the new_subclass, weight, checked, split, delete, ignore and main class columns.