dchaley / deepcell-imaging Goto Github PK

Tools & guidance to scale DeepCell imaging on Google Cloud Batch

Jupyter Notebook 98.28% Python 1.41% Cython 0.23% Dockerfile 0.02% Shell 0.07%

ai bioinformatics cancer-research cloud gpu tensorflow

deepcell-imaging's Introduction

Software Engineer-Architect-Alchemist ⚙️ 🧪🔬

Hi 👋🏻 I'm an independent cloud architect and software alchemist.

I help build, operate, and manage software systems. I wrote my first line of code in 1989 and have over 25 years' experience shipping to production.

Want some help on a short- or long-term project? I'm available! My clients range from single persons to Fortune 50 enterprises. Check out my site for more: https://www.redwoodconsulting.io/

Some projects outside my client work:

DeepCell AI cellular segmentation for cancer research. [repo]

I'm working with the DiMi Lab to deploy DeepCell imaging tools on Google Cloud. It's a zero-to-one effort running DeepCell on the cloud in their target environment.

DeepCell was developed by the Van Valen Lab at Caltech.
You Need A Splitter. [repo]

I'm writing a tool for You Need A Budget to help my personal workflow. I want to mark shared transactions as split to a certain category, but it's rather a hassle to do it by hand (especially on mobile devices). I don't intend to truly productionize this but might do. It's a web app written in Kotlin transpiled to Javascript, because, why not?

I'm a Google Cloud Expert, and hold a 2nd degree black belt in Seido Karate.

deepcell-imaging's People

Contributors

Stargazers

Watchers

Forkers

langitlynn khadija997

deepcell-imaging's Issues

Document what the mesmer sample data files are

Update file: sample-data/deepcell/README.md

Add 10x Genomics sample: preview-human-breast-20221103-418mb

Add the sample input images. (pull request)
Add the processed input_channels.npz data for DeepCell

Get cost tables

Fill out the cost tables in this sheet.

Method:

Start creating a notebook.
Select hardware configuration.
You should see a cost table like this:

Copy these fields into the sheet:
- TOTAL (put this into discounted $/mo)
- Sustained use discount (put this into discount $/mo)
Then copy the other computed columns from previous rows

Determine way to speed up benchmarks of large files

Larger data testing is tedious because post-processing is Hella Slow™. It's 8min or more for 1.3 GB inputs.

Note that infrastructure doesn’t seem to make a big difference for post-processing time. And GPU is not used at all during this phase (per observation of monitoring charts + knowledge of implementation).

This represents post-processing time broken down by machine type, GPU (or not), and input size. Note that post-processing doesn't vary too much.

It would be really nice if we didn't have to wait for this.

(1) Skip post-processing in benchmarks.

Note in benchmark data if post-processing was run.
The output is a bit meaningless in terms of correctness.
(2) Speed up post-processing. #28

Option 1 could be something like: skip the post-processing by passing in no-op function as the postprocessing_fn in constructor to Application object (maybe need to create a subclass to Mesmer class to override constructor).

Determine if CPU prediction is affected by warm-up

Relating to #94 we need to determine if CPU prediction is affected by 1st vs subsequent runs.

Task: run the ~230 MB sample through the benchmark, on n1-standard-8 no GPU batch size 16, then restart kernel (NOT make a new instance) & do it again.

Investigate: post-processing performance: watershed

Investigate: post-processing performance: fill_holes

Determine what happened to previous mesmer samples

The mesmer samples we extracted used a previous commit's dataset no longer available in the deepcell-tf main branch.

From commit history it looks like the dataset may have been replaced with the tissue_net dataset. The expected hash values don't match, but this could be just the naming inside the .npz file.

Objective of this work: determine the difference between the old commit data (which is still available on s3 as of 2023-11-17 at least) and the newly available tissue net data.

Rename Data Folder to VanValen

Current folder name deepcell doesn't match folder naming convention for sample data, suggest refactor folder name to vanvalen
https://github.com/dchaley/deepcell-imaging/tree/main/sample-data/deepcell

I assume you'll have to update example notebook paths if/when you refactor this folder name

Figure out kernel version mismatch

When running the e2e benchmark notebook on Vertex AI, there was a kernel warning:

2023-12-03 07:19:17.937327: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/jupyter/.local/lib/python3.10/site-packages/cv2/../../lib64:/usr/local/cuda/lib64:/usr/local/cuda/lib:/usr/local/lib/x86_64-linux-gnu:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-12-03 07:19:17.937385: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2023-12-03 07:19:17.937413: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (72a01191f8f9): /proc/driver/nvidia/version does not exist

I think this is because we're using a TF 2.10 kernel but have installed TF 2.8 (DeepCell's dependency).

If the kernel is relevant: how can we fix this?

If the kernel is irrelevant: can we use something different? Like a basic python kernel?

I'm not sure how much to worry about this– perhaps it means we (and/or DeepCell??) aren't using modern Vertex AI kernels optimally…

Incorporate h_maxima improvements into DeepCell

Apply results from this effort: #8

into deepcell-toolbox, specifically the deep_watershed implementation

1: implement into scikit, release, update scikit in DeepCell
2: release fast-hybrid separately, integrate into DeepCell

Create runnable 'predict' notebook

As a user I can: go to github repo, download it to local, get ipython notebook, get sample data, upload notebook to test env, (vertex ai), config info (instance types/size, GPUs, ...) to verify. (Part of test is to figure out config parameters)

Notebook that runs prediction on parameterized input file
Script to create notebook execution w/ specified machine parameters (instance type, GPU)

Result of running notebook: timing matrix of input size vs config parameters

Benchmark will be: whole-cell compartment. (Potentially add nuclear/both later, opportunity for parallelization later.)

Investigate grayreconstruct edge case test_zero_image_one_mask

This test is failing:

    def test_zero_image_one_mask():
        """Test reconstruction with an image of all zeros and a mask that's not"""
        result = reconstruction(np.zeros((10, 10)), np.ones((10, 10)))
>       assert_array_almost_equal(result, 0)
E       AssertionError:
E       Arrays are not almost equal to 6 decimals
E
E       Mismatched elements: 100 / 100 (100%)
E       Max absolute difference: 1.
E       Max relative difference: inf
E        x: array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
E              [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
E              [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],...
E        y: array(0)

test_reconstruction.py:113: AssertionError

I'm surprised to see max abs = 1 and max rel = inf, also it's 100% different so something weird is happening.

Run benchmark for `mesmer-sample-3` 1mb

Input channels path: gs://davids-genomics-data-public/cellular-segmentation/deep-cell/vanvalenlab-multiplex-20200810_tissue_dataset/mesmer-sample-3/input_channels.npz

Result spreadsheet.

Running notebook for just postprocessing

https://github.com/dchaley/deepcell-imaging/blob/main/notebooks/Benchmarking-Postprocess.ipynb

Address result type mismatch

The test test_two_image_peaks asserts that out.dtype == _supported_float_type(mask.dtype). Meanwhile the current reconstruct implementation indeed creates the result image as a float, even if it was ints to begin with.

I'm not sure we need to (always?) do this. The core of the algorithm is to adjust to the neighborhood max. The max can't have more precision than any of the starting numbers. And the max can't be capped to more precision than the mask precision. So if ints are masking ints, why not have int results?

However floats masked by ints are floats, and arguably ints masked by floats should be floats. for example 10 mask 0.51 should be 5.1 not 5. (Right?)

The current behavior is to always return floats. This is undesirable for performance as it precludes updating in-place. I wonder if we can simply ship this as new behavior. Downstream usages could be affected if they assume floats and start getting ints. Can we control via "Yet Another Parameter"™️ ?

Add cached model download to notebook

The model file is relatively large (100MB). Cache the download to disk to avoid refetching. Also, the notebook doesn't support the model download in the first place 😬

Use this gs uri:
gs://davids-genomics-data-public/cellular-segmentation/deep-cell/vanvalenlab-tf-model-multiplex-downloaded-20230706/MultiplexSegmentation.tgz

Create setup notebook

Create a separate setup notebook to install dependencies. This avoids restarting notebooks.

create setup.ipynb in root directory, does 1 thing: the pip install from the top of the benchmark file
in the benchmark notebook, change the pip install to trying to import deepcell – if that fails, refer the user to the setup notebook

Add Simple Diagram

Create and add a simple arch diagram which explains our perf-testing work plan for DeepCell on GCP to the top of the README file

Define test configurations per sample file

For each sample file, specify the configurations to run for the benchmark.

mesmer sample 3
pick an ark-angelo sample
preview-human-breast-20221103-418mb
human-prostate-cancer-20210727-725mb blocked on #38

Expected output: a github issue, with a checkmark of configurations (machine type, gpu type, gpu number)

See this list for machine types:
https://cloud.google.com/compute/docs/general-purpose-machines#n1_machine_types

Note that we can't run "interesting" GPU configurations (more than 1, or fancy GPUs), we're "calling a friend" see also #6

Inconsistent array shape: preview-human-breast-20221103-418mb

The samples located here:

gs://davids-genomics-data-public/cellular-segmentation/10x-genomics/preview-human-breast-20221103-418mb

were generated with a previous convention, following the DeepCell API (one file == 4D array starting with num samples).

The other samples are 3D: x, y, channel. (One array == one input)

This creates problems because worksheets & people don't know which shape to expect, therefore when they need to do a new-axis, or not.

We should normalize one way or the other. My general thinking is that a thing is a single thing, until it is a group of things– which we could represent as either a list of things, or a numpy vector of the things. In other words, the shape of a single data example is not a list.

Unexpected segments in prostate cancer results

Running these benchmarks: #73

We noticed that the output file looks a bit strange. Here's an example:

It appears to be circling artifacts outside the tissue, as well as not circling cells as we expect within the tissue.

DEBUG APPROACH:

Validate we're using correct channels here: https://github.com/dchaley/deepcell-imaging/blob/main/notebooks/Extract-Sample_human-prostate-cancer-20210727-725mb.ipynb
Extract smaller tiles like 3k x 3k (instead of full 25k x 25k)
Test prediction on selected tiles

Refactor fast_hybrid_reconstruct to accept footprint

Currently, the fast_hybrid_reconstruct implementation accepts a radius parameter. Instead, accept a footprint.

This will bring it in line with existing skimage api:
def reconstruction(seed, mask, method='dilation', footprint=None, offset=None):

h_maxima performance

Mesmer postprocessing has 3 main steps: h_maxima, watershed, and fill_holes. (See: #9)

h_maxima happens in the deepcell-toolbox

which in turn calls scikit-image h_maxima, which lastly is largely implemented using gray reconstruction.

This ticket is to investigate optimization opportunities.

Build csv to chart

Now that we have the benchmarking from PR #42 , which generates CSV output, get a bunch of local data on various images and test how to visualize it.

test data
build visualization

Add Mesmer samples to repo

To support #10 , let's at least add DeepCell's Mesmer data (multiplex_tissue) to the repo. This gives us an easily accessible starting point for test data (albeit quite small at 512 x 512).

See the DeepCell mesmer sample notebook

numpy inputs
rgb inputs
raw predictions
output predictions & rgb image

Attempt benchmark with small persistent disk

The persistent disk is a relatively small expenditure ($0.14 cents daily, for a forgotten 100 GB persistent disk). We probably don't need to worry too much about this.

Still, it would be nice to know if we're vastly over-provisioned. Let's try running a benchmark with ~10GB persistent disk, or 50GB. Use one of the larger files.

Also consider simply not caring for now, assuming DevOps processes & cost monitoring would find the issue. (Really though?) It's still just a few cents.

Test notebook with 4 GPUs

Document BigQuery benchmarks table

The end-to-end benchmark uploads its results to the BigQuery table: benchmarking.results

The table has a schema but otherwise no documentation. That should be changed 😎

Add GPU setup to benchmarking output

The benchmarking outputs CPU/RAM info but not GPU. Add this to the output. Hopefully, the tensorflow APIs work +/- as-is.

Support grayreconstruct uint8 dtype

The cython code needs to generate signatures with uint8. Just add to the fused type.

Find readily available or at least publishable large sample image

The larger sample data I've been using Xenium_FFPE_Human_Breast_Cancer_Rep1_if_image.tif was obtained from 10X genomics, but isn't super-duper easy to fetch especially not programmatically.

It would be nice if we had a comparable ~500MB sample available.

Test what happens to execution if Vertex AI browser tab closed

If the browser tab is closed while the long-running benchmark does its thing, what happens?

aka: do we have to leave open the tabs?

Implement grayreconstruct erosion vs dilation

The optimization only supported the dilation method: finding local maxima. We need to support erosion (local minima). It's basically a question of flipping the min/max signs.

Generate sample images for 10x genomics file: human-prostate-cancer-20210727-725mb

Update 10x genomics readme to include sample pictures. (These should be generated by the benchmark notebook)

Fix peak RAM metric for Vertex AI

The current method for getting peak RAM usage on local does not work on Vertex AI notebooks:

peak_mem_b = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss

Figure out a fix, or just abandon the metric.

Parameterize batch_size for predict method

Adding GPUs is not improving performance. This is a bit surprising considering how 1 GPU improves performance dramatically.

Are we not leveraging multiple GPUs? Possibly. With even 1 GPU, we aren't maxing out the GPU:

The batch_size parameter defaults to 4, which controls the number of images we sent to TensorFlow in parallel.

Add batch size to notebook parameters
Add batch size to benchmark output

Then we can run some benchmarks with increased batch size (with 1 + several GPUs) <-- make follow-up issues.

Develop test suite for optimizations

I want to iterate rapidly while prepping my optimizations for merge. That means automated testing not manually running notebooks.

Rather than write my own tests, let's use the tests in scikit already 😎

The initial version will need non-supported tests disabled, these will be created as issues in the milestone.

Document exact benchmark process

Update the top-level readme, and/or e2e-deepcell benchmark readme, with precise benchmark steps.

Something like:

open notebook in specific kernel (but see also #59)
select input file, update in notebook
select hardware type (clicking in notebook)
restart kernel & run all cells
copy the csv from the bottom into a sheet (this one?)

Add ARK sample data

Add the ARK-Example sample data from the Angelo lab.

Run benchmark for human-prostate-cancer-20210727-725mb

Input channels path: gs://davids-genomics-data-public/cellular-segmentation/10x-genomics/human-prostate-cancer-20210727-725mb/input_channels.npz

Results spreadsheet.

Experiment with "warming up" benchmark script

Problem we observed: the first run is consistently ~30s slower than subsequent runs, even though we're restarting the kernel in between.

Prediction time (s)	First run?	Machine config
104.48	y	n1-highmem-4 + 1x Tesla T4
103.17	y	n1-highmem-4 + 1x Tesla T4
74.09	n	n1-highmem-4 + 1x Tesla T4
73.41	n	n1-highmem-4 + 1x Tesla T4
74.08	n	n1-highmem-4 + 1x Tesla T4
78.2	y	n1-highmem-4 + 1x Tesla P100-PCIE-16GB
78.75	y	n1-highmem-4 + 1x Tesla P100-PCIE-16GB
43.55	n	n1-highmem-4 + 1x Tesla P100-PCIE-16GB
44.4	n	n1-highmem-4 + 1x Tesla P100-PCIE-16GB
44.17	n	n1-highmem-4 + 1x Tesla P100-PCIE-16GB
43.47	n	n1-highmem-4 + 1x Tesla P100-PCIE-16GB

MORE OBSERVATIONS:

GPU memory goes back to 0% after kernel restart. Probably no caching in GPU memory.
All documentation + stackoverflow posts suggest that GPU is cleared after process shutdown (kernel restart).
It seems quite consistently faster afterward.

IDEA:

Add a dummy data 512x512 prediction in the benchmark notebook, but outside the timed portion.
Does this warm up whatever needs to be warmed up, for the main prediction loop?

Output pixels to benchmark data

Backfill # of pixels to benchmark data
Update benchmark script to output # pixels

Run benchmark for `preview-human-breast-20221103-418mb`

Input channels path: gs://davids-genomics-data-public/cellular-segmentation/10x-genomics/preview-human-breast-20221103-418mb/input_channels.npz

Result spreadsheet.

Support grayreconstruct offset parameter

The scikit-image algorithm supports passing a footprint offset, vs assuming the center. Refactor to support that.

Add 10x Genomics sample: human-prostate-cancer-20210727-725mb

Process this sample file: 10x Genomics data page

upload input image to cloud storage
produce more friendly-size thumbnails: #84
create README
add processed input_channels.npz file for DeepCell to cloud storage

Update benchmark notebook to fetch cloud instance type value

The notebook has a placeholder for getting current machine type. Fix this!

Add prediction compartment to benchmark output

Update existing data to be compartment "whole-cell"
Update benchmark notebook to add new column in csv output

Build setup/instructions for cython fast-hybrid

The cython fast-hybrid implementation is a bit "raw", requiring manual cythonify in the right directory etc.

The file should be repackaged into a proper module in deepcell-imaging, with appropriate setup.py etc. so that pip knows to build the extension as part of installation.

This could also be accomplished by publishing the fast-hybrid implementation as its own library, and including that as a dependency to this repo.

Merge PR to scikit-image

Once everything is ready, we'll open a PR for a review cycle.

Depends on: #130