Git Product home page Git Product logo

abagen's Introduction

abagen: A toolbox for the Allen Brain Atlas genetics data

This package provides a Python interface for fetching and working with the Allen Human Brain Atlas (AHBA) microarray expression data.

https://circleci.com/gh/rmarkello/abagen.svg?style=shield https://readthedocs.org/projects/abagen/badge/?version=latest

Overview

In 2013, the Allen Institute for Brain Science released the Allen Human Brain Atlas, a dataset containing microarray expression data collected from six human brains (Hawrylycz et al., 2012) . This dataset has offered an unprecedented opportunity to examine the genetic underpinnings of the human brain, and has already yielded novel insight into e.g., adolescent brain development and functional brain organization.

However, in order to be effectively used in most analyses, the AHBA microarray expression data often needs to be (1) collapsed into regions of interest (e.g., parcels or networks), and (2) combined across donors. While this may potentially seem trivial, there are a number of analytic choices in these steps that can dramatically influence the resulting data and any downstream analyses. Arnatkevičiūte et al., 2019 provided a thorough treatment of this in a recent publication, demonstrating how the techniques and code used to prepare the raw AHBA data have varied widely across published reports. We extended this work in a recent preprint (Markello et al., 2021) to quantify how different processing choices can impact statistical analyses of the AHBA.

The current Python package, abagen, aims to provide reproducible workflows for processing and preparing the AHBA microarray expression data for analysis.

Installation requirements

Currently, abagen works with Python 3.6+ and requires a few dependencies:

  • nibabel
  • numpy (>=1.14.0)
  • pandas (>=0.25.0), and
  • scipy

There are some additional (optional) dependencies you can install to speed up some functions:

  • fastparquet, and
  • python-snappy

These latter packages are primarily used to facilitate loading the (rather large!) microarray expression dataframes provided by the Allen Institute,

For detailed information on how to install abagen, including these dependencies, refer to our installation instructions.

Quickstart

At it's core, using abagen is as simple as:

>>> import abagen
>>> expression = abagen.get_expression_data('myatlas.nii.gz')

where 'myatlas.nii.gz' points to a brain parcellation file.

This function can also be called from the command line with:

$ abagen --output-file expression.csv myatlas.nii.gz

For more detailed instructions on how to use abagen please refer to our user guide!

Development and getting involved

If you've found a bug, are experiencing a problem, or have a question about using the package, please head on over to our GitHub issues and make a new issue with some information about it! Someone will try and get back to you as quickly as possible, though please note that the primary developer for abagen (@rmarkello) is a graduate student so responses make take some time!

If you're interested in getting involved in the project: welcome ✨! We're thrilled to welcome new contributors. You should start by reading our code of conduct; all activity on abagen should adhere to the CoC. After that, take a look at our contributing guidelines so you're familiar with the processes we (generally) try to follow when making changes to the repository! Once you're ready to jump in head on over to our issues to see if there's anything you might like to work on.

Citing abagen

For up-to-date instructions on how to cite abagen please refer to our documentation.

License Information

This codebase is licensed under the 3-clause BSD license. The full license can be found in the LICENSE file in the abagen distribution.

Reannotated gene information located at abagen/data/reannotated.csv.gz and individualized donor parcellations for the Desikan-Killiany atlas located at abagen/data/native_dk are taken from Arnatkevičiūte et al., 2018 and are separately licensed under the CC BY 4.0; these data can also be found on figshare.

Corrected MNI coordinates used to match AHBA tissues samples to MNI space located at abagen/data/corrected_mni_coordinates.csv are taken from the alleninf package, provided under the 3-clause BSD license.

All microarray expression data is copyrighted under non-commercial reuse policies by the Allen Institute for Brain Science (© 2010 Allen Institute for Brain Science. Allen Human Brain Atlas. Available from: Allen Human Brain Atlas).

All trademarks referenced herein are property of their respective holders.

abagen's People

Contributors

4lovi4 avatar abkosar avatar ellen-77 avatar gifuni avatar gshafiei avatar ianyliu avatar jamesfrierson1 avatar liuzhenqi77 avatar michaelnicht avatar rmarkello avatar vincebaz avatar yingqiuz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

abagen's Issues

Add option to use Allen ontology for collapsing across samples

The issue

In addition to the MNI coordinates of each sample, the Allen Institute provides a detailed ontology of the brain structure each sample was ostensibly taken from in each donor brain. Used in a bunch of published works (e.g., Hawrylycz et al., 2015), this ontology provides a built-in "atlas" for reducing microarray samples down into the sort of region x gene dataframes returned by abagen.get_expression_data().

Taking from the aforementioned paper:

For each brain, 345–911 samples spanning one (n = 4) or both (n = 2) hemispheres were analyzed using whole-genome Agilent microarrays. In total, samples from 232 discrete brain structures were sampled at least once in at least one brain. We first focused on comparing expression patterns for a smaller set of 96 brain regions that were sampled at least twice in at least five brains, pooling across hemispheres.

Proposed solution

Allow an alternative procedure to the currently implemented abagen.get_expression_data() function that requires a Nifti-like atlas that lets users combine samples using the built-in Allen Institute ontology. By default I think the 232 structures should be used and then users can specify (via the return_counts parameter) whether they want to know how many samples were pooled for each structure and reduce down to the 96 as desired.

Perhaps this can be the default for the abagen.get_expression_data()? That is, make atlas an optional parameter and, if none is supplied, use the Allen ontology?

Include other metrics for probe selection

The issue

We currently implement a differential stability metric in order to select a representative probe for each gene in the Allen data. To do this we calculate the correlation of each probe (across brain regions) between every pair of donors (6 choose 2 = 15 pairs) and then select the probe with the highest average correlation across all the pairs; the remaining probes are discarded.

While there's precedent for this in the literature, there are plenty of other feasible mechanisms by which to collapse across probes!

Proposed solution

Implementing any of these additional metrics taken from Table 2 of the manuscript by Arnatkevic̆iūtė et al. (2018) would be fantastic!

probe_selections

Extra space in `abagen.correct.remove_distance` doc-string

The issue

Introduced by #76. Due to an errant space in the doc-string of abagen.correct.remove_distance() the rendered Sphinx documentation for the function looks like this:

bad_docstring

That's...not right. There shouldn't be the line split and the weird highlighting on the first part of the description.

Proposed solution

Removing the extra space on L47 of abagen.correct right before "between" should do the trick! That is, the doc-string should be changed as follows:

     residualized : (R, R) numpy.ndarray
         Provided `coexpression` data residualized against spatial distance
-         between region pairs
+        between region pairs

Add issue + PR templates

The issue

As of #105 we have some vastly improved contributing guidelines! To help smooth out the contribution process even more it would be great to include issue and PR templates.

Proposed solution

Add in some templates! There are some fantastic step-by-step walkthroughs for creating issue and pull request templates directly from the GitHub UI.

I'm flexible about what the exact body of the templates should look like, but there are lots of examples that we can pull from!

Unquote printed Allen API URL query in abagen.mouse

The issue

The following code snippet highlights the undesirable behavior:

>>> from abagen import mouse
>>> info = mouse.get_structure_info(id=22, attributes='name', verbose=True)
'Querying https://api.brain-map.org/api/v2/data/Structure/query.json?criteria=%5Bid%24eq22%5D%2Contology%5Bid%24eq1%5D&only=name%2Cid&'
>>> info
                                    name
id                                      
22  Posterior parietal association areas

That URL printed by the call to mouse.get_structure_info() is "quoted" for the purposes of making the actual API query, but should be "unquoted" for printing purposes since it's almost entirely unreadable by the average person! The whole point of printing the query URL is to make sure that it's formatted correctly, but I can't tell if it's formatted correctly if I can barely read the thing.

Proposed solution

In abagen.mouse.utils, change L63 from:

        print("Querying {}...".format(url))

to

        print("Querying {}...".format(urllib.parse.unquote_plus(url)))

should do the trick! We only unquote for the purposes of printing the URL, but leave the actual URL quoted for the query.

Refactor API reference docs

The issue

The current documentation for abagen is a bit of a mess. In particular, the API reference is totally overwhelming to navigate, despite not even being that long or in-depth! There's definitely a need to update the docs a bit more generally (e.g., #39), but the API requires some special love and attention.

Proposed solution

Refactoring the API reference in the docs to look more like, e.g., netneurotools would be awesome! In particular, I really like the little tables with brief descriptions of the functions that then link out to more detailed descriptions of their use. That layout is adapted from nilearn, so using the .rst files from those toolboxes should be sufficient to move abagen in the right direction.

Command line implementation of get_expression_data()

The issue

The abagen.get_expression_data() function is the primary workflow of use in the abagen toolbox. Right now, everything else in the toolbox (besides the abagen.mouse module) supports that function. It might be nice, therefore, to allow access to the function via the command line instead of having to use it from within a Python terminal session for those who are less comfortable with Python.

Proposed solution

Add a CLI for abagen.get_expression_data() that allows users to point to their atlas image file and specify an output CSV file for the dataframe(s) returned by the function. I'm imaging something like:

abagen --output expression.csv my_atlas.nii.gz

which would use my_atlas.nii.gz to generate a region x gene dataframe that is saved to expression.csv in the current directory.

All the parameters of the function should be accessible via optional arguments to the function, and I am happy to discuss / iterate on the actual command name itself, but this might be a nice feature for some users!

Update documentation for abagen.mouse

Thanks to @yingqiuz, abagen can now query the Allen Brain database for mouse in situ hybridization data! 🎉 Unfortunately, this is documented exactly nowhere outside of the codebase. This should be rectified! It would be great to update the documentation to reflect the new functionality of the abagen.mouse module with some good examples and how-to guides.

Currently the documentation is a bit sparse (just a generic install/usage with a brief API reference), so there's not a great location to add the mouse docs. We should think about how to restructure the documentation to be a bit more extensible, especially since there are plans to continue developing / growing abagen! For now, I think adding a separate page (mouse.rst) after the current "user guide" will suffice, but this is definitely the time to start thinking about what the purpose of abagen is and how the documentation can support that purpose.

Allow single hemisphere mirroring with `lr_mirror`

The issue

As of #87, abagen.get_expression_data() accepts the lr_mirror parameter (a boolean; default: False) that will, if set, mirror tissue samples bilaterally across hemispheres, effectively duplicating the number of samples and dramatically increasing the spatial coverage of the AHBA dataset.

There are some instances, however, where it might be desirable to not mirror samples bilaterally and rather only mirror left --> right or right --> left.

Proposed solution

Modify the lr_mirror parameter to accept 'left' and 'right', which would specify the SOURCE hemisphere from which samples would be mirrored.

That's might be a bit confusing (though the documentation should be clear), so maybe it would be good to also accept some other inputs (e.g., 'lr' would mean 'mirror left to right' and 'rl' would be the opposite)? Open to suggestions on what exactly the choices would be such that this would make sense and be the least confusing.

I still think by default any form of mirroring should be OFF, but it would be nice to expose these additional options!

RuntimeWarning when using DK atlas in example

Raised initially in #42.

The issue

Running the code below raises the following RuntimeWarning:

>>> import abagen
>>> atlas = abagen.fetch_desikan_killiany()
>>> expression = abagen.get_expression_data(atlas.image, atlas_info=atlas.info, exact=False)
/path/to/python/site-packages/scipy/ndimage/measurements.py:1328: RuntimeWarning: invalid value encountered in true_divide
  for dir in range(input.ndim)]

While harmless (the outputs seem to look okay) it's frustrating to get a warning when running an example!

The source

From some digging it looks like this is being triggered when trying to match a sample (specifically, well_id 2572) to regions in abagen.allen._assign_sample, and only occurs when atlas_info is provided.

The issue seems to be that one of the regions in the DK atlas that overlaps (or is close to) this sample does not match the structure/hemisphere designation of the sample. As a result of this mis-match, the label (region 81) is set to 0. Unfortunately, this 0 is never discard and is instead passed to scipy.ndimage.center_of_mass(), which then attempts to find the center of mass of all the zero values in the atlas image. Because center_of_mass considers 0 to be a background value, trying to specify it as a label to be estimated gives strange results.

The solution

Changing the following lines:

abagen/abagen/allen.py

Lines 62 to 66 in 195293a

if atlas_info is not None and sample_info is not None:
for old_label in labels:
new_label = _check_label(old_label, sample_info, atlas_info)
nz_labels[nz_labels == old_label] = new_label
labels, counts = np.unique(nz_labels, return_counts=True)

to

    if atlas_info is not None and sample_info is not None:
        for old_label in labels:
            new_label = _check_label(old_label, sample_info, atlas_info)
            if old_label != new_label:
                nz_labels[nz_labels == old_label] = new_label
        labels, counts = np.unique(nz_labels[nz_labels.nonzero()],
                                   return_counts=True)

should address the issue. (The if old_label != new_label addition isn't entirely necessary but is a nice check, nonetheless.)

This will ensure that the array of labels passed to scipy.ndimage.center_of_mass will contain no zero values.

Include demographic information on donors

The issue

The AHBA donors differ (pretty significantly) on age, sex, ethnicity, medical conditions, and post-mortem interval. It might be nice to provide an easy mechanism for users to get this demographic information for either reporting with results or using to correct expression data. It's all neatly summarized in Table 3 of the Arnatkevičiūte et al., 2019 manuscript!

Proposed solution

The following function (perhaps in abagen.datasets) would be great! We can probably just include this as a CSV with the abagen distribution:

def fetch_donor_info(donors='all'):
    """
    Returns dataframe with donor demographic information

    Parameters
    ----------
    donors : list, optional
        List of donors for whom to fetch demographic info; can be either donor 
        number or UID. Can also specify 'all' to get info for all available
        donors. Default: 'all'

    Returns
    -------
    info : pandas.DataFrame
        With columns ['donor', 'age', 'sex', 'ethnicity', 'medical_conditions', 
        'postmortem_interval'] detailing basic demographic info about donors
    """

Move tests for submodules into submodules

The issue

Currently we have three (tiny) submodules in abagen:

  1. abagen.mouse,
  2. abagen.cli, and
  3. abagen.datasets

The tests for abagen.mouse live inside the submodule, but it would be great if we could move the tests for the other two into their respective modules as well!

Proposed solution

Move abagen/tests/test_datasets.py into abagen/datasets/tests/test_datasets.py and move abagen/tests/test_cli.py into abagen/cli/tests/test_run.py.

It might also be good to rename abagen/datasets/datasets.py to something like abagen/datasets/fetchers.py for clarity purposes! The relevant test file should be renamed in this case, as well.

Fetching of T1/2w MRI images for donors

The issue

We currently permit automated fetching of the microarray expression data from the Allen Institute database. Even though the raw MRI images are likely less immediately useful to users, it would be great to provide a similar API call for fetching them.

Proposed solution

Add a function abagen.fetch_raw_mri() to the codebase. I imagine it would accept all the same arguments as abagen.fetch_microarray() and simply download the T1/2w images for the specified subjects to the same location as the microarray expression data!

See #44 for additional information on path handling.

Add interpolation for missing ROIs

The issue

When a region of interest in a user-specified atlas contains no expression samples we currently either (1) leave the region blank, filling the region's expression values with NaNs or (2) find the sample closest to the center-of-mass of the region and assign those expression values.

Neither of these leaves me feeling particularly warm and fuzzy, though the first option is the current default for a reason. Still, having alternatives would be awesome!

Proposed solution

One alternative that's been previously published on (Romero-Garcia et al., 2018) involves interpolating missing regions using nearby samples. This could potentially get highly complicated as there may be situations where the values would need to be extrapolated, not interpolated, based on the proximity of surrounding samples and that's...never fun. Nonetheless, this would be a nice alternative to the nearest-neighbor matching currently implemented.

In the event this is implemented I imagine that the exact parameter currently used in abagen.get_expression_data() would change from being a boolean operator to accepting string inputs: (1) None indicates to leave the region blank and fill with NaNs, (2) 'nearest' would indicate to perform the current matching, and (3) 'linear' would indicate to perform an interpolation procedure. Also it should probably be renamed to fill_missing to indicate that this procedure will only be implemented for regions missing samples.

Revamp testing framework to alleviate downloads from Allen

The issue

The first test build of #67 broke due to timeout errors attempting to connect to the Allen Institute API. This is the first instance of such an error, and it's very possible it's an isolated event, but it's worth considering alternative methods for running the test suite.

Currently, for every test build, we download microarray expression data for two donors from the Allen Institute. The data are cached and used for every test in the build, but with six builds (x2 for push and PR) this can get quite onerous.

Beyond the human microarray expression downloads, the abagen.mouse test suite makes a number of calls to the Allen API to query different data on mouse microarray expression. None of these calls are particularly "heavy" but the scale of them can be something that might be...less than ideal.

Proposed solution

Human microarray data

Perhaps it would be worth adopting something like how nilearn tests their dataset fetchers (e.g., here). They pre-generate (empty) files that are named like the real files such that when they call the dataset fetcher function the empty files are returned without ever having to download anything. This would greatly speed up the amount of time spent in the tests since we wouldn't be waiting for the downloaded data!

Unfortunately, in order to run the large majority of the tests, we do need some sort of data to work with. One option is to simply cache the microarray data with Travis. An alternative is to generate simulated data "on-the-fly" that looks like the AHBA data, but is simply hard-coded into the tests. This would have the added benefit of probably being much, much smaller than the real data and therefore also making tests go a bit faster.

Mouse microarray data

I'm honestly not sure of a good alternative to testing the calls to the Allen API! Will need to think on this...

Add Zenodo integration to repository

The issue

Currently in the "Acknowledgments" section of the README we provide some guidance on citing abagen:

...if you use this code it would be good to...provide a link back to the ``abagen`` 
repository with the version of the code used...

This is fine, but it would be great to provide some other means by which to cite abagen!

Proposed solution

Zenodo + GitHub have some really amazing features that allow syncing a GitHub repository such that every time a new release is made it triggers Zenodo to issue a new (citeable!) DOI. Setting this up and then adding a bit to the README describing how to cite the Zenodo release of abagen would be awesome!

Change all within-package imports to be relative

The issue

Currently abagen.correct and abagen.datasets use absolute imports, which are formatted as:

from abagen import x

instead of relative imports, which are formatted as:

from . import x

All other modules in abagen use the relative import method, and it would be great to make this consistent throughout the package. While there's some contention about which of these is more appropriate, I personally prefer using relative imports within packages.

Proposed solution

Modifying the imports in abagen.correct and abagen.datasets to use the relative import syntax should mean simply replacing the following lines, where the red lines beginning with - indicate the current import format and the green lines beginning with + indicate the desired import format:

abagen.correct, L11

- from abagen import utils
+ from . import utils

abagen.datasets, L16

- from abagen import io
+ from . import io

Referenced line numbers are accurate as of 77cb1f7.

Include RNAseq data fetchers + loaders?

The issue

For two of the AHBA donors RNA sequencing was run for a number of tissue samples. These data are, generally, higher quality than the microarray data that are used throughout most of abagen. It might be nice to include an automated way to fetch and load this data for people who want to do...things...with it. Maybe, eventually, we will think about including some potential things you can do with RNA sequencing data. For now, we consider adding ways to get it and load it easily.

Proposed solution

Adding the following function to abagen.datasets.fetchers:

def fetch_rnaseq(data_dir=None, donors=None, resume=True, verbose=1):
    """
    Downloads RNA-sequencing data from the Allen Human Brain Atlas

    Parameters
    ----------
    data_dir : str, optional
        Directory where data should be downloaded and unpacked. Default:
        current directory
    donors : list, optional
        List of donors to download; can be either donor number or UID. Can also
        specify 'all' to download all available donors. Default: 9861
    resume : bool, optional
        Whether to resume download of a partly-downloaded file. Default: True
    verbose : int, optional
        Verbosity level (0 means no message). Default: 1

    Returns
    -------
    data : dict
        Dictionary with keys ['counts', 'tpm', 'ontology', 'genes',
        'annotation'], where corresponding values are lists of filepaths to
        downloaded CSV files.

    References
    ----------
    Hawrylycz, M. J., Lein, E. S., Guillozet-Bongaarts, A. L., Shen, E. H., Ng,
    L., Miller, J. A., ... & Abajian, C. (2012). An anatomically comprehensive
    atlas of the adult human brain transcriptome. Nature, 489(7416), 391.
    """

and the following functions to abagen.io:

def read_tpm():
def read_counts():
def read_genes():

should be a good start! The current abagen.io.read_annotation and abagen.io.read_ontology work for the ontology and annotation files shipped with the RNA sequencing data, already, too.

Installation issue -- pandas.io has no attribute parquet

Package looks great, recently try to install out of curiosity!

I created a new python 3.6 environment and installed the package with the [io] option. I did this both through git and pip install. Both times, when trying to import, I got the following error:

AttributeError                            Traceback (most recent call last)
<ipython-input-1-1a564bd5acfd> in <module>()
----> 1 import abagen

~/anaconda2/envs/py3/lib/python3.6/importlib/_bootstrap.py in _find_and_load(name, import_)

~/anaconda2/envs/py3/lib/python3.6/importlib/_bootstrap.py in _find_and_load_unlocked(name, import_)

~/anaconda2/envs/py3/lib/python3.6/importlib/_bootstrap.py in _load_unlocked(spec)

~/anaconda2/envs/py3/lib/python3.6/importlib/_bootstrap.py in _load_backward_compatible(spec)

~/anaconda2/envs/py3/lib/python3.6/site-packages/abagen-0.0.1-py3.6.egg/abagen/__init__.py in <module>()
      5 from .info import __version__
      6 
----> 7 from abagen.allen import get_expression_data
      8 from abagen.correct import keep_stable_genes, remove_distance
      9 from abagen.datasets import fetch_microarray, fetch_desikan_killiany

~/anaconda2/envs/py3/lib/python3.6/importlib/_bootstrap.py in _find_and_load(name, import_)

~/anaconda2/envs/py3/lib/python3.6/importlib/_bootstrap.py in _find_and_load_unlocked(name, import_)

~/anaconda2/envs/py3/lib/python3.6/importlib/_bootstrap.py in _load_unlocked(spec)

~/anaconda2/envs/py3/lib/python3.6/importlib/_bootstrap.py in _load_backward_compatible(spec)

~/anaconda2/envs/py3/lib/python3.6/site-packages/abagen-0.0.1-py3.6.egg/abagen/allen.py in <module>()
     12 from scipy.spatial.distance import cdist
     13 
---> 14 from abagen import datasets, io, process, utils
     15 
     16 

~/anaconda2/envs/py3/lib/python3.6/importlib/_bootstrap.py in _find_and_load(name, import_)

~/anaconda2/envs/py3/lib/python3.6/importlib/_bootstrap.py in _find_and_load_unlocked(name, import_)

~/anaconda2/envs/py3/lib/python3.6/importlib/_bootstrap.py in _load_unlocked(spec)

~/anaconda2/envs/py3/lib/python3.6/importlib/_bootstrap.py in _load_backward_compatible(spec)

~/anaconda2/envs/py3/lib/python3.6/site-packages/abagen-0.0.1-py3.6.egg/abagen/datasets.py in <module>()
     16 import requests
     17 from sklearn.utils import Bunch
---> 18 from abagen import io
     19 
     20 WELL_KNOWN_IDS = Recoder(

~/anaconda2/envs/py3/lib/python3.6/importlib/_bootstrap.py in _find_and_load(name, import_)

~/anaconda2/envs/py3/lib/python3.6/importlib/_bootstrap.py in _find_and_load_unlocked(name, import_)

~/anaconda2/envs/py3/lib/python3.6/importlib/_bootstrap.py in _load_unlocked(spec)

~/anaconda2/envs/py3/lib/python3.6/importlib/_bootstrap.py in _load_backward_compatible(spec)

~/anaconda2/envs/py3/lib/python3.6/site-packages/abagen-0.0.1-py3.6.egg/abagen/io.py in <module>()
     12 import pandas as pd
     13 try:
---> 14     eng = pd.io.parquet.get_engine('fastparquet')
     15     assert 'SNAPPY' in eng.api.compression.compressions
     16     use_parq = True

AttributeError: module 'pandas.io' has no attribute 'parquet'

Apologies if this is a silly problem on my end. Let me know if I can help or provide more information!

Manage versioning with versioneer

The issue

Currently, versioning for abagen (i.e., the single version that exists) has been ad-hoc. Since I'm planning (read: hoping) to ramp up development soon it might be nice to develop a more principled system for versioning.

Proposed solution

Use versioneer! Thankfully the set-up for versioneer is pretty straightforward:

Quick Install

1. `pip install versioneer` somewhere to your $PATH
2. Add a [versioneer] section to your setup.cfg, and
3. Run `versioneer install` in your source tree and then commit the results

It would be great to get abagen set up to use this system.

Error when only one donor provided to `abagen.get_expression_data()`

Calling abagen.get_expression_data() with files from only a single donor currently results in a (very lengthy and overwhelming) error; this is not great.

The error is generated by pandas during a call to process.get_stable_probes(), a function that picks a single "stable" microarray probe to represent expression data for a given gene amongst many potentially redundant or noisy probes. The selection process is performed by computing a stability metric ("differential stability"), examining which probes have the most consistent brain-wide expression pattern across donors; when only one donor is provided, there is no way to compute this cross-donor metric!

I see two potential solutions to this that might happen at varying timeframes:

Interim solution

So that users don't receive an ugly error, we could check whether the files argument provided to get_expression_data() has more than one donor and raise an error if not. Indeed, adding to the existing checks implemented against files could handle this. Something like the following could work:

for key in ['microarray', 'probes', 'annotation', 'pacall', 'ontology']:
    if key not in files:
        raise KeyError('Provided `files` dictionary is missing {}. '
                       'Please check inputs.'.format(key))
    if len(files.get(key)) <= 1:
        raise ValueError('Files from at least two donors are required for '
                         'generating expression data; {} {} file(s) were '
                         'provided.'.format(len(files.get(key), key)))

However, this is only a temporary solution—even though users should provide information for all available donors, it is not a strict requirement and they should not get an ugly error if they do.

Long-term solution

There are two longer-term solutions to this problem:

  1. Completely circumvent the call to get_stable_probes() within get_expression_data() if only a single donor is provided, or
  2. Permit other mechanisms for reducing the number of probes besides the currently implemented "differential stability" metric.

Given that the latter solution would be a fantastic enhancement to the current module it would be my preference to enact that one! The recent paper by Arnatkevičiūte et al., 2018 had a good breakdown of some alternative metrics (pg 8, Table 2) that could serve as a starting point for this!

Add additional donor normalization options

The issue

The current method for normalizing between-donor microarray expression differences prior to collapsing across donors is to use a scaled robust sigmoid correction procedure. This is cool and good and I think the right move, but is perhaps a bit non-standard so providing alternative options would be nice.

Proposed solution

Adding a parameter to abagen.get_expression_data() like donor_norm and adding the ability to z-score the data would be cool. That way someone could run either:

abagen.get_expression_data(atlas, donor_norm='srs')

or

abagen.get_expression_data(atlas, donor_norm='zscore')

and get back interpretable results.

My only concern with this solution is that the number of optional parameters for abagen.get_expression_data() is getting a bit overwhelming! It might be worthwhile making this function a "workflow" (with a CLI, too?) and refactoring some of the other functions to be more modular / tool-oriented so that a user could construct an alternative workflow if they wanted.

Add fetcher for Gryglewski et al., 2018 expression maps

The issue

There was a really nice paper last year by Gryglewski et al., 2018 that used some variogram modelling to generate "smooth" microarray expression maps for all the genes with valid Entrez IDs from the AHBA dataset. They've generously made these maps available online for download.

Proposed solution

I'm proposing a new function in abagen.datasets that lets people fetch maps for a given gene, provided they have the Entrez ID:

def fetch_gryglewski_map(entrez, data_dir=None, resume=True, verbose=1):
    """
    Fetches expression maps from [DS2]_ for gene with Entrez ID `entrez`

    Parameters
    ----------
    entrez : int
        Entrez ID of gene for which to fetch expression map
    data_dir : str, optional
        Directory where data should be downloaded and unpacked. Default:
        current directory
    resume : bool, optional
        Whether to resume download of a partly-downloaded file. Default: True
    verbose : int, optional
        Verbosity level (0 means no message). Default: 1

    Returns
    -------
    maps : dict
        Dictionary with keys ['vol', 'vol_mirror', 'surf', 'surf_mirror'],
        where corresponding values are lists of filepaths to downloaded Nifti
        (volumetric) and MGH (surface) files.

    References
    ----------
    .. [DS2] Gryglewski, G., Seiger, R., James, G. M., Godbersen, G. M.,
    Komorowski, A., Unterholzner, J., ... & Kasper, S. (2018). Spatial analysis
    and high resolution mapping of the human whole-brain transcriptome for
    integrative analysis in neuroimaging. NeuroImage, 176, 259-267.
    """

    # fetch and return dictionary with files for specified map
    # can use other abagen.datasets.fetch_X functions as models

Improve data fetcher path handling

The issue(s)

If a user does not have a local copy of all the Allen Institute human microarray data then, when calling abagen.get_expression_data, the data are automatically fetched. Currently, this fetching procedure will create a directory in the user's current directory called allenbrain and all the data are downloaded and unzipped in there. Users can optionally provide a path to the data_dir parameter, but this requires that the specified path be to a directory inside of which is a directory labelled allenbrain with all the necessary data. This can be confusing as many people think that data_dir should be the allenbrain directory itself.

Furthermore, most users might not want to have to constantly point to the directory where they store it (or deal with constantly downloading duplicate copies of it in their current working directory). It might be nice to think about using a $HOME directory (a la nilearn, sklearn, etc.) to store the downloaded data and attempt to fetch from, if no data_dir path is provided.

Proposed solutions

To address the first issue we could modify the checking of the data_dir parameter to accept EITHER the current scenario (a path to a directory inside of which is a directory labelled allenbrain with all the necessary data) OR a path to a directory inside of which is all the Allen Brain Atlas data (i.e., it does not need to be labelled allenbrain). So, providing either data_dir=/home/ross/Desktop or data_dir=/home/ross/Desktop/allenbrain in the following example should work:

>>> import abagen
>>> atlas = abagen.fetch_desikan_killiany()

# allenbrain directory exists on Desktop
>>> data_dir = '/home/ross/Desktop'
>>> abagen.get_expression_data(atlas.image, data_dir=data_dir)

# data_dir points to allenbrain directory itself
>>> data_dir = '/home/ross/Desktop/allenbrain'
>>> abagen.get_expression_data(atlas.image, data_dir=data_dir)

assuming that /home/ross/Desktop/allenbrain was organized as:

allenbrain/
├── normalized_microarray_donor9861/
├── normalized_microarray_donor10021/
...
└── normalized_microarray_donor15697/

To address the second issue we could consider creating the allenbrain directory in the user's $HOME directory by default (instead of in the current directory). If it doesn't already exist there, we can do a quick check in the current working directory (and the data_dir path, if provided) and, if nothing else exists, fetch the data to the $HOME directory.

Error using downloaded data

Hi Ross,

I am keen to use your code to generate genes for a parcellation scheme.

I have already downloaded the allen data as have been using for a while however when I use

import abagen
files = abagen.fetch_microarray(data_dir='/Users/petermccolgan/github/abagen/allen_data', donors='all')

It starts downloading the data again and creates a allenbrain folder in that path.

I already have the data downloaded in that path and just wanted to point to where it is.

Do you know how i do this?

On the off chance you're at OHBM this week it would be great to meet up and get a few pointers if you're free (no worries if not though)

Thanks

Peter

Change abagen.mouse data fetcher caching location

The issue

If you use any of the data fetchers in the abagen.mouse module the results are cached in the installed abagen distribution. Unfortunately, if you install a new version of abagen (e.g., via pip install --upgrade abagen) that distribution is wiped, so the next time the data are fetched they'll need to be re-downloaded from the internet which can be costly / annoying.

Proposed solution

Cache the data in a different location! This is highly related to #44; it might make a lot more sense to use a centralized location for all the abagen data fetched from the internet.

Add sets of pre-determined gene groups

The issue

Burt et al., 2018 performed their analyses using pre-determined groups of genes (i.e., "brain-specific," "neuron- and oligodendrocyte-specific," etc.). It would be nice to incorporate these "standard" sets into abagen so that users don't have to dig through the supplementary materials to get them. (Also, the excel file that contains the gene groups for the aforementioned article auto-converted some of the gene symbols to dates, so converting those back for users would be helpful!)

Another potential database for gene groups would be the Molecular Signature database that include groupings for different biological processes.

Proposed solution

Transliterate gene symbols for different groupings into separate CSV files and include them with the abagen distribution in abagen/data/gene_sets/.

Also add a new function (abagen.get_gene_group(group)) to the codebase, where group can be any of the available groups (e.g., 'brain', 'neuron', 'oligodendrocyte', 'synaptome', 'layers'). This function would simply load the queried set from the distribution and return a list of gene symbols!

Make use of pytest parametrize consistent in testing framework

The issue

There's a really elegant feature of pytest that allows you to provide a list of parameters to a testing function and it will run the test once for each set of parameters (see: pytest.mark.parametrize). This is currently used in some of the abagen.tests functions, but not all, which is frustrating because inconsistencies in the testing framework make figuring out the best way to write new tests difficult!

Proposed solution

Update all current tests that don't use the pytest.mark.parametrize and instead have a bunch of for loops in them (or a bunch of serial one-liners) to use the parametrize marker! From a quick look, the list is (in no particular order):

  • abagen.tests.test_allen
    • test_extra_get_expression_data()
  • abagen.tests.test_correct
    • test_resid_dist()
    • test_keep_stable_genes()
  • abagen.tests.test_io
    • test_readfiles()
  • abagen.tests.test_utils
    • test_coords_transforms()
    • test_check_atlas_info()
    • test_check_metric()

We can use any of the below as examples for how to implement the pytest.mark.parametrize function:

  • abagen.mouse.tests.test_mouse
    • test_get_experiments_from_gene()
    • test_get_unionization_from_experiment()
    • test_get_unionization_from_gene()
  • abagen.mouse.tests.test_structure
    • test_get_structure_coordinates()

Automatically fetch human micoarray data

Currently you have to provide the files downloaded from abagen.fetch_microarray() to abagen.get_expression_data() as in a typical workflow:

>>> import abagen
>>> files = abagen.fetch_microarray(donors='all')
>>> atlas = abagen.fetch_desikan_killiany()
>>> data = abagen.get_expression_data(files, atlas.image)

It would be great if the latter function simply called the fetcher internally so users would just need to run:

>>> import abagen
>>> atlas = abagen.fetch_desikan_killiany()
>>> data = abagen.get_expression_data(atlas.image)

Ideally you should be able to specify which donors should be used with abagen.get_expression_data() (as you can do with the donors keyword argument in abagen.fetch_microarray()). That way people can control if they want to exclude, e.g., the donors with samples from both hemispheres.

This would require a minor refactor of abagen.get_expression_data() to (1) remove the files argument and (2) adding the donors keyword argument (default set to 'all'), so the signature would look like:

def get_expression_data(atlas, atlas_info=None, *, donors='all', exact=True,
                        tolerance=2, metric='mean', ibf_threshold=0.5,
                        corrected_mni=True, reannotated=True,
                        return_counts=False, return_donors=False):

The only internal of the function that would need changing would be to immediately call files = abagen.fetch_microarray(donors=donors) (right at the top of the function).

It might be good to consider also exposing the data_dir keyword argument of fetch_microarray() in the signature of get_expression_data() in case users have the AHBA files downloaded somewhere non-standard and want to specify the path---but that can be discussed!

Add within-sample (across-gene) normalization option

The issue

As of #90 users have the option to specify what type of donor-level normalization they would like to perform via the donor_norm parameter in abagen.get_expression_data(). This normalization procedure attempts to correct for gross donor-level differences in microarray expression levels across samples and genes.

Critically, this normalization is performed separately for each donor within-genes and across-samples (unless donor_norm='batch', but that's a special case), and is done after samples have been aggregated into regions in the user-provided atlas. It might be nice to provide an additional option to normalize within-samples and across-genes prior to aggregating samples within regions in the user-provided atlas. That would reduce the possibility that an "outlier" sample might influence a given region's expression values.

Proposed solution

Add (yet another) optional parameter to abagen.get_expression_data(): sample_norm. This will mirror the functionality of the donor_norm parameter in that it will accept 'srs', 'zscore', and None as inputs (but not 'batch', because that doesn't really make sense...).

Importantly, the normalization will take place after probe filtering / selection but prior to aggregating samples within atlas regions. The sample normalization should be performed separately for each donor. We should be able to use the abagen.correct.normalize_expression() function for this (just making sure that we transpose the dataframes before passing them, since by default it operates on the column-wise axis), so implementation shouldn't be too difficult. An if-clause in abagen.get_expression_data() to the effect of:

if sample_norm is not None:
    data = correct.normalize_expression(data.T, norm=sample_norm).T

should work! I think this would be best positioned within the for subj in range(num_subj) for-loop in the get_expression_data() function, somewhere before groupby_label() is called.

Move to full setuptools install

The issue

Always a fan of how the folks over at nibabel run things, I thought it might be nice to mirror their recent switch to a fully setuptools-based installation build. This is incredibly low priority since the current install procedure works just fine, but it seems nice / easier to maintain!

Proposed solution

I'm honestly not sure of all the required steps here, but referring to the setuptools documentation on this would probably be a good start! Happy to discuss / dig into this when the times comes.

Get expression samples inside a supplied mask

The issue

Sometimes people don't want ROI x gene expression matrices! 😮 Sometimes, as in this recent paper by @illdopejake, they want all the samples inside (or near) a mask. It would be great to supply this option!

Proposed solution

A function called something like abagen.get_samples_in_mask() that returns a dictionary (?) where the keys are the donor IDs and the values are lists of sample IDs that fall within a user-specified distance of the mask (i.e., provide the ability to dilate the mask on-the-fly in the event that no samples are directly inside the mask). This would allow users to then use the relevant samples for whatever analyses they wanted to do!

I'm open to the function returning something else instead of the dictionary. It could, for example, return a list of well IDs (which are unique) for each sample, with no mention of the donor from which they were drawn. In this case a suite of helper functions where you could supply well IDs and get the expression data for the given IDs without having to specify the donor from which they originated would be super useful. (Behind the scenes this could simply iterate through the donors and match them to the specified well IDs).

def get_samples_in_mask(mask, tolerance=None):
    """
    Returns all expression samples that are within supplied `mask`

    Parameters
    ----------
    mask : niimg-like object
        A boolean mask in MNI space. True (or 1) where a voxel should be
        considered.
    tolerance : int, optional
        Distance (in mm) that a sample must be from `mask` for it to be
        retained. If not set only samples directly within the supplied mask
        will be retained. Default: None

    Returns
    -------
    samples : dict
        Where keys are donor IDs and values are lists of sample IDs for samples
        from that donor that fall within `mask`

    << OR >>

    samples : numpy.ndarray
        List of sample IDs that fall within `mask`
    """

    # relevant code here

Add L/R flipping procedure (Romero-Garcia et al., 2018, NeuroImage)

The issue

Only two of the six donors from the Allen Institute have samples taken from the right hemisphere of the brain, so many analyses ignore the right hemisphere and only focus on the left. Still, it seems wasteful to simply drop these samples, so finding a way to use them would be great.

Proposed solution

Romero-Garcia and colleagues (2018, NeuroImage) proposed a sample flipping procedure wherein they reflect all samples from the right hemisphere onto the left, thereby substantially increasing the anatomical coverage of the left hemisphere. It would be nice to add this ability to the abagen workflows. I'd prefer it to be OFF by default, so a new boolean parameter (reflect_hemi=False) to abagen.get_expression_data() would probably work!

In the event that it was set to True we could look for all samples with x-coordinates >0 (i.e., in the right hemisphere) and duplicate those samples, multiplying the x-coordinate by -1, then proceed as normal with the rest of the workflow. There might be some difficulties since the sample-specific well_id will be duplicated so we would need to figure out how to best handle identifying the reflected samples and matching them to annotation information.

Add Allen Institute references to README + doc-string

The issue

The Allen Institute has some citation policies that we should probably do a better job highlighting throughout our documentation and codebase!

Proposed solution

As it applies to the current toolbox, we should add the following publication to all Acknowledgments and References sections:

Hawrylycz, M.J. et al. (2012) An anatomically comprehensive atlas of the adult
human transcriptome, Nature 489: 391-399. doi:10.1038/nature11405

Furthermore, we should probably add a disclaimer in the License Information section of the README that all the data obtained from calling abagen.fetch_microarray() is copyrighted by Allen Institute. Adding the following would be a good start:

All microarray expression data is copyrighted under [non-commercial reuse
policies] (http://alleninstitute.org/legal/terms-use/) by the Allen Institute
for Brain Science (© 2010 Allen Institute for Brain Science. Allen Human Brain
Atlas. Available from: human.brain-map.org).

Remove mutable default arguments in abagen.datasets

The issue

It's bad form to have the default settings of keyword arguments be mutable objects (like lists, numpy arrays, etc.). Currently, however, the donors parameter in abagen.datasets.fetch_microarray() is a list (['9861']). While I don't think this is causing any real issues as currently implemented to conform to good practices it would be great to fix this!

Proposed solution

This should hopefully be as simple as modifying the function signature so that the default is donors=None and then inside the function adding a check like:

if donors is None:
    donors = ['9861']

That way the actual default will still be the same (i.e., calling abagen.datasets.fetch_microarray() with no parameters will return a single donor), but we won't have this pesky mutable object that could potentially (somehow) wreak havoc!

Set datatype of counts dataframe in get_expression_data to integer

The issue

In abagen.get_expression_data() we create a counts dataframe that keep tracks of the number of samples from each donor that are assigned to a given region of interest in whatever atlas the user provides. Currently this dataframe is created with float values, but since it's just a tally it would be great if we explicitly set it to be int.

Proposed solution

Modify the instantiation of the dataframe to explicitly set the datatype to int, from:

counts = pd.DataFrame(np.zeros((len(all_labels) + 1, num_subj)),
                      index=np.append([0], all_labels))

to

counts = pd.DataFrame(np.zeros((len(all_labels) + 1, num_subj), dtype=int),
                      index=np.append([0], all_labels))

should do the trick!

Column ordering for abagen.mouse fetchers is inconsistent

The issue

This is horribly pedantic, but when you call abagen.mouse.fetch_allenref_genes() and abagen.mouse.fetch_allenref_structures() TWICE (the first time the data are grabbed from the web and cached, and the second time they are read from the cache) the columns of the resulting pandas.DataFrame are:

['acronym', 'id', 'name']. 

However, when you call abagen.mouse.fetch_rubinov2015_structures() (any number of times) the columns are:

['id', 'acronym', 'name']. 

It would be great for these to be in the same order!

Proposed solution

For both fetch_allenref_genes() and fetch_allenref_structures() modifying the lines that read:

else:
    genes/structures = pd.read_csv(fname)

to:

else:
    genes/structures = pd.read_csv(fname)[entries]

would do the trick! The entries object is a list defined earlier in both functions that is ordered ['id', 'acronym', 'name'] and would ensure consistent outputs. This is how fetch_rubinov2015_structures() works so making them consistent would be great!

Integrate ability to handle surface parcellations

The issue

Currently all the samples for the donors are mapped to volumetric space. That is, we have xyz coordinates for each sample both in the donors' native MRI space and mapped to the stereotaxic MNI space. The primary function (abagen.get_expression_data()) accepts a volumetric atlas to generate an ROI x gene microarray expression matrix.

But not all atlases are volumetric. It would be fantastic to accept surface-based atlases/parcellations!

Proposed solution

Romero-Garcia and colleagues (2018, NeuroImage) went through all the trouble of processing the donor brains through FreeSurfer's recon-all pipeline and made the reconstructions openly available on the University of Cambridge repository. We could use these reconstructions to match samples to surface vertices and then apply surface parcellations.

Unfortunately, surface parcellations come in a variety of formats:

  1. Standard parcellations can be defined in any of the fsaverage spaces—do we limit which ones we accept (i.e., only accept fsaverage or fsaverage5 or fsaverage6)? Moreover, I haven't checked the quality the registration to fsaverage space for any of the reconstructed donor brains; this might be worth doing.
  2. We could accept probabilistic atlas files (i.e., FreeSurfer .gcs files), but then would have to make a subprocess call out to FreeSurfer in order to apply those parcellations to the donor brains. Do we allow these sorts of files and just warn users when FreeSurfer isn't installed?

Allowing surface parcellations and surface reconstructions would also add a whole bunch of additional concerns with matching samples to parcels (i.e., when considering "nearby" samples do we travel along the surface or continue to use Euclidean distance?).

Add verbose logging statements (optional)

Currently abagen.get_expression_data() takes a really long time (this should also probably be addressed...) and provides exactly zero information about its progress during that period. It would be fantastic to add some helpful logging / print statements to the function to let people running it know what's happening and how it's progressing.

At this point any status messages would be helpful, but am open to discussing what the optimal locations would be for providing status updates!

Address FutureWarnings for pandas.DataFrame.get_values()

The issue

As of pandas version 0.25.0 the DataFrame.get_values() method is being deprecated. Calling it returns the following warning:

FutureWarning: The 'get_values' method is deprecated and will be removed in a future
version. Use '.values' or 'np.asarray(..)' instead.

Examining the documentation for DataFrame.values, however, suggests using DataFrame.to_numpy() instead. This method, unfortunately, was only added in pandas version 0.24.0.

Proposed solution

In order to support older versions of pandas all calls to DataFrame.get_values() should be replaced with np.asarray(DataFrame). This should be an easy find-and-replace, but since the call to .get_values() is pretty prevalent throughout the repo it would be good to through and ensure that things are changed correctly!

Memory issues with `abagen.get_expression_data`

The issue

Using the fantastic memory_profiler package and running a few tests with abagen reveals that, in the best case, the get_expression_data() functions uses ~4GiB RAM and, in the worst case, uses up to ~7GiB RAM.

This isn't great, and it would be nice to find ways to optimize this down a bit. Unfortunately, I think at a minimum we're going to be dealing with ~2GiB of RAM given that's about how much loading and storing the raw microarray + pacall data uses, but beyond that we can definitely try and get things down.

Currently, the worst of the memory usage is happening when lr_mirror is set to True. The step where we duplicate the samples has, for a brief period of time, both the original microarray data AND the duplicated microarray data in memory, leading to huge jumps until the original is removed.

Proposed solution

I tried to resolve this issue by adding an optional keyword argument inplace to the samples.mirror_samples command, but that didn't seem to help a ton (at least not with the current implementation). I'm going to have to dig into the internals of things a bit more and try and figure out if there are stages where we can offload holding things in memory.

The flipside is that offloading things from memory might increase runtime (due to the need to load them back in whenever they're needed), so perhaps this might warrant a low_memory parameter or something?

Output values

Hi Ross,

I just wanted to check something. The values I get are between 0 and 1.

At the risk if asking a stupid question, can I assume that 1 is highest gene expression and 0 is lowest, when I am comparing expression levels across regions?

Thanks again for all your help.

bw

Peter

Update Arnatkevičiūte reference in README

The fantastic manuscript on preprocessing AHBA microarray data by Arnatkevičiūte et al. was published in NeuroImage 🎉 (https://www.sciencedirect.com/science/article/pii/S1053811919300114). It would be great to update the reference (and links) in the README to reference this development!

To Do

Since the README is in reStructuredText this will be a bit different than simply updating a markdown document (which is the default for GitHub). In particular, the following link will need to be modified from:

this in a `recent manuscript <https://www.biorxiv.org/content/early/2018/07/30/
380089>`_, demonstrating how the techniques and code used to prepare the raw
AHBA data have varied widely across published reports.

to

this in a `recent manuscript <https://www.sciencedirect.com/science/article/pii
/S1053811919300114>`_, demonstrating how the techniques and code used to
prepare the raw AHBA data have varied widely across published reports.

Additionally, the reference in the Acknowledgments section should be updated from:

.. [1] Arnatkeviciute, A., Fulcher, B. D., & Fornito, A. (2018). A practical
   guide to linking brain-wide gene expression and neuroimaging data. bioRxiv,
   380089.

to

.. [1] Arnatkevic̆iūtė, A., Fulcher, B. D., & Fornito, A. (2019). A practical
   guide to linking brain-wide gene expression and neuroimaging data.
   NeuroImage, 189, 353-367.

Finally, there is one reference in the codebase (abagen/allen.py, in the doc-string of get_expression_data()) that needs to be updated, as well! The updated formatting for that reference should be identical to how the reference in the Acknowledgments section of the README should be updated.

The hippocampal formation: cortex or subcortex?

The issue

tl;dr The Allen Institute ontology classifies hippocampus as part of cortex, not subcortex, which could cause problems for matching some microarray samples to ROIs.


When users provide a file or dataframe to the atlas_info parameter in abagen.get_expression_data() they are required to specify a broad structural class for each region in their atlas (in a column labelled 'structure' in the file/dataframe). The current options for this structural class include:

  1. cortex,
  2. subcortex,
  3. cerebellum,
  4. brainstem,
  5. white matter, and
  6. other (i.e., ventricles and such)

We match these designations with the information from the Allen ontology such that samples that don't fall directly within a region in the atlas aren't incorrectly assigned to regions in atlas across hemispheric / structural boundaries.

That is, if one of the samples from the Allen Institute is labelled as having come from the left hemisphere subcortex we make sure to only assign it to a region in the user-specified atlas labelled as belonging to the left hemisphere subcortex. This impacts only a minority of samples (i.e., we don't currently check whether this is the case for those samples having coordinates directly within a region in the atlas), but a significant minority, nonetheless.

While matching these designations seems like a reasonable approach in most cases, the one point of contention that a general user might have is that the Allen Institute ontology classifies the hippocampal formation (including the subiculum, dentate gyrus, and CA1-4) as part of "cortex" rather than "subcortex". Specifically, their ontology specifies:

brain 
└─ gray matter
   └─ telencephalon
      └─ cerebral cortex
         └─ limbic lobe
            └─ hippocampal formation

Thus, if a researcher provides an atlas where they label all their hippocampal ROIs as "subcortex" they're liable to get vastly different results than if they label all their hippocampal ROIs as "cortex."

While I have it on good authority that the hippocampus is often considered part of "allocortex," I'm hesitant to add this as a permissible structural class to abagen since it seems quite a bit more specific than the current (rather broad) structural designations listed above (1-6).

Proposed solution

I genuinely don't know! It would be great to allow either specification for the hippocampus (i.e., "cortex" or "subcortex"), but the current framework for getting these structural classes from the Allen ontology doesn't allow for this hedging. I can think about how to modify it for this one instance in particular, but in the interim it would be great to come up with alternatives.

One option that might be worthwhile is to simply allow users to specify either (or both) of the expected 'hemisphere' and 'structure' information in atlas_info and just use whatever is available. Then, users who have hippocampal ROIs can refrain from specifying the 'structure' for their ROIs and we'll do our best to ensure samples simply don't cross hemispheric boundaries. This isn't necessarily ideal because there's the possibility that samples will get incorrectly assigned across e.g., cortical/subcortical boundaries for regions that aren't the hippocampus (but we might still consider this option outside of the current problem!).

Alternatively (and perhaps most immediately appealing), we can add a warning on the documentation about this designation and inform users to specify that their hippocampal ROIs are part of "cortex" (not "subcortex") when they provide atlas_info.

Mirror `abagen.mouse` functionality for human API

The issue

The abagen.mouse module provides some functionality for making Pythonic queries to the Allen API. For example, to get gene expression data for Prodynorphin from the anterior cingulate of the mouse, you would call:

>>> from abagen import mouse
>>> mouse.get_unionization_from_gene(name='prodynorphin', 
...                                  structures='Anterior cingulate area')
                      expression_density
gene_id structure_id                    
18376   31                      0.017199

Though the mouse module hasn't received much attention since #32, it might be nice to have a similar module for querying information from the human API.

Proposed solution

I'm thinking that having an abagen.api module could hold all this for the human data and would be designed similar to the abagen.mouse module. An example function could include:

>>> from abagen import api
>>> api.get_expression_from_gene(name='prodynorphin', donors='9861',
...                              structures='cingulate gyrus')

This is quite an open-ended enhancement so happy to workshop things a bit!

Option for returning donor-level expression arrays

abagen.get_expression_data() currently returns a single region x gene expression array, collapsed via a provided metric across donors (default is averaging across donors). It might be nice to provide an option for returning the individual donor-level arrays: something like return_donors=True would thus return a list instead of a single array.

This could be done by changing the following lines in abagen.allen.get_expression_data():

abagen/abagen/allen.py

Lines 349 to 350 in 93c1a8a

expression = [process.normalize_expression(e) for e in expression]
expression = pd.concat(expression).groupby('label').aggregate(metric)

Modifying them to something like the following (and including return_donors as an optional keyword argument in the function definition) should work:

expression = [process.normalize_expression(e) for e in expression]
if not return_donors:
    expression = pd.concat(expression).groupby('label').aggregate(metric)

Thus, the expression data would only be concatenated and aggregated across donors is users didn't want the individual donor-level arrays.

add further atlases or keep it lightweight? [enhancement]

From the documentation I understand that the Desikan-Killiany atlas is included to provide an example of how to use the toolbox and that's it's comparable easy to use any atlas one might want. However, it might be a nice feature if some well-known and heavily used parcellations would already be included. Depending on your thoughts and how you want the toolbox structure and size to be, I would be happy to help out (e.g. including some atlases I already have in ALPACA). To keep it lightweight, nilearn's fetch.datasets functions could be used to get the atlas files, while the respective csv could be included in the toolbox.

Constrain matching of microarray samples to atlas regions

abagen.get_expression_data() currently accepts an optional argument, atlas_info, which is used to constrain the matching of microarray samples to anatomical regions in the user-supplied atlas. This constraint ensures that samples taken from e.g., the left cortex of a donor are not incidentally matched to a left subcortical atlas region.

However, while this constraint is currently employed during the initial matching of donor samples to anatomical regions (see abagen.allen._assign_sample()), it is not used when the parameter exact=False is supplied to the abagen.get_expression_data() call.

The usage of the exact argument is currently described in the README.md:

...due to how samples were collected from the donor brains, it is possible that some regions in the atlas may not be represented by any expression data. If you require a full matrix with expression data for every region, you can specify the following:

>>> expression = abagen.get_expression_data(files, atlas, atlas_info, exact=False)

By default, abagen will attempt to be as precise as possible in matching microarray samples with brain regions. Specifying exact=False will, at the cost of this precision, ensure that every brain region is matched to at least one sample.

The mechanism for ensuring every brain region is matched to at least one sample is performed by finding the microarray sample closest to the centroid of the given brain region. This could very easily cause a brain region in e.g., left subcortex to be matched to a sample from e.g., the left cortex of a donor.

If exact=False is specified and atlas_info is provided, it would be great to ensure that samples are not incorrectly matched to anatomical regions. This would require some potentially significant changes to the code in abagen.get_expression_data(), but I am happy to discuss avenues for how this might best be tackled!

Perform donor normalization before probe selection?

The issue

In the current workflow for abagen.get_expression_data() there are a few calls to the probes.filter_probes() and probes.collapse_probes() functions, both of which use microarray expression data from donors to select "good" quality probes to be retained in future processing. These functions—though primarily the latter—operate on expression values from all tissues samples pooled across donors, performing some numerical calculations like averaging, calculation of variance, PCA, etc. However, the expression data passed to these functions are "raw"; that is, no within-donor normalization procedures have been performed. Thus, donor-specific effects present in the raw microarray data could potentially be biasing the results of these functions.

My question is: should we perform within-donor normalization (i.e., normalizing each probe across all samples from a given donor) prior to providing the microarray data to the probe filtering/selection functions?

We would still use the "raw" microarray data when pooling across tissue samples in a given brain region and perform a separate within-donor normalization towards the end of the workflow, prior to aggregating across donors, but this pre-probe-selection donor normalization might help ensure that results aren't biased by donor effects. The flip side to this is that it's possible that performing the normalization procedure prior to removing "noisy" probes might result in biased expression values...

Proposed solution

Unclear! I don't think this warrants yet another parameter for the function, so it would be good to make a decision and simply implement it one way or the other. It's worth noting that adding this donor normalization step will likely change which probes are selected, so it will be a relatively large impact on the downstream expression matrix.

Matching samples to brainstem regions

The issue

Currently abagen entirely ignores the brainstem in its sample matching when atlas_info is provided to abagen.get_expression_data(). That is, the structure IDs for brainstem nuclei aren't parsed from the ontology information provided by the Allen Institute and therefore using a brainstem atlas image will yield imprecise results.

Notably, this will ONLY affect those samples that don't reside directly inside one of the parcels in the provided atlas image, however it will still change the results one might get.

Proposed solution

The abagen.process.ONTOLOGY dictionary will need to be updated to include the Allen Ontology structure IDs for brainstem. While we're at it, it might be worthwhile to complete update the abagen.process.ONTOLOGY dictionary to include all structural classes.

The following structures from the Allen Ontology encompass more than one broad structural class and should therefore be ignored:

id name structures
4005 brain all structures
4006 gray matter cortex, subcortex, cerebellum, brainstem
4007 telencephalon cortex, subcortex
4833 metencephalon brainstem, cerebellum

This leaves the following structural classes as the "roots" of all sub-structures in the ontology:

id name structure
4008 cerebral cortex cortex
4275 cerebral nuclei subcortex
4391 diencephalon subcortex
9001 mesencephalon subcortex
4696 cerebellum cerebellum
9131 pons brainstem
9512 myelencephalon brainstem
9218 white matter white matter
9352 sulci & spaces other

This latter table should be used in place of the current abagen.process.ONTOLOGY dictionary to ensure that we allow atlases that include parcels in any of these broad structural designations.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.