Git Product home page Git Product logo

mannlabs / alphapept Goto Github PK

View Code? Open in Web Editor NEW
158.0 11.0 28.0 208.96 MB

A modular, python-based framework for mass spectrometry. Powered by nbdev.

Home Page: https://mannlabs.github.io/alphapept/

License: Apache License 2.0

Python 42.82% Inno Setup 0.10% Batchfile 0.05% HTML 53.25% CSS 0.04% JavaScript 3.69% Dockerfile 0.05%
mass-spectrometry bioinformatics proteomics alphapept-ecosystem

alphapept's Introduction

AlphaPept

nbdev CI launch - renku DOI:10.1038/s41467-024-46485-4

AlphaPept: a modern and open framework for MS-based proteomics

Nature Communications.

Be sure to check out other packages of our ecosystem:

Windows Quickstart

  1. Download the latest installer here, install and click the shortcut on the desktop. A browser window with the AlphaPept interface should open. In the case of Windows Firewall asking for network access for AlphaPept, please allow.
  2. In the New Experiment, select a folder with raw files and FASTA files.
  3. Specify additional settings such as modifications with Settings.
  4. Click Start and run the analysis.

See also below for more detailed instructions.

Current functionality

Feature Implemented
Type DDA
Filetypes Bruker, Thermo
Quantification LFQ
Isobaric labels None
Platform Windows

Linux and macOS should, in principle, work but are not heavily tested and might require additional work to set up (see detailed instructions below). To read Thermo files, we use Mono, which can be used on Mac and Linux. For Bruker files, we can use Linux but not yet macOS.

Python Installation Instructions

Requirements

We highly recommend the Anaconda or Miniconda Python distribution, which comes with a powerful package manager. See below for additional instructions for Linux and Mac as they require additional installation of Mono to use the RawFileReader.

AlphaPept can be used as an application as a whole or as a Python Package where individual modules are called. Depending on the use case, AlphaPept will need different requirements, and you might not want to install all of them.

Currently, we have the default requirements.txt, additional requirements to run the GUI gui and packages used for developing develop.

Therefore, you can install AlphaPept in multiple ways:

  • The default alphapept
  • With GUI-packages alphapept[gui]
  • With pacakges for development alphapept[develop] (alphapept[develop,gui]) respectively

The requirements typically contain pinned versions and will be automatically upgraded and tested with dependabot. This stable version allows having a reproducible workflow. However, in order to avoid conflicts with package versions that are too strict, the requirements are not pinned when being installed. To use the strict version use the -stable-flag, e.g. alphapept[stable].

For end-users that want to set up a processing environment in Python, the "alphapept[stable,gui-stable]" is the batteries-included-version that you want to use.

Python

It is strongly recommended to install AlphaPept in its own environment. 1. Open the console and create a new conda environment: conda create --name alphapept python=3.8 2. Activate the environment: conda activate alphapept 3. Install AlphaPept via pip: pip install "alphapept[stable,gui-stable]". If you want to use AlphaPept as a package without the GUI dependencies and without strict version dependencies, use pip install alphapept.

If AlphaPept is installed correctly, you should be able to import AlphaPept as a package within the environment; see below.


Linux

  1. Install the build-essentials: sudo apt-get install build-essential.
  2. Install AlphaPept via pip: pip install "alphapept[stable,gui-stable]". If you want to use AlphaPept as a package withouth the GUI dependencies and strict version dependencies use pip install alphapept.
  3. Install libgomp.1 with sudo apt-get install libgomp1.
Bruker Support
  1. Copy-paste the Bruker library for feature finding to your /usr/lib folder with sudo cp alphapept/ext/bruker/FF/linux64/alphapeptlibtbb.so.2 /usr/lib/libtbb.so.2.
Thermo Support
  1. Install Mono from mono-project website Mono Linux. NOTE, the installed mono version should be at least 6.10, which requires you to add the ppa to your trusted sources!
  2. Install pythonnet with pip install pythonnet>=2.5.2

Mac

  1. Install AlphaPept via pip: pip install "alphapept[stable,gui-stable]". If you want to use AlphaPept as a package withouth the GUI dependencies and strict version dependencies use pip install alphapept.
Bruker Support

Only supported for preprocessed files.

Thermo Support
  1. Install brew and pkg-config: brew install pkg-config
  2. Install Mono from mono-project website Mono Mac
  3. Register the Mono-Path to your system: For macOS Catalina, open the configuration of zsh via the terminal:
  • Type in cd to navigate to the home directory.
  • Type nano ~/.zshrc to open the configuration of the terminal
  • Add the path to your mono installation: export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig:/usr/lib/pkgconfig:/Library/Frameworks/Mono.framework/Versions/Current/lib/pkgconfig:$PKG_CONFIG_PATH. Make sure that the Path matches to your version (Here 6.12.0)
  • Save everything and execute . ~/.zshrc
  1. Install pythonnet with pip install pythonnet>=2.5.2

Developer

  1. Redirect to the folder of choice and clone the repository: git clone https://github.com/MannLabs/alphapept.git
  2. Navigate to the alphapept folder with cd alphapept and install the package with pip install . (default users) or with pip install -e . to enable developers mode. Note that you can use the different requirements here aswell (e.g. pip install ".[gui-stable]")

GPU Support

Some functionality of AlphaPept is GPU optimized that uses Nvidia’s CUDA. To enable this, additional packages need to be installed.

  1. Make sure to have a working CUDA toolkit installation that is compatible with CuPy. To check type nvcc --version in your terminal.
  2. Install cupy. Make sure to install the cupy version matching your CUDA toolkit (e.g. pip install cupy-cuda110 for CUDA toolkit 11.0.

Additional Notes

To access Thermo files, we have integrated RawFileReader into AlphaPept. We rely on Mono for Linux/Mac systems.

To access Bruker files, we rely on the timsdata-library. Currently, only Windows is supported. For feature finding, we use the Bruker Feature Finder, which can be found in the ext folder of this repository.

Notes for NBDEV

  • For developing with the notebooks, install the nbdev package (see the development requirements)
  • To facilitate navigating the notebooks, use jupyter notebook extensions. They can be called from a running jupyter instance like so: http://localhost:8888/nbextensions. The extensions collapsible headings and toc2 are very beneficial.

Standalone Windows Installer

To use AlphaPept as a stand-alone program for end-users, it can be installed on Windows machines via a one-click installer. Download the latest version here.

Docker

It is possible to run AlphaPept in a docker container. For this, we provide two Dockerfiles: Dockerfile_thermo and Dockerfile_bruker, depending on which filetypes you want to analyse. They are split because of drastically different requirements.

To run, navigate to the AlphaPept repository and rename the dockerfile you want to use, e.g. Dockerfile_thermo to Dockerfile.

  • Build the image with: docker build -t docker-alphapept:latest .
  • To run use docker run -p 8505:8505 -v /Users/username/Desktop/docker:/home/alphapept/ docker-alphapept:latest alphapept gui (Note that -v maps a local folder for convient file transfer)
  • Access the AlphaPept GUI via localhost:8505 in your browser.
  • Note 1: The Thermo Dockerfile is built on a Jupyter image, so you can also start a jupyter instance: docker run -p 8888:8888 -v /Users/username/Desktop/docker:/home/jovyan/ docker-alphapept:latest jupyter notebook --allow-root

Docker Troubleshooting on M1-Mac

  • The Thermo dockerfile was tested on an M1-Mac. Resources were set to 18GB RAM and 2 CPUs, 200 GB disk
  • It was possible to build the Bruker dockerfile with the platform tag --platform linux/amd64. However, it was very slow and the Bruker file is not recommended for an M1-Mac. Windows worked nicely.

Additional Documentation

The documentation is automatically built based on the jupyter notebooks (nbs/index.ipynb) and can be found here:

Version Performance

An overview of the performance of different versions can be found here. We re-run multiple tests on datasets for different versions so that users can assess what changes from version to version. Feel free to suggest a test set in case.

How to use

AlphaPept is meant to be a framework to implement and test new ideas quickly but also to serve as a performant processing pipeline. In principle, there are three use-cases:

  • GUI: Use the graphical user interface to select settings and process files manually.
  • CMD: Use the command-line interface to process files. Useful when building automatic pipelines.
  • Python: Use python modules to build individual workflows. Useful when building customized pipelines and using Python as a scripting language or when implementing new ideas.

Windows Standalone Installation

For the windows installation, simply click on the shortcut after installation. The windows installation also installs the command-line tool so that you can call alphapept via alphapept in the command line.

Python Package

Once AlphaPept is correctly installed, you can use it like any other python module.

from alphapept.fasta import get_frag_dict, parse
from alphapept import constants

peptide = 'PEPT'

get_frag_dict(parse(peptide), constants.mass_dict)
{'b1': 98.06004032687,
 'b2': 227.10263342687,
 'b3': 324.15539728686997,
 'y1': 120.06551965033,
 'y2': 217.11828351033,
 'y3': 346.16087661033}

Using as a tool

If alphapept is installed an a conda or virtual environment, launch this environment first.

To launch the command line interface use: * alphapept

This allows us to select different modules. To start the GUI use: * alphapept gui

To run a workflow, use: * alphapept workflow your_own_workflow.yaml An example workflow is easily generated by running the GUI once and saving the settings which can be modified on a per-project basis.

CMD / Python

  1. Create a settings-file. This can be done by changing the default_settings.yaml in the repository or using the GUI.
  2. Run the analysis with the new settings file. alphapept run new_settings.yaml

Within Python (i.e., Jupyter notebook) the following code would be required)

from alphapept.settings import load_settings
import alphapept.interface
settings = load_settings('new_settings.yaml')
r = alphapept.interface.run_complete_workflow(settings)

This also allows you to break the workflow down in indiviudal steps, e.g.:

settings = alphapept.interface.import_raw_data(settings)
settings = alphapept.interface.feature_finding(settings)

Notebooks

Within the notebooks, we try to cover most aspects of a proteomics workflow:

  • Settings: General settings to define a workflow
  • Chem: Chemistry related functions, e.g., for calculating isotope distributions
  • Input / Output: Everything related to importing and exporting and the file formats used
  • FASTA: Generating theoretical databases from FASTA files
  • Feature Finding: How to extract MS1 features for quantification
  • Search: Comparing theoretical databases to experimental spectra and getting Peptide-Spectrum-Matches (PSMs)
  • Score: Scoring PSMs
  • Recalibration: Recalibration of data based on identified peptides
  • Quantification: Functions for quantification, e.g., LFQ
  • Matching: Functions for Match-between-runs
  • Constants: A collection of constants
  • Interface: Code that generates the command-line-interface (CLI) and makes workflow steps callable
  • Performance: Helper functions to speed up code with CPU / GPU
  • Export: Helper functions to make exports compatbile to other Software tools
  • Label: Code for support isobaric label search
  • Display: Code related to displaying in the streamlit gui
  • Additional code: Overview of additional code not covered by the notebooks
  • How to contribute: Contribution guidelines
  • AlphaPept workflow and files: Overview of the worfklow, files and column names

Contributing

If you have a feature request or a bug report, please post it either as an idea in the discussions or as an issue on the GitHub issue tracker. Upvoting features in the discussions page will help to prioritize what to implement next. If you want to contribute, put a PR for it. You can find more guidelines for contributing and how to get started here. We will gladly guide you through the codebase and credit you accordingly. Additionally, you can check out the Projects page on GitHub. You can also contact us via [email protected].

If you like the project, consider starring it!

Cite us

If you use this project in your research, please cite:

Strauss, M.T., Bludau, I., Zeng, WF. et al. AlphaPept: a modern and open framework for MS-based proteomics. Nat Commun 15, 2168 (2024). https://doi.org/10.1038/s41467-024-46485-4

alphapept's People

Contributors

ammarcsj avatar cmdoret avatar dependabot-preview[bot] avatar dependabot[bot] avatar elena-krismer avatar eugeniavoytik avatar github-actions[bot] avatar hugokitano avatar ibludau avatar jalew188 avatar mgleeming avatar mschwoer avatar romanzenka avatar straussmaximilian avatar swillems avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

alphapept's Issues

Medium: change the overall progress value during DB generation in GUI when it starts

Is your feature request related to a problem? Please describe.
The overall progress during DB generation starts only in several minutes after the running of the tool.

Describe the solution you'd like
Maybe, it would be better to show from the beginning 0.1% when the user just run the tool. Otherwise, it could look like that nothing happens. (see the screenshot - https://i.gyazo.com/cb1c9bb4d2721d4074147183881ce3a8.jpg)

[Medium] "UnboundLocalError: local variable 'file' referenced before assignment" error in the Results section

Describe the bug
[Medium] "UnboundLocalError: local variable 'file' referenced before assignment" error in the Results section when you start the AP for the first time or when user cleans the "alphapept/finished" folder.

To Reproduce

  1. Install the tool on your computer and run it for the first time.
  2. Switch to "Results" section.
  3. "UnboundLocalError: local variable 'file' referenced before assignment" is displayed in this section.

Expected behavior
It would be nice to avoid showing the error when the folder is still empty.

Screenshots
https://gyazo.com/5d207434bd44156650f1612f8c1fe9e8

Version (please complete the following information):

  • OS: Windows
  • Version: 10.0.19041
  • Installation Type: One-Click Installer

Medium: wrong size for one option in the dropdown menu in GUI

Describe the bug
In GUI for all parameters in "Settings" tab that have a dropdown menu there is always an empty big space or one option from the menu is incomparable big.

To Reproduce

  1. Run the AlphaPept.
  2. In the "Settings" tab extend "fasta" parameters and click on "Protease" selection option.

Can be reproduced for the following dropdown menus:

  • "general > score"
  • "quantification > mode"

Expected behavior
It has to be the same size for all dropdown menu options.

Screenshots
https://i.gyazo.com/b2ef0ef36aa23bbb006e1aaf83d76404.jpg
https://i.gyazo.com/26d6244c747c468a7f1a86ee481a385a.jpg

Version (please complete the following information):

  • OS: Windows
  • Version: 0.3.11-dev0
  • Installation Type: One-Click Installer

DependaBot Integration

Currently, dependable is not tracking the software versions.
Check why this is.
Ideally, going back to the requirements.txt would be nice.

Replace NPZ with HDF

The NPZ format to store data files is only a temporary solution. @swillems has some fast implementations that use indexing and are being used in the ion networks project. This would allow to save even the query data in the hdf.

Keys steps would be:

  1. Replace the save function in io from npz to hdf.
  2. Replace the save function in FASTA
    One would need to think about how to store the peptide dict and the sequences here.

Windows GUI crashes due to access restricitons

Describe the bug
The Win GUI crashes and reports "PermissionError: [ WinError 5]".

To Reproduce
Steps to reproduce the behavior:

  1. Open Win GUI (0.3.13-dev0)
  2. Drag and drop Fasta files and raw files
  3. Start Analysis
  4. See error

Expected behavior
Expected analysis to start

Screenshots
image

Version (please complete the following information):

  • Windows 10 Pro 64-bit (10.0, Build 19041)
  • Version [0.3.13-dev0]
  • One-Click Installer

Additional context
After receiving the error I tried to run as admin, but I was unable to load any raw files or fast files to the GUI.

Medium: change the message that is shown after "Check" of the settings if no problems were found

Is your feature request related to a problem? Please describe.
If a user makes a check of the settings specified for the run in the "Settings" tab and no problem were found he gets the following message:
"Found a total of 0 problems: []" (see the screenshot - https://i.gyazo.com/8a515079cffed3dba4f92ffdce1123f8.png)

Describe the solution you'd like
It would be better to change the message for the case when no problems were found in the result of the check to something like:
"No problems were found."

nbdev windows vs. macOS / Linux

@jalew188 already encountered the following issue, which I can now confirm: Line endings of Jupter notebooks on Window include a carriage return, which unfortunately means that a nbdev_build_lib or nbdev_clean_nbs function creates massive "differences", which are in fact just line endings. This obscures which code has been modified though, making it difficult to work on both windows and macOS/linux machine simultaneously...

strings are loaded as binary strings from HDF_File (io.py)

Describe the bug
With h5py==3.2.1, pandas raised an error: ERROR:root:Scoring of file d:\DataSets\APTest\20170518_QEp1_FlMe_SA_BOX0_HeLa12_Ecoli1_Shotgun.ms_data.hdf failed. Exception Can only use .str accessor with string values!. alphapept could run smoothly after switching to h5py==2.10.0. I check the data frame from .ms_data.hdf, it seems that strings are loaded (or stored) as binary strings from HDF_File in h5py==3.2.1.

To Reproduce
Install h5py==3.2.1, and run the whole workflow.

Expected behavior
Error ...... Can only use .str accessor with string values!

Version (master branch, 3.11-dev0):

  • OS: Windows 10
  • Python: 3.8.0
  • Installation Type: pip

Selection of multiple proteases

Is your feature request related to a problem? Please describe.
Some people use multiple proteases in one experiment. It would be cool if it were possible to also select multiple proteases in alphapept.

Screenshot 2021-05-19 at 15 43 18

Quantification Accuracy with MaxLFQ

Currently, the quantification accuracy with MaxLFQ is not yet satisfactory.

These are the results when running the species test from PXD010012
Screen Shot 2020-08-10 at 12 51 21 AM

Notably, no clear distinction between the species is possible. The original data from the paper looks like this:
Screen Shot 2020-08-10 at 12 49 56 AM

When testing only the algorithm part on the data, the following results can be observed.

Screen Shot 2020-08-10 at 12 50 29 AM

The following two observations will be investigated:

  • Why is there a population at zero?
  • E.coli seems to have a population on the left side. Why?
  • From investigating the evidence files from MaxQuant, there seems to be a discrepancy between the protein groups, i.e., sometimes multiple protein groups are merged. Find out why this happens.

Extreme: exception occurs at the end of DB generation for Thermo HeLa test run

Describe the bug
An unknown exception occurs when the process of the DB generation ends for the test HeLa file(Thermo) -
/04_hela_testrun/20190402_QX1_SeVW_MA_HeLa_500ng_LC11.raw.

To Reproduce

  • see the attached settings.txt file with all information about the parameters of the run.
  1. Load the attached settings renaming the file into settings.yaml.
  2. Specify the path to the HeLa file and run the analysis.
  3. At the end of the process of DB generation you'll get an exception

!!! After getting an exception, the process(es) of alphapept.exe still exists in the Task Manager like active.
settings.txt

Screenshots
https://i.gyazo.com/9b04df0db8a5c85c492e408928301c03.png

Version (please complete the following information):

  • OS: Windows 10
  • Version: 0.3.11-dev0
  • Installation Type: One-Click Installer

Code Revision - Architecture

In terms of architecture, several design ideas can be utilized.

So far, a major approach was to use numba-optimized functions for the core code. Numba allows to use OOP with Jitclasses. A downside here is that you need to type the variables, which affects the clear and easy python syntax.

As discussed, we could use a combination of having regular Python classes with Numba functions.

The UI implementation relies much more on OOP, and basically no numba functions are employed.

This issue is intended to collect ideas on where we should revise the code so that we have more flexibility for further modules.

[Medium] "New experiment" section is not updated if user specifies the same path several times

To Reproduce

  1. Run AlphaPept and switch to the "New experiment" section.
  2. I.e. remove the fasta file from your experimental folder (or any other necessary file) and copy the path to this folder to the path input field.
  3. "No fasta files in folder" message appears that is correct.
  4. Add the fasta file to the folder and put the same path to the field again > we still see the same "No fasta files in folder" message. If the path is the same as before the section is not updated.

Expected behavior
To update the section when a file path is typed even if it's the same path.

Version (please complete the following information):

  • OS: Windows
  • Version: 10.0.19041
  • Installation Type: One-Click Installer

High: prevent the changing of the values in the Settings tab using the mouse scroll wheel

Is your feature request related to a problem? Please describe.
In the "Settings" tab if a user looks at all available options and scroll them up and down and if at this moment the mouse cursor appears on one of the IntegetInput/FloatInput options, the values of these options could be changed even without the user's notice.

Describe the solution you'd like
To prevent the changing of the IntegetInput/FloatInput options in the "Settings" tab using the mouse scrolling.

Option to cancel a running job

Is your feature request related to a problem? Please describe.
If you realise that you selected inappropriate settings or similar, there should be a possibility to cancel a job. Also, in case alphapept gets stuck (as happend for me - see below) you should be able to cancel.

Describe the solution you'd like
There could be a cancelation button in the 'queue' tab on the 'status' page.

Nice to have
There could also be an option to prioritise jobs in the queue. In case an urgent analysis comes up you could move it up in the list.

Additional context
Specifically, I have an issue with a job that (I guess) got stuck and it restarts and gets stuck at the same point again when I relaunch alphapept (this is stuck for >20 min already). I will write a separate issue for this.

Screenshot 2021-05-14 at 10 36 34

Github workflow security issue?

Updating the github workflow before pushing a commit allows to run arbitrary code on the runners. While some changes require to modify the github workflow to actually make a CI pass, this can pose a serious security threat once the repository is publically accessible. How can/will we deal with this?

[Medium] "Auto-Update Page" checkbox is automatically unchecked when user switches between different sections

To Reproduce

  1. Run AlphaPept and in the "Status" section check "Auto-Update Page" check box.
  2. Switch to any other section, i.e. "New experiment".
  3. Switch back to the "Status" section and check that the "Auto-Update Page" check box is again unchecked.

Expected behavior
It would be nice to save the checking/unchecking of this check box during switching between the sections.

Version (please complete the following information):

  • OS: Windows
  • Version: 10.0.19041
  • Installation Type: One-Click Installer

HeLa performance drastically decreased

Hi,
After investigating the latest performance runs, I noticed that we have approx only 40k of peptides for Thermo Runs.
In the past, before the automated tracking, hen making the performance runs, we had approx. 50k, so it seems that we lost 20% somewhere...
I checked the settings, and they did not change (same FASTA / file, tolerances, etc.)
I presume that we have introduced a bug along the way (raw conversion? feature calibration? mapping MS1 to MS2) and should investigate.

[High] Impossible to reanalyze the file if the previous analysis was terminated

Describe the bug
If terminate the analysis process at any point (f.e. FF) and after that try to rerun the analysis using default workflow steps,
"Processing of D:\04_hela_testrun\20190402_QX1_SeVW_MA_HeLa_500ng_LC11.ms_data.hdf for step raw_conversion failed. Exception File extension .hdf not understood." exception occurs in the terminal and the program is frozen.

To Reproduce
Steps to reproduce the behavior:

  1. Run any analysis with any settings and terminate the process, f.e. at the FF stage.
  2. In the New experiment section enter the path to the same folder again. Here we see that .ms_data.hdf file appears as in the Raw files section (I also don't think that it should be included in this section-
    image
    )
  3. Leave all other options as default ones (continue_runs check box in the Workflow shouldn't be selected which means that ms_data file should be deleted) and Submit an experiment.
  4. "Processing of D:\04_hela_testrun\20190402_QX1_SeVW_MA_HeLa_500ng_LC11.ms_data.hdf for step raw_conversion failed. Exception File extension .hdf not understood." occurs in the log and the whole run is frozen (longer than 30 minutes). (see attached log.file)

Expected behavior
It should be possible to rerun any terminated processes again.
As a suggestion, maybe, it makes sense not to show the .ms_data.hdf in the Raw files section.

Version (please complete the following information):

  • OS: Windows
  • Version: 10.0.19042
  • Installation Type: One-Click Installer v.0.3.23

Attached log file.
2021_05_27_thermo_hela_run2.log

Test Scope

To have a maintainable package, automated tests and performance benchmarks are crucial.
I have the following tests in mind, considering a versioning scheme where we are using (X-Y-Z): X-Major, Y-Minor, Z-Patch. In terms of branching we would have a master/dev and feature branches.

Unit Tests:

Simple function tests within nbdev. They should be run for every push on every branch. Duration ~ minutes

Workflow test:

Tests to run a full pipeline (i.e., perform a search on HeLa Thermo / Bruker data). We could run them for every version, even minor on dev). Duration <1h
We would auto-create a settings template for the current version and replace the file_path with the respective filenames)

Integration test:

Test to try all possible (or at least most) combinations of settings.
This is something we could do for every Minor version. Duration will be several hours.

Installer Test:

I think shipping is very crucial, and we should have one-click installers ready for each patch. To compile an installer takes approx. <10 minutes so this could be done for every push on the dev branch.

UI Test:

This is a very difficult test but very important to keep a userbase. Implementing a new feature and then the GUI doesn't work anymore. The current settings scheme is very flexible so that the core functionality should be tested with the Workflow test. A proper GUI test would probably include using tools like pyautogui that automatically "clicks" through workflows. If we want to be fancy, we could also use this to make a screenshoted documentation for each version automatically. Ideally I think this would be for every push on the dev branch.

Performance test:

The workflow test from above will not allow us to give a good estimate on performance. We will get execution time and proteins and peptides, but we should also consider metrics like quantification accuracy. For this, we should use multi-species samples with known mixing ratios that are computationally more demanding, and I would hence consider as a different kind of test.
The idea would be to have a set of PRIDE datasets like PXD010012, which we always re-run. As we could the analysis results from the repository, we would also have a baseline to compare the results to.
Depending on the number of datasets this could take
This is something we could potentially do for every minor version.

List of potential performance test sets:

  • PXD010012: Online PASEF Paper
  • PXD006109: BoxCar

Implementation

For running those tests, I will use GitHub-Actions self-hosted runners. This would allow us to use powerful workstations to run the tests.

Ideally, we can also set up runners for each Windows / Linux and Mac.

At some point, one could also make the testing results more explorable, i.e., pushing the results to a db and having a little dashboard app that shows performance over version/time.

Also, note that we can always trigger the tests manually.

Let me know if you would suggest additional or think the current test set should be optimized.

Styleguide

We should also add an automatic style test.

High: a fasta_paths parameter is always empty in the Settings tab in GUI

Describe the bug
A list of paths to fasta files is always empty in the "Settings" tab in GUI. It can't be modified and not updated with the paths to fasta files that are selected in the "Experiment" tab.

To Reproduce

  1. Run the AlphaPept.
  2. In the "Settings" tab extend "fasta" parameters and click on a "fasta_paths" option.
  3. Nothing happens and the user doesn't have an option to specify the path. It's always empty even if the user specified the paths to the fasta file(s) in the "Experiment" tab.

Screenshots
https://i.gyazo.com/b7962a7232e288fd1a65b003f9101b7c.png

Version (please complete the following information):

  • OS: Windows 10
  • Version: 0.3.11-dev0
  • Installation Type: One-Click Installer

Timing no longer reported in history

Describe the bug
The timing plot is not correctly displayed in release 0.3.23-dev0

Screenshots
Screenshot 2021-05-26 at 16 53 56

Version (please complete the following information):

  • OS: Windows 10 Pro
  • Version: 20H2, OS build 19042.804
  • Installation Type One-Click Installer

Axis labels in History tab

Describe the bug
Not really a bug, but I would add units to the axes in the History tab of the GUI.
Specifically rt_length should be I guess in minutes and timing as well.

UI Bugs and Enhancements

The current UI has several bugs, and several ideas for enhancement exists.

Bugs

  • ProgessBar is not correctly implemented
  • Logging sometimes shows weird spaces
  • Stop busy indicator after run is complete
  • Loading settings file via drag and drop doesn't work
  • Program crashes after completing the run

Enhancements

  • Yellow progress bar is hardly readable when having more than 50% progress
  • Auto-download FASTA files from UniProt
  • Starting GUI is relatively slow: Speed up or create starting screen
  • Include more information in the files dialog (i.e., the number of entries in the FASTA file)
  • Ability to make crude data exploration in the explore tab (i.e., search for sequence/protein), make histograms
  • General Code Cleanup: Variable names etc

Usability: Database / Fasta Files

Currently, one can select FASTA files and a database file. This isn't very clear:
When are we creating a new database file, when are we using the existing one?

Stability: Waiting indicator

Several steps could become unresponsive when having extreme cases. Examples are when dropping thousands of files in the file dialog or when displaying first_search in the explore tab.

Possible Solutions:
Start the respective steps in a thread and show a wait indicator.

Formatting of settings

Is your feature request related to a problem? Please describe.
I think the setting selection is not optimal in a way that you cannot distinguish what is a setting selection and what opens a drop-down menu (see screenshot)

Screenshot 2021-05-19 at 15 17 57

Here, 'raw' opens a drop-down selection, 'use_profile_ms1' is a setting and 'fasta' is a drop-down again.

Describe the solution you'd like
I would generally prefer if the overall drop-down categories wouldn't be selected by a tick box but rather with a plus on the right site as is done for the settings in general. Another option would to indent the actual settings selection or use another background color.

ValueError when using a small fasta (contaminants.fasta) for BSA raw

Probably because the fasta is too small to be divided into >1 fasta_block.

ValueError                                Traceback (most recent call last)
<ipython-input-3-cdeabb6ebdae> in <module>
      1 from alphapept.runner import run_alphapept
----> 2 run_alphapept(params)

/home/feng/alphapept/alphapept/alphapept/runner.py in run_alphapept(settings, callback)
     67                 cb = callback
     68 
---> 69             spectra, pept_dict, fasta_dict = generate_database_parallel(settings, callback = cb)
     70             logging.info('Digested {:,} proteins and generated {:,} spectra'.format(len(fasta_dict), len(spectra)))
     71 

/home/feng/alphapept/alphapept/alphapept/fasta.py in generate_database_parallel(settings, callback)
    708     spectra_set.append(spectra[-1])
    709 
--> 710     pept_dict = merge_pept_dicts(pept_dicts)
    711 
    712     return spectra_set, pept_dict, fasta_dict

/home/feng/alphapept/alphapept/alphapept/fasta.py in merge_pept_dicts(list_of_pept_dicts)
    515 
    516     if len(list_of_pept_dicts) < 2:
--> 517         raise ValueError('Need to pass at least two elements to merge.')
    518 
    519     new_pept_dict = list_of_pept_dicts[0]

ValueError: Need to pass at least two elements to merge.

Python versions

For the CI main, we apparantly use python3.6, while the sample files use python3.8.
This is very strange and turns out to mess up some things with the CI since dependencies might be different. Probably we should just make a single install script where e.g. a fixed conda setup (updates are relatively unreliable otherwise) installs python and alphapept consistenly. Calling this install script equally in all the .github/workflows should then be far more consistent.

Branch cleanup

There are quite some "stale" branches and some of the HDF branches were only merged and not properly deleted after merging/pulling.
I am not sure how it works if multiple people work on the ame branch, but ptobably we should try tro keep the github as clean as possible. The stale branches can probably be deleted, unless someone is acitvely working on it?
Merged/pulled branches can probably be deleted easily with a "merge and delete", kinstead of just merging, although I am not sure how good this is if other contributors are still working on this branch locally...

For what it is worth, I only consider the following branches active and all others can be deleted:

  • master
  • develop
  • readability
  • all dependabot branches

Dependabot

Deopendabot currently pushedsto master. This means that develop is behind master, which should never happen normally. Perhaps it is best to update dependabot to push to develop, so that master is always the "correct release version"?

Installation security block

Describe the bug
AlphaPept installation is cancelled after progress bar already reached the end. Security warning as shown in screenshot pops up. Error occurs both as user and administrator.

To Reproduce
Steps to reproduce the behavior:

  1. Start the installation

Expected behavior
Installation should result in AlphaPept being installed.

Screenshots

Screenshot 2021-05-10 at 10 06 31

Version (please complete the following information):

  • OS: Windows 10 Pro
  • Version: 20H2, OS build 19042.804
  • Installation Type One-Click Installer

Naming convention

I revised the settings file to have some more consistency:

Please find attached the current set of options:
I roughly tried to sort the categories as follows:

calibration: Related to calibration
experiment: The experiment details
fasta: related to creating a theoretical database from fasta
features: related to feature finding
general: general workflow settings
misc: everything else
quantification: related to quantification options
raw: related to handling raw files
search: search options

All options are defined in the settings_template.yaml. This is what is used to create the user interface. It defines the type, some min/max and default values and also a brief description.

As there is always a lot of debate about the naming convention, I would be happy to collect ideas on which options should be renamed and which ones you find missing.

Issue of get_isoforms with itertools.product

For phosphorylation as an example, there may be many phospho sites on a sequence, resulting in a lot of isoforms. If max_isoforms is large enough, there may be too many isoforms to be considered; if max_isoforms is too small, modifications on left-side AAs may be excluded due to the behavior of itertools.product.

seq = "SBCDMFSSSSSSSSSSSMMFD" # maybe more phosites when sequence gets longer
mod_dict = dict(zip("S,M".split(","), "(ph)S,(ox)M".split(",")))

max_isoforms = 1000000
peplist = []
from time import perf_counter
start = perf_counter()
for i in range(1000):
    peptides = get_isoforms(mod_dict, seq, max_isoforms)
    peplist.extend(peptides)
end = perf_counter()
print(peptides[:100])
import sys
print(f'Memory usage and running time for 1000 repeats: {sys.getsizeof(peplist)/10**6:2f} MB, {end-start:2f} s, # of isoforms without repeat: {len(peptides)}')

Outputs:

['SBCDMFSSSSSSSSSSSMMFD', 'SBCDMFSSSSSSSSSSSM(ox)MFD', 'SBCDMFSSSSSSSSSSS(ox)MMFD', 'SBCDMFSSSSSSSSSSS(ox)M(ox)MFD', 'SBCDMFSSSSSSSSSS(ph)SMMFD', 'SBCDMFSSSSSSSSSS(ph)SM(ox)MFD', 'SBCDMFSSSSSSSSSS(ph)S(ox)MMFD', 'SBCDMFSSSSSSSSSS(ph)S(ox)M(ox)MFD', 'SBCDMFSSSSSSSSS(ph)SSMMFD', 'SBCDMFSSSSSSSSS(ph)SSM(ox)MFD', ..., 'SBCDMFSSSSSS(ph)SS(ph)S(ph)S(ph)S(ox)MMFD', 'SBCDMFSSSSSS(ph)SS(ph)S(ph)S(ph)S(ox)M(ox)MFD', 'SBCDMFSSSSSS(ph)S(ph)SSSSMMFD', 'SBCDMFSSSSSS(ph)S(ph)SSSSM(ox)MFD', 'SBCDMFSSSSSS(ph)S(ph)SSSS(ox)MMFD', 'SBCDMFSSSSSS(ph)S(ph)SSSS(ox)M(ox)MFD']
Memory usage and running time for 1000 repeats: 271.614064 MB, 14.134513 s, # of isoforms without repeat: 32768

Deinstall

Describe the bug
If I deinstall AlphaPept this does not delete the .alphapept folder.
Should this be the case?

Git Config Issues

Hi,
The current git configs seem to be not in order. Please make sure that your git config is set correctly and you are using the correct credentials.

Screen Shot 2020-11-18 at 14 54 19

I am planning on re-writing the git history so that we all are consistent.
For now, I will change @ibludau and my username. @swillems and @jalew188 should I change yours as well? To the biochem email or the personal?

Installation on macOS Big Sur with M1 ARM chip

Describe the bug
I want to install alphapept on my Macbook with M1 chip (ARM). This does not work out-of-the-box. I want to share my learnings and open issues here for others to be more efficient in solving them with new dependency versions available.

Several packages will not work, such as numba, PyQt5 and pythonnet. One error message (regarding numba) is:

FileNotFoundError: [Errno 2] No such file or directory: 'llvm-config'

To Reproduce
Steps to reproduce the behavior:

conda create --name alphapept python=3.8
conda activate alphapept
cd
git clone https://github.com/MannLabs/alphapept.git
cd alphapept
pip install -r requirements.txt
pip install .

Expected behavior
Installation runs through with out an error and within the python console, I can run:

import alphapept

Version (please complete the following information):

  • OS: macOS Big Sur 11.2.2
  • Version: Current version (commit 1060009)
  • Installation Type: pip (see above)

Bug: Duplication of functions in `io` module

Greetings,

While reviewing the AlphaPept code, I noticed that the io.py module has a number of functions which appear to be repeated, as noted below.

  1. Functions for loading Thermo data -- looks like this relates to the switch away from pymsfilereader since it's no longer a required installation
  1. Functions for extracting mzML information -- these have exactly the same name but code is different, so I'm assuming the second one supersedes the first. But that's potentially a nasty bug.
  1. Functions for reading mzML information

GUI speed

The GUI seems extremely slow compared to the CLI (thermo quick test approx 4min instead of < 2min) on a. MacBook. Am I the first/only to notice this, or is this is a common issue/bug?

Current ToDo

Things that need to be done for the Alpha Release (v.0.4.0)

Major

  • DIA library export
  • Upgrade GPU FF to CPU
  • Include GPU CI/CD
  • Test unspecific search again

Minor Bugs

  • Unknown AAs -> What to do with this?
  • Update Runner notebooks

Stability

  • Include a check that when running a workflow the requirements are met

Bugs

  • Revise Quantification (@jalew188, @straussmaximilian ) -> Upgrade split level
  • Bug in Testrunner 2020-10-26 00:18:05 INFO - Numba version 0.50.1 -> Something is wrong here

GUI

  • Make check settings button usable (i.e. this setting needs to be changed) or remove
  • Fix speed issue (why is process slower than cmd) #116
  • Get FileWatcher to work again
  • Fix progress bar
  • Make data preview happen in GUI again

Installer

**- [ ] Loading of ext does not work -> Check installed version

  • Auto upload for proper release cycle**

Due Diligence

  • Go over Documentation
  • Check that all settings are being used
  • Revise logging output
  • Revise naming convention (peptides / precursors / psms, m_offset <> m_tol -> ms1_tolerance, ms2_tolerance)
  • Version: When are we calling what? version ?
  • Clean up Performance plots
  • Clean up Git Config
  • Clean up Branches

CI / CD

  • CI Test for Windows, Linux and Mac #134
  • Perform Integration Test
  • Include Styleguide Tests

Future

Features

  • PTM Support #39

Code Stability

  • Test Coverage
  • Include Bound checks in functions to make usage more stable (i.e. error when having negative intensity) #58

Isobaric Labels

  • TMT <> EasiTag <> Silac
  • Second Peptide Search: Shared fragments etc

Quantification

  • Constantins Quantification

Performance

  • Optimal parameters for Bruker FF
  • Speed up CLI

Installer Compilation

The current route to create an installer on Windows is to use pyinstaller and inno setup.
While the pyinstaller script (create_installer.bat) runs through on some machines, it does not do work on the self-hosted runner.

In general, we should make an installation routine that is platform-specific as PyInstaller does not support cross-platform compilation.

Also, the current installer script create_installer.bat is not testing intermediate steps (e.g. was pyinstaller successful?) so we could have an exe installer that installs something that is not working, so intermediate testing steps are necessary.

This is particularly relevant for the GUI, as PyQT can cause problems, but a GUI is not that straightforward to test.

This issue will be linked to a project so that we can track the ToDo.

Compatibility of workflow steps

Describe the bug
Currently people can select 'lfq_quantification' without ever having imported or searched any file. It would be important to make the selection of workflow steps somehow linked to whether an hdf file with required information is already there.

Importantly, in these cases, the settings files need to somehow be saved differently to ensure that users can still follow how the final results were created in case multiple different settings were used and then stacked on top of each other.

Results display gets messed-up when multiple settings were used on the same folder

Describe the bug
If you run a second analysis on the same folder as a previous analysis then the result files are overwritten. In the GUI you can however still select the yaml of the initial analysis, see the initial parameters, but non-matching results in the tables.

To Reproduce
Steps to reproduce the behaviour:

  1. Run one analysis with FDR=1
  2. Start a second analysis with FDR=0.01 in the same folder (rename the yaml file)
  3. Go to results tab and select the yaml of the first analysis
  4. Select the protein FDR table and sort by FDR >> the maximum FDR is 0.01, if you scroll up to 'run log' it states FDR=1 though

Expected behavior
I think it would be good to either append the name of the yaml file also to all result files as a general suffix for an analysis. This way alternative analysis results can be stored in teh same folder.
Alternatively, if only one set of results should be available per folder, I would suggest to restructure the results panel so that the results.yaml settings are shown and not any settings that were saved by the user at any point. They might not match the shown results.

Version (please complete the following information):

  • OS: Windows 10 Pro
  • Version: 20H2, OS build 19042.804
  • Installation Type: One-Click Installer

MacOS gets different results with Windows for the same raw file.

I used alphapept import to extract ms1 and ms2 on Windows, and then run alphapept workflow to identified the extracted ms_data. I could get extractly the same identifications comparing with running the whole workflow on Windows. So the bug should be in pyrawfilereader. Same codes have different behaviors on Windows and MacOS.

Starting page for the GUI

General idea
I would like to sugest a starting page for the GUI which briefly introduces AlphaPept and provides a mini-overview of what can be done and how. This can also be the place to provide a download button for the detailed user guide thats currently in the making. I think this would be good especially if people use AlphaPept for the first time and there is nothing to see or use on the 'status' page. What do you think?

Inconsistency between workflow selection and additional settings

Is your feature request related to a problem? Please describe.
I find it a bit unintuitive that you first select workflow steps, but then you can still choose settings for parts of the workflow that you didn't select.

Screenshot 2021-05-19 at 15 38 37

Here, matching is not selected in the workflow, but I can still adjust parameters for it.

Describe the solution you'd like
Would it be possible to restrict the settings to the ones that are actually relevant for the workflow steps that were chosen?

Weird error in search.py

I used a wrong raw (DIA raw) file while testing, and a weird error showed below. It may be because some spectra are empty in the DIA run. We should check whether idxs_lower, idxs_higher are empty or not during the run?

ValueError                                Traceback (most recent call last)
 in 
      1 from alphapept.runner import run_alphapept
----> 2 run_alphapept(params)

~/opt/anaconda3/lib/python3.8/site-packages/alphapept/runner.py in run_alphapept(settings, callback)
    140             cb = callback
    141 
--> 142         fasta_dict, pept_dict = search_parallel_db(settings, callback=cb)
    143 
    144     else:

~/opt/anaconda3/lib/python3.8/site-packages/alphapept/search.py in search_parallel_db(settings, calibration, callback)
   1062         file_npz, settings_ = to_process[0]
   1063         settings_['search']['parallel'] = True
-> 1064         search_db((file_npz, settings_))
   1065     else:
   1066         with Pool(n_processes) as p:

~/opt/anaconda3/lib/python3.8/site-packages/alphapept/search.py in search_db(to_process)
   1022             features = pd.read_hdf(base+'.hdf', 'features')
   1023 
-> 1024         psms, num_specs_compared = get_psms(query_data, db_data, features, **settings["search"])
   1025         if len(psms) > 0:
   1026             psms, num_specs_scored = get_score_columns(psms, query_data, db_data, features, **settings["search"])

~/opt/anaconda3/lib/python3.8/site-packages/alphapept/search.py in get_psms(query_data, db_data, features, parallel, m_tol, m_offset, ppm, min_frag_hits, callback, m_offset_calibrated, **kwargs)
    249     idxs_lower, idxs_higher = get_idxs(db_masses, query_masses, m_offset, ppm)
    250     frag_hits = np.zeros(
--> 251         (len(query_masses), np.max(idxs_higher - idxs_lower)), dtype=int
    252     )
    253 

<__array_function__ internals> in amax(*args, **kwargs)

~/opt/anaconda3/lib/python3.8/site-packages/numpy/core/fromnumeric.py in amax(a, axis, out, keepdims, initial, where)
   2665     5
   2666     """
-> 2667     return _wrapreduction(a, np.maximum, 'max', axis, None, out,
   2668                           keepdims=keepdims, initial=initial, where=where)
   2669 

~/opt/anaconda3/lib/python3.8/site-packages/numpy/core/fromnumeric.py in _wrapreduction(obj, ufunc, method, axis, dtype, out, **kwargs)
     88                 return reduction(axis=axis, out=out, **passkwargs)
     89 
---> 90     return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
     91 
     92 

ValueError: zero-size array to reduction operation maximum which has no identity

Documentation not working

Describe the bug

The documentation page build is failing. NBDEV documentation can be best tested by installing Jekyll and running locally (Best for Mac). There is a tutorial here

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.