kusterlab / curve_curator Goto Github PK

View Code? Open in Web Editor NEW

17.0 6.0 3.0 43.37 MB

Analysis platform for large-scale dose-dependent data

License: Apache License 2.0

Makefile 0.03% Python 99.97%

analysis bioinformatics curve-fitting dashboard dose-response p-value proteomics statistics decryptm viability

curve_curator's People

Contributors

Stargazers

Watchers

Forkers

erkankesik temponemh animesh

curve_curator's Issues

Bokeh version ~3.4 changed scatterplot API

Bokeh version ~3.4 changed scatterplot API. This manifests in all data points disappearing upon selection.
https://docs.bokeh.org/en/latest/docs/user_guide/basic/scatters.html#ug-basic-scatters
https://docs.bokeh.org/en/latest/docs/releases.html

Adapt changes in the dashboard. This will be covered in 0.4.1.

To prevent the older versions from failing, fix the current bokeh version in 0.4.0.

adapt this also in poetry builds.
https://python-poetry.org/docs/configuration/

False number of non-missing values are calculated during the thresholding when multiple controls are present

make sure the correct number of missing values during thresholding is calculated
Change the assertion error to a value error to make it more sensical

Switch to tomllib standard library

TOML parser is now a standard library starting from Python 3.11. This can replace the toml package and simultaneously enable a better TOML syntax, e.g., inf objects.

https://docs.python.org/3/library/tomllib.html#module-tomllib

It will lead to a concomitant minimal Python version of 3.11.

Bug: Incorrect R2 calculation

Not correct calculation of total sums of squares.
Should be of course:

ss_total = np.sum((y - np.mean(y)) ** 2)

Add a spectronaut parser

Protein and Peptide from DIA should be fine for now...

Only fixed params of logistic model are used for calling the core method after fitting

Here the fixed parameters are taken although a model has been fitted and is then raising an error...

curve_curator/curve_curator/models.py

Line 864 in b87b212

def build_model(self):

Example:

# data
line_x = np.linspace(-10.3, -4.3, 1000)
x = np.log10([0.1, 1.0, 10.0, 30.0, 100.0, 300.0, 1000.0, 3000.0, 10000.0, 30000.0]) - 9
y = pd.Series([1, 1, 4, 6, 7, 16, 55, 105, 147, 160])


# Define the logistic Model
logistic_model = LogisticModel()

# Fit the unrestricted model with ordinary least squares (ols)
logistic_model.find_best_guess_ols(x, y)
logistic_model.fit_ols(x,y)
logistic_model(line_x)

Error message:
TypeError: LogisticModel.core() missing 4 required positional arguments: 'pec50', 'slope', 'front', and 'back'

Fold change calculation

Hello!

I would like to know how to derive the estimated 'Curve Fold Change' given in the output curves file?

I have tried taking the log2-transformed average response for the samples with highest concentration, minus the log2-transformed average response for samples with lowest concentration (not control samples). Even if I get somewhat close to the output 'Curve Fold Change' value, they are not identical.

Add DIANN DIA PEPTIDE parser

Not yet implemented...

NaNs in the dashobard break automatic range adjustment after curve selection

Math.max(...) javascript function expects only finite values. Otherwise it returns a NaN and the range is not adjusted at all.

example before fixing:

example after fixing:

Remove completely empty raw columns from the input

So far, one can add empty intensity columns. This can mess up correct decoy formation and FDR estimation.

Add a filter system that removes all these empty columns and warns the user.

CurveCurator

Hi I am unsure of how to interpret this error or what to do to fix it.

Uncaught exception

Traceback (most recent call last):
File "C:\Users.conda\envs\CurveCuratorEnv\Lib\site-packages\pandas\core\indexes\base.py", line 3805, in get_loc
return self._engine.get_loc(casted_key)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "index.pyx", line 167, in pandas._libs.index.IndexEngine.get_loc
File "index.pyx", line 196, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 7081, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 7089, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Modified sequence'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "", line 198, in run_module_as_main
File "", line 88, in run_code
File "C:\Users.conda\envs\CurveCuratorEnv\Scripts\CurveCurator.exe_main.py", line 7, in
sys.exit(main())
^^^^^^
File "C:\Users.conda\envs\CurveCuratorEnv\Lib\site-packages\curve_curator_main.py", line 99, in main
data = data_parser.load(config)
^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users.conda\envs\CurveCuratorEnv\Lib\site-packages\curve_curator\data_parser.py", line 427, in load
df = load_mq_tmt_peptides(path, search_engine_version, unique_cols=unique_cols, sum_cols=raw_cols, first_cols=first_cols, max_cols=max_cols)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users.conda\envs\CurveCuratorEnv\Lib\site-packages\curve_curator\data_parser.py", line 156, in load_mq_tmt_peptides
df['Modified sequence'] = clean_modified_sequence(df['Modified sequence'])
~~^^^^^^^^^^^^^^^^^^^^^
File "C:\Users.conda\envs\CurveCuratorEnv\Lib\site-packages\pandas\core\frame.py", line 4090, in getitem
indexer = self.columns.get_loc(key)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users.conda\envs\CurveCuratorEnv\Lib\site-packages\pandas\core\indexes\base.py", line 3812, in get_loc
raise KeyError(key) from err
KeyError: 'Modified sequence'

Add Curve RMSE to the output file

The RMSE of the curve fit can be an interesting parameter for fold-change analysis.
It is currently not reported in the curves.txt file.
Make this available to the user.

Linear interpolation option only works well for non replicated data

The current interpolation only works for single observations per dose because it is filling in between adjacent data points. However, this can lead to weird distortions of the interpolated estimate when the adjacent data points are the upper or lower points in both cases.

curve_curator/curve_curator/quantification.py

Lines 274 to 282 in 01b22bc

 f_linear = interpolate.interp1d(x_data, y_data, kind='linear') 

 x_linear = fit_params['x_interpolated'] 

 # Mask terminal missing values 

 if not all(finite_values): 

 x_linear = x_linear[(x_linear >= x_data.min()) & (x_linear <= x_data.max())] 

 # Add the values to the fit variables 

 x_fit = np.append(x, x_linear) 

 y_fit = np.append(y, f_linear(x_linear)) 

 weights = None

Todo:

Make it a function
Make it compatible with aggregations per dose such that the linear interpolation estimate is relative to the means (robust)
Unit test it
Think about weights again and how to construct them (if possible).
Warn users that when they have weights, they are not in action when interpolation is activated.

Add direct support for MSFragger Search engine

Make it possible for Protein and Peptide data in DDA LFQ and TMT
Add it to Readme (how to search the data) and example toml files

Replicates

Hi,

Please could you give an example of how a TOML and input file with replicates would look?
I have my control in triplicate, and my inhibitor treatments in duplicates.
If possible I want to do one curve fit based off the replicates without averaging them prior to the analysis.
Is it possible to generate a curve with the both duplicates present and error bars?
Also, is there any way to change the size of the text on the axis of the generated figures? It is very small.

Thank you

Add max_imputation parameter

Make the maximal imputation parameter available to the user. For backward compatibility, the default value is equivalent to the max missing value.
Report how many curves were filtered because too many imputed values were filtered out.
Add this to toml file parameter and readme

Report the number of curves that cannot be processed due to missing values in the controls

Report the number of curves that cannot be processed due to missing values in the controls.

Filter those all nan curves
Report the number of affected rows
Terminate CurveCurator if no curves are left for processing

In generic mode, no duplicates are aggregated

Currently, using the generic matrix upload, the "Name" duplicates are not aggregated.

TOML:

measurement_type= 'OTHER'
data_type = 'OTHER'
search_engine = 'OTHER'

However, they should be aggregated to enforce the "Name" column being unique to enable the three different modes of aggregation of viability data. Make this also clearer with more elaborate example data (CYL viability).

Correctly transfere missing values during the median normalization step

Currently, missing values were changed to 0 after median normalization, which is an undesired behavior and an unexpected imputation step.

NaNs should be conserved and dealt with elsewhere.

PD Example question

Good morning,

First of all congratulations on this useful tool :)

I have data output from PD with triplicates. While looking into your TOML examples, I noticed there's an example for triplicates, but unfortunately, I couldn't find any example specific to PD. Could you give an example for PD output data? I've been trying to implement it but haven't succeeded so far. Any help would be greatly appreciated!

Thank you,
Paula

PD: For the control setting on TOML, should we add all the sample names of the control curve or only the 0 value?

Bug: Incorrect number of return columns for N <= 4

CurveCurator breaks if data has N <= 4 datapoints because the false number of nan values are returned. It returns 19 but expects 20.

https://github.com/kusterlab/curve_curator/blob/58ab06f48147320872232c677f0d7cd3e7039b56/curve_curator/quantification.py#L270C21-L270C30

High degree of imputed data can cause while loop trap

Some users got trapped in a while loop during decoy simulation because there was a high degree of data imputation, leading to 0 variance, which in turn made the min_noise threshold 0 and prevented a successful while loop exit.

This can be fixed by removing 0 variance estimates from the empirical noise distribution.

New PD output has no flanking amino acids

New ProteomeDiscoverer (3.1.0.622) output files do not show the flanking amino acids in the Annotated Sequence column, i.e. APEPTIDE instead of [K].APEPTIDE.[R] previously. This results in an error in the following line:

curve_curator/curve_curator/search_engine_outputs/ProteomeDiscoverer.py

Line 36 in 0760d7d

 df['Modified sequence'] = df['Modified sequence'].str.split('.', expand=True)[1].str.upper() 

Should be an easy fix to check if the flanking amino acids exist before removing them.

Apply the max_missing parameter without the doses

There are situations where many controls are used. This can inflate the goal of the max missing parameter. As the many controls are anyways aggregated, it makes sense to only apply it to the observations covering the dose range and not the controls.

Correct index update from string selection with newer bokeh version

With the newer bokeh version, one needed to double-click to re-select other dots via text search.

This was caused by not-triggering re-selection with:

source.change.emit()

The trick is to correctly emit selection changes with:

source.selected.change.emit()

which invokes in turn the table update via:

source.selected.js_on_change('indices', CustomJS(args=dict(source=source, view=source_view_table), code=code))

Add negative value warning and clipping option

CurveCurator expects all values to be positive (since a negative signal is also impossible).

2 new features:

If negative values are present in the ratio data, warn the user and count how many are present in the data.
Give the option to clip negative (and positive) values to a pre-specified range as a new optional parameter to the TOML file.

Improve unit testing for basic model functions

Add test suits for these methods for both sigmoidal and mean models:

AUC
Residual
SSE
R2
RMSE

Warning message for decoys incorrectly triggered

A warning message is triggered if fewer than 10k decoys are generated, but the warning message warns that there are fewer than 1k decoys generated:

curve_curator/curve_curator/data_simulator.py

Lines 166 to 167 in 0760d7d

 if n < 1e4: 

 ui.warning(f' * Less then 1k decoys may result in inaccurate FDR estimations. Consider increasing the decoy_ratio parameter.')

Bug: aggregation behavior of pandas is changing NaNs to 0

The pandas aggregation function is silently overwriting NaNs to 0.

https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.sum.html

Fix this by introducing the min_count parameter.

Make dashboard (i.e. Gene, Name) search case insensitive

Dear Florian Bayer,

I would be highly grateful if you could make the search function in the dashboard case INsensitive. My pinky already hurts from pressing down the shift key.

Thanks a lot!

Stephan Eckert

	f_linear = interpolate.interp1d(x_data, y_data, kind='linear')
	x_linear = fit_params['x_interpolated']
	# Mask terminal missing values
	if not all(finite_values):
	x_linear = x_linear[(x_linear >= x_data.min()) & (x_linear <= x_data.max())]
	# Add the values to the fit variables
	x_fit = np.append(x, x_linear)
	y_fit = np.append(y, f_linear(x_linear))
	weights = None

	if n < 1e4:
	ui.warning(f' * Less then 1k decoys may result in inaccurate FDR estimations. Consider increasing the decoy_ratio parameter.')