mahnoorngondal / itfsc Goto Github PK

integrated transcription factor analysis for single cell data

License: MIT License

R 100.00%

itfsc's Introduction

iTFSC

integrated transcription factor analysis for single cell data (iTFSC) is an R package designed to consolidate transcription factor (TF) information across multiple tools to arrive at a more robust list of TF. The package also allows users to do downstream visualization including differential expression analysis. The package will be validated using four different cancer types.

Requirements

In order to run this package you will need to install the following dependencies: (please use R version 4.1.1)

library(Seurat)
library(SeuratDisk)
library(SCENIC)
library(BITFAM)
library(Dorothea)
library(piano)
library(ggplot2)
library(dplyr)
library(tidyr)
library(AUCell)
library(RcisTarget)
library(GENIE3)
library(base)
library(tibble)
library(ComplexHeatmap)
library(ggVennDiagram)
library(reshape2)
library(piano)
library(ggpubr)

If there is any issues install SCENIC, please visit this link for installation: http://htmlpreview.github.io/?https://github.com/aertslab/SCENIC/blob/master/inst/doc/SCENIC_Setup.html

Project detail:

Description: develop an integrated transcription factor analysis tool for single-cell and bulk data. The tool will include 4-5 existing transcription factors tools (eg SCENIC, DORTHEA, BITFAM etc) for single-cell data combined to give the users the transcription factor probability generated by a combined analysis. One way to select the best transcription factor is simply extracting the most common transcription factor generated from multiple tools. Other ways are to use the differential expression to decide on the best transcription factor across different cell types and find a common or high probability one. I am doing something similar for my research project, but I always thought it would be helpful if there was a package or tool to do this for me. The main idea would be to ensure that the transcription factors that we are getting are the ones that are actually involved, and this would be done by reproducibility across tools and through other downstream analyses.

Features: the features tool will include the following features:

integrated and fast TF analysis using 4-5 existing tools
extract common TFs generated from all tools
differential expression between cell types using limma on the output of the results from different tools
GSEA on the results from differential expression analysis
(if time) Apply the tools for bulk data (given there are at least 100 patients)
(if time) use the transcription factors for the deconvolution of bulk data

Example data

The RDS file for these datasets can be downloaded from here: https://drive.google.com/drive/folders/1WL0TxDAQpPGzmGy8gltT-x-ezSw6Ndh1?ths=true

How the user will run the example data:

Download the data from ### Example data (link is above) The RDS file for these datasets can be downloaded from here: https://drive.google.com/drive/folders/1WL0TxDAQpPGzmGy8gltT-x-ezSw6Ndh1?ths=true
user can directly use the example data in the functions, no further processing is required is the data is already processed by standard seurat pipeline, please refer to this tutorial for converting your data into a standard seurat object: https://satijalab.org/seurat/articles/pbmc3k_tutorial.html

Expected results:

a robust list of TF generated using three methods for extracting transcriptional activity score from single cell data
we will also provide a venn diagram to show how many common TFs exists
we will also employ heatmaps as depicted in the workflow image below to show individual methods output

itfsc's People

Contributors

Watchers

itfsc's Issues

Add high level documentation

Add parallel processing

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

design document update

Is your feature request related to a problem? Please describe.
update design document

Describe the solution you'd like
make biorender figure

Describe alternatives you've considered
Use .drawio instead

Additional context
None

Monica - Feedback

WBS Document:
Delegates tasks in a temporal manner with actionable items. The tasks are split into activities that are feasible and accomplishable.
Makes use of multiple packages ( SCENIC, RSHINY, BITFAM, Dorothea)
Details the output of each analysis with use friendliness in mind
Data visualizations are also taken into account
Comment: Workflow is clear and concise, activities are split into manageable tasks.

Materials:
Working with scRNA-Seq data, specifically 4 datasets ( breast cancer, colon cancer, lung cancer, ovarian cancer).
Three different software requirement specification documents. Document SRS doc 3 is the most detailed document, introducing the software package and its utilization for a user.

Comments:
Very well thought out project, WBS document and SDS documents are informative and clear.

I recommend making a single document consisting of the necessary inputs for all the different packages you will run and their outputs. This will enable the user to identify different data types necessary to use your package and ascertain what outputs they will be comparing. This is a small detail but one page takeaway might help the user feel prepared to begin analysis if they can check what they need.
Progress has been made on subsetting the data, writing a function and testing that it works.
Are there transcription that are known to be dysregulated across all cancers (cancer agnostic) and some that specific to each cancer you are investigating? Very cool to test the accuracy of the different trancription inference programs and converge on the output that is the same across programs ( computational). From a biology perspective are you able to identify a transcription factor network that is common in all 4 cancer datasets you are utilizing ( a cancer transcription factor network signature ?) Potentially look into Chip-Seq data as a validation step of findings?

Code and Test File:
R code reading in data, normalization of data, initial run of BITFAM
Test Code: Checking raw counts of scRNA-seq is present and that normalization of data was performe
)

The goals of this project can be achieved and progress has been made :)

unit test needs to import function from code

current unit test file does not know the functions (Normalization_check and Rawcount_check) to run the unit test. Will need to source it from the other file. :)

cell quality check unit test

Cell quality control could be added to the unit test:
Cell quality control can be done using metrics such as the number of reads per cell, the number of genes detected per cell, and the percentage of reads mapping to mitochondrial genes. Its completion can be checked by setting a modest range of these metrics and then checking if cells meet these quality control metrics.

SCENIC implementation

Require assistance in implementing SCENIC code in my analysis.
SCENIC will be the last method that I want to implement in my code. For that, I need assistance in writing code for implementing scenic on my dataset.