Decline_Package

1. Background

The optimal approaches to study lung function decline, particularly in the setting of omic-scale predictors, are not known. The goal of this research project is to test different statistical models (e.g. slope, GEE, GLM) in simulated and real world cohorts (population and case-control) to compare Type 1 and Type 2 error, identify factors leading to heterogeneity, and to provide a foundation for modeling for future omics data.

Aims

develop a population of simulated spirometry data with a) autocorrelation, estimated from real world data, b) four groups representing normal, lower baseline, rapid decline, and lower baseline and rapid decline and c) both linear and quadratic decline
(this code package): test a set of models in real world data: a) simple slope models; b) LMM with random intercept, slope, or both; c) GEE; d) age, time; e) quadratic or linear terms (for underlying linear or quadratic decline). We will look at the effects of smoking (as a positive control), and selected SNPs.

Authorship

Anticipate up to 4 per cohort, with additional as required for writing / analysis.

Methods

Prepare cohort dataset as requested below. We require pre_fev1 in mL, age, sex, height, race, smoking status, pack-years, SNP allele freq, fev1pp (can use GLI global, or can use what has been previously calculated for your cohort), fev1/fvc ratio; follow up time, number of visits. We anticipate that the dataset will be clean; i.e. with minimal missingness, subjects with existing longitudinal data (>= 2 time points, smoking data), removal of erroneous data (i.e. QC’d spirometry and identification of spurious outliers). Large discrepancies in sample size in models / baseline characteristics will be assessed, and if necessary, request to re-prepare the datasets. We selected 12 SNPs: 6 from prior GWAS (COPD, lung function, or lung function decline) and 6 proxies (null control). If you have related individuals, please reach out to Jingwen. Errors particularly with GEE model 13 are known, the software will continue to run.

Results

All result files will be added to the source folder. Please zip/tar/etc with filename cohort_date.zip and send to Jingwen ([email protected]) and Matt Moll ([email protected]) See project document for further details on the project.

2. Columns for data set

IID: Unique individual ID (numeric). Note: two different individuals CANNOT have the same IID
FID: family ID (numeric); For unrelated individuals, create a column of the same number as "FID", e.g. a column of 1
pre_fev1: FEV1 (in mL)
SNPs: SNP information, column name MUST starts with the lower case "rs", e.g. rs507211, rsChrPosRefAlt
timefactor_spiro: time since the baseline exam (in YEARS). At baseline, timefactor_spiro=0
age: time-varying age
smoking_status: time-varying smoking status (never=0, former=1, current=2);

Baseline

age_baseline: baseline age
ht_baseline: baseline height (in cm)
smoking_packyears_base: pack-years at baseline
sex: biological sex (female=0, male=1)
smoking_status_base: baseline smoking status (never=0, former=1, current=2); Will be used as the grouping variable for glmmkin.

Other cohort-specific variables

PCs (PC1, PC2, ... ...)
equipchange ......

Variables for summary not for analysis

pre_fev1fvc: ratio of fev1 and fvc (fev1/fvc)
fev1_pp: fev1 percent predicted

Kinship matrix (for related data): both row names and column names MUST be IID

3. Important Note

Cohorts with multiple racial groups should conduct race-stratified analyses.
If cohorts have only 3 or less repeated measurements per individual, you may encounter an error message for GEE model 13 (e.g. something related to contrasting for variables with less than 2 levels). If this error only occurs in GEE model 13, please ignore this error, but kindly notify us and provide results for the remaining models.
Please check the files from the output folder "decline_package_output" to make sure that you have all the requested files and plots (plots are displayed normally).

4. Required packages

If the current version does not work, you can try previous version. Both versions should give the SAME output

R version: R/4.2.1 (current version)

geepack_1.3.9
GMMAT_1.4.0
dplyr_1.1.2
readxl_1.4.3
ggplot2_3.4.2
kinship2_1.9.6 (optional) for kinship matrix

R version: R/4.0.2 (previous version)

geepack_1.3-2
GMMAT_1.3.1
dplyr_1.0.2
readxl_1.3.1
ggplot2_3.3.2
kinship2 (optional)

5. For analysis, use "Analysis_MAIN.R"

Example code is included inside Analysis_MAIN.R as comments. Please make sure your dataset has all the columns that are mentioned in section 2. Columns for data set.

Example data:

IID	FID	pre_fev1	timefactor_spiro	age	smoking_status	age_baseline	…	PC1	rs507211	rs4077833
1	1	2771	0	50	1	50	…	-0.0293	0	1
1	1	2500	4	54	1	50	…	-0.0293	0	1
1	1	2450	10	60	1	50	…	-0.0293	0	1
2	11	3510	0	38	0	38	…	-0.0038	0	2
2	11	3450	9	47	0	38	…	-0.0038	0	2
2	11	3320	12	50	0	38	…	-0.0038	0	2
2	11	3220	17	55	0	38	…	-0.0038	0	2
3	24	2570	0	42	2	42	…	0.0071	1	0
3	24	2600	6	48	2	42	…	0.0071	1	0
3	24	2540	8	50	2	42	…	0.0071	1	0

Example Kinship matrix for 5 individuals (IID = 11,20,31,42,50):

IID	11	20	31	42	50
11	0.5	0	0.25	0.125	0
20	0	0.5	0.25	0.125	0
31	0.25	0.25	0.5	0.25	0.25
42	0.125	0.125	0.25	0.5	0
50	0	0	0.25	0	0.5

mikecho95 / decline_package Goto Github PK

decline_package's Introduction