Tutorial for R packge
htlgmm
Heterogeneous Transfer Learning Dealing with Unmatched Features
In recent days, datasets with information on all desired features typically have small sample sizes, but one or more external studies with limited set of predictors sometimes have large sample sizes. HTL-GMM method (Heterogeneous Transfer Learning with Generalized Method of Moments) aims to build GLMs and Cox PH models with all desired features incorporating the external studies, which can be potentially high-dimensional.
This tutorial serves as an introduction of HTL-GMM method. We will introduce the R package usage and some application examples. Please refer to https://arxiv.org/abs/2312.12786 for the work of HTL-GMM method.
Code for HTL-GMM Manuscript
Please refer to https://github.com/RuzhangZhao/htlgmm_data_analysis to download the code for simulation and real data application mentioned in the HTL-GMM Manuscript.
Method Recap
Main study and external study can be from the same or different population.
External study can be in the form of individual data or pre-trained model with parameter estimation.
\({\bf A}\) and \(\tilde{\bf A}\): study-specific adjustment factors; \({\bf Z}\): overlapping features in both main and external studies; \({\bf W}\): unmatched features only in main study; \({\bf Y}\): outcome variable (family = “gaussian”: continuous; family = “binomial”: binary)
Function Usage of Package htlgmm
Primary Functions of Package htlgmm
We primarily provid three functions
htlgmm,
cv.htlgmm,
gwas.htlgmm. The first two work similarly
like functions glmnet and cv.glmnet from R
package glmnet.
cv.glmnet/cv.htlgmm does cross validation with
automatically generated tuning parameter list(s) to select tuning
parameters. We recommend users to use cv.htlgmm without
knowing prefixed tuning parameters.
htlgmm
cv.htlgmm
gwas.htlgmm
Input Parameters of cv.htlgmm
y : Response Variable.
Z : Overlapping Features.
W : Unmatched Features.
A : Study-specific Adjustment Factors. For default setting, we ignore the input of A if there is no intercept term (all variables are centered) when family = “gaussian” or only intercept term when family = “binomial”.
ext_study_info : Parameter estimation from External Study regarding overlapping features. “ext_study_info” comes as list, including the information of estimated coefficient, estimated variance (variance-covariance matrix).
family : Currently focus on linear (“gaussian”) and logistic (“binomial”).
penalty_type : Select from c(“none”,“adaptivelasso”,“lasso”,“ridge”,“elasticnet”). We allow all penalty choices available from glmnet. The default is “lasso”.
beta_initial: The initial estimation for beta if a consistent estimator is available. The default is NULL, and main study is used for initial estimation according to “initial_with_type”.
tune_weight: Whether to assign tuning weight for the regularization of weighting matrix. The default is FALSE.
tune_weight_method: Method for weighting matrix regularization. The default is “multiplicative shrinkage”, which can also be used as “mshrink” or “ms”. The other choice is “ridge”.
Click for more details of input parameters
initial_with_type: Get initial estimation for beta using main study data only by cross validation using penalty regression, chosen from c(“ridge”,“lasso”,“glm”).
weight_adaptivelasso: The customized adaptive-weight of adaptive lasso when using penalty_type = “adaptivelasso”.
alpha: The elasticnet mixing parameter, between 0 and 1, which is corresponding to the alpha for glmnet. Only used when penalty_type = “elasticnet”.
hat_thetaA: If A is not NULL, one can provide hat_thetaA as the input. If “hat_thetaA = NULL”, we estimate hat_thetaA with glm by main study.
V_thetaA: If A is not NULL, one can provide V_thetaA as the input. If “V_thetaA = NULL”, we estimate V_thetaA with glm by main study.
remove_penalty_Z: Do not penalize Z if it is TRUE. The default is FALSE.
remove_penalty_W: Do not penalize W if it is TRUE. The default is FALSE.
inference: Whether to do inference without penalty or post-selection inference with adaptive lasso penalty. The default is “default”, which is TRUE when penalty_type = “none” or “adaptivelasso”, and FALSE otherwise.
fix_C: When fix_C = NULL, the optimal C is computed. When user wants to customize the fix_C, please match its dimension as dim(A)+2*dim(Z)+dim(W) and make sure it is positive definite.
fix_inv_C: When fix_inv_C = NULL, the optimal C is computed. When user wants to customize the fix_inv_C, please match its dimension as dim(A)+2*dim(Z)+dim(W) and make sure it is positive definite. When fix_C and fix_inv_C are both given, the fix_C will be used.
type_measure: Select from c(“default”, “mse”, “deviance”, “auc”). Default is mse(liner), auc(logistic). “deviance” is another choice for binary y.
fix_weight: The fixed weight for the regularization of weighting matrix. The default is NULL. If it is NULL, select the best weight via cross validation.
weight_list: The weight list if it is preset. The default is NULL and weight list will be generated.
gamma_adaptivelasso: The gamma for adaptive lasso. Select from c(1/2,1,2). The default is 1/2.
Output Values of cv.htlgmm
beta : Target coefficient estimation of all predictors. The coefficients go with the order of (A, Z, W).
lambda_list: The lambda list for cross validation.
weight_list: The weight list for cross validation, when set tune_weight = TRUE.
cv_mse: The mean square error(mse) when family = “gaussian” for cross validation.
cv_auc: The area under ROC curve(auc) when family = “binomial” for cross validation. For each lambda, cv_auc is the average auc across all folds.
cv_dev: The deviance(dev) when family = “binomial” for cross validation An alternative choice of cv_auc.
lambda_min: The selected best lambda.
weight_min: The selected best weight, when set tune_weight = TRUE.
selected_vars: A list of results for predictors with nonzero estimated coefficients.
position: the position of nonzero predictors.
name: the name of nonzero predictors.
coef: the coefficient of nonzero predictors.
variance: the variance of nonzero predictors.
variance-covariance: the variance-covariance matrix of nonzero predictors.
pval: the p values of nonzero predictors.
Simulation Examples with Binary Outcome
In this simulation example, we illustrate the usage of
cv.htlgmm for binary outcome.
Simulation Setup
The simulation example is to illustrate how the
cv.htlgmm method is implemented, compared with
cv.glmnet. The simulation is performed under logistic
regression setting. We use 20 overlapping features
(Z) and 100 unmatched features
(W). 3/20 of Z whose effects on
outcome are nonzero, while 2/100 of W whose effects on
outcome are nonzero.
Main study sample size n is 500, while external study sample size nE is 3000.
Click for data generation simulation code
library(htlgmm)
pZ=20 # overlapping features
pW=100 # unmatched features
coef<-c(rep(0,pZ+pW))
coef[1:3]<-0.5
coef[c(21,22)]<-0.5
n=500
nE=3000
n_joint=n+nE
main_index<-1:n
print(paste0("Nonzero positions: ",paste(which(coef!=0),collapse = ", ")))## [1] "Nonzero positions: 1, 2, 3, 21, 22"
We jointly generate main and external studies.
y is generated with both Z and W.
set.seed(1)
Z_joint<-matrix(rnorm(n_joint*pZ),n_joint,pZ)
colnames(Z_joint)<-paste0("Z",1:pZ)
W_joint<-matrix(rnorm(n_joint*pW),n_joint,pW)
colnames(W_joint)<-paste0("W",1:pW)
Z<-Z_joint[main_index,] # separate main and external study for Z
ZE<-Z_joint[-main_index,]
W<-W_joint[main_index,] # only need main study for W
y_joint<-rbinom(n=n_joint,size=1,prob=expit(cbind(Z_joint,W_joint)%*%coef))
y<-y_joint[main_index] # separate main study y
yE<-y_joint[-main_index] # separate external study yDisplay the data structure of main study (y, Z, W) with the first 2 rows and first 30 columns.
## y Z1 Z2 Z3 Z4 Z5 Z6
## [1,] 1 -0.6264538 0.5205997 -1.3254177 0.7051107 0.5232667 -2.4675028
## [2,] 0 0.1836433 0.3775619 0.9519797 -0.6268688 0.9935537 -0.7000656
## Z7 Z8 Z9 Z10 Z11 Z12
## [1,] 0.6010915 -0.9252129 0.97027832 0.5375559 0.6986309 0.5574862
## [2,] -2.7671158 0.1805965 -0.02072336 0.2271813 -1.1650711 0.3520408
## Z13 Z14 Z15 Z16 Z17 Z18 Z19
## [1,] 0.3707608 2.5285947 -0.06455497 -1.504228 1.2651240 0.7859485 1.380371
## [2,] -0.7265963 0.1232134 -0.58150321 1.093152 0.5123215 -0.4794245 1.682890
## Z20 W1 W2 W3 W4 W5 W6
## [1,] 0.3768347 -1.00203611 -0.7132184 1.042499 0.9530678 1.1411060 0.3168121
## [2,] 1.2553529 0.02590761 -1.0166592 1.080826 -0.4157007 -0.6290421 -1.2707453
## W7 W8 W9
## [1,] 0.7156816 -0.9463759 0.52378339
## [2,] 1.4276169 -1.4034296 -0.06035219
Display the data structure of external study (yE, ZE).
## yE Z1 Z2 Z3 Z4 Z5 Z6
## [1,] 0 0.07730312 -1.1346302 -0.7948034 -1.411522 -0.5676023 0.9514099
## [2,] 1 -0.29686864 0.7645571 0.6060846 1.083870 0.4837350 0.4570987
## Z7 Z8 Z9 Z10 Z11 Z12 Z13
## [1,] -0.5682999 0.1965621 0.5263087 -1.0905102 0.8084132 -1.680998 -0.4447731
## [2,] 0.1131262 -0.4199427 -1.1767110 -0.4629759 1.5886737 1.265549 0.4094111
## Z14 Z15 Z16 Z17 Z18 Z19 Z20
## [1,] -1.6220979 0.5468718 -0.6363353 -1.6490749 0.3413341 1.564013 -1.024849
## [2,] 0.1897096 -0.1667157 0.7258137 -0.7714769 0.4136665 -0.795476 -1.249276
Scale Z and W predictors
Train with Main Study by cv.glmnet
We do cv.glmnet for lasso only on main study.
Train with External Study by glm
Summarize external study to be pre-trained model with parameter estimation. Here we use GLM with external study (yE,ZE). Extract estimated coefficients (other than intercept terms, only for Z), variance-covariance matrix. Convert the results into a list.
resglm<-glm(y~.,data = data.frame(y=yE,ZE),family = "binomial")
study_external = list(
Coeff=resglm$coefficients[-1],
Covariance=vcov(resglm)[-1,-1])
ext_study_info<-list()
ext_study_info[[1]] <- study_externalTrain with Main and External Studies by
cv.htlgmm
Train with cv.htlgmm, the family is
binomial. First train HTL-GMM using lasso. The
tune_weight option introduces the regularization of
weighting matrix.
res_htlgmm<-cv.htlgmm(y=y,
Z=Z,
W=W,
ext_study_info = ext_study_info,
family="binomial",
tune_weight = T,
penalty_type = "lasso",
beta_initial = as.vector(coef(res_glmnet,s="lambda.min")))For HTL-GMM using adaptive lasso, the adaptive weight is recommended using weight from ridge regression. When using adaptive lasso, the inference outputs are generated by default.
res_ridge<-cv.glmnet(x=cbind(Z,W),y=y,alpha=0,family="binomial")
weight_adaptivelasso<-1/abs(c(coef(res_ridge,s="lambda.min")[-1]))^(1/2)
res_htlgmm_ada<-cv.htlgmm(y=y,
Z=Z,
W=W,
ext_study_info = ext_study_info,
family="binomial",
tune_weight = T,
penalty_type = "adaptivelasso",
weight_adaptivelasso = weight_adaptivelasso,
beta_initial = as.vector(coef(res_glmnet,s="lambda.min")))Check the estimation error between estimated coefficients and true coefficients.
## Estimation Error: lasso(0.2046);
## htlgmm(0.0734)
Click for comparison code
est_coef_htlgmm<-res_htlgmm$beta[-1]
ee_htlgmm<-sum((coef-est_coef_htlgmm)^2)
ee_htlgmm<-round(ee_htlgmm,4)
est_coef_lasso<-coef.glmnet(res_glmnet,s="lambda.min")[-1]
ee_lasso<-sum((coef-est_coef_lasso)^2)
ee_lasso<-round(ee_lasso,4)
cat(paste0("Estimation Error: ","lasso(",ee_lasso,");\n htlgmm(",ee_htlgmm,")"))## Estimation Error: lasso(0.2046);
## htlgmm(0.0734)
Check the out of sample prediction performance on test data with 2000 sample size.
## AUC on test data: lasso(0.7355);
## htlgmm(0.7467)
Click for test data generation simulation code
n_test = 2000
set.seed(203)
Z_test<-matrix(rnorm(n_test*pZ),n_test,pZ)
colnames(Z_test)<-paste0("Z",1:pZ)
W_test<-matrix(rnorm(n_test*pW),n_test,pW)
colnames(W_joint)<-paste0("W",1:pW)
y_test<-rbinom(n=n_test,size=1,prob=expit(cbind(Z_test,W_test)%*%coef))
library(pROC)
# intercepts of estimated coefficients are ignored
# since intercepts do not affect AUC.
pred_y_htlgmm<-cbind(Z_test,W_test)%*%est_coef_htlgmm
pred_y_lasso<-cbind(Z_test,W_test)%*%est_coef_lasso
suppressMessages(auc_htlgmm<-c(auc(y_test,c(htlgmm::expit(pred_y_htlgmm)),direction = "<")))
suppressMessages(auc_lasso<-c(auc(y_test,c(htlgmm::expit(pred_y_lasso)),direction = "<")))
cat(paste0("AUC on test data: ","lasso(",round(auc_lasso,4),");\n htlgmm(",round(auc_htlgmm,4),")"))## AUC on test data: lasso(0.7355);
## htlgmm(0.7467)
Click for detailed inference output of cv.htlgmm
Example of beta
est_coef_htlgmm<-res_htlgmm$beta
names(est_coef_htlgmm)<-c("Intercept",paste0('Z',1:pZ),paste0('W',1:pW))
est_coef_htlgmm[1:30] # only show the first 30 coefficients## Intercept Z1 Z2 Z3 Z4 Z5
## -0.044383546 0.498662972 0.448938638 0.308491332 0.000000000 0.000000000
## Z6 Z7 Z8 Z9 Z10 Z11
## 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
## Z12 Z13 Z14 Z15 Z16 Z17
## 0.000000000 0.000000000 0.000000000 0.028778728 -0.008967114 -0.053259808
## Z18 Z19 Z20 W1 W2 W3
## 0.000000000 0.000000000 0.000000000 0.461327236 0.356570948 0.000000000
## W4 W5 W6 W7 W8 W9
## 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
HTLGMM selected_vars
## [1] -0.04140164 0.51867246 0.46843552 0.31214784 0.48650693 0.40330147
## [7] -0.03980315 -0.03332190
## [1] 0.007575373 0.002137143 0.002087740 0.002199780 0.007014063 0.007770544
## [7] 0.006641592 0.006219383
## [1] 6.343018e-01 3.268335e-29 1.158337e-24 2.826334e-11 6.283285e-09
## [6] 4.759089e-06 6.252620e-01 6.726392e-01
The FDR adjusted positions passing significant level 0.05 after BH adjustment.
print(paste0("The predictors whose effects on outcome are nonzero: ",paste(c(paste0('Z',1:pZ),paste0('W',1:pW))[coef!=0],collapse = ", ")))## [1] "The predictors whose effects on outcome are nonzero: Z1, Z2, Z3, W1, W2"
print(paste0("The predictors selected by htlgmm after FDR adjustment :",paste(res_htlgmm_ada$selected_vars$FDR_adjust_name,collapse = ", ")))## [1] "The predictors selected by htlgmm after FDR adjustment :Z1, Z2, Z3, W1, W2"
The lambda list and cross validation auc.
## lambda cv_mse.1 cv_mse.2 cv_mse.3 cv_mse.4 cv_mse.5 cv_mse.6
## 1 0.5551131 0.5000000 0.5000000 0.5000000 0.5000000 0.5000000 0.5000000
## 2 0.5057983 0.7249778 0.7248000 0.7253333 0.7239111 0.7328000 0.7212444
## 3 0.4608646 0.7224889 0.7226667 0.7233778 0.7328000 0.7233778 0.7155556
## 4 0.4199227 0.7232000 0.7226667 0.7248000 0.7285333 0.7182222 0.7107556
## 5 0.3826179 0.7223111 0.7248000 0.7306667 0.7244444 0.7162667 0.7100444
## 6 0.3486272 0.7212444 0.7304889 0.7328000 0.7232000 0.7143111 0.7091556
The best selected lambda.
## [1] 0.0542351
Real Application Example for Risk Prediction
The real application for 10-year risk prediction of “incident” Breast Cancer using proteomics and risk factor data
The proteomics data is downloaded from https://www.ukbiobank.ac.uk/enable-your-research/about-our-data/past-data-releases
The risk factors are downloaded from https://biobank.ndph.ox.ac.uk/showcase/browse.cgi
Because of data privacy issue, we use simulated risk factor and
proteomics data for the application of cv.htlgmm here.
Click for risk factor and proteomics data generation simulation code
Load the required package including htlgmm.
Simulated UKB data including disease status, risk factor and proteomics
The risk factors names are listed from the Breast Cancer related ones. Record the summarized mean and standard deviance for each risk factor, including whether the predictors are binary or continuous.
risk_factor_names<-c("age","BMI","Height","Smoking_Status","Alcohol_Freq",
"Mother_Breast_Cancer_History","Silbings_Breast_Cancer_History",
"Number_of_Birth","Age_Menarche","Ever_Used_Oral_Contraceptive",
"Ever_Received_Hormone_Replacement_Therapy",
"Ever_Received_Bilateral_Oophorectomy",
"Ever_Had_Still_Birth","Prevalent_Benign_Breast_Disease")
mean_risk_factor<-c(59.7636,27.1449,162.1822,0.3442,0.1804,0.0758,0.0409,1.8789,12.9352,0.8001,0.4922,0.0961,0.3094,0.0083)
sd_risk_factor<-c(5.4963,5.0753,6.1417,0.4751,0.3845,0.2647,0.1980,1.1317,1.6063,0.3999,0.4999,0.2948,0.4622,0.0907)
names(mean_risk_factor)<-risk_factor_names
print(mean_risk_factor)## age
## 59.7636
## BMI
## 27.1449
## Height
## 162.1822
## Smoking_Status
## 0.3442
## Alcohol_Freq
## 0.1804
## Mother_Breast_Cancer_History
## 0.0758
## Silbings_Breast_Cancer_History
## 0.0409
## Number_of_Birth
## 1.8789
## Age_Menarche
## 12.9352
## Ever_Used_Oral_Contraceptive
## 0.8001
## Ever_Received_Hormone_Replacement_Therapy
## 0.4922
## Ever_Received_Bilateral_Oophorectomy
## 0.0961
## Ever_Had_Still_Birth
## 0.3094
## Prevalent_Benign_Breast_Disease
## 0.0083
For simplicity, we assume all binary risk factors follow binomial distribution and continuous risk factors follow normal distribution. We perform case-control analysis to allow fast computation in this simulation. The main study is assumed to be 2,000, and the external study is 9 times more, where the ratio is similar to actual case.
There are in total 1,500 proteins in UK Biobank proteomics data, we use 500 for simulation.
binary_predictors<-c("Mother_Breast_Cancer_History",
"Silbings_Breast_Cancer_History",
"Ever_Used_Oral_Contraceptive",
"Ever_Received_Hormone_Replacement_Therapy",
"Ever_Received_Bilateral_Oophorectomy",
"Ever_Had_Still_Birth",
"Prevalent_Benign_Breast_Disease")
n_main<-2000
n_external<-n_main*9
n_joint<-n_main+n_external
p_protein<-500
set.seed(2023)
risk_factor<-list()
for( i in which(risk_factor_names%in%binary_predictors)){
risk_factor[[i]]<-rbinom(n = n_joint,size = 1,prob = mean_risk_factor[i])
}
for( i in which(!risk_factor_names%in%binary_predictors)){
risk_factor[[i]]<-rnorm(n = n_joint,mean = mean_risk_factor[i],sd = sd_risk_factor[i])
}
risk_factor_matrix<-Reduce("cbind",risk_factor)Check the data structure of the risk factor matrix, which is similar to the actual data structure.
## age BMI Height Smoking_Status Alcohol_Freq
## [1,] 64.52 31.98 160.26 -0.23 0.61
## [2,] 44.70 27.05 168.29 0.38 0.20
## Mother_Breast_Cancer_History Silbings_Breast_Cancer_History
## [1,] 0 0
## [2,] 0 0
## Number_of_Birth Age_Menarche Ever_Used_Oral_Contraceptive
## [1,] 2.62 14.42 0
## [2,] 3.12 13.53 1
## Ever_Received_Hormone_Replacement_Therapy
## [1,] 0
## [2,] 1
## Ever_Received_Bilateral_Oophorectomy Ever_Had_Still_Birth
## [1,] 0 0
## [2,] 0 0
## Prevalent_Benign_Breast_Disease
## [1,] 0
## [2,] 0
The proteins are assumed to follow N(0,1).
proteomics_matrix<-matrix(rnorm(n_joint*p_protein),nrow = n_joint,ncol = p_protein)
eg_proteomics<-proteomics_matrix[1:6,1:5]
colnames(eg_proteomics)<-c("aarsd1","abhd14b","abl1","acaa1","ace2")
eg_proteomics## aarsd1 abhd14b abl1 acaa1 ace2
## [1,] -1.1105242 -0.01414405 -1.0864208 -0.5041349 4.490670e-01
## [2,] 1.0969956 0.81478464 -0.3507938 -1.2923906 -1.026354e-01
## [3,] 0.1096429 1.17141003 -0.6785858 1.6844419 -6.698569e-05
## [4,] -0.4895755 0.09112312 -0.5685032 -0.6172262 -5.422962e-01
## [5,] 1.0021940 0.66385649 -1.4594543 0.6078855 -1.003801e+00
## [6,] -1.0121142 -0.64776973 -0.4683433 -1.1744974 -1.432753e+00
We assume the disease status follows binomial distribution, and the case:control ratio to be roughly 0.1.
coef_risk_factor<-c(0.1172,0.1216,0.1067,-0.0312,0.0816,0.1038,0.0755,-0.0507,-0.0099,-0.0099,0.0376,-0.0266,-0.0251,0.0628)
coef_proteomics<-rep(0,p_protein)
set.seed(2024)
coef_proteomics[sample(1:p_protein,20)]<-rnorm(20,mean=0,sd = 0.1)
mean_case<-0.1
Y<-rbinom(n_joint,size = 1,prob =expit(
log(mean_case/(1-mean_case))+c(scale(risk_factor_matrix)%*%coef_risk_factor)+c(proteomics_matrix%*%coef_proteomics)))
print(paste0("simulated disease prevalence:",sum(Y)/n_joint))## [1] "simulated disease prevalence:0.11255"
Separate the simulated data into main study and external study.
main_study<-data.frame(Y = Y[1:n_main],risk_factor_matrix[1:n_main,],proteomics_matrix[1:n_main,])
external_study<-data.frame(Y = Y[-c(1:n_main)],risk_factor_matrix[-c(1:n_main),])Scale the predictors from external study. Extract summary statistics from external study.
external_study[,-1]=scale(external_study[,-1])
family = "binomial"
ext_study_info=list()
external_glm=glm(Y~.,data = external_study,family = family)
study_external = list(
Coeff=external_glm$coefficients[-1],
Covariance=vcov(external_glm)[-1,-1])
ext_study_info[[1]] = study_external
#ext_study_infoSplit main study into train and test with the proportion (70% and 30%).
y = main_study$Y[-test_index]
X = as.matrix(main_study[-test_index,-1])
y_test = main_study$Y[test_index]
X_test = as.matrix(main_study[test_index,-1])
prefunc <- preProcess(X, method = c("center", "scale"))
X = predict(prefunc, X)
X_test = as.matrix(predict(prefunc, X_test))
p_risk = ncol(external_study)-1Train with Main Study by cv.glmnet
y and X are training sets for main study, while y_test and X_test are test sets for main study. We perform normalization on X and apply the same transformation to X_test.
Perform cv.glmnet only for main study’s training set and
check the performance on main study’s test set. The criterion is
AUC.
start_t=Sys.time()
main_lasso=cv.glmnet(x=X, y=y, family="binomial")
end_t=Sys.time()
sprintf('cv.glmnet time: %.2f %s', end_t-start_t, units(end_t-start_t))## [1] "cv.glmnet time: 14.27 secs"
pred_main_lasso=c(predict(main_lasso,newx=X_test,s="lambda.min"))
suppressMessages(auc_main_lasso<-auc(y_test,expit(pred_main_lasso)))Train with Main study and External Study by
cv.htlgmm
Perform cv.htlgmm for main study’s training set with
external study and check the performance on main study’s test set. The
initial beta for cv.htlgmm is taken from the output of
cv.glmnet.
start_t=Sys.time()
main_external_htlgmm=htlgmm::cv.htlgmm(y = y,
Z = X[,c(1:p_risk)],
W = X[,-c(1:p_risk)],
ext_study_info = ext_study_info,
penalty_type = "lasso",
beta_initial = as.vector(coef(main_lasso,s="lambda.min")),
family = "binomial")
end_t=Sys.time()
sprintf('cv.htlgmm time: %.2f %s', end_t-start_t, units(end_t-start_t))## [1] "cv.htlgmm time: 18.03 secs"
Perform cv.htlgmm with the option of weight tuning by
setting tune_weight = TRUE.
start_t=Sys.time()
main_external_htlgmm_tune=htlgmm::cv.htlgmm(y = y,
Z = X[,c(1:p_risk)],
W = X[,-c(1:p_risk)],
ext_study_info = ext_study_info,
family = "binomial",
penalty_type = "lasso",
beta_initial = as.vector(coef(main_lasso,s="lambda.min")),
tune_weight = TRUE)
end_t=Sys.time()
sprintf('cv.htlgmm with weight tuning time: %.2f %s', end_t-start_t, units(end_t-start_t))## [1] "cv.htlgmm with weight tuning time: 35.51 secs"
Performance Comparison
Check the AUC comparison between cv.glmnet and
cv.htlgmm.
## AUC on test data: main study only lasso(0.544);
## htlgmm(0.5477);
## htlgmm_tune_weight(0.5921)
Plot to show the change of cross validation deviance against log lambda (tuning parameter for lasso penalty).
Click for comparison code
pred_main_external_htlgmm=c(X_test%*%main_external_htlgmm$beta[-1])
suppressMessages(auc_main_external_htlgmm<-pROC::auc(y_test,expit(pred_main_external_htlgmm)))
pred_main_external_htlgmm_tune=c(X_test%*%main_external_htlgmm_tune$beta[-1])
suppressMessages(auc_main_external_htlgmm_tune<-pROC::auc(y_test,expit(pred_main_external_htlgmm_tune)))
cat(paste0("AUC on test data: main study only lasso(",round(auc_main_lasso,4) ,");\n htlgmm(", round(auc_main_external_htlgmm,4),");\n htlgmm_tune_weight(",round(auc_main_external_htlgmm_tune,4),")"))## AUC on test data: main study only lasso(0.544);
## htlgmm(0.5477);
## htlgmm_tune_weight(0.5921)
Plot to show the change of cross validation deviance against log lambda (tuning parameter for lasso penalty).
df = data.frame(lambda = main_external_htlgmm$lambda_list,
cv_dev=main_external_htlgmm$cv_dev,
cv_auc=main_external_htlgmm$cv_auc)
library(ggplot2)
ggplot(df)+
geom_line(aes(x=log(lambda),y=cv_auc))+
geom_vline(xintercept = log(main_external_htlgmm$lambda_min),
linetype=2)+
theme(panel.background=element_rect(fill='transparent', color='black'),
axis.text=element_text(size=14,face="bold"))+
geom_text(aes(x=log(main_external_htlgmm$lambda_min),y=median(cv_auc),label='lambda with max cv auc'))GWAS version of HTLGMM
In this example, simulation example is generated to illustrate the
application of gwas.htlgmm.
We simulate SNP_matrix by binomial distribution, where the allele frequencies are simulated from uniform distribution. We consider 100 SNPs, with main study sample size 10k and external summary statistics with sample size 100k.
The adjustment covariates include age, sex, BMI. Assume age and sex
are adjusted from external summary statistics, while BMI is not
adjusted. We can further adjust BMI using both main study and external
summary statistics by gwas.htlgmm.
Simulation Setup
Click for genotype and phenotype generation simulation code, including
how to set up main and external studies
Simulate SNP matrix and covariates
library(htlgmm)
snp_count<-100
n_ext<-1e5
n_main<-1e4
n_joint<-n_main+n_ext
set.seed(2024)
allele_freq<-runif(snp_count,min = 0.1,max = 0.9)
snp_matrix<-sapply(1:snp_count, function(i){
rbinom(n=n_joint,size=2,prob=allele_freq[i])})
age<-rnorm(n_joint,mean = 60,sd = 5)
sex<-rbinom(n_joint,size=1,prob = 0.5)
BMI<-rnorm(n_joint,mean = 27,sd = 5)Simulate binary outcome by randomly setting casual SNPs
Randomly set true casual SNP postions and check the case and control counts for simulated disease.
true_snp_position<-sample(1:100,5)
coef_snp<-rep(0,snp_count)
coef_snp[true_snp_position]<-log(1.2)*sample(c(1,-1),5,replace=TRUE)
print(paste0("True Causal SNP Positions: ",paste(sort(true_snp_position),collapse = ", ")))## [1] "True Causal SNP Positions: 7, 10, 19, 27, 98"
coef_risk_factor<-c(0.1172,-0.05,0.1216)
risk_factor_matrix<-cbind(age,sex,BMI)
t2d<-rbinom(n_joint,size = 1,prob = expit(
log(0.1/(1-0.1))+c(scale(risk_factor_matrix)%*%coef_risk_factor)+c(snp_matrix%*%coef_snp)))
table(t2d)## t2d
## 0 1
## 98858 11142
Split simulated data into main study and external study
A consist of age and sex; W only
includes BMI; Z only includes each SNP. In general
case, A, and W can be lists of
covariates. For the A input of
gwas.htlgmm, a column of 1 is required for
intercept term.
A<-cbind(1,risk_factor_matrix[1:n_main,c(1,2)])
AE<-cbind(1,risk_factor_matrix[-c(1:n_main),c(1,2)])
W<-risk_factor_matrix[1:n_main,3,drop=F]
y<-t2d[1:n_main]
yE<-t2d[-c(1:n_main)]
snp_matrix_main<-snp_matrix[1:n_main,]
snp_matrix_ext<-snp_matrix[-c(1:n_main),]Simulate External GWAS Summary Statistics
The simulated external summary statistics ext_study_info is corresponding to the external summary statistics we usually obtain from a large consortia.
Please follow the list format below for external summary statistics.
ext_study_info<-list()
for(i in 1:snp_count){
ZE<-snp_matrix_ext[,i]
resglm<-glm(y~.,data = data.frame(y=yE,ZE,AE[,-1]),family = "binomial")
study_external = list(
Coeff=resglm$coefficients[2],
Covariance=vcov(resglm)[2,2])
ext_study_info[[i]] <- study_external
}
print(ext_study_info[[1]])## $Coeff
## ZE
## -0.008973933
##
## $Covariance
## [1] 0.0003068138
Randomly set true casual SNP postions and check the case and control counts for simulated disease.
## [1] "True Causal SNP Positions: 7, 10, 19, 27, 98"
## t2d
## 0 1
## 98858 11142
A consist of age and sex; W only
includes BMI; Z only includes each SNP. In general
case, A, and W can be lists of
covariates. For the A input of
gwas.htlgmm, a column of 1 is required for
intercept term.
The simulated external summary statistics ext_study_info is corresponding to the external summary statistics we usually obtain from a large consortia.
Input Parameters of gwas.htlgmm
gwas.htlgmm requireds:
(1) : All the covariates (A & W) except intercept term should be centered with mean 0;
(2) : The SNP variable also needs to be centered with mean 0;
(3) : With provided beta_initial, the beta_initial should be corresponding to be the centered version and follow the order of (A, W, Z).
Train with Main Study and External GWAS by
gwas.htlgmm
Key Parameter List of Function
gwas.htlgmm
y : Outcome of Interest. Binary of continuous.
Z : SNP.
W : Covariate of Interest, Unadjusted in External GWAS.
A : Study-specific Adjustment Factors. All the covariates adjusted in external summary statistics. When family = “binomial”, a column with 1 should be added for intercept term.
ext_study_info : External GWAS Summary Statistics. “ext_study_info” comes as list, including the information of estimated coefficient, estimated variance for the SNP.
family : linear (“gaussian”) and logistic (“binomial”).
beta_initial :
repeated_term : The repeated computation which can be reused.
output_SNP_only : Only output SNP updated estimated coefficient, estimated variance. Otherwise, information of all covariates can be outputted.
output_tmp : Output repeated_term or not.
Note that repeated_term should be computed in advance to
accelerate the computation.
start_t=Sys.time()
library(speedglm,quietly = T)
# first compute the repeated terms.
repeated_term <- gwas.htlgmm.repeated.term(y,A,"binomial")
beta_list<-c()
se_list<-c()
for(i in 1:snp_count){
# center SNP
Z<-snp_matrix_main[,i]
Z<-scale(Z,scale=F)
# beta initial
resglm<-speedglm(y~.,data = data.frame(y=y,A[,-1],W,Z),family=binomial())
beta_initial<-resglm$coefficients
# find external gwas
GWAS_ext_study_info = list()
GWAS_ext_study_info[[1]] = ext_study_info[[i]]
res.gwas<-gwas.htlgmm(y,Z,W,GWAS_ext_study_info,A,"binomial",beta_initial,repeated_term=repeated_term)
beta_list<-c(beta_list,res.gwas$beta)
se_list<-c(se_list,sqrt(res.gwas$variance))
}
end_t=Sys.time()
sprintf('gwas.htlgmm time: %.2f %s', end_t-start_t, units(end_t-start_t))## [1] "gwas.htlgmm time: 3.35 secs"
Verify the results
In our simulation, because BMI works as a general strong risk factor, without BMI, the GWAS summary statistics will not result in correct conclusion.
## [1] "External GWAS identifies siginicant SNP number of : 93"
## [1] "gwas.htlgmm identifies siginificant SNP positions: 7, 10, 19, 27, 98"
Click for comparison code
pval_list_ext<-sapply(1:snp_count, function(i){
pchisq(ext_study_info[[i]][[1]]^2/ext_study_info[[i]][[2]]^2,1,lower.tail = F)
})
pval_list_htlgmm<-pchisq(beta_list^2/se_list^2,1,lower.tail = F)
print(paste0("External GWAS identifies siginicant SNP number of : ",sum(pval_list_ext<0.05/100)))## [1] "External GWAS identifies siginicant SNP number of : 93"
print(paste0("gwas.htlgmm identifies siginificant SNP positions: ",paste(which(pval_list_htlgmm<0.05/100),collapse = ", ")))## [1] "gwas.htlgmm identifies siginificant SNP positions: 7, 10, 19, 27, 98"
Accelerate gwas.htlgmm by plink2
Tutorial of plink2 with example files
Check this link for introduction of plink2: http://ruzhangzhao.github.io/tutorial/plink2
To be updated.
Fast Matrix Production & Inverse
In our package htlgmm, we provide fast implementation of
matrix product. Please use crossprod_fast to replace
crossprod, and use prod_fast to replace
%*%. For a matrix A, a matrix B,
a vector b.
crossprod_fast(A) : generates \(A^\top\cdot A\), previously coded as crossprod(\(A\)) or t(\(A\))%*%\(A\).
crossprod_fast(A,B) : generates \(A^\top\cdot B\), previously coded as crossprod(\(A\),\(B\)) or t(\(A\))%*%\(B\).
crossprod_fast(A,b) : generates \(A^\top\cdot b\), previously coded as crossprod(\(A\),\(b\)) or t(\(A\))%*%\(b\).
prod_fast(A,B) : generates \(A\cdot B\), previously coded as \(A\)%*%\(B\).
prod_fast(A,b) : generates \(A\cdot b\), previously coded as \(A\)%*%\(b\).
Under the same setting \(A\cdot B\)
for matrix dot product, %*% takes time:
## user system elapsed
## 0.585 0.007 0.593
Our fast implementation prod_fast takes time:
## user system elapsed
## 0.163 0.003 0.166
Under the same setting \(A^\top\cdot
B\) for matrix dot product, crossprod takes
time:
## user system elapsed
## 0.980 0.008 0.988
Our fast implementation crossprod_fast takes time:
## user system elapsed
## 0.161 0.002 0.164
Click for comparison code
set.seed(1)
x<-matrix(rnorm(1e5),1e2,1e3)
y<-matrix(rnorm(1e5),1e3,1e2)
standard_f<-function(){.<-sapply(1:100, function(i){.<-x%*%y })}
## The standard matrix dot product takes time:
system.time(standard_f())## user system elapsed
## 0.605 0.009 0.615
fast_f<-function(){.<-sapply(1:100, function(i){.<-prod_fast(x,y)})}
## The new fast matrix dot product takes time:
system.time(fast_f())## user system elapsed
## 0.161 0.003 0.165
set.seed(1)
x<-matrix(rnorm(1e5),1e3,1e2)
y<-matrix(rnorm(1e5),1e3,1e2)
standard_f<-function(){
.<-sapply(1:100, function(i){.<-crossprod(x,y) })
}
## The standard matrix dot product takes time:
system.time(standard_f())## user system elapsed
## 0.973 0.012 1.002
fast_f<-function(){
.<-sapply(1:100, function(i){.<-crossprod_fast(x,y)})
}
## The new fast matrix dot product takes time:
system.time(fast_f())## user system elapsed
## 0.164 0.003 0.167
In package htlgmm, we also provide fast implementation
of matrix inverse for positive definite matrix. We compare the
computation speed among the following methods:
solve : the default matrix inverse from package base.
spdinv : the inverse of a symmetric positive definite matrix from package Rfast.
chol2inv : the inverse from Choleski (or QR) decomposition from package base.
choinv : the inverse based on Cholesky decomposition from package htlgmm.
sqrtchoinv : for positive definite matrix \(A\), \(L\) = sqrtchoinv(\(A\)), we have \(L^\top\cdot L\) is the inverse of \(A\).
Under the same setting for positive definition matrix inverse,
solve takes time:
## user system elapsed
## 5.543 0.028 5.577
The matrix inverse by ‘spdinv’ takes time:
## user system elapsed
## 2.978 0.008 2.994
The matrix inverse by ‘chol2inv’ takes time:
## user system elapsed
## 1.583 0.003 1.589
Our fast matrix inverse takes time:
## user system elapsed
## 1.353 0.014 1.370
Our fast matrix inverse and square root takes time:
## user system elapsed
## 0.775 0.016 0.808
Click for comparison code
set.seed(1)
x<-matrix(rnorm(2e7),1e4,2e3)
xtx<-crossprod_fast(x)
## The matrix inverse by 'solve' takes time:
system.time(.<-base::solve(xtx))## user system elapsed
## 5.521 0.017 5.545
## user system elapsed
## 2.904 0.006 2.915
## user system elapsed
## 1.572 0.003 1.576
## user system elapsed
## 1.331 0.010 1.342
## user system elapsed
## 0.771 0.007 0.779
Feedback & Reference
https://github.com/RuzhangZhao/htlgmm/issues
Please submit any feedback and bugs on github issue page or email Ruzhang Zhao (ruzhangzhao@gmail.com) & Nilanjan Chatterjee (nchatte2@jhu.edu).
Please consider citing our work: Zhao, Ruzhang, et al. “Heterogeneous Transfer Learning for Building High-Dimensional Generalized Linear Models with Disparate Datasets.” arXiv preprint arXiv:2312.12786 (2023).