Tutorial for R packge `htlgmm`

Heterogeneous Transfer Learning Dealing with Unmatched Features

setup

In recent days, datasets with information on all desired features typically have small sample sizes, but one or more external studies with limited set of predictors sometimes have large sample sizes. HTL-GMM method (Heterogeneous Transfer Learning with Generalized Method of Moments) aims to build GLMs and Cox PH models with all desired features incorporating the external studies, which can be potentially high-dimensional.

This tutorial serves as an introduction of HTL-GMM method. We will introduce the R package usage and some application examples. Please refer to https://arxiv.org/abs/2312.12786 for the work of HTL-GMM method.

Code for HTL-GMM Manuscript

Please refer to https://github.com/RuzhangZhao/htlgmm_data_analysis to download the code for simulation and real data application mentioned in the HTL-GMM Manuscript.

Installation

devtools::install_github("RuzhangZhao/htlgmm",force = T)

Method Recap

setup

Main study and external study can be from the same or different population.
External study can be in the form of individual data or pre-trained model with parameter estimation.
\({\bf A}\) and \(\tilde{\bf A}\): study-specific adjustment factors; \({\bf Z}\): overlapping features in both main and external studies; \({\bf W}\): unmatched features only in main study; \({\bf Y}\): outcome variable (family = “gaussian”: continuous; family = “binomial”: binary)

Function Usage of Package htlgmm

Primary Functions of Package htlgmm

We primarily provid three functions htlgmm, cv.htlgmm, gwas.htlgmm. The first two work similarly like functions glmnet and cv.glmnet from R package glmnet. cv.glmnet/cv.htlgmm does cross validation with automatically generated tuning parameter list(s) to select tuning parameters. We recommend users to use cv.htlgmm without knowing prefixed tuning parameters.

htlgmm setup

cv.htlgmm setup

gwas.htlgmm setup

Input Parameters of `cv.htlgmm`

setup

y : Response Variable.
Z : Overlapping Features.
W : Unmatched Features.
A : Study-specific Adjustment Factors. For default setting, we ignore the input of A if there is no intercept term (all variables are centered) when family = “gaussian” or only intercept term when family = “binomial”.
ext_study_info : Parameter estimation from External Study regarding overlapping features. “ext_study_info” comes as list, including the information of estimated coefficient, estimated variance (variance-covariance matrix).
family : Currently focus on linear (“gaussian”) and logistic (“binomial”).
penalty_type : Select from c(“none”,“adaptivelasso”,“lasso”,“ridge”,“elasticnet”). We allow all penalty choices available from glmnet. The default is “lasso”.
beta_initial: The initial estimation for beta if a consistent estimator is available. The default is NULL, and main study is used for initial estimation according to “initial_with_type”.
tune_weight: Whether to assign tuning weight for the regularization of weighting matrix. The default is FALSE.
tune_weight_method: Method for weighting matrix regularization. The default is “multiplicative shrinkage”, which can also be used as “mshrink” or “ms”. The other choice is “ridge”.

Click for more details of input parameters

initial_with_type: Get initial estimation for beta using main study data only by cross validation using penalty regression, chosen from c(“ridge”,“lasso”,“glm”).
weight_adaptivelasso: The customized adaptive-weight of adaptive lasso when using penalty_type = “adaptivelasso”.
alpha: The elasticnet mixing parameter, between 0 and 1, which is corresponding to the alpha for glmnet. Only used when penalty_type = “elasticnet”.
hat_thetaA: If A is not NULL, one can provide hat_thetaA as the input. If “hat_thetaA = NULL”, we estimate hat_thetaA with glm by main study.
V_thetaA: If A is not NULL, one can provide V_thetaA as the input. If “V_thetaA = NULL”, we estimate V_thetaA with glm by main study.
remove_penalty_Z: Do not penalize Z if it is TRUE. The default is FALSE.
remove_penalty_W: Do not penalize W if it is TRUE. The default is FALSE.
inference: Whether to do inference without penalty or post-selection inference with adaptive lasso penalty. The default is “default”, which is TRUE when penalty_type = “none” or “adaptivelasso”, and FALSE otherwise.
fix_C: When fix_C = NULL, the optimal C is computed. When user wants to customize the fix_C, please match its dimension as dim(A)+2*dim(Z)+dim(W) and make sure it is positive definite.
fix_inv_C: When fix_inv_C = NULL, the optimal C is computed. When user wants to customize the fix_inv_C, please match its dimension as dim(A)+2*dim(Z)+dim(W) and make sure it is positive definite. When fix_C and fix_inv_C are both given, the fix_C will be used.
type_measure: Select from c(“default”, “mse”, “deviance”, “auc”). Default is mse(liner), auc(logistic). “deviance” is another choice for binary y.
fix_weight: The fixed weight for the regularization of weighting matrix. The default is NULL. If it is NULL, select the best weight via cross validation.
weight_list: The weight list if it is preset. The default is NULL and weight list will be generated.
gamma_adaptivelasso: The gamma for adaptive lasso. Select from c(1/2,1,2). The default is 1/2.

Output Values of `cv.htlgmm`

beta : Target coefficient estimation of all predictors. The coefficients go with the order of (A, Z, W).
lambda_list: The lambda list for cross validation.
weight_list: The weight list for cross validation, when set tune_weight = TRUE.
cv_mse: The mean square error(mse) when family = “gaussian” for cross validation.
cv_auc: The area under ROC curve(auc) when family = “binomial” for cross validation. For each lambda, cv_auc is the average auc across all folds.
cv_dev: The deviance(dev) when family = “binomial” for cross validation An alternative choice of cv_auc.
lambda_min: The selected best lambda.
weight_min: The selected best weight, when set tune_weight = TRUE.
selected_vars: A list of results for predictors with nonzero estimated coefficients.
- position: the position of nonzero predictors.
- name: the name of nonzero predictors.
- coef: the coefficient of nonzero predictors.
- variance: the variance of nonzero predictors.
- variance-covariance: the variance-covariance matrix of nonzero predictors.
- pval: the p values of nonzero predictors.

Simulation Examples with Binary Outcome

In this simulation example, we illustrate the usage of cv.htlgmm for binary outcome.

Simulation Setup

The simulation example is to illustrate how the cv.htlgmm method is implemented, compared with cv.glmnet. The simulation is performed under logistic regression setting. We use 20 overlapping features (Z) and 100 unmatched features (W). 3/20 of Z whose effects on outcome are nonzero, while 2/100 of W whose effects on outcome are nonzero.

Main study sample size n is 500, while external study sample size nE is 3000.

Click for data generation simulation code

library(htlgmm)
pZ=20 # overlapping features
pW=100 # unmatched features 
coef<-c(rep(0,pZ+pW))
coef[1:3]<-0.5
coef[c(21,22)]<-0.5
n=500
nE=3000
n_joint=n+nE
main_index<-1:n
print(paste0("Nonzero positions: ",paste(which(coef!=0),collapse = ", ")))

## [1] "Nonzero positions: 1, 2, 3, 21, 22"

We jointly generate main and external studies.

y is generated with both Z and W.

set.seed(1)
Z_joint<-matrix(rnorm(n_joint*pZ),n_joint,pZ)
colnames(Z_joint)<-paste0("Z",1:pZ)
W_joint<-matrix(rnorm(n_joint*pW),n_joint,pW)
colnames(W_joint)<-paste0("W",1:pW)
Z<-Z_joint[main_index,]  # separate main and external study for Z
ZE<-Z_joint[-main_index,]

W<-W_joint[main_index,] # only need main study for W

y_joint<-rbinom(n=n_joint,size=1,prob=expit(cbind(Z_joint,W_joint)%*%coef))
y<-y_joint[main_index] # separate main study y
yE<-y_joint[-main_index] # separate external study y

Display the data structure of main study (y, Z, W) with the first 2 rows and first 30 columns.

main_study<-cbind(y,Z,W)
head(main_study,2)[,1:30]

##      y         Z1        Z2         Z3         Z4        Z5         Z6
## [1,] 1 -0.6264538 0.5205997 -1.3254177  0.7051107 0.5232667 -2.4675028
## [2,] 0  0.1836433 0.3775619  0.9519797 -0.6268688 0.9935537 -0.7000656
##              Z7         Z8          Z9       Z10        Z11       Z12
## [1,]  0.6010915 -0.9252129  0.97027832 0.5375559  0.6986309 0.5574862
## [2,] -2.7671158  0.1805965 -0.02072336 0.2271813 -1.1650711 0.3520408
##             Z13       Z14         Z15       Z16       Z17        Z18      Z19
## [1,]  0.3707608 2.5285947 -0.06455497 -1.504228 1.2651240  0.7859485 1.380371
## [2,] -0.7265963 0.1232134 -0.58150321  1.093152 0.5123215 -0.4794245 1.682890
##            Z20          W1         W2       W3         W4         W5         W6
## [1,] 0.3768347 -1.00203611 -0.7132184 1.042499  0.9530678  1.1411060  0.3168121
## [2,] 1.2553529  0.02590761 -1.0166592 1.080826 -0.4157007 -0.6290421 -1.2707453
##             W7         W8          W9
## [1,] 0.7156816 -0.9463759  0.52378339
## [2,] 1.4276169 -1.4034296 -0.06035219

Display the data structure of external study (yE, ZE).

external_study<-cbind(yE,ZE)
head(external_study,2)

##      yE          Z1         Z2         Z3        Z4         Z5        Z6
## [1,]  0  0.07730312 -1.1346302 -0.7948034 -1.411522 -0.5676023 0.9514099
## [2,]  1 -0.29686864  0.7645571  0.6060846  1.083870  0.4837350 0.4570987
##              Z7         Z8         Z9        Z10       Z11       Z12        Z13
## [1,] -0.5682999  0.1965621  0.5263087 -1.0905102 0.8084132 -1.680998 -0.4447731
## [2,]  0.1131262 -0.4199427 -1.1767110 -0.4629759 1.5886737  1.265549  0.4094111
##             Z14        Z15        Z16        Z17       Z18       Z19       Z20
## [1,] -1.6220979  0.5468718 -0.6363353 -1.6490749 0.3413341  1.564013 -1.024849
## [2,]  0.1897096 -0.1667157  0.7258137 -0.7714769 0.4136665 -0.795476 -1.249276

Scale Z and W predictors

Z<-scale(Z)
ZE<-scale(ZE)
W<-scale(W)

Train with Main Study by cv.glmnet

We do cv.glmnet for lasso only on main study.

Train with External Study by glm

Summarize external study to be pre-trained model with parameter estimation. Here we use GLM with external study (yE,ZE). Extract estimated coefficients (other than intercept terms, only for Z), variance-covariance matrix. Convert the results into a list.

resglm<-glm(y~.,data = data.frame(y=yE,ZE),family = "binomial")
study_external = list(
                Coeff=resglm$coefficients[-1],
                Covariance=vcov(resglm)[-1,-1])
ext_study_info<-list()
ext_study_info[[1]] <- study_external

Train with Main and External Studies by `cv.htlgmm`

Train with cv.htlgmm, the family is binomial. First train HTL-GMM using lasso. The tune_weight option introduces the regularization of weighting matrix.

res_htlgmm<-cv.htlgmm(y=y,
                      Z=Z,
                      W=W,
                      ext_study_info = ext_study_info,
                      family="binomial",
                      tune_weight = T,
                      penalty_type = "lasso",
                      beta_initial = as.vector(coef(res_glmnet,s="lambda.min")))

For HTL-GMM using adaptive lasso, the adaptive weight is recommended using weight from ridge regression. When using adaptive lasso, the inference outputs are generated by default.

res_ridge<-cv.glmnet(x=cbind(Z,W),y=y,alpha=0,family="binomial")
weight_adaptivelasso<-1/abs(c(coef(res_ridge,s="lambda.min")[-1]))^(1/2)
res_htlgmm_ada<-cv.htlgmm(y=y,
                      Z=Z,
                      W=W,
                      ext_study_info = ext_study_info,
                      family="binomial",
                      tune_weight = T,
                      penalty_type = "adaptivelasso",
                      weight_adaptivelasso = weight_adaptivelasso,
                      beta_initial = as.vector(coef(res_glmnet,s="lambda.min")))

Check the estimation error between estimated coefficients and true coefficients.

## Estimation Error: lasso(0.2046);
##  htlgmm(0.0734)

Click for comparison code

est_coef_htlgmm<-res_htlgmm$beta[-1]
ee_htlgmm<-sum((coef-est_coef_htlgmm)^2)
ee_htlgmm<-round(ee_htlgmm,4)

est_coef_lasso<-coef.glmnet(res_glmnet,s="lambda.min")[-1]
ee_lasso<-sum((coef-est_coef_lasso)^2)
ee_lasso<-round(ee_lasso,4)

cat(paste0("Estimation Error: ","lasso(",ee_lasso,");\n htlgmm(",ee_htlgmm,")"))

## Estimation Error: lasso(0.2046);
##  htlgmm(0.0734)

Check the out of sample prediction performance on test data with 2000 sample size.

## AUC on test data: lasso(0.7355);
##  htlgmm(0.7467)

Click for test data generation simulation code

n_test = 2000
set.seed(203)
Z_test<-matrix(rnorm(n_test*pZ),n_test,pZ)
colnames(Z_test)<-paste0("Z",1:pZ)
W_test<-matrix(rnorm(n_test*pW),n_test,pW)
colnames(W_joint)<-paste0("W",1:pW)

y_test<-rbinom(n=n_test,size=1,prob=expit(cbind(Z_test,W_test)%*%coef))
library(pROC)
# intercepts of estimated coefficients are ignored 
# since intercepts do not affect AUC. 
pred_y_htlgmm<-cbind(Z_test,W_test)%*%est_coef_htlgmm
pred_y_lasso<-cbind(Z_test,W_test)%*%est_coef_lasso
  
suppressMessages(auc_htlgmm<-c(auc(y_test,c(htlgmm::expit(pred_y_htlgmm)),direction = "<")))
suppressMessages(auc_lasso<-c(auc(y_test,c(htlgmm::expit(pred_y_lasso)),direction = "<")))

cat(paste0("AUC on test data: ","lasso(",round(auc_lasso,4),");\n htlgmm(",round(auc_htlgmm,4),")"))

## AUC on test data: lasso(0.7355);
##  htlgmm(0.7467)

Click for detailed inference output of cv.htlgmm

Example of beta

est_coef_htlgmm<-res_htlgmm$beta
names(est_coef_htlgmm)<-c("Intercept",paste0('Z',1:pZ),paste0('W',1:pW))
est_coef_htlgmm[1:30] # only show the first 30 coefficients

##    Intercept           Z1           Z2           Z3           Z4           Z5 
## -0.044383546  0.498662972  0.448938638  0.308491332  0.000000000  0.000000000 
##           Z6           Z7           Z8           Z9          Z10          Z11 
##  0.000000000  0.000000000  0.000000000  0.000000000  0.000000000  0.000000000 
##          Z12          Z13          Z14          Z15          Z16          Z17 
##  0.000000000  0.000000000  0.000000000  0.028778728 -0.008967114 -0.053259808 
##          Z18          Z19          Z20           W1           W2           W3 
##  0.000000000  0.000000000  0.000000000  0.461327236  0.356570948  0.000000000 
##           W4           W5           W6           W7           W8           W9 
##  0.000000000  0.000000000  0.000000000  0.000000000  0.000000000  0.000000000

HTLGMM selected_vars

print(res_htlgmm_ada$selected_vars$coef)

## [1] -0.04140164  0.51867246  0.46843552  0.31214784  0.48650693  0.40330147
## [7] -0.03980315 -0.03332190

print(res_htlgmm_ada$selected_vars$variance)

## [1] 0.007575373 0.002137143 0.002087740 0.002199780 0.007014063 0.007770544
## [7] 0.006641592 0.006219383

print(res_htlgmm_ada$selected_vars$pval)

## [1] 6.343018e-01 3.268335e-29 1.158337e-24 2.826334e-11 6.283285e-09
## [6] 4.759089e-06 6.252620e-01 6.726392e-01

The FDR adjusted positions passing significant level 0.05 after BH adjustment.

print(paste0("The predictors whose effects on outcome are nonzero: ",paste(c(paste0('Z',1:pZ),paste0('W',1:pW))[coef!=0],collapse = ", ")))

## [1] "The predictors whose effects on outcome are nonzero: Z1, Z2, Z3, W1, W2"

print(paste0("The predictors selected by htlgmm after FDR adjustment :",paste(res_htlgmm_ada$selected_vars$FDR_adjust_name,collapse = ", ")))

## [1] "The predictors selected by htlgmm after FDR adjustment :Z1, Z2, Z3, W1, W2"

The lambda list and cross validation auc.

head(lambda_mse_df<-data.frame(lambda=res_htlgmm$lambda_list,
                  cv_mse=t(res_htlgmm$cv_auc)))

##      lambda  cv_mse.1  cv_mse.2  cv_mse.3  cv_mse.4  cv_mse.5  cv_mse.6
## 1 0.5551131 0.5000000 0.5000000 0.5000000 0.5000000 0.5000000 0.5000000
## 2 0.5057983 0.7249778 0.7248000 0.7253333 0.7239111 0.7328000 0.7212444
## 3 0.4608646 0.7224889 0.7226667 0.7233778 0.7328000 0.7233778 0.7155556
## 4 0.4199227 0.7232000 0.7226667 0.7248000 0.7285333 0.7182222 0.7107556
## 5 0.3826179 0.7223111 0.7248000 0.7306667 0.7244444 0.7162667 0.7100444
## 6 0.3486272 0.7212444 0.7304889 0.7328000 0.7232000 0.7143111 0.7091556

The best selected lambda.

lambda_min<-res_htlgmm$lambda_min
print(lambda_min)

## [1] 0.0542351

Real Application Example for Risk Prediction

The real application for 10-year risk prediction of “incident” Breast Cancer using proteomics and risk factor data

setup

The proteomics data is downloaded from https://www.ukbiobank.ac.uk/enable-your-research/about-our-data/past-data-releases

The risk factors are downloaded from https://biobank.ndph.ox.ac.uk/showcase/browse.cgi

Because of data privacy issue, we use simulated risk factor and proteomics data for the application of cv.htlgmm here.

Click for risk factor and proteomics data generation simulation code

Load the required package including htlgmm.

library(caret)
library(glmnet)
library(htlgmm)
library(pROC)

Simulated UKB data including disease status, risk factor and proteomics

The risk factors names are listed from the Breast Cancer related ones. Record the summarized mean and standard deviance for each risk factor, including whether the predictors are binary or continuous.

risk_factor_names<-c("age","BMI","Height","Smoking_Status","Alcohol_Freq",
                     "Mother_Breast_Cancer_History","Silbings_Breast_Cancer_History",
                     "Number_of_Birth","Age_Menarche","Ever_Used_Oral_Contraceptive",
                     "Ever_Received_Hormone_Replacement_Therapy",
                     "Ever_Received_Bilateral_Oophorectomy",
                     "Ever_Had_Still_Birth","Prevalent_Benign_Breast_Disease")

mean_risk_factor<-c(59.7636,27.1449,162.1822,0.3442,0.1804,0.0758,0.0409,1.8789,12.9352,0.8001,0.4922,0.0961,0.3094,0.0083)

sd_risk_factor<-c(5.4963,5.0753,6.1417,0.4751,0.3845,0.2647,0.1980,1.1317,1.6063,0.3999,0.4999,0.2948,0.4622,0.0907)

names(mean_risk_factor)<-risk_factor_names
print(mean_risk_factor)

##                                       age 
##                                   59.7636 
##                                       BMI 
##                                   27.1449 
##                                    Height 
##                                  162.1822 
##                            Smoking_Status 
##                                    0.3442 
##                              Alcohol_Freq 
##                                    0.1804 
##              Mother_Breast_Cancer_History 
##                                    0.0758 
##            Silbings_Breast_Cancer_History 
##                                    0.0409 
##                           Number_of_Birth 
##                                    1.8789 
##                              Age_Menarche 
##                                   12.9352 
##              Ever_Used_Oral_Contraceptive 
##                                    0.8001 
## Ever_Received_Hormone_Replacement_Therapy 
##                                    0.4922 
##      Ever_Received_Bilateral_Oophorectomy 
##                                    0.0961 
##                      Ever_Had_Still_Birth 
##                                    0.3094 
##           Prevalent_Benign_Breast_Disease 
##                                    0.0083

For simplicity, we assume all binary risk factors follow binomial distribution and continuous risk factors follow normal distribution. We perform case-control analysis to allow fast computation in this simulation. The main study is assumed to be 2,000, and the external study is 9 times more, where the ratio is similar to actual case.

There are in total 1,500 proteins in UK Biobank proteomics data, we use 500 for simulation.

binary_predictors<-c("Mother_Breast_Cancer_History",
                     "Silbings_Breast_Cancer_History",
                     "Ever_Used_Oral_Contraceptive",
                     "Ever_Received_Hormone_Replacement_Therapy",
                     "Ever_Received_Bilateral_Oophorectomy",
                     "Ever_Had_Still_Birth",
                     "Prevalent_Benign_Breast_Disease")

n_main<-2000
n_external<-n_main*9
n_joint<-n_main+n_external
p_protein<-500
set.seed(2023)
risk_factor<-list()
for( i in which(risk_factor_names%in%binary_predictors)){
    risk_factor[[i]]<-rbinom(n = n_joint,size = 1,prob = mean_risk_factor[i])
}
for( i in which(!risk_factor_names%in%binary_predictors)){
    risk_factor[[i]]<-rnorm(n = n_joint,mean = mean_risk_factor[i],sd = sd_risk_factor[i])
}
risk_factor_matrix<-Reduce("cbind",risk_factor)

Check the data structure of the risk factor matrix, which is similar to the actual data structure.

colnames(risk_factor_matrix)<-risk_factor_names
round(head(risk_factor_matrix,2),2)

##        age   BMI Height Smoking_Status Alcohol_Freq
## [1,] 64.52 31.98 160.26          -0.23         0.61
## [2,] 44.70 27.05 168.29           0.38         0.20
##      Mother_Breast_Cancer_History Silbings_Breast_Cancer_History
## [1,]                            0                              0
## [2,]                            0                              0
##      Number_of_Birth Age_Menarche Ever_Used_Oral_Contraceptive
## [1,]            2.62        14.42                            0
## [2,]            3.12        13.53                            1
##      Ever_Received_Hormone_Replacement_Therapy
## [1,]                                         0
## [2,]                                         1
##      Ever_Received_Bilateral_Oophorectomy Ever_Had_Still_Birth
## [1,]                                    0                    0
## [2,]                                    0                    0
##      Prevalent_Benign_Breast_Disease
## [1,]                               0
## [2,]                               0

The proteins are assumed to follow N(0,1).

proteomics_matrix<-matrix(rnorm(n_joint*p_protein),nrow = n_joint,ncol = p_protein)
eg_proteomics<-proteomics_matrix[1:6,1:5]
colnames(eg_proteomics)<-c("aarsd1","abhd14b","abl1","acaa1","ace2")
eg_proteomics

##          aarsd1     abhd14b       abl1      acaa1          ace2
## [1,] -1.1105242 -0.01414405 -1.0864208 -0.5041349  4.490670e-01
## [2,]  1.0969956  0.81478464 -0.3507938 -1.2923906 -1.026354e-01
## [3,]  0.1096429  1.17141003 -0.6785858  1.6844419 -6.698569e-05
## [4,] -0.4895755  0.09112312 -0.5685032 -0.6172262 -5.422962e-01
## [5,]  1.0021940  0.66385649 -1.4594543  0.6078855 -1.003801e+00
## [6,] -1.0121142 -0.64776973 -0.4683433 -1.1744974 -1.432753e+00

We assume the disease status follows binomial distribution, and the case:control ratio to be roughly 0.1.

coef_risk_factor<-c(0.1172,0.1216,0.1067,-0.0312,0.0816,0.1038,0.0755,-0.0507,-0.0099,-0.0099,0.0376,-0.0266,-0.0251,0.0628)
coef_proteomics<-rep(0,p_protein)
set.seed(2024)
coef_proteomics[sample(1:p_protein,20)]<-rnorm(20,mean=0,sd = 0.1)

mean_case<-0.1
Y<-rbinom(n_joint,size = 1,prob =expit(
    log(mean_case/(1-mean_case))+c(scale(risk_factor_matrix)%*%coef_risk_factor)+c(proteomics_matrix%*%coef_proteomics)))
print(paste0("simulated disease prevalence:",sum(Y)/n_joint))

## [1] "simulated disease prevalence:0.11255"

Separate the simulated data into main study and external study.

main_study<-data.frame(Y = Y[1:n_main],risk_factor_matrix[1:n_main,],proteomics_matrix[1:n_main,])

external_study<-data.frame(Y = Y[-c(1:n_main)],risk_factor_matrix[-c(1:n_main),])

Scale the predictors from external study. Extract summary statistics from external study.

external_study[,-1]=scale(external_study[,-1])
family = "binomial"
ext_study_info=list()
external_glm=glm(Y~.,data = external_study,family = family)
study_external = list(
                Coeff=external_glm$coefficients[-1],
                Covariance=vcov(external_glm)[-1,-1])
ext_study_info[[1]] = study_external
#ext_study_info

Split main study into train and test with the proportion (70% and 30%).

library(caret)
set.seed(2024)
test_index=createDataPartition(main_study$Y,p = 0.3)$Resample1

y = main_study$Y[-test_index]
X = as.matrix(main_study[-test_index,-1])
y_test = main_study$Y[test_index]
X_test = as.matrix(main_study[test_index,-1])
prefunc <- preProcess(X, method = c("center", "scale"))
X = predict(prefunc, X)
X_test = as.matrix(predict(prefunc, X_test))        
p_risk = ncol(external_study)-1

Train with Main Study by cv.glmnet

y and X are training sets for main study, while y_test and X_test are test sets for main study. We perform normalization on X and apply the same transformation to X_test.

Perform cv.glmnet only for main study’s training set and check the performance on main study’s test set. The criterion is AUC.

start_t=Sys.time()
main_lasso=cv.glmnet(x=X, y=y, family="binomial")
end_t=Sys.time()
sprintf('cv.glmnet time: %.2f %s', end_t-start_t, units(end_t-start_t))

## [1] "cv.glmnet time: 14.27 secs"

pred_main_lasso=c(predict(main_lasso,newx=X_test,s="lambda.min"))
suppressMessages(auc_main_lasso<-auc(y_test,expit(pred_main_lasso)))

Train with Main study and External Study by `cv.htlgmm`

Perform cv.htlgmm for main study’s training set with external study and check the performance on main study’s test set. The initial beta for cv.htlgmm is taken from the output of cv.glmnet.

start_t=Sys.time()
main_external_htlgmm=htlgmm::cv.htlgmm(y = y,
                               Z = X[,c(1:p_risk)],
                               W = X[,-c(1:p_risk)],
                               ext_study_info = ext_study_info,
                               penalty_type = "lasso",
                               beta_initial = as.vector(coef(main_lasso,s="lambda.min")),
                               family = "binomial")
end_t=Sys.time()
sprintf('cv.htlgmm time: %.2f %s', end_t-start_t, units(end_t-start_t))

## [1] "cv.htlgmm time: 18.03 secs"

Perform cv.htlgmm with the option of weight tuning by setting tune_weight = TRUE.

start_t=Sys.time()
main_external_htlgmm_tune=htlgmm::cv.htlgmm(y = y,
                               Z = X[,c(1:p_risk)],
                               W = X[,-c(1:p_risk)],
                               ext_study_info = ext_study_info,
                               family = "binomial",
                               penalty_type = "lasso",
                               beta_initial = as.vector(coef(main_lasso,s="lambda.min")),
                               tune_weight = TRUE)
end_t=Sys.time()
sprintf('cv.htlgmm with weight tuning time: %.2f %s', end_t-start_t, units(end_t-start_t))

## [1] "cv.htlgmm with weight tuning time: 35.51 secs"

Performance Comparison

Check the AUC comparison between cv.glmnet and cv.htlgmm.

## AUC on test data: main study only lasso(0.544);
##  htlgmm(0.5477);
##  htlgmm_tune_weight(0.5921)

Plot to show the change of cross validation deviance against log lambda (tuning parameter for lasso penalty).

Click for comparison code

pred_main_external_htlgmm=c(X_test%*%main_external_htlgmm$beta[-1])
suppressMessages(auc_main_external_htlgmm<-pROC::auc(y_test,expit(pred_main_external_htlgmm)))

pred_main_external_htlgmm_tune=c(X_test%*%main_external_htlgmm_tune$beta[-1])
suppressMessages(auc_main_external_htlgmm_tune<-pROC::auc(y_test,expit(pred_main_external_htlgmm_tune)))

cat(paste0("AUC on test data: main study only lasso(",round(auc_main_lasso,4) ,");\n htlgmm(", round(auc_main_external_htlgmm,4),");\n htlgmm_tune_weight(",round(auc_main_external_htlgmm_tune,4),")"))

## AUC on test data: main study only lasso(0.544);
##  htlgmm(0.5477);
##  htlgmm_tune_weight(0.5921)

Plot to show the change of cross validation deviance against log lambda (tuning parameter for lasso penalty).

df = data.frame(lambda = main_external_htlgmm$lambda_list,
                cv_dev=main_external_htlgmm$cv_dev,
                cv_auc=main_external_htlgmm$cv_auc)
library(ggplot2)
ggplot(df)+
  geom_line(aes(x=log(lambda),y=cv_auc))+
  geom_vline(xintercept = log(main_external_htlgmm$lambda_min),
             linetype=2)+
  theme(panel.background=element_rect(fill='transparent', color='black'),
        axis.text=element_text(size=14,face="bold"))+
  geom_text(aes(x=log(main_external_htlgmm$lambda_min),y=median(cv_auc),label='lambda with max cv auc'))

GWAS version of HTLGMM

In this example, simulation example is generated to illustrate the application of gwas.htlgmm.

We simulate SNP_matrix by binomial distribution, where the allele frequencies are simulated from uniform distribution. We consider 100 SNPs, with main study sample size 10k and external summary statistics with sample size 100k.

The adjustment covariates include age, sex, BMI. Assume age and sex are adjusted from external summary statistics, while BMI is not adjusted. We can further adjust BMI using both main study and external summary statistics by gwas.htlgmm.

Simulation Setup

Click for genotype and phenotype generation simulation code, including how to set up main and external studies

Simulate SNP matrix and covariates

library(htlgmm)
snp_count<-100
n_ext<-1e5
n_main<-1e4
n_joint<-n_main+n_ext
set.seed(2024)
allele_freq<-runif(snp_count,min = 0.1,max = 0.9)
snp_matrix<-sapply(1:snp_count, function(i){
    rbinom(n=n_joint,size=2,prob=allele_freq[i])})

age<-rnorm(n_joint,mean = 60,sd = 5)
sex<-rbinom(n_joint,size=1,prob = 0.5)
BMI<-rnorm(n_joint,mean = 27,sd = 5)

Simulate binary outcome by randomly setting casual SNPs

Randomly set true casual SNP postions and check the case and control counts for simulated disease.

true_snp_position<-sample(1:100,5)
coef_snp<-rep(0,snp_count)
coef_snp[true_snp_position]<-log(1.2)*sample(c(1,-1),5,replace=TRUE)
print(paste0("True Causal SNP Positions: ",paste(sort(true_snp_position),collapse = ", ")))

## [1] "True Causal SNP Positions: 7, 10, 19, 27, 98"

coef_risk_factor<-c(0.1172,-0.05,0.1216)
risk_factor_matrix<-cbind(age,sex,BMI)

t2d<-rbinom(n_joint,size = 1,prob = expit(
    log(0.1/(1-0.1))+c(scale(risk_factor_matrix)%*%coef_risk_factor)+c(snp_matrix%*%coef_snp)))
table(t2d)

## t2d
##     0     1 
## 98858 11142

Split simulated data into main study and external study

A consist of age and sex; W only includes BMI; Z only includes each SNP. In general case, A, and W can be lists of covariates. For the A input of gwas.htlgmm, a column of 1 is required for intercept term.

A<-cbind(1,risk_factor_matrix[1:n_main,c(1,2)])
AE<-cbind(1,risk_factor_matrix[-c(1:n_main),c(1,2)])
W<-risk_factor_matrix[1:n_main,3,drop=F] 
y<-t2d[1:n_main]
yE<-t2d[-c(1:n_main)]
snp_matrix_main<-snp_matrix[1:n_main,]
snp_matrix_ext<-snp_matrix[-c(1:n_main),]

Simulate External GWAS Summary Statistics

The simulated external summary statistics ext_study_info is corresponding to the external summary statistics we usually obtain from a large consortia.

Please follow the list format below for external summary statistics.

ext_study_info<-list()
for(i in 1:snp_count){
    ZE<-snp_matrix_ext[,i]
    resglm<-glm(y~.,data = data.frame(y=yE,ZE,AE[,-1]),family = "binomial")
    study_external = list(
        Coeff=resglm$coefficients[2],
        Covariance=vcov(resglm)[2,2])
    ext_study_info[[i]] <- study_external
}
print(ext_study_info[[1]])

## $Coeff
##           ZE 
## -0.008973933 
## 
## $Covariance
## [1] 0.0003068138

# center A and W except intercept term
A[,-1]<-scale(A[,-1],scale=F)
W<-scale(W,scale=F)

Randomly set true casual SNP postions and check the case and control counts for simulated disease.

## [1] "True Causal SNP Positions: 7, 10, 19, 27, 98"

## t2d
##     0     1 
## 98858 11142

The simulated external summary statistics ext_study_info is corresponding to the external summary statistics we usually obtain from a large consortia.

Input Parameters of `gwas.htlgmm`

gwas.htlgmm requireds:

(1) : All the covariates (A & W) except intercept term should be centered with mean 0;
(2) : The SNP variable also needs to be centered with mean 0;
(3) : With provided beta_initial, the beta_initial should be corresponding to be the centered version and follow the order of (A, W, Z).

Train with Main Study and External GWAS by `gwas.htlgmm`

Key Parameter List of Function gwas.htlgmm

y : Outcome of Interest. Binary of continuous.
Z : SNP.
W : Covariate of Interest, Unadjusted in External GWAS.
A : Study-specific Adjustment Factors. All the covariates adjusted in external summary statistics. When family = “binomial”, a column with 1 should be added for intercept term.
ext_study_info : External GWAS Summary Statistics. “ext_study_info” comes as list, including the information of estimated coefficient, estimated variance for the SNP.
family : linear (“gaussian”) and logistic (“binomial”).
beta_initial :
repeated_term : The repeated computation which can be reused.
output_SNP_only : Only output SNP updated estimated coefficient, estimated variance. Otherwise, information of all covariates can be outputted.
output_tmp : Output repeated_term or not.

Note that repeated_term should be computed in advance to accelerate the computation.

start_t=Sys.time()
library(speedglm,quietly = T)
# first compute the repeated terms. 
repeated_term <- gwas.htlgmm.repeated.term(y,A,"binomial")
beta_list<-c()
se_list<-c()
for(i in 1:snp_count){
    # center SNP
    Z<-snp_matrix_main[,i]
    Z<-scale(Z,scale=F)
    # beta initial 
    resglm<-speedglm(y~.,data = data.frame(y=y,A[,-1],W,Z),family=binomial())
    beta_initial<-resglm$coefficients
    # find external gwas
    GWAS_ext_study_info = list()
    GWAS_ext_study_info[[1]] = ext_study_info[[i]]
    res.gwas<-gwas.htlgmm(y,Z,W,GWAS_ext_study_info,A,"binomial",beta_initial,repeated_term=repeated_term)
    beta_list<-c(beta_list,res.gwas$beta)
    se_list<-c(se_list,sqrt(res.gwas$variance))
}
end_t=Sys.time()
sprintf('gwas.htlgmm time: %.2f %s', end_t-start_t, units(end_t-start_t))

## [1] "gwas.htlgmm time: 3.35 secs"

Verify the results

In our simulation, because BMI works as a general strong risk factor, without BMI, the GWAS summary statistics will not result in correct conclusion.

## [1] "External GWAS identifies siginicant SNP number of : 93"

## [1] "gwas.htlgmm identifies siginificant SNP positions: 7, 10, 19, 27, 98"

Click for comparison code

pval_list_ext<-sapply(1:snp_count, function(i){
  pchisq(ext_study_info[[i]][[1]]^2/ext_study_info[[i]][[2]]^2,1,lower.tail = F)
})
pval_list_htlgmm<-pchisq(beta_list^2/se_list^2,1,lower.tail = F)
print(paste0("External GWAS identifies siginicant SNP number of : ",sum(pval_list_ext<0.05/100)))

## [1] "External GWAS identifies siginicant SNP number of : 93"

print(paste0("gwas.htlgmm identifies siginificant SNP positions: ",paste(which(pval_list_htlgmm<0.05/100),collapse = ", ")))

## [1] "gwas.htlgmm identifies siginificant SNP positions: 7, 10, 19, 27, 98"

Accelerate `gwas.htlgmm` by plink2

Tutorial of plink2 with example files

Check this link for introduction of plink2: http://ruzhangzhao.github.io/tutorial/plink2

To be updated.

Fast Matrix Production & Inverse

In our package htlgmm, we provide fast implementation of matrix product. Please use crossprod_fast to replace crossprod, and use prod_fast to replace %*%. For a matrix A, a matrix B, a vector b.

crossprod_fast(A) : generates \(A^\top\cdot A\), previously coded as crossprod(\(A\)) or t(\(A\))%*%\(A\).
crossprod_fast(A,B) : generates \(A^\top\cdot B\), previously coded as crossprod(\(A\),\(B\)) or t(\(A\))%*%\(B\).
crossprod_fast(A,b) : generates \(A^\top\cdot b\), previously coded as crossprod(\(A\),\(b\)) or t(\(A\))%*%\(b\).
prod_fast(A,B) : generates \(A\cdot B\), previously coded as \(A\)%*%\(B\).
prod_fast(A,b) : generates \(A\cdot b\), previously coded as \(A\)%*%\(b\).

Under the same setting \(A\cdot B\) for matrix dot product, %*% takes time:

##    user  system elapsed 
##   0.585   0.007   0.593

Our fast implementation prod_fast takes time:

##    user  system elapsed 
##   0.163   0.003   0.166

Under the same setting \(A^\top\cdot B\) for matrix dot product, crossprod takes time:

##    user  system elapsed 
##   0.980   0.008   0.988

Our fast implementation crossprod_fast takes time:

##    user  system elapsed 
##   0.161   0.002   0.164

Click for comparison code

set.seed(1)
x<-matrix(rnorm(1e5),1e2,1e3)
y<-matrix(rnorm(1e5),1e3,1e2)
standard_f<-function(){.<-sapply(1:100, function(i){.<-x%*%y })}
## The standard matrix dot product takes time:
system.time(standard_f())

##    user  system elapsed 
##   0.605   0.009   0.615

fast_f<-function(){.<-sapply(1:100, function(i){.<-prod_fast(x,y)})}
## The new fast matrix dot product takes time:
system.time(fast_f())

##    user  system elapsed 
##   0.161   0.003   0.165

set.seed(1)
x<-matrix(rnorm(1e5),1e3,1e2)
y<-matrix(rnorm(1e5),1e3,1e2)
standard_f<-function(){
    .<-sapply(1:100, function(i){.<-crossprod(x,y) })
}
## The standard matrix dot product takes time:
system.time(standard_f())

##    user  system elapsed 
##   0.973   0.012   1.002

fast_f<-function(){
    .<-sapply(1:100, function(i){.<-crossprod_fast(x,y)})
}
## The new fast matrix dot product takes time: 
system.time(fast_f())

##    user  system elapsed 
##   0.164   0.003   0.167

In package htlgmm, we also provide fast implementation of matrix inverse for positive definite matrix. We compare the computation speed among the following methods:

solve : the default matrix inverse from package base.
spdinv : the inverse of a symmetric positive definite matrix from package Rfast.
chol2inv : the inverse from Choleski (or QR) decomposition from package base.
choinv : the inverse based on Cholesky decomposition from package htlgmm.
sqrtchoinv : for positive definite matrix \(A\), \(L\) = sqrtchoinv(\(A\)), we have \(L^\top\cdot L\) is the inverse of \(A\).

Under the same setting for positive definition matrix inverse, solve takes time:

##    user  system elapsed 
##   5.543   0.028   5.577

The matrix inverse by ‘spdinv’ takes time:

##    user  system elapsed 
##   2.978   0.008   2.994

The matrix inverse by ‘chol2inv’ takes time:

##    user  system elapsed 
##   1.583   0.003   1.589

Our fast matrix inverse takes time:

##    user  system elapsed 
##   1.353   0.014   1.370

Our fast matrix inverse and square root takes time:

##    user  system elapsed 
##   0.775   0.016   0.808

Click for comparison code

set.seed(1)
x<-matrix(rnorm(2e7),1e4,2e3)
xtx<-crossprod_fast(x)
## The matrix inverse by 'solve' takes time:
system.time(.<-base::solve(xtx))

##    user  system elapsed 
##   5.521   0.017   5.545

## The matrix inverse by 'spdinv' takes time:
system.time(.<-Rfast::spdinv(xtx))

##    user  system elapsed 
##   2.904   0.006   2.915

## The matrix inverse by 'chol2inv' takes time:
system.time(.<- base::chol2inv(xtx))

##    user  system elapsed 
##   1.572   0.003   1.576

## The new fast matrix inverse takes time:
system.time(.<-htlgmm::choinv(xtx))

##    user  system elapsed 
##   1.331   0.010   1.342

## The new fast matrix inverse and square root takes time:
system.time(.<-htlgmm::sqrtchoinv(xtx))

##    user  system elapsed 
##   0.771   0.007   0.779

Feedback & Reference

https://github.com/RuzhangZhao/htlgmm/issues

Please submit any feedback and bugs on github issue page or email Ruzhang Zhao (ruzhangzhao@gmail.com) & Nilanjan Chatterjee (nchatte2@jhu.edu).

Please consider citing our work: Zhao, Ruzhang, et al. “Heterogeneous Transfer Learning for Building High-Dimensional Generalized Linear Models with Disparate Datasets.” arXiv preprint arXiv:2312.12786 (2023).

Tutorial for R packge htlgmm

Heterogeneous Transfer Learning Dealing with Unmatched Features