Welcome to the AB-PRS R package tutorial! In this guide, you will learn how to use the package’s functions in two distinct ways. The first approach is straightforward: you can use the ABPRS function to directly obtain the selected \(\theta_{SNP}\) supports and their corresponding weights (Example 1). The second approach walks you through using individual helper functions, each corresponding to a step in the AB-PRS flowchart, or Figure 1 in the paper (Example 2). The tutorial uses simulated datasets generated by a function within the package, so no external data downloads are necessary. For demonstrative purposes, we will be generating binary outcomes in the first example and continuous outcomes in the second one.

Below is a list of all functions available in the package, each of which will be covered in detail. For more specific information about any of these functions, you can refer to the user manual by using the help command ?function_name after the package is loaded in your R environment:

Installation and Loading

Install library

To install the AB-PRS package, please run the following function in R:

remotes::install_github("Cedars-CIG/ABPRS") 

Load library

After installing the AB-PRS package, you can use it the same way as you use other R packages. Simply load it using the code below and use ? to view the user manual of a specific function.

library(ABPRS)

Example 1: Overall Function on Simulated Dataset with Binary Outcome

Part 1: Data Generation and Splitting

First, you need to generate the simulated data, specifically the phenotype and genotype information. The data_simulation function in our package allows users to generate those information at ease with high customization.

Using the following code, you will generate data for 10,000 individuals with 250 SNPs, including 25 additive, 25 codominant, and 200 non-effective SNPs. All SNPs have a minor allele frequency (MAF) of 0.3. Furthermore, this example generates binary phenotypes. The effect size is set to 0.7, and the baseline effect (beta0) is -19.5, which maintains the case prevalence to around 30% in this dataset. Note that a larger beta value will result in fewer case outcomes.

For more details on how the outcomes are generated, please refer to the manual by running ?data_simulation in R.

data <- data_simulation(m=250, effect_snps_vec = c(25, 0, 0, 25), n=10000, 
                        maf=0.3, effect_size=0.7, beta0= (-19.5), binary = TRUE)

From the simulation function, a dataframe with phenotype in the first column and genotype in the rest of the columns is produced.

#> 'data.frame':    10000 obs. of  251 variables:
#>   [list output truncated]
Phenotype SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8 SNP9 SNP10
1 1 0 0 1 0 1 0 1 0 0 0
2 1 2 0 0 0 0 0 0 0 1 0
3 0 0 1 0 1 0 1 1 0 0 2
4 0 0 0 0 1 0 0 0 2 1 1
5 0 0 1 1 0 0 0 0 1 0 0
6 1 0 0 2 1 0 1 0 0 0 1
7 0 1 0 1 1 1 0 1 0 2 1
8 1 0 1 0 2 1 1 1 1 0 2

As you can see, the case outcome is ~30% using the above combination of parameters:

mean(data[,1])
#> [1] 0.3153

Now that you have the data, you can divide them for training, validation, and testing purposes. The following code produce three non-overlapping partitions with 50%, 25%, and 25% of the data respectively.

sample <- sample(1:3,size=10000,replace=T,prob=c(0.5, 0.25, 0.25))
data_train = data[which(sample==1),]
data_val = data[which(sample==2),]
data_test = data[which(sample==3),]
rm(sample)

For clarity in upcoming analysis, you can store the data into separate variables.

training_phenotype = data_train[,1]
validation_phenotype = data_val[,1]
testing_phenotype = data_test[,1]

training_genotype = data_train[,-1]
validation_genotype = data_val[,-1]
testing_genotype = data_test[,-1]

rm(data_test, data_train, data_val)

Part 2: Generate pre-trained polygenic risk scores (pre-trained PRSs)

Next, you need to generate pre-trained polygenic risk scores.In this example, we first run a simple binomial regression model on the training data to obtain GWAS effect sizes (betas). Then, we apply these betas to the training, validation, and testing genotypes using matrix multiplication. For real datasets, you can use PRSice, PLINK P+T, PRScs, LDPred2, or any method to generate the pre-trained PRSs.

#Generate GWAS on training dataset
gwas <- matrix(nrow=250,ncol=2)
for (i in 1:250){
  gwas[i,1]<-summary(glm(training_phenotype~training_genotype[,i],family="binomial"))$coef[2,1]
  gwas[i,2]<-i
}
  
# PRS calculation
training_prs <- as.matrix(training_genotype[,(gwas[,2])])%*%as.matrix(gwas[,1])
validation_prs <- as.matrix(validation_genotype[,(gwas[,2])])%*%as.matrix(gwas[,1])
testing_prs <- as.matrix(testing_genotype[,(gwas[,2])])%*%as.matrix(gwas[,1])

Here are snapshots of the pre-trained polygenic risk scores for this particular training, validation, and testing data:

str(training_prs, give.attr = FALSE)
#>  num [1:4979, 1] 9.74 7.05 9.5 7.5 5.9 ...
str(validation_prs, give.attr = FALSE)
#>  num [1:2434, 1] 6.32 5.81 5.75 7.64 5.76 ...
str(testing_prs, give.attr = FALSE)
#>  num [1:2587, 1] 6.29 6.9 9.16 7.65 8.51 ...

Part 3: AB-PRS Function

Now that you have the phenotype, genotype, and pre-trained PRSs of the split datasets, you can run the AB-PRS function! The AB-PRS function generates weights of the pre-trained polygenic risk scores and that of the selected \(\theta_{SNPs}\). This function also outputs the theta1 and theta2 values of the selected \(\theta_{SNPs}\), which are the corrected theta encodings for heterozygous (Aa) and homozygous alternative (aa) alleles, respectively.

Note that this function includes many parameters, with the first seven being required fields. You can learn about each parameter and their corresponding default values in detail by running ?ABPRS in R.

weights <- ABPRS(pre_trained_prs = training_prs, validation_prs=validation_prs, 
                 training_genotype=training_genotype, validation_genotype=validation_genotype,
                 training_phenotype=training_phenotype, validation_phenotype=validation_phenotype,
                 family="binomial", covariate=NULL, biglasso=FALSE, 
                 lam.max=NULL, lam.min=NULL, nlambda=100,
                 alpha=0.1, tolerance=0.025, threshold=0.01,
                 err=1e-5, delta=NULL)

Here is a snapshot of the output of the ABPRS function:

Weight theta1 theta2
Pre_Trained_PRS 1.910984 NA NA
SNP2 2.576104 0.0671804 0.1510881
SNP4 2.269562 0.0118356 0.1274996
SNP5 1.385844 0.0040549 0.2207389
SNP10 1.790937 0.0065061 0.2339151

Here are all the SNPs selected as supports:

rownames(weights)[-1]
#>  [1] "SNP2"   "SNP4"   "SNP5"   "SNP10"  "SNP20"  "SNP21"  "SNP26"  "SNP27" 
#>  [9] "SNP28"  "SNP29"  "SNP30"  "SNP31"  "SNP32"  "SNP33"  "SNP34"  "SNP35" 
#> [17] "SNP36"  "SNP37"  "SNP38"  "SNP39"  "SNP40"  "SNP41"  "SNP42"  "SNP43" 
#> [25] "SNP44"  "SNP45"  "SNP46"  "SNP47"  "SNP48"  "SNP49"  "SNP50"  "SNP54" 
#> [33] "SNP55"  "SNP60"  "SNP75"  "SNP78"  "SNP83"  "SNP107" "SNP111" "SNP114"
#> [41] "SNP117" "SNP127" "SNP130" "SNP131" "SNP132" "SNP143" "SNP147" "SNP148"
#> [49] "SNP166" "SNP185" "SNP199" "SNP206" "SNP211" "SNP239"

Next, use the apply_weights helper function to apply the generated weights to datasets like the testing or external data, producing the final AB-PRSs!

testing_abprs <- apply_weights(testing_prs, testing_genotype, weights)

Here is a summary of the resulting dataframe:

str(testing_abprs)
#> 'data.frame':    2587 obs. of  1 variable:
#>  $ ABPRS: num  16.9 17.3 22.9 18.2 21.8 ...

Part 4: Model Evaluation

Lastly, you can evaluate various PRS models using the model evaluation function. Here, we compare the performance of the pre-trained PRS and AB-PRS.

model_evaluation(phenotype=testing_phenotype, 
                 'Pre-Trained PRS' = testing_prs, 
                 'AB-PRS' = testing_abprs,
                 binary = TRUE, 
                 bin=10,
                 filename="Evaluation_Binary")
#> [1] "Done"

After the model evaluation function, an html file would be produced.

Example 2: Step-by-step Function on Simulated Dataset with Continuous Outcome

In addition to the main ABPRS function, this package also includes functions corresponding to each step of the AB-PRS flowchart, offering better understanding and customization.

Part 1: Data Generation and Splitting

In this example, we will use simulated dataset with continuous outcome.

Using the following code, you will generate data for 10,000 individuals with 250 SNPs, including 25 additive, 25 codominant, and 200 non-effective SNPs. All SNPs have a minor allele frequency (MAF) of 0.3. Continuous phenotype is generated with an effect size of 0.3 and baseline measurement of -1.

data <- data_simulation(m=250, effect_snps_vec = c(25, 0, 0, 25), n=10000, 
                        maf=0.3, effect_size=0.3, beta0= (-1), binary = FALSE)

Here is a snapshot of the generated dataset:

Phenotype SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8
7.737812 1 0 0 1 1 2 2 0
7.335880 0 2 1 0 0 0 1 0
5.195942 1 0 0 1 1 1 1 1
7.033109 0 0 1 0 0 2 0 0
5.513719 0 0 1 0 0 0 1 0
7.597184 0 0 1 1 1 1 0 1
3.702107 0 0 0 1 1 0 1 0
3.086818 0 0 1 1 0 0 0 0

Now that you have the data, you can divide them into training, validation, and testing datasets, and separate the phenotype and genotype for clarity (similar to Example 1). Once again, we are creating three non-overlapping partitions containing 50%, 25%, and 25% of the data, respectively.

sample <- sample(1:3,size=10000,replace=T,prob=c(0.5, 0.25, 0.25))
data_train = data[which(sample==1),]
data_val = data[which(sample==2),]
data_test = data[which(sample==3),]
rm(sample)

training_phenotype = data_train[,1]
validation_phenotype = data_val[,1]
testing_phenotype = data_test[,1]

training_genotype = data_train[,-1]
validation_genotype = data_val[,-1]
testing_genotype = data_test[,-1]

rm(data_test, data_train, data_val)

Part 2: Calculate pre-trained polygenic risk scores

To generate the pre-trained polygenic risk scores, we use the same method as with Example 1, but change the family object for regression models from binomial to gaussian.

#Generate GWAS on training dataset
gwas <- matrix(nrow=250,ncol=2)
for (i in 1:250){
  gwas[i,1]<-summary(glm(training_phenotype~training_genotype[,i],family="gaussian"))$coef[2,1]
  gwas[i,2]<-i
}
  
# PRS calculation
training_prs <- as.matrix(training_genotype[,(gwas[,2])])%*%as.matrix(gwas[,1])
validation_prs <- as.matrix(validation_genotype[,(gwas[,2])])%*%as.matrix(gwas[,1])
testing_prs <- as.matrix(testing_genotype[,(gwas[,2])])%*%as.matrix(gwas[,1])

Part 3: AB-PRS Functions

The AB-PRS process contains three main steps, shown in Figure 1 of the AB-PRS paper. This package provides functions to run the process step-by-step.

Step 1: Convert genotype to new \(\theta_{SNP}\) encoding

First, you will be converting a genotype matrix of any kind of encoding to a new \(\theta_{SNP}\) encoding. To compute \(\theta_{SNPs}\) encoding values, you should be using the training data with the learning_theta_snps function. Since we have a dataset with continuous outcomes, you should be using "gaussian" for the family parameter. Datasets with binary outcomes should use "binomial" instead.

thetas_mat <- learning_theta_snps(phenotype=training_phenotype, genotype=training_genotype,
                                  pre_trained_prs=training_prs, covariate=NULL, family="gaussian")

Here is what thetas_mat looks like:

SNP theta1 theta2 pval
SNP1 0.0079103 0.0342831 0.5780480
SNP2 0.0396693 0.0215623 0.2636475
SNP3 0.0997384 0.0252641 0.0045660
SNP4 -0.0487391 -0.0202916 0.1735727
SNP5 0.0201627 -0.0472736 0.2570937
SNP6 -0.0240418 0.0433101 0.2438731
SNP7 0.0351357 -0.0249107 0.2270028
SNP8 0.0380499 0.0072375 0.2729571

Then, we use the thetas matrix to turn our original genotype matrix with additive encoding to one with theta encoding.

theta_train <- encoding_theta_snps(training_genotype, thetas_mat)
theta_val <- encoding_theta_snps(validation_genotype, thetas_mat)
theta_test <- encoding_theta_snps(testing_genotype, thetas_mat)

Here’s an example of what theta_train looks like:

SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8
1 0.0079103 0.0000000 0.0000000 -0.0487391 0.0201627 0.0433101 -0.0249107 0.0000000
3 0.0079103 0.0000000 0.0000000 -0.0487391 0.0201627 -0.0240418 0.0351357 0.0380499
4 0.0000000 0.0000000 0.0997384 0.0000000 0.0000000 0.0433101 0.0000000 0.0000000
5 0.0000000 0.0000000 0.0997384 0.0000000 0.0000000 0.0000000 0.0351357 0.0000000
6 0.0000000 0.0000000 0.0997384 -0.0487391 0.0201627 -0.0240418 0.0000000 0.0380499
9 0.0079103 0.0000000 0.0997384 -0.0487391 0.0201627 -0.0240418 0.0351357 0.0000000
11 0.0000000 0.0215623 0.0997384 -0.0202916 0.0000000 0.0000000 0.0000000 0.0000000
12 0.0000000 0.0000000 0.0252641 -0.0487391 0.0000000 0.0433101 0.0351357 0.0380499

Step 2: Select important \(\theta_{SNPs}\) with adaptive variable selection and FDR control

Next, we use the adaptive_variable_selection function to select important \(\theta_{SNPs}\) as supports.

support_ABPRS <- adaptive_variable_selection(pre_trained_prs = training_prs, 
                                            validation_prs = validation_prs,
                                            training_phenotype = training_phenotype, 
                                            validation_phenotype = validation_phenotype,
                                            training_theta_encoding = theta_train, 
                                            validation_theta_encoding = theta_val,
                                            family = "gaussian", biglasso = FALSE,
                                            lam.max = NULL, lam.min = NULL, nlambda = 100,
                                            alpha = 0.1, tolerance = 0.025, threshold = 0.01,
                                            err = 1e-05, delta = NULL)

Here are your supports:

support_ABPRS
#>  [1] "SNP2"   "SNP3"   "SNP6"   "SNP10"  "SNP11"  "SNP20"  "SNP26"  "SNP27" 
#>  [9] "SNP28"  "SNP29"  "SNP30"  "SNP31"  "SNP32"  "SNP33"  "SNP34"  "SNP35" 
#> [17] "SNP36"  "SNP37"  "SNP38"  "SNP39"  "SNP40"  "SNP41"  "SNP42"  "SNP43" 
#> [25] "SNP44"  "SNP45"  "SNP46"  "SNP47"  "SNP48"  "SNP49"  "SNP50"  "SNP66" 
#> [33] "SNP80"  "SNP84"  "SNP88"  "SNP94"  "SNP96"  "SNP101" "SNP107" "SNP109"
#> [41] "SNP118" "SNP135" "SNP141" "SNP148" "SNP149" "SNP153" "SNP164" "SNP172"
#> [49] "SNP176" "SNP190" "SNP221" "SNP234"

Step 3: Calculate final AB-PRS by boosting with fine-tuned \(\theta_{SNPs}\)

To calculate the final AB-PRS, you need to first train a model using the training PRS and the theta-encoded genotype of selected \(\theta_{SNPs}\).

# Generate gaussian model from training data
trainABPRS <- data.frame(prs=training_prs, theta_train[,support_ABPRS])
modABPRS <- glm(training_phenotype ~ ., data = trainABPRS, family="gaussian")

Then, you can apply the weights derived from the above regression model to any data you wish. In this case, we applied the weights to the testing data.

# Apply the model to generate polygenic risk scores for testing data
testABPRS <- data.frame(prs=testing_prs, theta_test[,support_ABPRS])
testing_abprs = as.matrix(testABPRS)%*%modABPRS$coefficients[-1]

Here is what the AB-PRSs for the testing data look like:

str(as.vector(testing_abprs))
#>  num [1:2523] 9.15 7.17 7.97 8.57 8.93 ...

Part 4: Model Evaluation

Here, we compare the pre-trained PRS and AB-PRS models for continuous traits.

model_evaluation(phenotype=testing_phenotype, 
                 'Pre-Trained PRS' = testing_prs, 
                 'AB-PRS' = testing_abprs,
                 binary = FALSE, 
                 bin=10,
                 filename="Evaluation_Continuous")
#> [1] "Done"

After the model evaluation function, an html file would be produced.