Welcome to the AB-PRS R package tutorial! In this guide, you will
learn how to use the package’s functions in two distinct ways. The first
approach is straightforward: you can use the ABPRS
function
to directly obtain the selected \(\theta_{SNP}\) supports and their
corresponding weights (Example 1). The second
approach walks you through using individual helper functions, each
corresponding to a step in the AB-PRS flowchart, or Figure 1 in the
paper (Example 2). The tutorial uses simulated
datasets generated by a function within the package, so no external data
downloads are necessary. For demonstrative purposes, we will be
generating binary outcomes in the first example and continuous outcomes
in the second one.
Below is a list of all functions available in the package, each of
which will be covered in detail. For more specific information about any
of these functions, you can refer to the user manual by using the help
command ?function_name
after the package is loaded in your
R environment:
ABPRS
: the main AB-PRS function to
obtain weights for selected \(\theta_{SNPs}\).
apply_weights
: a helper function to
apply weights to a data.
learning_theta_snps
: learns
theta1
and theta2
values.
encoding_theta_snps
: encodes \(\theta_{SNPs}\) on genotype data.
adaptive_variable_selection
:
selects important \(\theta_{SNPs}\)
(supports).
data_simulation
: generates
simulated datasets used in the paper.
model_evaluation
: evaluates the
performance of different PRS models.
To install the AB-PRS package, please run the following function in R:
remotes::install_github("Cedars-CIG/ABPRS")
After installing the AB-PRS package, you can use it the same way as
you use other R packages. Simply load it using the code below and use
?
to view the user manual of a specific function.
library(ABPRS)
First, you need to generate the simulated data, specifically the
phenotype and genotype information. The data_simulation
function in our package allows users to generate those information at
ease with high customization.
Using the following code, you will generate data for 10,000 individuals with 250 SNPs, including 25 additive, 25 codominant, and 200 non-effective SNPs. All SNPs have a minor allele frequency (MAF) of 0.3. Furthermore, this example generates binary phenotypes. The effect size is set to 0.7, and the baseline effect (beta0) is -19.5, which maintains the case prevalence to around 30% in this dataset. Note that a larger beta value will result in fewer case outcomes.
For more details on how the outcomes are generated, please refer to
the manual by running ?data_simulation
in R.
data <- data_simulation(m=250, effect_snps_vec = c(25, 0, 0, 25), n=10000,
maf=0.3, effect_size=0.7, beta0= (-19.5), binary = TRUE)
From the simulation function, a dataframe with phenotype in the first column and genotype in the rest of the columns is produced.
#> 'data.frame': 10000 obs. of 251 variables:
#> [list output truncated]
Phenotype | SNP1 | SNP2 | SNP3 | SNP4 | SNP5 | SNP6 | SNP7 | SNP8 | SNP9 | SNP10 | |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
2 | 1 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
3 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 2 |
4 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 2 | 1 | 1 |
5 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
6 | 1 | 0 | 0 | 2 | 1 | 0 | 1 | 0 | 0 | 0 | 1 |
7 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 2 | 1 |
8 | 1 | 0 | 1 | 0 | 2 | 1 | 1 | 1 | 1 | 0 | 2 |
As you can see, the case outcome is ~30% using the above combination of parameters:
mean(data[,1])
#> [1] 0.3153
Now that you have the data, you can divide them for training, validation, and testing purposes. The following code produce three non-overlapping partitions with 50%, 25%, and 25% of the data respectively.
sample <- sample(1:3,size=10000,replace=T,prob=c(0.5, 0.25, 0.25))
data_train = data[which(sample==1),]
data_val = data[which(sample==2),]
data_test = data[which(sample==3),]
rm(sample)
For clarity in upcoming analysis, you can store the data into separate variables.
training_phenotype = data_train[,1]
validation_phenotype = data_val[,1]
testing_phenotype = data_test[,1]
training_genotype = data_train[,-1]
validation_genotype = data_val[,-1]
testing_genotype = data_test[,-1]
rm(data_test, data_train, data_val)
Next, you need to generate pre-trained polygenic risk scores.In this example, we first run a simple binomial regression model on the training data to obtain GWAS effect sizes (betas). Then, we apply these betas to the training, validation, and testing genotypes using matrix multiplication. For real datasets, you can use PRSice, PLINK P+T, PRScs, LDPred2, or any method to generate the pre-trained PRSs.
#Generate GWAS on training dataset
gwas <- matrix(nrow=250,ncol=2)
for (i in 1:250){
gwas[i,1]<-summary(glm(training_phenotype~training_genotype[,i],family="binomial"))$coef[2,1]
gwas[i,2]<-i
}
# PRS calculation
training_prs <- as.matrix(training_genotype[,(gwas[,2])])%*%as.matrix(gwas[,1])
validation_prs <- as.matrix(validation_genotype[,(gwas[,2])])%*%as.matrix(gwas[,1])
testing_prs <- as.matrix(testing_genotype[,(gwas[,2])])%*%as.matrix(gwas[,1])
Here are snapshots of the pre-trained polygenic risk scores for this particular training, validation, and testing data:
str(training_prs, give.attr = FALSE)
#> num [1:4979, 1] 9.74 7.05 9.5 7.5 5.9 ...
str(validation_prs, give.attr = FALSE)
#> num [1:2434, 1] 6.32 5.81 5.75 7.64 5.76 ...
str(testing_prs, give.attr = FALSE)
#> num [1:2587, 1] 6.29 6.9 9.16 7.65 8.51 ...
Now that you have the phenotype, genotype, and pre-trained PRSs of the split datasets, you can run the AB-PRS function! The AB-PRS function generates weights of the pre-trained polygenic risk scores and that of the selected \(\theta_{SNPs}\). This function also outputs the theta1 and theta2 values of the selected \(\theta_{SNPs}\), which are the corrected theta encodings for heterozygous (Aa) and homozygous alternative (aa) alleles, respectively.
Note that this function includes many parameters, with the first
seven being required fields. You can learn about each parameter and
their corresponding default values in detail by running
?ABPRS
in R.
weights <- ABPRS(pre_trained_prs = training_prs, validation_prs=validation_prs,
training_genotype=training_genotype, validation_genotype=validation_genotype,
training_phenotype=training_phenotype, validation_phenotype=validation_phenotype,
family="binomial", covariate=NULL, biglasso=FALSE,
lam.max=NULL, lam.min=NULL, nlambda=100,
alpha=0.1, tolerance=0.025, threshold=0.01,
err=1e-5, delta=NULL)
Here is a snapshot of the output of the ABPRS function:
Weight | theta1 | theta2 | |
---|---|---|---|
Pre_Trained_PRS | 1.910984 | NA | NA |
SNP2 | 2.576104 | 0.0671804 | 0.1510881 |
SNP4 | 2.269562 | 0.0118356 | 0.1274996 |
SNP5 | 1.385844 | 0.0040549 | 0.2207389 |
SNP10 | 1.790937 | 0.0065061 | 0.2339151 |
Here are all the SNPs selected as supports:
rownames(weights)[-1]
#> [1] "SNP2" "SNP4" "SNP5" "SNP10" "SNP20" "SNP21" "SNP26" "SNP27"
#> [9] "SNP28" "SNP29" "SNP30" "SNP31" "SNP32" "SNP33" "SNP34" "SNP35"
#> [17] "SNP36" "SNP37" "SNP38" "SNP39" "SNP40" "SNP41" "SNP42" "SNP43"
#> [25] "SNP44" "SNP45" "SNP46" "SNP47" "SNP48" "SNP49" "SNP50" "SNP54"
#> [33] "SNP55" "SNP60" "SNP75" "SNP78" "SNP83" "SNP107" "SNP111" "SNP114"
#> [41] "SNP117" "SNP127" "SNP130" "SNP131" "SNP132" "SNP143" "SNP147" "SNP148"
#> [49] "SNP166" "SNP185" "SNP199" "SNP206" "SNP211" "SNP239"
Next, use the apply_weights
helper function to apply the
generated weights to datasets like the testing or external data,
producing the final AB-PRSs!
testing_abprs <- apply_weights(testing_prs, testing_genotype, weights)
Here is a summary of the resulting dataframe:
str(testing_abprs)
#> 'data.frame': 2587 obs. of 1 variable:
#> $ ABPRS: num 16.9 17.3 22.9 18.2 21.8 ...
Lastly, you can evaluate various PRS models using the model evaluation function. Here, we compare the performance of the pre-trained PRS and AB-PRS.
model_evaluation(phenotype=testing_phenotype,
'Pre-Trained PRS' = testing_prs,
'AB-PRS' = testing_abprs,
binary = TRUE,
bin=10,
filename="Evaluation_Binary")
#> [1] "Done"
After the model evaluation function, an html file would be produced.
In addition to the main ABPRS function, this package also includes functions corresponding to each step of the AB-PRS flowchart, offering better understanding and customization.
In this example, we will use simulated dataset with continuous outcome.
Using the following code, you will generate data for 10,000 individuals with 250 SNPs, including 25 additive, 25 codominant, and 200 non-effective SNPs. All SNPs have a minor allele frequency (MAF) of 0.3. Continuous phenotype is generated with an effect size of 0.3 and baseline measurement of -1.
data <- data_simulation(m=250, effect_snps_vec = c(25, 0, 0, 25), n=10000,
maf=0.3, effect_size=0.3, beta0= (-1), binary = FALSE)
Here is a snapshot of the generated dataset:
Phenotype | SNP1 | SNP2 | SNP3 | SNP4 | SNP5 | SNP6 | SNP7 | SNP8 |
---|---|---|---|---|---|---|---|---|
7.737812 | 1 | 0 | 0 | 1 | 1 | 2 | 2 | 0 |
7.335880 | 0 | 2 | 1 | 0 | 0 | 0 | 1 | 0 |
5.195942 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 |
7.033109 | 0 | 0 | 1 | 0 | 0 | 2 | 0 | 0 |
5.513719 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
7.597184 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 1 |
3.702107 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 |
3.086818 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |
Now that you have the data, you can divide them into training, validation, and testing datasets, and separate the phenotype and genotype for clarity (similar to Example 1). Once again, we are creating three non-overlapping partitions containing 50%, 25%, and 25% of the data, respectively.
sample <- sample(1:3,size=10000,replace=T,prob=c(0.5, 0.25, 0.25))
data_train = data[which(sample==1),]
data_val = data[which(sample==2),]
data_test = data[which(sample==3),]
rm(sample)
training_phenotype = data_train[,1]
validation_phenotype = data_val[,1]
testing_phenotype = data_test[,1]
training_genotype = data_train[,-1]
validation_genotype = data_val[,-1]
testing_genotype = data_test[,-1]
rm(data_test, data_train, data_val)
To generate the pre-trained polygenic risk scores, we use the same
method as with Example 1, but change the family
object for regression models from binomial
to
gaussian
.
#Generate GWAS on training dataset
gwas <- matrix(nrow=250,ncol=2)
for (i in 1:250){
gwas[i,1]<-summary(glm(training_phenotype~training_genotype[,i],family="gaussian"))$coef[2,1]
gwas[i,2]<-i
}
# PRS calculation
training_prs <- as.matrix(training_genotype[,(gwas[,2])])%*%as.matrix(gwas[,1])
validation_prs <- as.matrix(validation_genotype[,(gwas[,2])])%*%as.matrix(gwas[,1])
testing_prs <- as.matrix(testing_genotype[,(gwas[,2])])%*%as.matrix(gwas[,1])
The AB-PRS process contains three main steps, shown in Figure 1 of the AB-PRS paper. This package provides functions to run the process step-by-step.
First, you will be converting a genotype matrix of any kind of
encoding to a new \(\theta_{SNP}\)
encoding. To compute \(\theta_{SNPs}\)
encoding values, you should be using the training data with the
learning_theta_snps
function. Since we have a dataset with
continuous outcomes, you should be using "gaussian"
for the
family
parameter. Datasets with binary outcomes should use
"binomial"
instead.
thetas_mat <- learning_theta_snps(phenotype=training_phenotype, genotype=training_genotype,
pre_trained_prs=training_prs, covariate=NULL, family="gaussian")
Here is what thetas_mat
looks like:
SNP | theta1 | theta2 | pval |
---|---|---|---|
SNP1 | 0.0079103 | 0.0342831 | 0.5780480 |
SNP2 | 0.0396693 | 0.0215623 | 0.2636475 |
SNP3 | 0.0997384 | 0.0252641 | 0.0045660 |
SNP4 | -0.0487391 | -0.0202916 | 0.1735727 |
SNP5 | 0.0201627 | -0.0472736 | 0.2570937 |
SNP6 | -0.0240418 | 0.0433101 | 0.2438731 |
SNP7 | 0.0351357 | -0.0249107 | 0.2270028 |
SNP8 | 0.0380499 | 0.0072375 | 0.2729571 |
Then, we use the thetas matrix to turn our original genotype matrix with additive encoding to one with theta encoding.
theta_train <- encoding_theta_snps(training_genotype, thetas_mat)
theta_val <- encoding_theta_snps(validation_genotype, thetas_mat)
theta_test <- encoding_theta_snps(testing_genotype, thetas_mat)
Here’s an example of what theta_train
looks like:
SNP1 | SNP2 | SNP3 | SNP4 | SNP5 | SNP6 | SNP7 | SNP8 | |
---|---|---|---|---|---|---|---|---|
1 | 0.0079103 | 0.0000000 | 0.0000000 | -0.0487391 | 0.0201627 | 0.0433101 | -0.0249107 | 0.0000000 |
3 | 0.0079103 | 0.0000000 | 0.0000000 | -0.0487391 | 0.0201627 | -0.0240418 | 0.0351357 | 0.0380499 |
4 | 0.0000000 | 0.0000000 | 0.0997384 | 0.0000000 | 0.0000000 | 0.0433101 | 0.0000000 | 0.0000000 |
5 | 0.0000000 | 0.0000000 | 0.0997384 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0351357 | 0.0000000 |
6 | 0.0000000 | 0.0000000 | 0.0997384 | -0.0487391 | 0.0201627 | -0.0240418 | 0.0000000 | 0.0380499 |
9 | 0.0079103 | 0.0000000 | 0.0997384 | -0.0487391 | 0.0201627 | -0.0240418 | 0.0351357 | 0.0000000 |
11 | 0.0000000 | 0.0215623 | 0.0997384 | -0.0202916 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 |
12 | 0.0000000 | 0.0000000 | 0.0252641 | -0.0487391 | 0.0000000 | 0.0433101 | 0.0351357 | 0.0380499 |
Next, we use the adaptive_variable_selection
function to
select important \(\theta_{SNPs}\) as
supports.
support_ABPRS <- adaptive_variable_selection(pre_trained_prs = training_prs,
validation_prs = validation_prs,
training_phenotype = training_phenotype,
validation_phenotype = validation_phenotype,
training_theta_encoding = theta_train,
validation_theta_encoding = theta_val,
family = "gaussian", biglasso = FALSE,
lam.max = NULL, lam.min = NULL, nlambda = 100,
alpha = 0.1, tolerance = 0.025, threshold = 0.01,
err = 1e-05, delta = NULL)
Here are your supports:
support_ABPRS
#> [1] "SNP2" "SNP3" "SNP6" "SNP10" "SNP11" "SNP20" "SNP26" "SNP27"
#> [9] "SNP28" "SNP29" "SNP30" "SNP31" "SNP32" "SNP33" "SNP34" "SNP35"
#> [17] "SNP36" "SNP37" "SNP38" "SNP39" "SNP40" "SNP41" "SNP42" "SNP43"
#> [25] "SNP44" "SNP45" "SNP46" "SNP47" "SNP48" "SNP49" "SNP50" "SNP66"
#> [33] "SNP80" "SNP84" "SNP88" "SNP94" "SNP96" "SNP101" "SNP107" "SNP109"
#> [41] "SNP118" "SNP135" "SNP141" "SNP148" "SNP149" "SNP153" "SNP164" "SNP172"
#> [49] "SNP176" "SNP190" "SNP221" "SNP234"
To calculate the final AB-PRS, you need to first train a model using the training PRS and the theta-encoded genotype of selected \(\theta_{SNPs}\).
# Generate gaussian model from training data
trainABPRS <- data.frame(prs=training_prs, theta_train[,support_ABPRS])
modABPRS <- glm(training_phenotype ~ ., data = trainABPRS, family="gaussian")
Then, you can apply the weights derived from the above regression model to any data you wish. In this case, we applied the weights to the testing data.
# Apply the model to generate polygenic risk scores for testing data
testABPRS <- data.frame(prs=testing_prs, theta_test[,support_ABPRS])
testing_abprs = as.matrix(testABPRS)%*%modABPRS$coefficients[-1]
Here is what the AB-PRSs for the testing data look like:
str(as.vector(testing_abprs))
#> num [1:2523] 9.15 7.17 7.97 8.57 8.93 ...
Here, we compare the pre-trained PRS and AB-PRS models for continuous traits.
model_evaluation(phenotype=testing_phenotype,
'Pre-Trained PRS' = testing_prs,
'AB-PRS' = testing_abprs,
binary = FALSE,
bin=10,
filename="Evaluation_Continuous")
#> [1] "Done"
After the model evaluation function, an html file would be produced.