Documentation

ProteinPRS is a web portal designed for the visualization, querying, and downloading of polygenic risk score (PRS) models for plasma protein levels.
These models are based on Olink data for nearly 3,000 proteins measured in over 50,000 samples from the UK Biobank.
The models are derived using the SNPBoost algorithm (https://github.com/hklinkhammer/snpboost), which efficiently processes high-dimensional omics data to generate optimized sparse polygenic models taking into account the joint effect of several variants.

ProteinPRS models incorporate complete genome-wide data, capturing both local cis-regulatory effects and distant regulatory mechanisms. The selected variants in the cis-region may potentially be pQTL regulatory variants or their linkage disequilibrium (LD) proxies. The selected variants in distant regions can represent the effect of transcription factors and/or disclose protein-protein interaction effects affecting protein level concentration.

Similar to transcriptome-wide association studies (TWAS), ProteinPRS models enable the inference of the genetically regulated component of protein expression in an independent dataset.
The imputed protein levels can then be associated with available phenotypes, as in a proteome-wide association study (PWAS).
This approach can be utilized for gene-based prioritization of significant loci identified by genome-wide association studies (GWAS).

The PRS models for each protein can be downloaded as scoring files compatible with the PLINK2 score function, including variant identifiers as rsid, effect alleles, and weights.
The scoring function can be applied to standard genetic data input formats (e.g., PLINK, binary PLINK, VCF, etc.).
Additional annotation files with genomic coordinates in hg19 and hg38 are also provided to adapt the scoring files with possible different variant annotations (e.g., chr:bp:ref:alt)

An abstract on SNPBoost-based protein level prediction has been submitted to ESHG 2024 (https://2024.eshg.org/), and a corresponding manuscript is under preparation. However, different studies using the implemented boosting algorithm for classical phenotype prediction have already been published:

With the growing availability of large-biobank data and intensive research efforts, several alternatives likely exist.
A major reference database for polygenic risk score prediction of molecular markers, including proteomics data, is OMICSPRED (https://www.omicspred.org/). 
However, most models are typically based only on local cis-expression regulation and neglect distant trans-pQTL effects.
Thus ProteinPRS models by taking into account distant pQTL effects might improve protein prediction and also precision for GWAS prioritization (while cis-eQTL are expected to be correlated at in nearby genes, the trans-eQTL are expected to be independent across genes in a locus).

The major limitations are as follows:
  1. Protein expression is primarily environmentally driven; thus, the genetic prediction is limited and heterogeneous across genes. It is advisable to check the model prediction performance (e.g., Pearson correlation and p-value) for individual proteins since the genetic regulation can significantly influence protein concentration for some proteins, whereas for others, the effect may be negligible.
  2. The models are based on plasma level concentrations, not accounting for tissue and cell specificity of protein regulation.
  3. The models were trained on predominantly European sample data.  Although gene-expression regulation may be less impacted by population-specific effects with respect to disease phenotypes, also ProteinPRS models, as all PRS models, might have reduced performances when tested on different populations.