Back to blog
Tutorial

ORA vs GSEA: A Side-by-Side Tutorial in R with clusterProfiler

By Abdullah Shahid · · 10 min read

After you have a DEG list, the next question is almost always: what biology does this represent? Over-representation analysis and gene set enrichment analysis are the two methods the field uses to answer that question, and they are not interchangeable. They ask different questions, require different inputs, and give results that are not directly comparable. Using the wrong one for your data, or using the right one incorrectly, is one of the most reliable ways to produce enrichment results that will not survive peer review.

This tutorial walks through both methods in clusterProfiler, the most widely used R package for functional enrichment analysis. It covers the conceptual distinction, the exact code for each method, the correct way to apply FDR correction, and how to read the results. Every code block runs on real DESeq2 output.

The Conceptual Difference

ORA (over-representation analysis) asks: given my list of significant genes, are any GO terms represented more often than expected by chance? It uses a one-sided Fisher exact test, comparing the fraction of your DEG list annotated to a term against the fraction of your background gene set annotated to the same term. It requires two inputs: a list of significant genes and a background gene set.

GSEA asks: is a GO term systematically enriched at the extremes of a ranked list of all my genes? It does not require a significance cutoff. Every tested gene participates, ordered from most upregulated to most downregulated by some continuous statistic, typically the DESeq2 Wald statistic. The enrichment score measures how much a gene set’s members cluster at the top or bottom of that ranking. GSEA was first described by Subramanian et al. in PNAS (2005) and has been one of the most cited methods in computational biology since.

The practical consequence of this difference is substantial. ORA with a strict fold change cutoff may miss a pathway where many genes show modest, consistent upregulation that does not individually cross your threshold. GSEA captures exactly that signal because it uses the full ranking. Conversely, ORA gives you a cleaner story when a small, specific set of genes is dramatically changed and the rest of the transcriptome is quiet.

ORA without a proper background gene set is statistically wrong

The background for ORA must be the set of genes that were actually tested in your differential expression analysis, not the full genome. If you use the full genome as background, you inflate the denominator of the Fisher test and make GO terms appear enriched when they are not. In clusterProfiler, pass the universe argument to enrichGO as the gene IDs of all genes with a non-NA adjusted p-value from DESeq2. Omitting the universe argument causes enrichGO to default to the full annotation database, which is incorrect for RNA-seq data.

Setup and Data Preparation

library(DESeq2)
library(clusterProfiler)
library(enrichplot)
library(org.Hs.eg.db)
library(ggplot2)
# Load DESeq2 results (shrunken, from lfcShrink)
# Assumes you have already run DESeq2 and saved res_shrunk
res_df <- as.data.frame(res_shrunk) |>
tibble::rownames_to_column("ensembl_id") |>
dplyr::filter(!is.na(padj))
# All tested genes: this is the correct ORA background
background_genes <- res_df$ensembl_id
# Significant DEGs: padj < 0.05 and |log2FC| >= 1
sig_genes <- res_df |>
dplyr::filter(padj < 0.05, abs(log2FoldChange) >= 1) |>
dplyr::pull(ensembl_id)
cat("Background (tested) genes:", length(background_genes), "\n")
cat("Significant DEGs: ", length(sig_genes), "\n")
# Ranked gene list for GSEA: all tested genes sorted by Wald statistic
# Use stat from the UNSHRUNKEN results object (res, not res_shrunk)
# The Wald statistic is more appropriate than log2FC for ranking
ranked_genes <- res_df$stat
names(ranked_genes) <- res_df$ensembl_id
ranked_genes <- sort(ranked_genes[!is.na(ranked_genes)], decreasing = TRUE)
cat("Genes in GSEA ranking: ", length(ranked_genes), "\n")

Note the distinction between the objects: sig_genes is for ORA, background_genes is the ORA universe, and ranked_genes is for GSEA. These serve different roles and cannot be substituted for each other.

Running ORA with enrichGO

# Over-representation analysis on Biological Process GO terms
ora_result <- enrichGO(
gene = sig_genes,
universe = background_genes, # correct background: all tested genes
OrgDb = org.Hs.eg.db,
keyType = "ENSEMBL",
ont = "BP", # Biological Process; also try "MF" and "CC"
pAdjustMethod = "BH", # Benjamini-Hochberg FDR correction
pvalueCutoff = 0.05,
qvalueCutoff = 0.20, # q-value is an additional FDR measure
readable = TRUE # converts gene IDs to gene symbols in output
)
# How many terms were significant?
cat("Significant ORA terms:", nrow(as.data.frame(ora_result)), "\n")
# Reduce redundant terms using semantic similarity
# The simplify() function removes GO terms that are too similar to each other
ora_simplified <- simplify(
ora_result,
cutoff = 0.7, # similarity threshold; lower = more aggressive pruning
by = "p.adjust",
select_fun = min
)
cat("Terms after simplify():", nrow(as.data.frame(ora_simplified)), "\n")

The simplify() function from the enrichplot package removes GO terms that share a high proportion of their annotated genes. Parent-child relationships in the GO hierarchy mean that “inflammatory response” and “regulation of inflammatory response” will often co-appear in ORA results. After simplify(), the list is shorter and more interpretable.

# Visualize ORA results as a dotplot
dotplot(ora_simplified,
showCategory = 20,
title = "ORA: Biological Process (top 20, simplified)",
font.size = 10) +
ggplot2::theme(axis.text.y = element_text(size = 9))
clusterProfiler dotplot showing top 20 enriched GO Biological Process terms from ORA, with dot size proportional to gene count and color representing adjusted p-value
Figure 1: ORA dotplot from enrichGO showing the top 20 simplified Biological Process terms. Dot size is proportional to the number of DEGs annotated to the term. Color represents the Benjamini-Hochberg adjusted p-value: darker color indicates stronger significance.

Running GSEA with gseGO

# Gene Set Enrichment Analysis on Biological Process GO terms
# Input: all tested genes ranked by Wald statistic (continuous, no threshold)
gsea_result <- gseGO(
geneList = ranked_genes, # named numeric vector, sorted descending
OrgDb = org.Hs.eg.db,
keyType = "ENSEMBL",
ont = "BP",
minGSSize = 10, # minimum gene set size to test
maxGSSize = 500, # maximum gene set size to test
pvalueCutoff = 0.05,
pAdjustMethod = "BH",
verbose = FALSE,
seed = 42 # set for reproducibility
)
cat("Significant GSEA terms:", nrow(as.data.frame(gsea_result)), "\n")
# Simplify GSEA results the same way as ORA
gsea_simplified <- simplify(gsea_result, cutoff = 0.7, by = "p.adjust", select_fun = min)
cat("Terms after simplify():", nrow(as.data.frame(gsea_simplified)), "\n")

Setting seed is important for reproducibility. The permutation procedure used to estimate significance in gseGO involves random sampling. Without a fixed seed, results will differ between runs, which complicates reproduction and version control.

The minGSSize and maxGSSize parameters control which gene sets are tested. Very small gene sets (fewer than 10 genes) produce unstable enrichment scores. Very large gene sets (more than 500 genes) often correspond to broad, non-specific terms like “metabolic process” that are rarely useful for biological interpretation. The defaults in gseGO are 10 and 500 respectively, which is appropriate for most RNA-seq experiments.

# Enrichment plot for the top GSEA term
# This is the classic GSEA plot showing the running enrichment score
top_term <- gsea_simplified@result$ID[1]
gseaplot2(gsea_simplified,
geneSetID = top_term,
title = gsea_simplified@result$Description[1])
GSEA enrichment plot from clusterProfiler gseaplot2 showing the running enrichment score for the top enriched GO term, with the leading edge marked and gene positions shown as a rug plot
Figure 2: GSEA enrichment plot for the top enriched Biological Process term. The running sum rises steeply in the upper portion of the ranking (genes ranked by Wald statistic), indicating the gene set is enriched among upregulated genes. The leading edge subset, genes to the left of the peak, is the core of the signal.

The Ridgeplot: Seeing Many Terms at Once

The ridgeplot is one of the most useful visualizations for GSEA because it shows the fold change distribution for all significant gene sets simultaneously. Each ridge represents one GO term, and the x-axis shows where the member genes fall in your fold change distribution.

# Ridgeplot: fold change distribution per significant gene set
ridgeplot(gsea_simplified,
showCategory = 20,
fill = "p.adjust") +
ggplot2::labs(
title = "GSEA ridgeplot: fold change distribution by GO term",
x = "Log2 fold change"
)
GSEA ridgeplot showing log2 fold change distribution for top 20 significant GO terms, with each ridge colored by adjusted p-value and peaks indicating the direction of enrichment
Figure 3: GSEA ridgeplot for the top 20 significant Biological Process terms. Each ridge shows the log2 fold change distribution of member genes for that term. Terms with peaks to the right of zero are enriched among upregulated genes; peaks to the left indicate downregulation.

Interpreting and Comparing Results

ORA and GSEA will often identify overlapping terms, but they will not agree perfectly, and when they disagree the disagreement is informative.

A term that appears in ORA but not GSEA typically means a small, specific set of genes is strongly differentially expressed for that term, but the rest of the genome does not show a consistent signal in that direction. GSEA requires a genome-wide pattern; ORA only requires a local one.

A term that appears in GSEA but not ORA is the more interesting case. It usually means many genes are consistently but modestly shifted in one direction, without any individual gene crossing your fold change threshold. This is exactly the class of biology that ORA misses. Whole pathways under coordinated regulatory pressure, shifts in metabolic state, cell-type compositional changes reflected in the transcriptome: these often produce GSEA hits that never appear in ORA.

When a term appears in both, check that the direction is consistent. An ORA result cannot tell you direction (enrichment is enrichment), but GSEA tells you whether the enrichment is in the upregulated or downregulated tail via the sign of the normalized enrichment score (NES). Positive NES means enriched among upregulated genes; negative NES means enriched among downregulated genes.

When To Use Which

ConsiderationORAGSEA
Input requiredDEG list and backgroundAll tested genes, ranked by statistic
Threshold sensitivityHigh: results change with padj and LFC cutoffLow: uses the full continuous ranking
Background sensitivityVery high: wrong background = wrong resultNone: no hypergeometric test
Detects subtle coordinated shiftsNoYes
Works with small DEG listsYesCaution: ranking is noisy with few DEGs
Preferred for publicationAcceptable with correct backgroundGenerally preferred by reviewers
Directionality in resultsNo (enrichment only)Yes (positive or negative NES)
Computational costLowModerate (permutation-based)
clusterProfiler functionenrichGO()gseGO()

For most RNA-seq papers, running both and comparing them is more informative than choosing one. Report ORA results if you want a clean, specific gene-list-driven story. Report GSEA results if the biology involves coordinated pathway-level responses. If they agree, the result is stronger. If they disagree, understanding why is itself a biological finding.

The clusterProfiler documentation by the Yu lab is the authoritative reference for both methods and covers extensions to KEGG, Reactome, and custom gene sets. The original GSEA methodology paper by Subramanian et al. (2005) and the clusterProfiler 4.0 paper by Wu et al. are the primary citations for your methods section.

NotchBio runs both ORA and GSEA in parallel on every dataset, applies Benjamini-Hochberg correction by default, uses the tested gene universe as the ORA background, and runs simplify() on both result sets before returning the output. If you want publication-ready enrichment results without configuring each parameter manually, that is what the platform produces.

Further reading

Read another related post

View all posts