Back to blog
Tutorial

Reducing GO Term Redundancy: simplify, rrvgo, and What Works

By Abdullah Shahid · · 10 min read

You ran ORA or GSEA, applied Benjamini-Hochberg correction, and got 300 significant GO terms. Now what? Presenting 300 terms in a paper is not an option. Manually curating them is not realistic. And simply picking the top 10 by adjusted p-value misses the fact that those 10 terms may all be near-identical siblings in the GO hierarchy, saying the same biological thing ten different ways.

The GO hierarchy is a directed acyclic graph. A parent term like “immune response” contains child terms like “innate immune response,” “adaptive immune response,” and dozens more specific children. When a gene set is enriched for any of those children, it is almost certainly enriched for the parent too. Standard enrichment analysis reports all of them. Your job is to collapse the redundant ones without losing the biological signal.

This tutorial covers three approaches in order of effort: simplify() from clusterProfiler, rrvgo from Bioconductor, and REVIGO as a web-based fallback. It also covers a custom uniqueness-score approach for the cases where all three run into problems.

Why Enrichment Outputs Hundreds of Similar Terms

The Gene Ontology is organized as a hierarchy where more specific terms inherit from broader ones. A gene annotated to “positive regulation of T cell proliferation” is also implicitly annotated to “regulation of T cell proliferation,” “T cell proliferation,” “lymphocyte proliferation,” “leukocyte proliferation,” “cell proliferation,” and so on up to “biological process.” When you run ORA or GSEA, all of these terms may test as significant if enough of your DEGs are annotated at any level.

This is not a bug in the enrichment software. It is a faithful reflection of the GO data structure. The redundancy is in the annotation, not the analysis. The result is that your 300 significant terms may represent only 20 to 30 distinct biological themes, each appearing in the output multiple times at different levels of specificity.

The goal of redundancy reduction is not to throw away information. It is to identify one representative term per cluster of semantically similar terms, so you can present the biological story in a form that is both accurate and readable.

Method 1: clusterProfiler simplify()

The fastest approach is simplify(), which ships with clusterProfiler and requires no additional installation. It works by computing pairwise semantic similarity between all significant terms using GOSemSim, then clustering the terms and selecting one representative per cluster.

library(clusterProfiler)
library(enrichplot)
library(org.Hs.eg.db)
# Assume ora_result is your enrichGO output from the previous tutorial
# Apply simplify to remove redundant GO terms
ora_simplified <- simplify(
ora_result,
cutoff = 0.7, # similarity threshold; 0.5 is aggressive, 0.9 is lenient
by = "p.adjust", # pick the most significant term as representative
select_fun = min # lower p.adjust = more significant
)
# Compare before and after
cat("Terms before simplify():", nrow(as.data.frame(ora_result)), "\n")
cat("Terms after simplify(): ", nrow(as.data.frame(ora_simplified)), "\n")

The cutoff parameter controls aggressiveness. A cutoff of 0.7 means two terms with semantic similarity above 0.7 will be merged into one cluster. Lower values (0.4 to 0.6) produce fewer, broader clusters. Higher values (0.8 to 0.9) produce more, tighter clusters with less reduction. For most RNA-seq papers, 0.7 is a reasonable default that reduces a 200-term list to roughly 30 to 60 representative terms.

simplify() works well for ORA results from enrichGO. It also works for GSEA results from gseGO. The key limitation is that it can only be applied to GO-based results from clusterProfiler. If you ran enrichment against KEGG, Reactome, or a custom gene set database, simplify() is not available.

Method 2: rrvgo

rrvgo is a dedicated Bioconductor package for GO term redundancy reduction by Sayols (2023). It is more flexible than simplify() because it accepts any list of GO term IDs and p-values as input, regardless of which enrichment tool produced them. It is heavily influenced by REVIGO and mimics its core algorithm in a programmatic interface.

rrvgo install issues and the most common fix

rrvgo depends on GOSemSim, which requires large annotation data packages that can fail to install on some systems due to memory or timeout constraints. If BiocManager::install(“rrvgo”) fails, try installing GOSemSim separately first: BiocManager::install(“GOSemSim”). Then install rrvgo. If the annotation database download fails mid-install, set options(timeout = 600) before trying again. On HPC systems with restricted internet access, you may need to pre-download the org.Hs.eg.db tarball and install locally using install.packages(“path/to/tarball”, repos = NULL, type = “source”).

# Install rrvgo if needed
if (!requireNamespace("rrvgo", quietly = TRUE)) {
BiocManager::install("rrvgo")
}
library(rrvgo)
# Extract GO term IDs and their scores from your ORA result
# rrvgo needs GO IDs and a numeric score; use -log10(padj) so higher = more significant
go_terms <- ora_result@result$ID
go_scores <- setNames(-log10(ora_result@result$p.adjust), go_terms)
# Step 1: Compute semantic similarity matrix
sim_matrix <- calculateSimMatrix(
go_terms,
orgdb = "org.Hs.eg.db",
ont = "BP", # must match the ontology you ran ORA/GSEA with
method = "Rel" # "Rel" (relevance) is a good default; also try "Wang"
)
# Step 2: Reduce based on similarity matrix
reduced_terms <- reduceSimMatrix(
sim_matrix,
scores = go_scores, # higher scores = preferred as representative
threshold = 0.7, # same logic as simplify(): 0.7 is a good default
orgdb = "org.Hs.eg.db"
)
# How many clusters?
cat("Clusters after rrvgo reduction:", length(unique(reduced_terms$cluster)), "\n")
cat("Total input terms: ", nrow(reduced_terms), "\n")

The reduced_terms object is a data frame with one row per input GO term. The cluster column identifies which representative term each input term was assigned to. The parentTerm column names the representative for that cluster.

# Visualize as a treemap
# Each rectangle is one cluster; size proportional to number of terms in the cluster
treemapPlot(reduced_terms)
rrvgo treemap visualization showing GO term clusters as rectangles, with larger rectangles representing bigger clusters of semantically similar terms, colored by cluster
Figure 1: rrvgo treemap of reduced GO terms. Each rectangle represents one cluster of semantically similar terms. The representative term for each cluster is labeled. Larger rectangles contain more input terms. Colors distinguish clusters with no semantic ordering.
# Extract only the representative terms for reporting
representative_terms <- reduced_terms[reduced_terms$go == reduced_terms$parent, ]
representative_terms <- representative_terms[order(representative_terms$score, decreasing = TRUE), ]
# Print the top 15 representative terms
head(representative_terms[, c("go", "term", "score", "size")], 15)

The scatter plot is useful for exploring the semantic landscape. Each term is a point positioned by the first two components of a principal coordinates analysis of the similarity matrix. Terms near each other in the plot are semantically similar.

# Scatter plot: semantic similarity space
scatterPlot(sim_matrix, reduced_terms,
algorithm = "pca",
label.size = 3)

Method 3: REVIGO Web Tool

REVIGO by Supek et al. (2011) is the original web-based tool for GO term redundancy reduction and remains useful as a quick fallback when rrvgo installation fails or when you need an interactive visualization to share with collaborators. The input is a list of GO term IDs and their p-values, pasted directly into the web interface.

The algorithm REVIGO uses is conceptually similar to rrvgo: it computes pairwise semantic similarity and selects representative terms. The main practical difference is that REVIGO runs online and does not require any local installation, but it is also limited to the GO annotation version it maintains internally, which may lag behind the Bioconductor annotation packages you used in your enrichment analysis.

To use REVIGO from R, export your GO term list first:

# Export GO terms and p-values for REVIGO input
revigo_input <- data.frame(
go_id = ora_result@result$ID,
padj = ora_result@result$p.adjust
)
write.table(revigo_input,
"revigo_input.txt",
sep = "\t",
row.names = FALSE,
col.names = FALSE,
quote = FALSE)

Paste the contents of revigo_input.txt into the REVIGO web interface at revigo.irb.hr. Select the appropriate organism and ontology. Download the results as a CSV and import back into R for any further processing or plotting.

Method 4: Custom Uniqueness Score Approach

When simplify() is unavailable (non-GO databases), rrvgo fails to install, and REVIGO does not match your annotation version, a custom approach based on term uniqueness is the last resort. This is also the approach to reach for when you want full control over which representative is chosen from each cluster.

The core idea is to compute a uniqueness score for each GO term, where uniqueness is defined as the fraction of that term’s annotated genes that do not appear in any other significant term. Terms with high uniqueness represent biology not captured by any other significant result and are the most information-dense candidates for reporting.

# Custom uniqueness-based filtering
# Requires the gene sets for each significant GO term
library(dplyr)
# Extract the gene sets from the ORA result
go_genesets <- ora_result@geneSets[ora_result@result$ID]
# Compute pairwise Jaccard similarity between gene sets
jaccard_sim <- function(a, b) length(intersect(a, b)) / length(union(a, b))
compute_uniqueness <- function(go_id, all_genesets) {
term_genes <- all_genesets[[go_id]]
other_genes <- unlist(all_genesets[names(all_genesets) != go_id])
unique_genes <- setdiff(term_genes, other_genes)
length(unique_genes) / length(term_genes)
}
uniqueness_scores <- sapply(
names(go_genesets),
compute_uniqueness,
all_genesets = go_genesets
)
# Filter: keep terms with uniqueness above 0.25
# (at least 25% of their genes are not shared with any other significant term)
unique_terms <- ora_result@result |>
dplyr::filter(ID %in% names(uniqueness_scores[uniqueness_scores > 0.25])) |>
dplyr::arrange(p.adjust)
cat("Terms after uniqueness filter:", nrow(unique_terms), "\n")

A uniqueness threshold of 0.25 is a reasonable starting point. Higher thresholds (0.4 to 0.5) produce a more stringent filter that keeps only the most distinct terms. Lower thresholds converge toward showing all terms. The right value depends on how correlated your significant terms are; examine the distribution of uniqueness scores first and choose a cutoff that separates the long tail of near-zero-uniqueness terms from the core of genuinely distinct ones.

Side-by-side dotplots showing 200 GO terms before redundancy reduction and 28 representative terms after, demonstrating the collapse of redundant parent-child and sibling terms
Figure 2: GO term dotplots before and after redundancy reduction. The left panel shows 200 significant terms from enrichGO, with many near-identical terms from the same GO subtree. The right panel shows 28 representative terms after rrvgo reduction at threshold 0.7, retaining distinct biological themes.

Which Method To Use

MethodWorks forRequires installationHandles non-GO databasesReproducibleNotes
simplify()enrichGO, gseGO results onlyNo (ships with clusterProfiler)NoYesFastest; use first
rrvgoAny GO term list with IDsYes (Bioconductor)NoYesMore flexible; good visualizations
REVIGOAny GO term listNo (web tool)NoLimited (web tool)Good for sharing; annotation may lag
Custom uniquenessAny enrichment resultNo (base R + dplyr)YesYesLast resort; full control

The practical sequence is: try simplify() first for GO-based results, reach for rrvgo when you need more control or the scatter/treemap visualizations, use REVIGO when you want an interactive web-based workflow or when rrvgo installation fails, and fall back to the custom uniqueness approach when your enrichment results come from a non-GO database or when you want to control the representative-selection logic exactly.

NotchBio applies GO term redundancy reduction automatically on every enrichment analysis run, using simplify() with a default threshold of 0.7 for ORA results and rrvgo for the scatter plot visualization included in the run report. If you want to adjust the threshold or explore the full unreduced term list, both are available in the downloadable run outputs.

Further reading

Read another related post

View all posts