Tutorial

How to Run DESeq2 in R: From Salmon Counts to DEG Results

By Abdullah Shahid · April 24, 2026 · 12 min read

rna-seq tutorial differential-expression bioinformatics

Getting a list of differentially expressed genes from DESeq2 is not the hard part. Getting a list you can actually trust is.

Most DESeq2 errors are quiet. The pipeline runs cleanly, you get thousands of genes, and none of it flags an error. But if you fed it normalized counts instead of raw counts, set your reference level wrong, or forgot to include batch in the design formula, your results are biased and you have no warning.

This tutorial walks through the full DESeq2 workflow in R: importing Salmon quant.sf files with tximeta, building the DESeqDataSet, running the model, applying apeglm shrinkage, and exporting a clean ranked results table. We use the r-rnaseq conda environment from Blog 1. Your Salmon outputs live in results/quant/ from Blog 4.

Five-stage DESeq2 pipeline diagram showing tximeta import producing a SummarizedExperiment, then DESeqDataSet construction with design formula, then DESeq() running size factor estimation plus dispersion estimation plus Wald test, then lfcShrink with apeglm, then results table with columns baseMean log2FoldChange lfcSE pvalue padj — Figure 1: The full DESeq2 pipeline from Salmon outputs to a ranked DEG table. Each stage depends on the previous one correctly completing. The design formula is the most consequential decision you will make in this entire workflow.

Why Raw Counts Matter More Than You Think

DESeq2 expects raw, un-normalized integer counts. This is not a preference. It is a requirement for the statistical model to work correctly.

The negative binomial model that DESeq2 uses requires knowledge of the raw count variance. If you pre-normalize your data before passing it to DESeq2, the model cannot accurately estimate dispersion, and your p-values will be wrong. Do not provide TPM, RPKM, or CPM to DESeq2.

It is important to not provide counts that were pre-normalized for sequencing depth, as the statistical model is most powerful when applied to un-normalized counts and is designed to account for library size differences internally.

The tximeta import pipeline handles this correctly. It converts Salmon’s estimated transcript-level abundances into gene-level counts using the offsets method, which accounts for differences in transcript length across samples. The resulting counts are appropriate for DESeq2 without any further transformation.

How to Import Salmon quant.sf Files with tximeta

Activate the R environment and open an R session or RStudio.

# Activate from terminal:
# conda activate r-rnaseq

library(tximeta)
library(DESeq2)
library(dplyr)
library(readr)

Build the sample table

The sample table maps each Salmon output directory to its metadata. Every column you want available in the design formula must be present here.

# List all quant directories
quant_dirs <- list.dirs("results/quant", recursive = FALSE, full.names = TRUE)

# Parse sample names from directory names
# Assumes directories are named like: SRR26891234_quant
sample_names <- sub("_quant$", "", basename(quant_dirs))

# Load sample metadata (from pysradb output in Blog 2)
metadata <- read_csv("sample_sheet.csv") |>
  filter(sample_id %in% sample_names) |>
  arrange(match(sample_id, sample_names))

# Build the coldata table for tximeta
coldata <- data.frame(
  names      = sample_names,
  files      = file.path(quant_dirs, "quant.sf"),
  condition  = factor(metadata$condition),
  batch      = factor(metadata$batch),     # include if present
  row.names  = sample_names
)

# Set the reference level BEFORE building the DESeqDataSet
# The reference level is your control condition
coldata$condition <- relevel(coldata$condition, ref = "control")

# Verify all quant.sf files exist
stopifnot(all(file.exists(coldata$files)))
cat("All quant.sf files found:", nrow(coldata), "samples\n")

Import with tximeta

# tximeta automatically fetches GENCODE/Ensembl transcript annotations
# It attaches genomic ranges to the SummarizedExperiment
se <- tximeta(coldata)

# Summarize transcript-level estimates to gene level
# The tx2gene mapping is built automatically from the metadata in se
gse <- summarizeToGene(se)

cat("Gene-level SummarizedExperiment dimensions:", dim(gse), "\n")
# Should show ~62,000 genes x 12 samples for GENCODE v47

tximeta vs tximport: which one to use

Use tximeta if you quantified with GENCODE or Ensembl references. It auto-fetches the tx2gene mapping and attaches genomic coordinates. Use tximport if you used a custom reference or if tximeta fails to recognize your index. The downstream DESeq2 code is identical either way.

Diagram of a SummarizedExperiment object with three colored blocks: a pink assay block containing the counts matrix with genes as rows and samples as columns, a blue rowRanges block containing genomic coordinates per gene, and a green colData block containing the sample metadata table with condition and batch columns, with an arrow showing that colData columns feed directly into the DESeq2 design formula — Figure 2: The SummarizedExperiment produced by tximeta. The colData slot holds your sample metadata. Every variable you want in the design formula must be a column in colData. The rowRanges slot holds gene coordinates, which tximeta fetches automatically for GENCODE and Ensembl references.

How to Build a DESeqDataSet with the Right Design Formula

The design formula is the most consequential decision in the analysis. It tells DESeq2 which variables to model and which comparison to make.

# Simple two-group design: condition only
dds <- DESeqDataSet(gse, design = ~ condition)

# If you have a batch effect that MultiQC confirmed (from Blog 3)
# Add batch to the design to control for it
# dds <- DESeqDataSet(gse, design = ~ batch + condition)

The order of terms in the formula matters. DESeq2 tests the last term. With ~ batch + condition, it controls for batch and tests condition. With ~ condition + batch, it would test batch instead.

Filter low-count genes

Genes with very low counts across all samples add noise and reduce statistical power by increasing the multiple testing burden. Filter them before running the model.

# Keep genes with at least 10 counts in at least 3 samples
# The 3 here should match your smallest group size
keep <- rowSums(counts(dds) >= 10) >= 3
dds  <- dds[keep, ]

cat("Genes retained after filtering:", nrow(dds), "\n")
# Typically 18,000 to 22,000 protein-coding genes for human

Set your reference level before running DESeq()

DESeq2 computes fold changes as treated vs reference. If you do not call relevel() before constructing the DESeqDataSet, R defaults to alphabetical ordering. Your fold changes will be correct in magnitude but potentially in the wrong direction. Set the reference level explicitly and confirm it with levels(dds$condition)[1].

How to Run DESeq() and Interpret Each Step

The single DESeq() call runs three sequential steps internally: size factor estimation, dispersion estimation, and the Wald test.

# This takes 2-10 minutes depending on dataset size
dds <- DESeq(dds)

DESeq2 prints progress as it runs.

estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing

Each step matters. Size factor estimation corrects for library depth differences between samples. Dispersion estimation models the gene-level variance, which is what makes the Wald test conservative enough to control false discoveries in RNA-seq data.

Extract results for the contrast of interest

# Extract results for the condition comparison
res <- results(
  dds,
  name     = "condition_treated_vs_control",
  alpha    = 0.05     # FDR threshold for independent filtering
)

# Summary of results
summary(res)

The summary() output looks like:

out of 19,847 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up)     : 1,243, 6.3%
LFC < 0 (down)   : 987, 5.0%
outliers [1]     : 23, 0.12%
low counts [2]   : 2,104, 10.6%

How to Apply apeglm LFC Shrinkage

Raw log2 fold changes from DESeq2 are maximum likelihood estimates. Genes with low counts will have extreme fold changes with very high uncertainty. These noisy estimates make volcano plots uninformative and mislead gene ranking.

lfcShrink with apeglm applies a Bayesian shrinkage estimator that pulls unreliable fold changes toward zero while preserving large, well-supported fold changes. apeglm and ashr shrinkage methods help to preserve the size of large LFC and can be used to compute s-values. These properties are related to wide-tailed priors that avoid shrinking large LFCs.

library(apeglm)

# Get the coefficient name to shrink
resultsNames(dds)
# [1] "Intercept" "condition_treated_vs_control"

# Apply apeglm shrinkage
res_shrunk <- lfcShrink(
  dds,
  coef = "condition_treated_vs_control",
  type = "apeglm"
)

# Compare before vs after shrinkage for a low-count gene
# Raw LFC for a gene with baseMean = 4: might show log2FC = 8.3
# Shrunken LFC for same gene: pulled toward ~1.2

Use the shrunken results for visualization and ranking. Use the unshrunken results only if you need p-values with specific numeric contrast weights.

Two side-by-side MA plots: left plot labeled Before lfcShrink shows points scattered widely on the y-axis for genes with low baseMean on the left, with some extreme values reaching plus or minus 10, colored grey for non-significant and red for significant; right plot labeled After apeglm shrinkage shows the same data but with low-baseMean points pulled tightly toward zero on the y-axis, the significant genes remain elevated but the noise from low-count genes is eliminated, making the plot much cleaner — Figure 3: MA plots before and after apeglm shrinkage. Low-count genes on the left side of the x-axis have extreme raw fold changes that shrinkage pulls toward zero. Genes with genuine large fold changes and sufficient counts remain elevated after shrinkage.

How to Annotate Results with Gene Symbols

DESeq2 works with Ensembl gene IDs by default. Convert them to gene symbols before exporting.

library(AnnotationDbi)
library(org.Hs.eg.db)

# Add gene symbol column to the results
res_df <- as.data.frame(res_shrunk) |>
  tibble::rownames_to_column("gene_id") |>
  mutate(
    # Strip version number from Ensembl ID (e.g. ENSG00000123456.14 -> ENSG00000123456)
    gene_id_clean = sub("\\..*", "", gene_id),
    symbol = mapIds(
      org.Hs.eg.db,
      keys     = gene_id_clean,
      column   = "SYMBOL",
      keytype  = "ENSEMBL",
      multiVals = "first"
    )
  ) |>
  # Move symbol to the front
  select(gene_id, symbol, baseMean, log2FoldChange, lfcSE, pvalue, padj) |>
  arrange(padj, desc(abs(log2FoldChange)))

cat("Total genes in results:", nrow(res_df), "\n")

Export DEG tables

# All genes, ranked by adjusted p-value
write_csv(res_df, "results/deseq2_all_genes.csv")

# Significant DEGs only: padj < 0.05 and |log2FC| > 1 (2-fold change)
degs <- res_df |>
  filter(padj < 0.05, abs(log2FoldChange) > 1)

write_csv(degs, "results/deseq2_significant_degs.csv")

cat("Significant DEGs (padj < 0.05, |LFC| > 1):", nrow(degs), "\n")
cat("  Upregulated:  ", sum(degs$log2FoldChange > 0), "\n")
cat("  Downregulated:", sum(degs$log2FoldChange < 0), "\n")

What the results columns mean

baseMean is the average normalized count across all samples. log2FoldChange is the shrunken effect size. lfcSE is the posterior standard deviation. pvalue is the raw Wald test p-value. padj is the Benjamini-Hochberg adjusted p-value. Always use padj for significance thresholds, never the raw pvalue.

Annotated table showing the DESeq2 results DataFrame with six columns: gene_id showing ENSG IDs, symbol showing gene names like IFNG and TNF and MYC, baseMean showing average normalized counts, log2FoldChange showing shrunken effect sizes, pvalue, and padj; arrows point to key columns with labels explaining baseMean equals average expression across all samples, log2FoldChange equals shrunken LFC use this for ranking, and padj equals BH-corrected p-value use this for significance not pvalue — Figure 4: The annotated DESeq2 results table. The padj column is the only column to use for significance cutoffs. The log2FoldChange column contains apeglm-shrunken estimates, which are reliable for ranking and visualization. baseMean is useful for distinguishing low-expressed genes from well-measured ones.

A Complete Runnable Script

Here is the full analysis from import to export in one script. Save it as run_deseq2.R.

#!/usr/bin/env Rscript
# Full DESeq2 analysis from Salmon quant.sf files to DEG table

suppressPackageStartupMessages({
  library(tximeta)
  library(DESeq2)
  library(apeglm)
  library(AnnotationDbi)
  library(org.Hs.eg.db)
  library(dplyr)
  library(readr)
})

# ── 1. Build sample table ────────────────────────────────────────────
quant_dirs   <- list.dirs("results/quant", recursive = FALSE, full.names = TRUE)
sample_names <- sub("_quant$", "", basename(quant_dirs))
metadata     <- read_csv("sample_sheet.csv") |>
                  arrange(match(sample_id, sample_names))

coldata <- data.frame(
  names     = sample_names,
  files     = file.path(quant_dirs, "quant.sf"),
  condition = factor(metadata$condition),
  row.names = sample_names
)
coldata$condition <- relevel(coldata$condition, ref = "control")
stopifnot(all(file.exists(coldata$files)))

# ── 2. Import with tximeta ───────────────────────────────────────────
se  <- tximeta(coldata)
gse <- summarizeToGene(se)

# ── 3. Build DESeqDataSet ────────────────────────────────────────────
dds  <- DESeqDataSet(gse, design = ~ condition)
keep <- rowSums(counts(dds) >= 10) >= 3
dds  <- dds[keep, ]
cat("Genes after filtering:", nrow(dds), "\n")

# ── 4. Run DESeq2 ────────────────────────────────────────────────────
dds <- DESeq(dds)

# ── 5. Extract results with apeglm shrinkage ─────────────────────────
coef_name  <- grep("condition_", resultsNames(dds), value = TRUE)[1]
res_shrunk <- lfcShrink(dds, coef = coef_name, type = "apeglm")

# ── 6. Annotate and export ───────────────────────────────────────────
res_df <- as.data.frame(res_shrunk) |>
  tibble::rownames_to_column("gene_id") |>
  mutate(
    gene_id_clean = sub("\\..*", "", gene_id),
    symbol = mapIds(org.Hs.eg.db, keys = gene_id_clean,
                    column = "SYMBOL", keytype = "ENSEMBL", multiVals = "first")
  ) |>
  select(gene_id, symbol, baseMean, log2FoldChange, lfcSE, pvalue, padj) |>
  arrange(padj, desc(abs(log2FoldChange)))

dir.create("results/deseq2", showWarnings = FALSE)
write_csv(res_df, "results/deseq2/all_genes.csv")
write_csv(filter(res_df, padj < 0.05, abs(log2FoldChange) > 1),
          "results/deseq2/significant_degs.csv")

cat("Analysis complete.\n")
cat("Results written to results/deseq2/\n")

Run it from the terminal:

conda activate r-rnaseq
Rscript run_deseq2.R

Manual DESeq2 vs NotchBio: From quant.sf to DEG Table

The R workflow above is rigorous and reproducible. It still requires the r-rnaseq environment from Blog 1, a working AnnotationDbi installation, and careful attention to reference levels and design formula construction.

NotchBio runs the equivalent DESeq2 pipeline automatically when you upload Salmon outputs or complete the upstream quantification within the platform. The design formula, size factor estimation, dispersion plots, apeglm shrinkage, and annotated results table are all produced without writing a line of R.

Step	Manual R Pipeline	NotchBio
Install DESeq2 + apeglm	BiocManager or conda, 5-10 min	Not required
Build sample table	Write coldata DataFrame manually	Auto-built from metadata
Set reference level	relevel() before DESeqDataSet	UI dropdown
Design formula	Typed manually, easy to get wrong	Guided form
Batch correction	Add batch to formula manually	Detected from metadata
Low-count filtering	rowSums() filter written by hand	Automatic
apeglm shrinkage	lfcShrink() with correct coef name	Always on
Gene symbol annotation	AnnotationDbi mapIds()	Included in results table
Export DEG table	write_csv()	Download from dashboard
Time to first results	30-60 min including debugging	10 min from quant.sf upload

If you want publication-ready DEG tables from your Salmon outputs without managing an R environment, notchbio.app runs the same pipeline on our infrastructure.

Read another related post

View all posts

Research Guide