Back to blog
Tutorial

Differential Expression in Python with PyDESeq2: A Tutorial

By Abdullah Shahid · · 10 min read

Python users doing RNA-seq analysis have historically had two uncomfortable options: learn R well enough to run DESeq2, or use Python-native tools that do not implement the same statistical model. PyDESeq2 changes that. It is a faithful Python reimplementation of the DESeq2 negative binomial GLM, written by the Owkin team, and it integrates directly with AnnData objects and the scanpy ecosystem. If your analysis lives in Python and you want publication-grade differential expression without switching languages, PyDESeq2 is the current best option.

This tutorial covers the complete workflow: installing the library, loading count data, fitting the model, extracting results, validating against an R baseline, generating plots, and exporting for enrichment analysis.

When To Choose PyDESeq2 Over R DESeq2

The choice is almost entirely about your existing ecosystem rather than statistical performance. The two implementations share the same mathematical model: negative binomial GLM with shrunken dispersion estimates and Wald tests. Results between them are close but not identical because they differ in some numerical details of the optimization routines.

Choose PyDESeq2 when your analysis pipeline is already Python-native, your count data lives in AnnData objects from a scanpy workflow, or you are building an automated analysis system in Python where calling out to R would complicate the architecture. Choose R DESeq2 when your team works primarily in R, when you are working with complex multi-factor designs where the R implementation has more battle-tested edge case handling, or when a reviewer specifically asks you to use R DESeq2 because it is the field standard.

The one situation where you should be cautious about PyDESeq2 is small sample sizes, below three replicates per group. In that regime the dispersion estimation can diverge slightly from the R implementation. In practice the difference is unlikely to change your conclusions, but it is worth validating against R if your sample sizes are small and the paper is high-stakes.

PyDESeq2 results are close but not identical to R DESeq2

PyDESeq2 implements the same statistical model as R DESeq2 but differs in some numerical details of the optimization. Expect log2 fold changes and adjusted p-values to be very similar but not exactly equal. For most experiments the differences are well below any meaningful threshold. For small sample sizes or complex designs, run a validation comparison against R DESeq2 before relying solely on PyDESeq2 results for publication.

Installation

# Install from PyPI
# Requires Python 3.8 or later
pip install pydeseq2
# Optional but recommended: install anndata for AnnData input format
pip install anndata scanpy
# For plotting
pip install seaborn matplotlib

PyDESeq2 has no R dependency. It is pure Python with numpy, scipy, and pandas under the hood.

Loading Count Data and Metadata

PyDESeq2 accepts count data as either a pandas DataFrame or an AnnData object. The DataFrame path is the simplest starting point if you are coming from a CSV count matrix.

import pandas as pd
import numpy as np
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats
# Load the raw count matrix
# Rows must be SAMPLES, columns must be GENES
# This is the TRANSPOSE of the typical R convention
counts_df = pd.read_csv("counts_matrix.csv", index_col=0)
# PyDESeq2 expects samples as rows, genes as columns
# If your matrix has genes as rows, transpose it
if counts_df.shape[0] > counts_df.shape[1]:
# Likely genes x samples, need to transpose
counts_df = counts_df.T
print(f"Shape: {counts_df.shape}") # should be (n_samples, n_genes)
# Load sample metadata
metadata_df = pd.read_csv("sample_metadata.csv", index_col=0)
# Ensure sample IDs match between counts and metadata
assert all(counts_df.index == metadata_df.index), \
"Sample order mismatch between counts and metadata"
print(metadata_df)

The transposition is the most common point of confusion for people coming from R. DESeq2 in R expects genes as rows and samples as columns. PyDESeq2 follows the AnnData / scanpy convention where observations (samples) are rows and variables (genes) are columns. If you get this wrong, the model will appear to run but your results will be meaningless.

# Pre-filter: remove genes with fewer than 10 counts across all samples
# PyDESeq2 does its own internal filtering but this speeds up computation
gene_counts = counts_df.sum(axis=0)
counts_filtered = counts_df.loc[:, gene_counts >= 10]
print(f"Genes after filtering: {counts_filtered.shape[1]}")

Building the DeseqDataSet and Fitting the Model

The DeseqDataSet object in PyDESeq2 is the equivalent of DESeqDataSet in R. It holds the count matrix, the metadata, and the design specification.

# Build the DeseqDataSet
# design_factors: the column(s) in metadata to include in the model
dds = DeseqDataSet(
counts = counts_filtered,
metadata = metadata_df,
design_factors = "condition", # equivalent to ~ condition in R
ref_level = ["condition", "control"], # set the reference level
n_cpus = 4, # parallel processing
refit_cooks = True # refit for outlier samples (recommended)
)
# Run the full pipeline: size factor estimation, dispersion, GLM fit
dds.deseq2()

The ref_level argument is important. It sets which condition is the denominator in your fold change calculation. Setting it explicitly prevents PyDESeq2 from defaulting to alphabetical order, which may give you the wrong direction of effect.

Extracting and Examining Results

After fitting the model, create a DeseqStats object to run the Wald test and extract results for a specific contrast.

# Run the Wald test for the contrast of interest
stat_res = DeseqStats(
dds,
contrast = ["condition", "treated", "control"],
alpha = 0.05,
cooks_filter = True,
independent_filter = True
)
# Run the test and apply Benjamini-Hochberg correction
stat_res.run_wald_test()
stat_res.p_adjoust() # applies BH correction
# Extract results as a DataFrame
results_df = stat_res.results_df
print(results_df.head())
print(f"\nTotal genes tested: {len(results_df)}")
print(f"Significant (padj < 0.05): {(results_df['padj'] < 0.05).sum()}")

The results DataFrame has the same columns as the R DESeq2 output: baseMean, log2FoldChange, lfcSE, stat, pvalue, and padj. The column semantics are identical.

# Apply lfcShrink for better fold change estimates
# PyDESeq2 implements the same apeglm-equivalent shrinkage
stat_res.lfc_shrink(coeff="condition_treated_vs_control")
# Extract the filtered DEG table
results_df = stat_res.results_df.copy()
degs = results_df[
results_df["padj"].notna() &
(results_df["padj"] < 0.05) &
(results_df["log2FoldChange"].abs() >= 1)
].copy()
print(f"DEGs up: {(degs['log2FoldChange'] > 0).sum()}")
print(f"DEGs down: {(degs['log2FoldChange'] < 0).sum()}")

Validation: Comparing Against R DESeq2

If your experiment is important enough to publish, it is worth spending 15 minutes running the same analysis in R and comparing the top results. The comparison does not need to be perfect; you are looking for major discrepancies, not numerical identity.

# Load R DESeq2 results (exported from R as a CSV)
r_results = pd.read_csv("deseq2_r_results.csv", index_col=0)
# Merge on gene ID
comparison = results_df.join(
r_results[["log2FoldChange", "padj"]].rename(
columns={"log2FoldChange": "r_lfc", "padj": "r_padj"}
),
how="inner"
)
# Correlation of log2FC estimates
lfc_corr = comparison["log2FoldChange"].corr(comparison["r_lfc"])
print(f"log2FC correlation (Python vs R): {lfc_corr:.4f}")
# Expect > 0.99 for standard experiments
# Check that significant genes agree
python_sig = set(comparison[comparison["padj"] < 0.05].index)
r_sig = set(comparison[comparison["r_padj"] < 0.05].index)
overlap = len(python_sig & r_sig)
print(f"Significant gene overlap: {overlap} / {len(python_sig | r_sig)}")

A log2FC correlation above 0.99 and a significant gene overlap above 95 percent means the two implementations agree well and you can proceed with the Python results. If either metric is substantially lower, investigate whether the reference levels, design formulas, and input matrices are truly equivalent between the two runs before proceeding.

PCA Plot with Scanpy

Visualizing sample separation before differential expression is a critical sanity check. If treated and control samples do not separate on the first two PCs, something is wrong with your data or metadata.

import scanpy as sc
import anndata as ad
# Build an AnnData object from normalized counts for visualization
# Use VST-transformed values or log1p of normalized counts
# PyDESeq2 stores normalized counts in dds.layers["normed_counts"]
adata = ad.AnnData(
X = dds.layers["normed_counts"],
obs = metadata_df,
var = pd.DataFrame(index=counts_filtered.columns)
)
# Log-transform for PCA
sc.pp.log1p(adata)
# Compute PCA
sc.pp.pca(adata, n_comps=10)
# Plot colored by condition
sc.pl.pca(
adata,
color = "condition",
title = "PCA of normalized counts",
save = "pca_condition.png"
)
PCA plot from PyDESeq2 analysis showing clear separation between treated and control samples along PC1, generated with scanpy
Figure 1: PCA of log-normalized counts colored by condition. Clear separation along PC1 confirms that condition is the dominant source of variance. If samples from different conditions cluster together, investigate batch effects before proceeding with DE analysis.

Volcano Plot with Matplotlib

import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import numpy as np
# Prepare data
plot_df = results_df.copy()
plot_df = plot_df[plot_df["padj"].notna()]
plot_df["-log10padj"] = -np.log10(plot_df["padj"].clip(lower=1e-30))
# Assign colors
def assign_color(row):
if row["padj"] < 0.05 and row["log2FoldChange"] >= 1:
return "up"
elif row["padj"] < 0.05 and row["log2FoldChange"] <= -1:
return "down"
return "ns"
plot_df["direction"] = plot_df.apply(assign_color, axis=1)
color_map = {"up": "#E63946", "down": "#457B9D", "ns": "#CCCCCC"}
fig, ax = plt.subplots(figsize=(8, 6))
for direction, color in color_map.items():
subset = plot_df[plot_df["direction"] == direction]
ax.scatter(
subset["log2FoldChange"],
subset["-log10padj"],
c=color, s=15, alpha=0.6, linewidths=0, label=direction
)
# Threshold lines
ax.axhline(-np.log10(0.05), color="grey", linestyle="--", linewidth=0.8)
ax.axvline(-1, color="grey", linestyle="--", linewidth=0.8)
ax.axvline(1, color="grey", linestyle="--", linewidth=0.8)
ax.set_xlabel("Log2 fold change (treated vs control)", fontsize=12)
ax.set_ylabel("-Log10 adjusted p-value", fontsize=12)
ax.set_title("Differential expression: treated vs control", fontsize=13)
ax.set_xlim(-6, 6)
ax.set_ylim(0, 30)
up_count = (plot_df["direction"] == "up").sum()
down_count = (plot_df["direction"] == "down").sum()
ax.legend(
handles=[
mpatches.Patch(color="#E63946", label=f"Up ({up_count})"),
mpatches.Patch(color="#457B9D", label=f"Down ({down_count})"),
mpatches.Patch(color="#CCCCCC", label="NS"),
],
frameon=False
)
plt.tight_layout()
plt.savefig("volcano_treated_vs_control.png", dpi=300, bbox_inches="tight")
plt.show()
Volcano plot from PyDESeq2 results showing upregulated genes in red on the right, downregulated genes in blue on the left, and non-significant genes in grey
Figure 2: Volcano plot of treated vs control differential expression results from PyDESeq2. Red: padj below 0.05 and log2FC at or above 1. Blue: padj below 0.05 and log2FC at or below -1. Grey: not significant or below fold change threshold.

Exporting for Enrichment Analysis

The export format is the same as the R workflow: a full results table for GSEA input, and a filtered DEG list for ORA.

# Full results for GSEA (all tested genes, sorted by Wald stat)
gsea_input = results_df[results_df["stat"].notna()].copy()
gsea_input = gsea_input.sort_values("stat", ascending=False)
gsea_input[["stat"]].to_csv("gsea_ranked_pydeseq2.csv")
# DEG list for ORA
degs.to_csv("degs_pydeseq2.csv")
# Background for ORA: all genes with non-null padj
background = results_df[results_df["padj"].notna()].index.tolist()
pd.Series(background).to_csv("ora_background.csv", index=False, header=False)
print(f"Exported {len(gsea_input)} genes for GSEA ranking")
print(f"Exported {len(degs)} DEGs for ORA")
print(f"Background size for ORA: {len(background)}")

Python vs R DE Library Comparison

LibraryLanguageImplements DESeq2 modelAnnData supportActive maintenanceBest for
PyDESeq2PythonYes (reimplementation)YesYesPython-native pipelines, scanpy workflows
diffxpyPythonNo (Wald test on GLM)YesLimitedQuick DE in scanpy, not publication-primary
edgeR via rpy2Python via RYes (edgeR)PartialVia rpy2 bridgeR-quality results from Python with complexity
DESeq2 via rpy2Python via RYes (original)PartialVia rpy2 bridgeMaximum fidelity to R DESeq2, harder to set up
pydeseq2 + pyGSEAPythonYes + enrichmentYesYesEnd-to-end Python DE and enrichment

PyDESeq2 is the right choice for most Python-native workflows. If you need absolute fidelity to the original R implementation and are comfortable with the rpy2 bridge, the R-via-Python approach works but adds significant setup complexity.

NotchBio accepts a count matrix upload and returns results in both R-compatible and Python-compatible formats, including a pandas DataFrame export of the full results table and a ranked gene list ready for pyGSEA or gseapy. If you want the statistical rigor of DESeq2 without the language choice overhead, that is what the platform is built for.

Further reading

Read another related post

View all posts