Back to blog
Bioinformatics

Why Reproducibility Should Not Be Optional in RNA-Seq Pipelines

By Abdullah Shahid · · 8 min read

Picture your wet-lab bench. Every reagent has a lot number. Every kit has an expiration date. The cell line has a passage number recorded in your notebook. The antibody came from a specific catalog entry with a unique identifier. You would not run an immunostaining experiment, publish the result, and then describe the antibody as “a rabbit anti-phospho-AKT antibody.” Your reviewers would send it back.

Now picture your RNA-seq analysis. In most published papers, the computational equivalent of “a rabbit anti-phospho-AKT antibody” is the entire methods section. “Reads were aligned using STAR and differential expression was performed using DESeq2.” No version numbers. No parameter choices. No reference genome release. No statement about which samples failed QC and were excluded. No seed set for the random number generator. No record of what the count matrix looked like before the filtering step that removed 40 percent of genes.

The lab bench has stricter reproducibility standards than the computation that generates the paper’s primary result. That is the problem.

The Scale of the Gap

A 2021 systematic analysis published in Briefings in Bioinformatics examined 1,000 randomly selected RNA-seq publications and evaluated whether their reported methodology contained enough information to reproduce the computational pipeline. The findings were unambiguous. The vast majority of papers omitted at least one category of required methodological information. Parameter choices were the most consistently absent. Reference genome versions and annotation releases were frequently unspecified. Tool versions appeared in fewer than half of papers reviewed.

This is not a fringe problem. These are papers published in respected journals, from well-funded labs, passing peer review. The reviewers did not ask. The journals did not require it. The field had collectively agreed that computational pipelines did not need to meet the same documentation standard as bench protocols. That agreement is now eroding as more researchers attempt to reproduce published results and fail.

The cost is not merely academic inconvenience. When a DEG list cannot be reproduced, the downstream experiments it motivated may be replicated on a false signal. When a clinical biomarker study cannot be reproduced, the biomarker development program built on it may be flawed from its foundation. The stakes are proportional to the science the analysis supports.

The Five Things Every RNA-Seq Run Must Capture

Reproducibility in computational biology is not technically complex. It requires discipline, not expertise. The five items below, documented at the time the analysis runs, are sufficient to enable reproduction in most cases.

The five-item reproducibility checklist for every RNA-seq run

  1. Tool names and exact versions for every step: aligner, quantifier, DE tool, enrichment tool. Use sessionInfo() in R and conda env export or pip freeze in Python. 2. Reference genome assembly name, patch level, and annotation release with a stable URL or DOI. 3. A parameter file or configuration record for every tool that accepts non-default parameters. 4. The sample-to-condition mapping table, including batch assignments. 5. Any filtering or exclusion decisions made during or after analysis, with explicit criteria stated.

These five items are not burdensome. They already exist somewhere in your analysis environment. The discipline is deciding to commit them alongside the code before you close the laptop.

The one item that is harder than it looks is stable reference file access. URLs in README files break routinely. FTP paths on genome databases get reorganized. Archiving the exact GTF and FASTA files you used in a project directory, even though it adds storage overhead, is the most reliable solution. The alternative is a broken link two years later and no way to recover what you actually ran against.

Why Containers Do Not Solve This

Docker and Singularity have become the standard answer to the reproducibility problem, and they do solve one real problem: capturing which version of each tool was installed in the analysis environment. A container that records the exact version of every software dependency means anyone with the image can run the same tools without the dependency resolution problems that plague ad hoc installations.

But containers capture the environment, not the analysis. A Docker image that contains STAR 2.7.11a does not tell you which parameters STAR was run with. It does not tell you which GTF file was used as the annotation. It does not tell you which samples were excluded from the DESeq2 analysis because they failed PCA inspection. It does not tell you what the seed was for any stochastic steps. And if the container image is not archived in a stable registry, the image itself will eventually become inaccessible, reproducing the broken-URL problem at a different layer of the stack.

Containers are necessary and not sufficient. They solve the software version problem. They leave the other four problems untouched.

Layered diagram showing five components of a reproducible RNA-seq run, with containers solving only the bottom tool-version layer and the upper four layers requiring explicit documentation
Figure 1: The five layers of a reproducible RNA-seq run. Containers capture only the bottom layer. The upper four layers require explicit documentation that most published analyses do not provide.

The reproducibility problem in RNA-seq analysis is not a knowledge problem. Most researchers know what they should be documenting. It is a friction problem. Committing session info, parameter files, and sample mappings at the end of every analysis requires discipline applied consistently at a moment when you are already eager to move on to interpretation.

The engineering solution to friction problems is removing the friction. In production software, this is standard practice: deterministic builds, content-addressed storage, and immutable infrastructure make reproducibility automatic rather than optional. The same pattern applies to bioinformatics pipelines.

The pattern looks like this: every analysis run produces a snapshot automatically, capturing the tool versions, parameters, reference genome version, and sample mapping in a single record associated with a stable identifier. Reproducing the analysis means pointing the platform at the same snapshot ID and clicking run. The researcher does not need to remember to commit anything because the platform commits everything.

Diagram showing the anatomy of a NotchBio run snapshot with locked tool versions, parameters, reference genome, sample mapping, and a permalink with one-click rerun
Figure 2: The anatomy of a NotchBio run snapshot. Every analysis run produces a permalink that captures all five reproducibility layers. Sharing the link is sufficient for anyone to reproduce the exact analysis, or for a reviewer to verify that the described methods match what was actually run.

What This Looks Like in Practice

The difference between a reproducible and an irreproducible analysis is not in the science. It is in the infrastructure.

An irreproducible analysis runs well, produces results, and generates a paper. When someone tries to reproduce it two years later, they find that the STAR version in the conda environment is no longer pinned, the GTF URL returns a 404, and the code contains a filtering step that was applied interactively during exploration and never committed. The results cannot be reproduced. The follow-up experiment cannot verify the original finding.

A reproducible analysis runs the same code, produces the same results, and generates the same paper. When someone tries to reproduce it two years later, they check out the repository, read the parameter file, download the archived reference files from the stable URL in the config, and re-run the pipeline. They get the same count matrix, the same DEG list, and the same enrichment results. The science can be verified.

The difference between these two outcomes is not the analysis. It is whether the person running it treated documentation as part of the work or as an afterthought.

NotchBio’s Approach

NotchBio generates a permalink for every RNA-seq run that captures all five reproducibility layers automatically. Every run record includes: the exact version of every tool in the pipeline, the full parameter set used, the reference genome assembly and annotation release with a stable identifier, the sample-to-condition mapping table, and any QC filtering decisions made by the platform. Rerunning from the permalink reproduces the analysis identically.

For labs submitting to journals that require code and data availability statements, the NotchBio run permalink satisfies the computational reproducibility requirement for any standard bulk RNA-seq differential expression analysis. The permalink is stable, the record is complete, and no additional documentation is required.

Reproducibility should not require discipline applied under deadline pressure at the end of a project. It should be what happens by default every time an analysis runs. That is the standard your reagent lot numbers already meet. It is the standard your computational pipeline should meet too.

If your lab runs RNA-seq analyses regularly and you want to see what a fully reproducible run record looks like, start a free trial at notchbio.app. The first run takes under five minutes to set up, and the permalink it generates is the evidence your next reviewer may ask for.

Further reading

Read another related post

View all posts