Back to blog
Research Guide

The Reproducibility Crisis in Bulk RNA-Seq: What Actually Breaks

By Abdullah Shahid · · 9 min read

Download someone’s published RNA-seq pipeline. Try to run it. There is a better-than-even chance something fails before you see a single output file.

The dependency is pinned to a version that no longer exists on Bioconductor. The reference genome URL is a dead link. A hardcoded path points to a directory on a server at an institution the first author left two years ago. The filtering step that removed 30 percent of genes before differential expression is mentioned nowhere in the methods section. The pipeline runs, gives you numbers, and you have no way of knowing whether those numbers match what was published.

This is not a niche problem. A 2019 analysis by Bhandari and colleagues examined RNA-seq papers across major journals and found the majority could not be reproduced from published materials alone. The field has known about this for years. The problem persists because the incentives that drive publication do not reward the extra hours it takes to make an analysis genuinely repeatable.

The State of the Field

The reproducibility gap in computational biology is not primarily about fraud. Most researchers who publish irreproducible analyses are not cutting corners deliberately. They are cutting corners under deadline pressure, and the corners they cut are the ones that feel administrative rather than scientific: version documentation, parameter files, data availability statements, path portability.

The result is a large body of published work that is nominally open but practically closed. The code exists on GitHub. The data exists in GEO. The paper describes what was done at a high level. But the connective tissue that would let someone actually replicate the result from scratch, including exact tool versions, exact parameter choices, exact reference files, and the exact sample-to-condition mapping, is missing.

A bioinformatician who posted to Reddit put it directly: the code is provided “at your own risk,” and most of it is badly written enough that you would not want to de-spaghettify it anyway. The response from the community was not disagreement. It was recognition.

Half the breakthroughs I read in bioinformatics are not reproducible, scalable, or even usable in real pipelines. The gap between what is marketed and what actually works in day-to-day bioinformatics is getting huge.

r/bioinformatics community , 279 upvotes, 63 comments

The Five Things That Actually Break

Reproducibility failures in RNA-seq are not random. The same categories appear over and over when you try to rerun published analyses. Understanding them specifically makes it possible to prevent them systematically.

Tool version drift is the most common failure. A pipeline written against DESeq2 1.36 may behave differently under 1.40, and the difference matters when dispersion estimation algorithms change between versions. Bioconductor releases major updates every six months. An undated sessionInfo() call at the end of a script tells you what the author ran on one machine at one point in time. It does not help you reproduce that environment two years later without significant effort.

Undocumented parameters account for a large fraction of the remaining failures. STAR has dozens of parameters that affect alignment behavior. featureCounts has parameters controlling how ambiguous reads are handled, whether multi-mapping reads are included, and whether strand specificity is enforced. The defaults change between versions. A methods section that says “reads were aligned using STAR” tells you almost nothing about what the pipeline actually did.

Missing or unstated reference files are a quiet killer. Which Ensembl release? Which genome assembly: GRCh38 or GRCh38.p14? Primary assembly or with ALT contigs? The GTF annotation file you use affects gene count substantially for genes with complex alternative splicing. These choices are rarely documented explicitly, and the file URLs that appear in READMEs routinely expire.

Hardcoded paths break the moment anyone tries to run the pipeline on a different machine, which is to say immediately. A script that opens /home/jsmith/projects/liver_rnaseq/data/raw/ is only runnable by one person on one server, and often not even by that person two years later.

Undocumented filtering steps produce the most scientifically consequential failures. If you removed lowly expressed genes with a threshold that is not stated, if you excluded samples based on PCA inspection without recording which samples and why, if you trimmed the DEG list for pathway analysis with a secondary filter not mentioned in the paper: these choices silently shape the published result. Someone trying to reproduce the analysis will get different numbers and have no way to know whether the difference is in their setup or in the undocumented step they cannot see.

The Minimum Viable Reproducibility Checklist

Most of the above failures are preventable with straightforward habits. The following five items, captured at the time the analysis is run, are sufficient to enable reproduction in most cases.

Five things every RNA-seq run must capture

  1. Tool names and exact versions for every step: aligner, quantifier, DE tool, enrichment tool. Use sessionInfo() in R and conda env export or pip freeze in Python environments. 2. Reference genome assembly and annotation release, with a stable URL or DOI. 3. A parameter file or configuration record for every tool that accepts non-default parameters. 4. The sample-to-condition mapping table, including batch assignments if applicable. 5. Any filtering or exclusion decisions made during or after analysis, with the criteria stated explicitly.

None of these require new infrastructure. A plain text file committed to the same repository as the analysis code covers all five. The discipline is not in the tools; it is in deciding that the five minutes required to write it down are worth the hours of confusion they prevent for anyone who reads the paper later, including yourself six months from now.

The one item that is genuinely harder than it looks is stable reference file access. DOIs for genome releases exist but are not universally used. Archiving the exact GTF and FASTA files you used in a project-level data directory is the most reliable solution, even though it adds storage. The alternative is a URL that will break.

Why Containers Do Not Solve This

Docker and Singularity are the standard answer to the reproducibility problem. Containerize the environment, and anyone who has the container can reproduce the run. In practice, containers solve the dependency problem but leave the other four failure modes untouched.

A container that captures the exact versions of every tool does not document the parameters those tools were run with. It does not record which reference genome was used. It does not explain the filtering decisions. It does not make hardcoded paths portable if those paths do not exist inside the container. And a container image that is not archived in a stable registry will eventually become inaccessible, reproducing the broken-URL problem at a different layer of the stack.

Containers are necessary but not sufficient. They solve the “which version of STAR” question definitively. They do not solve any of the other four questions.

The deeper issue is that reproducibility requires documentation of intent, not just environment. A container captures what was installed. It does not capture what the researcher decided to do with it.

Diagram showing the components of a complete RNA-seq run snapshot: tool versions, parameters, reference files, sample mapping, and filtering decisions
Figure 1: The anatomy of a reproducible RNA-seq run. A container captures only the tool version layer. The remaining four layers require explicit documentation.

Reproducibility-by-Default vs DIY

There are two architectural approaches to this problem, and they have different cost profiles.

The DIY approach puts the reproducibility burden on the individual researcher. You write the pipeline, you document the parameters, you commit the session info, you archive the reference files. With discipline and good tooling, this works. It requires that everyone on the team exercises that discipline consistently, that the institutional infrastructure exists to archive files long-term, and that the pipeline is maintained as tools and references update. In practice, most academic labs do not sustain this consistently.

The reproducibility-by-default approach builds the documentation into the platform. Every analysis run produces a snapshot automatically: the tool versions used, the parameters applied, the reference genome accessed, the sample mapping uploaded. The snapshot is stored with a stable identifier. Reproducing the analysis means pointing the platform at the same snapshot ID.

This is not a novel idea in software engineering. Deterministic builds, content-addressed storage, and immutable infrastructure have been standard practice in production software for over a decade. Bioinformatics has been slow to adopt these patterns, partly because the field grew up in academic computing environments where long-term reproducibility was not a primary concern.

NotchBio generates a permalink for every RNA-seq run that captures tool versions, parameter settings, reference genome version, and sample-to-condition mapping in a single record. Rerunning from that permalink reproduces the analysis with the same configuration. For labs that want publication-grade reproducibility without building the infrastructure themselves, this is what the default should look like.

A More Honest Conversation About Standards

The bioinformatics community has been having the reproducibility conversation for at least a decade. The papers documenting the problem are well-cited. The solutions are not technically complex. The gap between knowing what is needed and doing it consistently is a cultural and incentive problem, not a technical one.

Peer review could close this gap quickly. A reviewer who requires a reproducibility checklist as a condition of acceptance would shift the behavior of every author who submits to that journal. The tools exist. The documentation standards exist. What has been missing is the expectation that they will be used.

Until that expectation is enforced externally, the researchers who benefit most from building reproducibility habits are the ones whose future selves will thank them. The analysis you document properly today is the one you can hand to a new lab member in two years without spending a week reconstructing what you did. That is not an altruistic argument. It is a practical one.

Further reading

Read another related post

View all posts