Back to blog
Bioinformatics

Industrial Bioinformatics Is Still In Its Infancy

By Abdullah Shahid · · 10 min read

A bioinformatician with fifteen years of experience, working at the intersection of academic genomics and pharmaceutical applications, posted an observation to the community that generated nearly 600 upvotes and 66 comments. The post did not describe a new tool or a methodological finding. It described a feeling of professional frustration that many in the field recognized immediately: that despite the sophistication of the science being done, most bioinformatics groups, even in commercial settings, operate with the instincts and infrastructure of a graduate student’s computing environment scaled up by accident.

The observation deserves a direct response. Not as a criticism of the people doing the work, most of whom are excellent scientists operating under constraints they did not choose. But as an honest assessment of where the field sits relative to where it needs to be, and what the gap actually consists of.

Bioinformatics is still in its infancy. The field moves fast but the infrastructure underneath it is often embarrassingly immature. We use tools held together with bash scripts and prayer, and call it a pipeline.

r/bioinformatics community , 598 upvotes, 66 comments

What the Senior Practitioner Is Actually Observing

The frustration expressed in that post is not about the algorithms or the science. Those are, in many cases, genuinely sophisticated. The frustration is about the operating layer below the science: the infrastructure, the process, the norms of practice.

In software engineering, a field that has had four decades to formalize its dysfunction and develop countermeasures, there are established practices for building systems that work reliably under real-world conditions. Code is version controlled. Changes are reviewed before they are merged. Systems have tests that run automatically when something changes. Deployments are tracked. The behavior of a system at any point in the past can be recovered, not from someone’s memory, but from a commit log.

Most bioinformatics groups, including those in well-funded pharmaceutical and biotech companies, operate with a fraction of this infrastructure in place. The reasons are understandable: the people building bioinformatics systems were trained as scientists, not engineers. The incentive system rewards publications, not maintainable code. The career path does not typically distinguish between someone who produces good science with poor software practices and someone who produces good science with excellent ones.

The result is what the poster described: tools held together by bash scripts and prayer, pipelines that work until someone changes a dependency version, analyses that cannot be reproduced because the person who ran them has left the lab, and codebases that a new team member cannot read without a week of explanation from the original author.

What Industry Actually Needs

The distinction between what academia rewards and what commercial bioinformatics requires is worth making explicit.

Academic bioinformatics optimizes for discovery. Novel findings, publishable results, demonstrated capability. Reproducibility is a virtue but not a hard requirement. Code that works once is sufficient to generate a paper.

Commercial bioinformatics optimizes for reliability. An analysis that fails one time in twenty is not a minor inconvenience; it is a production incident. An analysis whose results cannot be traced back to specific inputs and specific tool versions is a compliance problem. An analysis that only one person understands is a business continuity risk.

These are genuinely different optimization targets, and they require different practices. The transition from academic to industrial bioinformatics is not primarily a scientific transition. It is an engineering transition. Scientists who make it successfully are not those who learn more biology or more statistics, but those who develop the engineering habits that make complex computational systems reliable.

The question to ask your team: are you doing engineering or research?

Research accepts uncertainty as a feature. You are exploring territory you do not understand, and your process reflects that. Engineering assumes you understand what you want to build and optimizes for doing it reliably and repeatably. Bioinformatics pipelines in commercial settings almost always need to be engineering, even when the science inside them is novel. The failure to distinguish these two modes is one of the most common causes of technical debt accumulation in genomics groups.

The Engineering Practices That Close the Gap

The 2015 paper “Engineering Bioinformatics” by Strozzi and Aerts identified a significant and specific lack of software engineering practices in bioinformatics compared to commercial software development. A 2022 systematic review in PeerJ Computer Science identified the same core gaps: requirements gathering, documentation, testing, and integration. A 2024 paper in Bioinformatics added that bioinformaticians who work in organized teams with code review and knowledge sharing produce substantially better software quality. None of these findings are surprising. All of them are consistently ignored in practice.

The practices that close the gap are not exotic. They are the standard toolkit of software engineering applied to the specific constraints of bioinformatics.

Version control for everything, including data and results. Git for code is standard in most groups, but version-controlling the reference genome files, the configuration parameters, the sample manifests, and the output count matrices is less common. A commit log that can answer “what exactly was run against this sample on this date” is not a luxury in a clinical or regulatory context; it is a minimum.

Automated testing at the pipeline level. Bioinformatics pipelines fail in ways that do not produce error messages. A parameter change that silently inverts the direction of your fold changes, a reference genome mismatch that reduces mapping rate by 8 percent without raising an exception, a batch variable that gets dropped from the design matrix: none of these produce stack traces. Automated tests that run a known dataset through the pipeline and check that the output matches expected values catch these failures before they propagate into published results. Most bioinformatics pipelines have zero automated tests.

Documentation written for the person who runs the pipeline, not the person who built it. The standard of documentation in a bioinformatics group is often “the code is self-documenting” or “I can explain it to you in a meeting.” Neither of these survives personnel turnover. Documentation written for the specific purpose of enabling someone with the relevant scientific background but no prior exposure to your codebase to run, understand, and maintain the pipeline is different from code comments, and almost entirely absent from most research pipelines.

Formal dependency management. A conda environment file pinning tool versions, or a container image with a specific tag and a documented provenance, is the minimum for reproducibility. Analyses that depend on whatever version of a tool is currently installed on the cluster are not reproducible by definition.

A separation between research code and production code. Exploratory analysis, which is the majority of what bioinformaticians actually write, should look different from code that runs in a production pipeline. Notebooks, ad hoc scripts, and interactive analysis are appropriate for exploration. The code that gets used repeatedly, at scale, on samples that matter, needs the engineering treatment: version control, tests, documentation, dependency pinning.

The Academic vs Industrial Practice Gap

PracticeAcademic normIndustrial requirementWhy it matters
Version controlGit for code; often absent for data and configsGit for code AND data AND configs AND resultsAudit trail requires everything to be versioned
Code reviewRare; PI review is qualitativeRequired before merging to productionCatches silent failures before they run on real samples
Automated testingAlmost neverRequired for production pipelinesCatches regressions when dependencies update
DocumentationREADME.md if you are luckyRunbooks written for the unfamiliar userEnables personnel turnover without knowledge loss
Dependency pinningconda env when it is rememberedLocked container with SHA-verified imageEnsures identical behavior across runs and machines
Incident tracking”I remember that run failing”Formal log with root cause analysisRegulators and auditors require this in clinical contexts
Separation of research vs production codeNot madeExplicit distinction enforcedPrevents exploratory code from running on production samples
On-call rotationDoes not existStandard for production systemsEnsures issues are addressed without bottlenecking on one person

Why This Matters for Hiring

The gap between academic and industrial bioinformatics practice is most visible at the hiring interface. Groups looking for senior bioinformaticians are frequently disappointed by candidates who have excellent scientific credentials and poor engineering habits. Groups looking for software engineers who can work in bioinformatics find that many software engineers have the engineering habits but lack the domain knowledge to work productively without significant ramp time.

The hybrid role, sometimes called bioinformatics engineer or computational scientist, is the one that is actually needed and the one that the field has not yet developed a reliable pipeline to produce. Graduate training in bioinformatics still prioritizes algorithmic understanding over software engineering practice. The result is a hiring market where the most valuable practitioners are those who have learned both through experience, which means they are expensive and scarce.

The implication for groups building bioinformatics teams is not to lower the bar. It is to be explicit about which deficits are acceptable and which are not. A scientist who lacks engineering habits but is aware of the gap and motivated to close it is a very different hire than one who has never been told that the gap exists. The former can grow into the role. The latter will reproduce academic practices in a setting that requires something different.

Where Platforms Substitute for Engineering Culture

The engineering maturity problem in bioinformatics has a structural component that individual discipline cannot solve. Most bioinformatics groups are small. A two-person group doing RNA-seq analysis for a biotech company does not have the personnel to run code reviews, maintain a test suite, and document everything to on-call-ready standards while also doing the science.

For standard analyses, platforms that embed engineering practices in their architecture substitute for the engineering culture that small groups cannot sustain. A platform where every run is automatically version-pinned, automatically logged, and automatically generates reproducible output is implementing the practices the post above is describing, without requiring a team of four to maintain the infrastructure.

This is not the argument that platforms replace engineering expertise. Novel methods, custom pipelines, and genuinely new scientific questions still require people who can write production-quality code. It is the argument that the 90 percent of bioinformatics that consists of standard analyses should not require engineering-grade infrastructure that most teams cannot sustain, and does not need to if the platform is built correctly.

NotchBio is built around the practices this post describes: locked runs, versioned parameters, reproducible outputs, automatic audit trails, and methods text generated from the run record rather than reconstructed from memory. The instincts it embodies are engineering instincts, not academic ones. For a field that is still catching up to that distinction, that matters more than any individual feature.

The senior practitioner who wrote that post was not wrong. Bioinformatics is still in its infancy as an engineering discipline. The question is whether your group’s practices are helping it grow up.

Further reading

Read another related post

View all posts