Back to blog
Bioinformatics

GTF and GFF Files: Why They Hurt and How To Tame Them

By Abdullah Shahid · · 11 min read

You download a GTF file for your organism of interest. You run it through your aligner. The aligner either crashes, silently produces wrong counts, or succeeds in a way that only reveals its problems three steps later when featureCounts cannot find the feature type it was told to look for. You download a second GTF from a different database for the same genome. It is structured differently. You spend an afternoon figuring out why two annotation files for the same genome disagree on where a gene ends.

Every generation of bioinformaticians goes through this. It is not a rite of passage. It is a consequence of annotation formats that were designed incrementally by different groups with different priorities over several decades, and a community norm that treats getting your GTF working as the researcher’s problem rather than an infrastructure problem worth solving properly.

This post will not make GTF files fun. It will give you a map of what you are actually dealing with, the specific failures that appear most often, and the fastest paths through them.

Why GTF and GFF Formats Disagree

GTF and GFF3 are not the same format with different file extensions, though they are often treated as interchangeable. GTF (Gene Transfer Format) emerged from the Ensembl project and the early microarray era as a simpler derivative of GFF2. GFF3 (General Feature Format version 3) is a later, more formally specified standard with richer hierarchical relationships between features. The two formats have different field conventions, different ways of representing parent-child relationships between genes, transcripts, and exons, and different conventions for attribute quoting and key-value separation.

The practical consequence is that a tool expecting GTF format will often fail on a GFF3 file, and vice versa, even when both files describe the same genome annotation. This is predictable and unavoidable given the history.

The source divergence compounds the format divergence. NCBI, Ensembl, and GENCODE all provide annotation files for the human genome, and all three differ in meaningful ways beyond format. NCBI uses its own chromosome naming convention (NC_000001.11 for chromosome 1 in GRCh38). Ensembl uses the bare chromosome number (1). UCSC uses the chr prefix (chr1). Using a STAR index built with one naming convention and a GTF file using a different one produces zero-mapped reads with no error message, which is one of the most confusing failure modes in RNA-seq analysis.

Feature hierarchy conventions also differ. GENCODE GTF files contain gene, transcript, exon, CDS, UTR, and start/stop codon features. NCBI RefSeq GFF3 files structure the same information differently, with mRNA as the parent of exon rather than transcript, and region features at the top level that some tools interpret as chromosome-level entries and others ignore. AGAT, the self-described GTF/GFF Swiss army knife, handles the conversion between these structures but introduces its own complications.

Diagram showing the GTF feature hierarchy from gene to transcript to exon and CDS, with annotations showing which levels STAR, featureCounts, and 10x Genomics cellranger require
Figure 1: The standard GTF feature hierarchy. Tools differ in which levels they require: STAR needs exon and transcript, featureCounts needs exon, and 10x cellranger requires transcript-level entries that some GFF3 files omit.

The Specific Gotchas That Break Pipelines

Understanding the general history is useful context. Understanding the specific failure modes is what saves you time at 11pm when your alignment job is failing.

Missing transcript-level features in prokaryotic GFF3 files. Bacterial genome annotations from NCBI RefSeq typically structure genes with CDS features as direct children of gene features, with no intermediate mRNA or transcript entry. Tools like 10x Genomics Cell Ranger and some versions of the nf-core RNA-seq pipeline explicitly require a transcript-level entry between the gene and the exon. When it is missing, the pipeline fails with a message about unrecognised feature types or, worse, silently produces a count matrix with zeros for every gene.

The nf-core/rnaseq GitHub issue tracker has a well-documented thread on this exact problem: the pipeline would fail on prokaryotic GTF files because the assumption of an exon feature was baked into the workflow. The resolution was to edit the GTF to add synthetic transcript entries, which AGAT can do but does not always do correctly on the first attempt.

Chromosome naming mismatches. As described above, this is the silent killer. STAR builds an index using the chromosome names in the FASTA file. If the GTF uses different chromosome names, STAR will report a very high mapping rate against the genome but zero or near-zero reads will be assigned to genes, because the GTF chromosome names do not match the index. There is no error. The numbers just do not make sense.

Check this before alignment: grep "^>" genome.fa | head -5 shows your FASTA chromosome names. cut -f1 annotation.gtf | sort -u | head -5 shows your GTF chromosome names. They must match exactly, including the chr prefix or lack thereof.

Coordinates that extend beyond the chromosome length. Some GFF3 files from NCBI contain feature coordinates that exceed the reported length of the chromosome, usually by a few bases. This is technically a malformed file. Most tools ignore it silently. A few, including some versions of AGAT and some validation tools, will either fail or strip the affected features during processing, causing genes near chromosome ends to disappear from your count matrix.

Attribute quoting differences. GTF format uses double-quoted attribute values: gene_id "ENSG00000139618". GFF3 format uses unquoted values in key=value notation: ID=gene-BRCA2. Tools parsing GTF that receive GFF3 input, or vice versa, may parse attribute fields incorrectly and produce empty or malformed feature names. This is hard to diagnose from error messages alone because the tool often succeeds but assigns all reads to a single gene with a garbled name.

AGAT: Where It Helps and Where It Hurts

AGAT (Another GTF/GFF Analysis Toolkit) is the community’s most-used tool for GTF/GFF manipulation and format conversion. It is genuinely useful, and it is also the source of some of the most confusing secondary problems in GTF troubleshooting.

Where AGAT reliably helps: converting GFF3 to GTF format when the input is well-formed, adding transcript-level entries to files that only have gene and exon, fixing simple coordinate errors, and checking files for consistency against the GFF3 specification. The agat_convert_sp_gff2gtf.pl script handles the most common conversion use case adequately for most Ensembl and GENCODE source files.

Where AGAT introduces problems: NCBI RefSeq GFF3 files contain region features at the top level that represent chromosomes, scaffolds, and organelles. AGAT sometimes treats these as gene-level entries during conversion, which can inflate your annotation with thousands of spurious features. The output GTF may be syntactically valid while being scientifically wrong. Filtering the output with grep -v "^#" annotation.gtf | awk '$3 != "region"' after conversion is a safe habit.

AGAT also performs semantic inference during conversion, meaning it will add features it believes are implied by the structure of the input file. In most cases this is helpful. In cases where the input file has non-standard structure, AGAT may infer the wrong feature hierarchy and produce a GTF that is internally consistent but does not match the original annotation. When in doubt, validate the output with agat_sq_stat_basic.pl and compare gene counts to the source.

Validate AGAT output before alignment

AGAT performs semantic inference during GFF3-to-GTF conversion. Always check the gene and transcript counts in the converted file: agat_sq_stat_basic.pl —gff converted.gtf. If the count differs substantially from the source annotation, the conversion introduced errors. Compare a few specific genes manually using grep before proceeding to alignment.

A Normalization Script That Handles 80 Percent of Cases

For the most common scenario, a well-formed GFF3 or GTF file from Ensembl, GENCODE, or NCBI that needs to be used with a tool expecting standard GTF, the following bash workflow resolves the most frequent issues without AGAT.

#!/bin/bash
# gtf_normalize.sh : basic GTF sanity checks and chr-prefix normalization
# Usage: bash gtf_normalize.sh input.gtf genome.fa output.gtf
INPUT_GTF=$1
GENOME_FA=$2
OUTPUT_GTF=$3
# Step 1: Check chromosome naming in genome FASTA
echo "=== Genome chromosome names (first 5) ==="
grep "^>" "$GENOME_FA" | head -5 | sed 's/>.*//' | cut -d' ' -f1
# Step 2: Check chromosome naming in GTF
echo "=== GTF chromosome names (first 5 unique) ==="
grep -v "^#" "$INPUT_GTF" | cut -f1 | sort -u | head -5
# Step 3: Detect chr-prefix mismatch and fix if needed
GENOME_HAS_CHR=$(grep "^>" "$GENOME_FA" | head -1 | grep -c "^>chr")
GTF_HAS_CHR=$(grep -v "^#" "$INPUT_GTF" | head -1 | grep -c "^chr")
if [ "$GENOME_HAS_CHR" -eq 1 ] && [ "$GTF_HAS_CHR" -eq 0 ]; then
echo "Adding chr prefix to GTF chromosome names..."
grep -v "^#" "$INPUT_GTF" | sed 's/^\([0-9XYM]\)/chr\1/' > "$OUTPUT_GTF"
elif [ "$GENOME_HAS_CHR" -eq 0 ] && [ "$GTF_HAS_CHR" -eq 1 ]; then
echo "Stripping chr prefix from GTF chromosome names..."
grep -v "^#" "$INPUT_GTF" | sed 's/^chr//' > "$OUTPUT_GTF"
else
echo "Chromosome naming consistent. Copying GTF..."
grep -v "^#" "$INPUT_GTF" > "$OUTPUT_GTF"
fi
# Step 4: Check for transcript-level features (required by many tools)
TRANSCRIPT_COUNT=$(grep -v "^#" "$OUTPUT_GTF" | awk '$3 == "transcript"' | wc -l)
echo "=== Transcript feature count: $TRANSCRIPT_COUNT ==="
if [ "$TRANSCRIPT_COUNT" -eq 0 ]; then
echo "WARNING: No transcript features found. Tools requiring transcript level may fail."
echo "Consider running: agat_convert_sp_gff2gtf.pl --gff $INPUT_GTF -o $OUTPUT_GTF"
fi
# Step 5: Check for coordinates beyond chromosome bounds (basic)
echo "=== Max coordinate in GTF ==="
grep -v "^#" "$OUTPUT_GTF" | awk '{if($5>max) max=$5} END{print max}'

This script does not replace AGAT for complex conversions. It catches the two most common immediate failures (chr-prefix mismatch and missing transcript features) before you discover them through a failed alignment run.

The Source-by-Source Decision Matrix

SourceFormatChr namingHas transcript levelCommon problemsRecommended use
EnsemblGTFNo chr prefix (1, 2, X)YesExtra scaffolds and patches inflate annotationRemove non-primary contigs: grep ”^[0-9XYM]“
GENCODEGTFchr prefix (chr1, chrX)YesVery large file, includes pseudogenes and lncRNA by defaultUse primary assembly file
NCBI RefSeqGFF3NC_ accessionsUsually for eukaryotesRegion features, NC_ naming mismatches, no transcript in prokaryotesConvert with AGAT, validate output
UCSCGTFchr prefix (chr1, chrX)YesBased on NCBI or Ensembl with coordinate lift-over; may lag current releaseMatch genome assembly version carefully
Ensembl BacteriaGFF3Accession-basedNo (prokaryote)No exon features in some releasesAdd synthetic transcript and exon features

When To Download a Different Version vs When To Fix What You Have

The most time-efficient decision rule is this: if you are more than one conversion step away from a valid GTF for your tool, download a different source file.

AGAT is reliable for one-step conversions: GFF3 to GTF, adding transcript features to a well-formed file, stripping region entries from a RefSeq file. If you find yourself running AGAT, validating the output, finding problems, running AGAT again with different flags, and still getting unexpected gene counts, the time cost of that iteration almost always exceeds the time cost of finding a pre-converted GTF from a different source.

For human and mouse, Ensembl and GENCODE provide high-quality GTF files that work with STAR, HISAT2, Salmon, and featureCounts with minimal preprocessing. The chr-prefix difference between Ensembl and GENCODE is the only regular mismatch you need to manage, and the normalization script above handles it in seconds.

For less-studied organisms, NCBI RefSeq is often the only option. In those cases, AGAT conversion is necessary, and the validate-before-proceeding habit is mandatory.

For prokaryotes specifically, adding synthetic transcript and exon entries from AGAT is the standard path. Verify the output transcript count matches your expected gene count and inspect three to five genes manually before running alignment.

NotchBio validates uploaded GTF and GFF files automatically before any pipeline run begins, flagging missing transcript-level features, chr-prefix mismatches against the selected reference genome, and coordinates that exceed chromosome bounds. For most standard organisms, it selects and validates the reference annotation without requiring a manual download.

Further reading

Read another related post

View all posts