Question

featureCounts fails to load annotation file

0

Entering edit mode

yasmin.hilliam ▴ 10

@02ef45ea

Last seen 5 months ago

United States

Hi,

I am running into issues with loading my GTF annotation file into featureCounts. I have used featureCounts previously on this dual-seq dataset to count reads aligned to a bacterial genome but now I want to examine the reads aligned to the human genome. I downloaded my alignment genome and GTF annotation file at the same time and source from NCBI (GCF_000001405.40) and aligned the reads using hisat2. Consistently, featureCounts is failing with this error:

Failed to open the annotation file ~/20241010.GCF_000001405.40.NCBI.RefSeq.gtf, or its format is incorrect, or it contains no 'transcript' features.

I have tried counting exons, transcripts, and genes using the -t flag. I have tried both with an without the -g flag. I have tried removing the commented out lines from the top of my GTF file and editing the GTF as described in this thread using the following code:

cut -d';' -f1 20241010.GCF_000001405.40.NCBI.RefSeq.gtf.gtf > 20241010.GCF_000001405.40.NCBI.RefSeq.edited.gtf

So far nothing has worked. My script is included below as well as the heads of both the original and edited versions of the GTF.

I have used full file paths throughout the script, I've just removed them here as they contain my username.

#!/bin/bash

#SBATCH --job-name=featurecounts.human

#SBATCH -o ~/slurm.out/featurecounts.human.out

#SBATCH -e ~/slurm.out/featurecounts.human.err

# Number of compute nodes
#SBATCH --nodes=1

# Number of tasks per node
#SBATCH --ntasks-per-node=2

# Number of CPUs per task
#SBATCH --cpus-per-task=4

# Request memory
#SBATCH --mem=8G

# Walltime
#SBATCH --time=48:00:00

# Set array
#SBATCH --array=1-16

# Email notifications (comma-separated options: BEGIN,END,FAIL)
#SBATCH --mail-type=BEGIN,FAIL,END

source /optnfs/common/miniconda3/etc/profile.d/conda.sh

filelist=($(~/coinfection.rnaseq.files.txt))

i=${filelist[${SLURM_ARRAY_TASK_ID}]}

conda activate featurecounts

featureCounts \
-a ~/20241010.GCF_000001405.40.NCBI.RefSeq.edited.gtf \
-o ~/featurecounts.human/"$i" \
~/human.aligned.reads/"$i".aligned.reads.sorted.bam \
-O -t transcript -g gene_id -f -d 50 --verbose

[]$ head -n 7 20241010_GCF_000001405.40_GRCh38.p14.NCBI_RefSeq.gtf 
#gtf-version 2.2
#!genome-build GRCh38.p14
#!genome-build-accession NCBI_Assembly:GCF_000001405.40
#!annotation-date 08/23/2024
#!annotation-source NCBI RefSeq GCF_000001405.40-RS_2024_08
NC_000001.11    BestRefSeq  gene    11874   14409   .   +   .   gene_id "DDX11L1"; transcript_id ""; db_xref "GeneID:100287102"; db_xref "HGNC:HGNC:37102"; description "DEAD/H-box helicase 11 like 1 (pseudogene)"; gbkey "Gene"; gene "DDX11L1"; gene_biotype "transcribed_pseudogene"; pseudo "true"; 
NC_000001.11    BestRefSeq  transcript  11874   14409   .   +   .   gene_id "DDX11L1"; transcript_id "NR_046018.2"; db_xref "GeneID:100287102"; db_xref "GenBank:NR_046018.2"; db_xref "HGNC:HGNC:37102"; gbkey "misc_RNA"; gene "DDX11L1"; product "DEAD/H-box helicase 11 like 1 (pseudogene)"; pseudo "true"; transcript_biotype "transcript";

[]$ head -n 7 20241010_GCF_000001405.40_GRCh38.p14.NCBI_RefSeq.edited.gtf 
#gtf-version 2.2
#!genome-build GRCh38.p14
#!genome-build-accession NCBI_Assembly:GCF_000001405.40
#!annotation-date 08/23/2024
#!annotation-source NCBI RefSeq GCF_000001405.40-RS_2024_08
NC_000001.11    BestRefSeq  gene    11874   14409   .   +   .   gene_id "DDX11L1"
NC_000001.11    BestRefSeq  transcript  11874   14409   .   +   .   gene_id "DDX11L1"

I'm at a bit of a loss and would appreciate any help.

featurecounts rnas RNASeqPower • 1.2k views

ADD COMMENT • link 6 months ago yasmin.hilliam ▴ 10

score 0 · Answer 1 · 2024-10-11

I imagine Wei Shi will be along later with actual help, but for now do note that you can use an SAF file instead of a GTF. It's easy enough to do that.

> library(txdbmaker)
> txdb <- makeTxDbFromGFF("https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.gff.gz")
> ex <- exonsBy(txdb, "gene")
## collapse overlapping exons
> ex <- reduce(ex)
## turn into a GRanges
> exgr <- unlist(ex)
> saf <- data.frame(names(exgr), as.character(seqnames(exgr)), start(exgr), end(exgr), strand(exgr))
> head(saf)
  names.exgr.
1        A1BG
2        A1BG
3        A1BG
4        A1BG
5        A1BG
6        A1BG
  as.character.seqnames.exgr..
1                 NC_000019.10
2                 NC_000019.10
3                 NC_000019.10
4                 NC_000019.10
5                 NC_000019.10
6                 NC_000019.10
  start.exgr. end.exgr. strand.exgr.
1    58345183  58347029            -
2    58347353  58347640            -
3    58350370  58350651            -
4    58351391  58351687            -
5    58352283  58352555            -
6    58352928  58353197            -

> write.table(saf, "thesaffile.txt", quote = FALSE, sep = "\t", row.names = FALSE, col.names = FALSE)

And then you can just point to the SAF file instead of the GTF.

score 0 · Answer 2 · 2024-10-13

Hi Yasmin, I'm following up on the issue you're having with featureCounts and the GRCh38 RefSeq annotation. I've been testing featureCounts in subread-2.0.7 using the same GTF file downloaded from NCBI RefSeq FTP site. And I had no error reported; featureCounts generated the count table. I used this command:

./subread-2.0.7-Linux-x86_64/bin/featureCounts \
    -t transcript -g gene_id \
    -a GCF_000001405.40_GRCh38.p14_genomic.gtf \
    -o del4.FC -p -O \
    -f -d 50 --verbose \
    test-minimum.bam

And I had the output count table:

$ head del4.FC
# Program:featureCounts v2.0.7; Command:"./subread-2.0.7-Linux-x86_64/bin/featureCounts" "-a" "GCF_000001405.40_GRCh38.p14_genomic.gtf" "-o" "del4.FC" "-p" "-O" "-t" "transcript" "-g" "gene_id" "-f" "-d" "50" "--verbose" "test-minimum.bam"
Geneid  Chr     Start   End     Strand  Length  test-minimum.bam
DDX11L1 NC_000001.11    11874   14409   +       2536    0
WASH7P  NC_000001.11    14362   29370   -       15009   0
MIR6859-1       NC_000001.11    17369   17436   -       68      0
MIR6859-1       NC_000001.11    17369   17391   -       23      0
MIR6859-1       NC_000001.11    17409   17431   -       23      0
MIR1302-2HG     NC_000001.11    29774   35418   +       5645    0
MIR1302-2       NC_000001.11    30366   30503   +       138     0
MIR1302-2       NC_000001.11    30438   30458   +       21      0

Namely, featureCounts correctly loaded the "transcript" annotations and used "gene_id" as the feature names. Can you provide the version number of your featureCounts program? Also, you said that you used the full path in the script. It seems that your secript was run on SLURM. Can you try adding this command in your script before calling featureCounts:

ls -l /FULL_PATH_TO/20241010.GCF_000001405.40.NCBI.RefSeq.gtf

, to see if the working node has access to the file? You don't need to edit the GTF file for making featureCounts work with it.