Hello,
My team would greatly appreciate assistance with running featureCounts using the human NCBI T2T assembly (assembly (T2T-CHM13v2.0) as a reference; when we run it we end up with nearly 14,000 fewer genes than what the annotation supposedly contains.What (if any) modifications can be made to run Subread or RSubread featureCounts on the new NCBI release and obtain counts data for all ~57 thousand genes supposedly in the assembly?
Typically my lab has used the Ensembl/Gencode GRCh38 for mapping total RNA-Seq projects with STAR, and then counted reads with the featureCounts command from Subread or Rsubread. We are attempting to set up new pipelines with the latest release 43 (GRCh38.p13) at gencode, https://www.gencodegenes.org/human/. In parallel we have attempted to implement a pipeline to run in with the new NCBI T2T assembly from https://www.ncbi.nlm.nih.gov/assembly/GCA_009914755.4), in case that newer assembly may improve gene-level biomarker discovery. This assembly supposedly has 57,771 genes, per awk code that a staff scientist at NCBI provided my colleague:
Number of unique genes:
zcat GCF_009914755.1_T2T-CHM13v2.0_genomic.gtf.gz | grep -v "^#" | awk 'BEGIN {FS="\t"; OFS= "\t"} $3=="gene" {print $9}' | cut -f 3 -d ";" | sort | uniq | wc -l)
57771
Number of gene placements in the GTF:
zcat GCF_009914755.1_T2T-CHM13v2.0_genomic.gtf.gz | grep -v "^#" | cut -f 3 | sort | uniq -c
699164 CDS
1017051 exon
57816 gene
65594 start_codon
65411 stop_codon
112905 transcript
However, with code such as the text below using, we end up with a much smaller number of genes, closer to 42-43,000 thousand. We have even tried implementing this specifying gene id to export Entrez IDs by adding the flag <-g db_xref> in case the default gene_id identifier was causing an issue. The NCBI staff scientist suggested that the difference of our tabulated genes vs the GTF assembly annotation of over 57 thousand, reflects unprocessed pseudogenes. We suspect this may have to do with how the NCBI GTF is set up, and are wondering if there is some way to modify the call of how the GTF is used by Subread/Rsubread to obtain an output that is more similar to the output of Gencode and reflects all 57,771 genes. We tried searching for other posts on the topic and found on one biostars at https://www.biostars.org/p/9547097/#google_vignette , leading us to attempt to use the T2T SAF at https://bioinf.wehi.edu.au/Rsubread/annot/ but that led to 0% alignments because the SAF format was chr1, chr2 etc and the NCBI designations differ (NC_060930.1).
What (if any) modifications can be made to run Subread or RSubread featureCounts on the new NCBI release and obtain counts data for all ~57 thousand genes? Many thanks in advance for your consideration.
Run of featureCounts
Creates a directory for the output of Subread and then changes directories to /run/media/userT/Padlock_DT/paired/results/T2T/STAR
mkdir -p /run/media/userT/Padlock_DT/paired/results/T2T/counts/
cd /run/media/userT/Padlock_DT/paired/results/T2T/STAR/
Runs featureCounts in paired-end mode, reverse strand, with 32 threads.
featureCounts -T 32 -s 2 -p -C FALSE -d 50 --verbose --countReadPairs \
-a $gtf_T2T \
-o /run/media/userT/Padlock_DT/paired/results/T2T/counts/T2T_featurecounts.txt \
The above was from Subread and NOT an R session so we do not have that, but the version of Subread used was 2.0.3; we plan to upgrade to 2.04.
Yet if it helps, I believe the Rsubread translation may be akin to code I ran in R a while ago, updating that to reflect the newer GTF would be roughly equivalent to:
MyOutput=featureCounts(files=RBam, isPairedEnd=TRUE, annot.ext="GCF_009914755.1_T2T-CHM13v2.0_genomic.gtf", countChimericFragments=FALSE, minFragLength=50, isGTFAnnotationFile=TRUE, strandSpecific=2, nthreads=32, countReadPairs=TRUE, verbose=TRUE)
sessionInfo( ) R version 4.0.5 (2021-03-31) Platform: x86_64-redhat-linux-gnu (64-bit) Running under: Red Hat Enterprise Linux 8.4 (Ootpa)
Matrix products: default BLAS/LAPACK: /usr/lib64/libopenblas-r0.3.12.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] Rsubread_2.4.2
loaded via a namespace (and not attached): [1] compiler_4.0.5 Matrix_1.3-2 grid_4.0.5 lattice_0.20-41
```
featureCounts only counts genes that have an
exon
feature by default. I suspect this is the reason why not all the genes were counted.Where in the documentation can I find this? In the user guide I see
exon
as default feature.Sorry I meant to say
exon
. I have corrected this.