Hi,
I am using DEXSeq to analyze differential exon usage. I did my alignment on GRCm39 (Ensembl) with fasta and .gtf file from Ensembl. Alignment was done by STAR (see below). When I run dexseq_count.py to quantify the reads on every exon in my .bam files, I see that the output file contains very few to no reads for each exon. In contrast, when I import the same .bam file into Igv, I see thousands of reads on the same exons. Why does the python script count only a few dozen in comparison? I think my alignment has worked perfectly, it is just the counting with the python scripts that is the problem. In addition, when I try later to import the counts.txt file with DEXSeqDataSetFromHTSeq. I get an error that is only resolved when I go back to the .gff file and the counts.txt and remove all the quotation marks. I would appreciate any help I can get.
EDIT: Since I am not getting any answer, can sb suggest how I could get a count matrix like the one below, where I am having counts per exon? I guess it would be possible to get sth like this with featureCounts. However, when I tried, I do not get the numbering for each exon (I get sth like this: ENSMUSG00000025902.7, without the ":002")
Here is an example output from the counts table:
"ENSMUSG00000000001":"001" 0
"ENSMUSG00000000001":"002" 0
"ENSMUSG00000000001":"003" 0
"ENSMUSG00000000001":"004" 0
"ENSMUSG00000000001":"005" 0
"ENSMUSG00000000001":"006" 0
removing the quotation marks:
ENSMUSG00000000001:001 0
ENSMUSG00000000001:002 0
ENSMUSG00000000001:003 0
ENSMUSG00000000001:004 0
ENSMUSG00000000001:005 0
ENSMUSG00000000001:006 0
output count table (bottom)
_ambiguous 30288 _ambiguous_readpair_position 0 _empty 43052386 _lowaqual 0 _notaligned 0
####===Genome Generate via STAR===###
STAR --runMode genomeGenerate --genomeDir ./genomes/GRCm39/ --genomeFastaFiles Mus_musculus.GRCm39.dna.primary_assembly.fa --sjdbGTFfile Mus_musculus.GRCm39.106.gtf --sjdbOverhang 99 –-runThreadN 6 --genomeSAsparseD 2 --limitGenomeGenerateRAM 409086573952
####===Alignment via STAR===###
STAR --genomeLoad LoadAndExit --genomeDir GRCm39/
for i in $(ls *READ1* | sed s/-READ1-Sequences.fastq// | sort -u);
do
STAR --runThreadN 5 --genomeDir GRCm39/ --genomeLoad LoadAndKeep --outSAMtype BAM SortedByCoordinate --readFilesIn .projects/1day_4/${i}-READ1-Sequences.fastq ./projects/day_4/${i}-READ2-Sequences.fastq --outFileNamePrefix ./projects/day_4/bam_files/${i}.
done
STAR --genomeLoad Remove --genomeDir /genomes/GRCm39/
####===generating gff file===###
python /.python_scripts/dexseq_prepare_annotation.py "./GRCm39_ENSEMBL/Mus_musculus.GRCm39.106.gtf" GRCm39.106.gff
####===counting exons===###
!/bin/bash
for i in ./bam_files_ENSEMBL/*.bam
do
python ./python_scripts/dexseq_count.py -p yes -f bam -r pos ./GRCm39_ENSEMBL/Flattened_GRCm39.gff $i ${i}_counted.txt
done
R version 4.2.0 (2022-04-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] writexl_1.4.0 pasilla_1.24.0 DEXSeq_1.42.0 RColorBrewer_1.1-3 BiocParallel_1.30.2
[6] edgeR_3.38.1 SingleCellExperiment_1.18.0 ensembldb_2.20.1 AnnotationFilter_1.20.0 GenomicFeatures_1.48.1
[11] AnnotationDbi_1.58.0 limma_3.52.1 DESeq2_1.36.0 SummarizedExperiment_1.26.1 Biobase_2.56.0
[16] MatrixGenerics_1.8.0 matrixStats_0.62.0 GenomicRanges_1.48.0 GenomeInfoDb_1.32.2 IRanges_2.30.0
[21] S4Vectors_0.34.0 BiocGenerics_0.42.0
loaded via a namespace (and not attached):
[1] ProtGenerics_1.28.0 bitops_1.0-7 bit64_4.0.5 filelock_1.0.2 progress_1.2.2 httr_1.4.3
[7] tools_4.2.0 utf8_1.2.2 R6_2.5.1 DBI_1.1.2 lazyeval_0.2.2 colorspace_2.0-3
[13] tidyselect_1.1.2 prettyunits_1.1.1 bit_4.0.4 curl_4.3.2 compiler_4.2.0 cli_3.3.0
[19] xml2_1.3.3 DelayedArray_0.22.0 rtracklayer_1.56.0 scales_1.2.0 genefilter_1.78.0 rappdirs_0.3.3
[25] stringr_1.4.0 digest_0.6.29 Rsamtools_2.12.0 XVector_0.36.0 pkgconfig_2.0.3 dbplyr_2.1.1
[31] fastmap_1.1.0 rlang_1.0.2 rstudioapi_0.13 RSQLite_2.2.14 BiocIO_1.6.0 generics_0.1.2
[37] hwriter_1.3.2.1 dplyr_1.0.9 RCurl_1.98-1.6 magrittr_2.0.3 GenomeInfoDbData_1.2.8 Matrix_1.4-1
[43] Rcpp_1.0.8.3 munsell_0.5.0 fansi_1.0.3 lifecycle_1.0.1 stringi_1.7.6 yaml_2.3.5
[49] zlibbioc_1.42.0 BiocFileCache_2.4.0 grid_4.2.0 blob_1.2.3 parallel_4.2.0 crayon_1.5.1
[55] lattice_0.20-45 Biostrings_2.64.0 splines_4.2.0 annotate_1.74.0 hms_1.1.1 KEGGREST_1.36.0
[61] locfit_1.5-9.5 pillar_1.7.0 rjson_0.2.21 geneplotter_1.74.0 biomaRt_2.52.0 XML_3.99-0.9
[67] glue_1.6.2 BiocManager_1.30.18 png_0.1-7 vctrs_0.4.1 gtable_0.3.0 purrr_0.3.4
[73] assertthat_0.2.1 cachem_1.0.6 ggplot2_3.3.6 xtable_1.8-4 restfulr_0.0.13 survival_3.3-1
[79] tibble_3.1.7 GenomicAlignments_1.32.0 memoise_2.0.1 statmod_1.4.36 ellipsis_0.3.2
OK, I found the problem. I used Truseq lib prep and therefore I had wrongly chosen the strand orientation. I needed to specify -s reverse in the python dexseq_count.py script. However, I still need to remove the ""