I am trying to use DEXSeq to analyze a mouse paired-end stranded RNA-seq data (poly-A RNA capture). I Aligned the reads using STAR and obtained the reference genome and annotation file (which I collapsed with dexseq_prepare_annotation.py) from GENCODE (release m11). I used HTSeq to count exons (using command "htseq-count -m union -s reverse -i gene_id -t exon) and over 80% mapped to a feature.
I set up my countFiles, collapsed annotation file and sampleTable in R:
countFiles <- list.files(directory, pattern=".txt", full.names = TRUE)
sampleTable <- data.frame(row.names = c("WT_RNA_1", "WT_RNA_2", "WT_RNA_3", "KO_RNA_1", "KO_RNA_2", "KO_RNA_3"), condition = c("WT", "WT", "WT", "KO", "KO", "KO"), libType = c("paired-end", "paired-end", "paired-end", "paired-end", "paired-end", "paired-end"))
When I try to create a DEXSeqdataset it gives me an out of bounds error:
dxd <- DEXSeqDataSetFromHTSeq(countFiles, sampleData = sampleTable, design= ~sample + exon + condition:exon, flattenedfile = flattenedFile)
Error in FUN(X[[i]], ...) : subscript out of bounds
I have looked over the count files and the collapsed annotation and I can't seem to figure out where the mismatch between these files is occurring. What is curious is that I do not see this error when I use DEXSeq to analyze RNA-seq data from human cell lines. Is there something I am missing regarding alignment and counting of the mouse data set?
Thank you for your help
Hi, it's probable that either the gtf file or that one or more of the count files are corrupted. Could you check that count files have all the same number of lines and that the gtf file has the same number of exonic regions as the number of lines of the count files?
Alejandro
Hi Alejandro,
It turns out it was likely the GENCODE gtf file. Switched to the mouse annotation file on Ensembl and it worked well. It is still odd to me considering the count files were created using the GENCODE gtf so I thought it would still result with the same number of lines but I guess not.
Mthabisi