Hello!
I am trying to run DEXseq on some paired-end data and I am getting some errors that I don't normally get.
I used the DEXSeq python scripts to prepare the data and I am using "Homo_sapiens.GRCh38.106.gtf" which I converted with 'python3 /home/met/.local/R_Libraries/DEXSeq/python_scripts/dexseq_prepare_annotation.py Homo_sapiens.GRCh38.106.gtf Homo_sapiens.GRCh38.106.chr.gff'
I sorted the STAR output sam files by name
#!/bin/bash
cd /dataVol/data/Vasudevan/13Apr22/paired2/sams/ &&
FILES=*.sam
for f in $FILES
do
describer=$(echo ${f} | sed 's/.sam//')
echo $describer
# sort
samtools sort -O sam -@ 32 -n -T temp -o "sorted_"${describer}.sam ${describer}.sam
done
I then used the DEXseq scripts to count the data
#!/bin/bash
cd /dataVol/data/Vasudevan/17Apr22/DEX/SAMS/ &&
FILES=*.sam
for f in $FILES
do
echo "Processing $f file..."
describer=$(echo ${f} | sed 's/.sam//')
echo $describer
python3 /home/met/.local/R_Libraries/DEXSeq/python_scripts/dexseq_count.py -p yes -s reverse -r name Homo_sapiens.GRCh38.106.chr.gff ${describer}.sam ${describer}.txt
done
So then when I use the DEXSeq script I made from the Vignette.
#!/usr/bin/Rscript --vanilla --slave
## Change to data directory
setwd("/dataVol/data/Vasudevan/17Apr22/DEX/SAMS/")
today <- Sys.Date()
dt <- format(today, format="%d%b%y")
library(DEXSeq)
countFiles = list.files(getwd(), pattern="txt$", full.names=TRUE)
basename(countFiles)
flattenedFile = list.files(getwd(), pattern="gff$", full.names=TRUE)
basename(flattenedFile)
sampleTable = data.frame(row.names = c("control1", "control2", "control3", "control4", "test1", "test2", "test3", "test4"), condition = c("control", "control", "control", "control", "test", "test", "test", "test"), LibType = c("paired-end", "paired-end", "paired-end", "paired-end", "paired-end", "paired-end", "paired-end", "paired-end"))
sampleTable
dxd = DEXSeqDataSetFromHTSeq(countFiles, sampleData=sampleTable, design = ~ sample + exon + condition:exon, flattenedfile=flattenedFile)
dxd = estimateSizeFactors(dxd)
dxd = estimateDispersions(dxd)
try({
tiff(file=paste(dt, "_Dispersion_estimates_DEXSeq.tiff", sep=""), width=8*300, height=8*300, res=300)
plotDispEsts(dxd)
dev.off()
})
# Save an image
save.image(file=paste(dt,"_1.RData", sep=""))
dxd = testForDEU(dxd)
dxd = estimateExonFoldChanges(dxd, fitExpToVar="condition")
dxr1 = DEXSeqResults(dxd)
dxr1
table(dxr1$padj < 0.05)
table(tapply(dxr1$padj < 0.05, dxr1$groupID, any))
try({
tiff(file=paste(dt, "_MA_plot_DEXSeq.tiff", sep=""), width=8*300, height=8*300, res=300)
plotMA(dxr1, cex=0.8)
dev.off()
})
library(biomaRt)
ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")
DEXSeqHTML(dxr1, FDR=0.001, color=c("#FF000080", "#0000FF80"), mart=ensembl, filter="ensembl_gene_id", attributes=c("external_gene_name"))
# Save an image
save.image(file=paste(dt,"_2.RData", sep=""))
# clean all
rm(list=ls(all=TRUE))
quit("yes")
I get the first Error:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 688116 did not have 3 elements
Calls: DEXSeqDataSetFromHTSeq -> lapply -> lapply -> FUN -> read.table -> scan
Execution halted
When I open up the files with vim, The line 688116 is close to the bottom and has some data for goodness of HTSeq count. There were four of them,.
_ambiguous 36642
_ambiguous_readpair_position 0
_empty 1704024
_lowaqual 0
_notaligned 0
I deleted them. This fixed the error.
The second error.
The scripts runs until
[1] "sorted_SE7840_SA145170_S1_L003_Hs_Ensembl_106_star_paired_twopass.Aligned.out.txt"
[2] "sorted_SE7840_SA145171_S2_L003_Hs_Ensembl_106_star_paired_twopass.Aligned.out.txt"
[3] "sorted_SE7840_SA145172_S3_L003_Hs_Ensembl_106_star_paired_twopass.Aligned.out.txt"
[4] "sorted_SE7840_SA145173_S4_L003_Hs_Ensembl_106_star_paired_twopass.Aligned.out.txt"
[5] "sorted_SE7840_SA145174_S5_L003_Hs_Ensembl_106_star_paired_twopass.Aligned.out.txt"
[6] "sorted_SE7840_SA145175_S6_L003_Hs_Ensembl_106_star_paired_twopass.Aligned.out.txt"
[7] "sorted_SE7840_SA145176_S7_L003_Hs_Ensembl_106_star_paired_twopass.Aligned.out.txt"
[8] "sorted_SE7840_SA145177_S8_L003_Hs_Ensembl_106_star_paired_twopass.Aligned.out.txt"
[1] "Homo_sapiens.GRCh38.106.chr.gff"
condition LibType
control1 control paired-end
control2 control paired-end
control3 control paired-end
control4 control paired-end
test1 test paired-end
test2 test paired-end
test3 test paired-end
test4 test paired-end
Error in FUN(X[[i]], ...) : subscript out of bounds
Calls: DEXSeqDataSetFromHTSeq -> sapply -> sapply -> lapply
Execution halted
I searched for "DEXSeq and subscript out of bounds and I found some answers where I should see if the "the number of lines containing "exonic_part" in the third column of the flattened gtf file correspond to the number of lines from the count files?" from here 'DEXSeq: Error in FUN(X[[i]], ...) : subscript out of bounds'
When I look at the gff file with vim. There are 745398 rows and the end of the file looks like this.
Y dexseq_prepare_annotation.py exonic_part 26622512 26622608 . - . gene_id "ENSG00000237917"; transcripts "ENST00000435945"; exonic_part_number "007"
Y dexseq_prepare_annotation.py exonic_part 26623586 26623666 . - . gene_id "ENSG00000237917"; transcripts "ENST00000435945"; exonic_part_number "008"
Y dexseq_prepare_annotation.py exonic_part 26627943 26628022 . - . gene_id "ENSG00000237917"; transcripts "ENST00000435945"; exonic_part_number "009"
Y dexseq_prepare_annotation.py exonic_part 26628271 26628437 . - . gene_id "ENSG00000237917"; transcripts "ENST00000435945"; exonic_part_number "010"
Y dexseq_prepare_annotation.py exonic_part 26630647 26630749 . - . gene_id "ENSG00000237917"; transcripts "ENST00000435945"; exonic_part_number "011"
Y dexseq_prepare_annotation.py exonic_part 26633345 26633431 . - . gene_id "ENSG00000237917"; transcripts "ENST00000435945"; exonic_part_number "012"
Y dexseq_prepare_annotation.py exonic_part 26634523 26634652 . - . gene_id "ENSG00000237917"; transcripts "ENST00000435945"; exonic_part_number "013"
Y dexseq_prepare_annotation.py aggregate_gene 26626520 26627159 . - . gene_id "ENSG00000231514"
Y dexseq_prepare_annotation.py exonic_part 26626520 26627159 . - . gene_id "ENSG00000231514"; transcripts "ENST00000435741"; exonic_part_number "001"
Y dexseq_prepare_annotation.py aggregate_gene 56855244 56855488 . + . gene_id "ENSG00000235857"
Y dexseq_prepare_annotation.py exonic_part 56855244 56855488 . + . gene_id "ENSG00000235857"; transcripts "ENST00000431853"; exonic_part
Then the end of the count files. 68115 rows.
"ENSG00000289711+ENSG00000084072":"037" 1
"ENSG00000289711+ENSG00000084072":"038" 2
"ENSG00000289711+ENSG00000084072":"039" 9
"ENSG00000289711+ENSG00000084072":"040" 9
"ENSG00000289712":"001" 0
"ENSG00000289712":"002" 0
"ENSG00000289713":"001" 0
"ENSG00000289714":"001" 0
"ENSG00000289718":"001" 0
"ENSG00000289718":"002" 0
"ENSG00000289718":"003" 0
"ENSG00000289718":"004" 0
"ENSG00000289718":"005" 0
There is quite a big jump in the GFF file, from 26626520 to 56855244. Also, it isn't clear how I should be checking. as there are quite a few less rows in my count file. Could it be that the GFF file needs to be sorted by name? Is this what is fragging me?
Could this be a memory error?
Any advice or assistance is greatly appreciated. Thank you!
> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.4 LTS
Matrix products: default
BLAS/LAPACK: /usr/local/lib/libopenblas_zenp-r0.3.18.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_4.1.2
I tried it over again with the sam file directly from the STAR alignment. Same problem. I will sort by coordinate and try it again. I am also going to use a computer with 256Gb of memory and see if that works. Thank you
Ran it on the USC CARC SLURM cluster with 512Gb of RAM and same error.
I tried a different alignment and sorting by name and coordinate. Nada. Is no one using this anymore? What are people using? If I have to I'll go back to Ensembl 100. If it still barfs on me, i'm hosed.
Hi Matthew. I Will try to reproduce the error message. Could you point me to the link where you are getting the raw gtf file?
HI Alejandro! Thank you!! Absolutely. I am getting it from Ensembl, here, http://ftp.ensembl.org/pub/release-106/gtf/homo_sapiens/Homo_sapiens.GRCh38.106.gtf.gz I really appreciate your help!!
Hi Matthew. I Will try to reproduce the error message. Could you point me to the link where you are getting the raw gtf file?