I have been working with DEXseq. Initially my data was aligned using RefSeq and then as DEXseq uses Ensembl I used a gtf/gff from Ensembl for the model organism I am working on.
My code blocks that ran quite well with no error are placed below:
library(GenomicFeatures)
download.file(
"https://ftp.ensembl.org/pub/release-110/gtf/danio_rerio/Danio_rerio.GRCz11.110.chr.gtf.gz",
destfile="/path/to/Downloads/Danio_rerio.GRCz11.110.chr.gtf.gz")
# Must use a **GTF** for the following:
txdb = makeTxDbFromGFF("path/to/Danio_rerio.GRCz11.110.chr.gtf.gz")
## Seems that the DEXseq object wont be created with RefSeq so trying a different GFF (Danio_rerio.GRCz11.110.chr.gff3) from Ensembl
inDir="path/to/Downloads/"
flattenedFile = list.files(inDir, pattern="\\.gff3$", full.names=TRUE)
# Provide the path to the directory containing counts files
countsDir <- "path/tol/Counts_Using_Ensembl/Folder_with_counts"
# List all files in the directory ending with ".txt"
countsFiles <- list.files(countsDir, pattern = "\\.txt$", full.names = TRUE)
sampleTable = data.frame(
row.names = c("1_Experiment.txt", "2_Experiment.txt", "3_Experiment.txt",
"1_Control.txt", "2_Control.txt", "3_Control.txt" ),
condition = c("knockdown", "knockdown", "knockdown",
"control", "control", "control" ),
libType = c("paired-end", "paired-end", "paired-end",
"paired-end", "paired-end", "paired-end" )
)
My script runs quite well until I hit this code block in Jupyter Lab:
library("DEXSeq")
# Create the DEXSeqDataSet object
dxd <- DEXSeqDataSetFromHTSeq(
countsFiles,
sampleData,
design= ~ sample + exon + condition:exon,
flattenedfile=flattenedFile )
Error:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 326107 did not have 3 elements
Traceback:
1. DEXSeqDataSetFromHTSeq(countsFiles_unquote, sampleData, design = ~sample +
. exon + condition:exon, flattenedfile = flattenedFile)
2. lapply(countfiles, function(x) read.table(x, header = FALSE,
. stringsAsFactors = FALSE))
3. lapply(countfiles, function(x) read.table(x, header = FALSE,
. stringsAsFactors = FALSE))
4. FUN(X[[i]], ...)
5. read.table(x, header = FALSE, stringsAsFactors = FALSE)
6. scan(file = file, what = what, sep = sep, quote = quote, dec = dec,
. nmax = nrows, skip = 0, na.strings = na.strings, quiet = TRUE,
. fill = fill, strip.white = strip.white, blank.lines.skip = blank.lines.skip,
. multi.line = FALSE, comment.char = comment.char, allowEscapes = allowEscapes,
. flush = flush, encoding = encoding, skipNul = skipNul)
I have tried removing the last couple of lines that contain summary data from each counts sample using the recommended code by DEXseq ( you can see it here ) and I also tried run it with RefSeq gtf/gff but the same error persists. I have also tried to remove the quotes and use an Unquote directory with the files with quotes removed using a modified version of what this person recommended when he came across a similar error here. Any guidance is appreciated.
My Session Information Is Below:
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.4.1
Matrix products: default
BLAS: /path/to/Resources/lib/libRblas.0.dylib
LAPACK: /path/to/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] C/UTF-8/C/C/C/C
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] digest_0.6.33 IRdisplay_1.1 utf8_1.2.3 base64enc_0.1-3
[5] fastmap_1.1.1 glue_1.6.2 htmltools_0.5.5 repr_1.1.6
[9] lifecycle_1.0.3 cli_3.6.1 fansi_1.0.4 vctrs_0.6.3
[13] pbdZMQ_0.3-9 compiler_4.3.1 tools_4.3.1 evaluate_0.21
[17] pillar_1.9.0 crayon_1.5.2 rlang_1.1.1 jsonlite_1.8.7
[21] IRkernel_1.3.2 uuid_1.1-0
It seems that the bottom lines contained summary statistics - when those are removed it worked.