I am using DEXSeq to perform a DEU analysis on some bulk RNA-Seq data but I am getting some strange behavior. My data comes from a transgenic animal model, so I've introduced a custom gtf file for the transgene. I am able to follow the vignette for DEXSeq well, but I went back today to try and update my summarized experiment object with rowRanges because I realized this didn't get done the first time.
Here is the code I am trying to run:
library(GenomicFeatures) # for the exonicParts() function
library(txdbmaker) # for the makeTxDbFromGFF() function
library(GenomicRanges)
## Flatten GTF
gtf <- makeTxDbFromGFF(file = "/custom.gtf")
flattened_gtf <- exonicParts(gtf, linked.to.single.gene.only = TRUE)
# Subset the relevant ranges
subset_ranges <- subsetByOverlaps(flattened_gtf, ranges = GRanges(seqnames = "chrMouse_BAC", ranges = IRanges(start = 36912, end = 39292)))
# 1. Adjust E23 and E24
# Extend the end of E23 by 50 base pairs
end(subset_ranges[2]) <- end(subset_ranges[2]) + 50
# Shorten the start of E24 by 50 base pairs
start(subset_ranges[3]) <- start(subset_ranges[3]) + 50
# 2. Merge E25 and E26
# Merge the ranges for E25 (row 4) and E26 (row 5)
merged_range <- reduce(subset_ranges[4:5]) # Combine the two ranges into one
# Update metadata for the merged range (using metadata from E25 as a base)
merged_metadata <- subset_ranges[4] # Use metadata from E25
start(merged_metadata) <- start(merged_range) # Update start coordinate
end(merged_metadata) <- end(merged_range) # Update end coordinate
# Remove the original E25 and E26
subset_ranges <- subset_ranges[-c(4, 5)]
# Add the new merged range back to the subset
subset_ranges <- append(subset_ranges, merged_metadata)
# 3. Replace the modified subset into the original GRanges object
# Remove the original ranges from the full object
flattened_gtf <- flattened_gtf[!flattened_gtf %over% GRanges(seqnames = "chrMouse_BAC", ranges = IRanges(start = 36912, end = 39292))]
# Add the modified ranges back into the GRanges object
flattened_gtf <- c(flattened_gtf, subset_ranges)
names(flattened_gtf) =
sprintf("%s:E%0.3d", flattened_gtf$gene_id, flattened_gtf$exonic_part)
This all works great, and I then move on to setting up my DEXSeq object
## Load in BAM files
library(GenomicAlignments)
library(DEXSeq)
library(stringr)
library(SummarizedExperiment)
library(BiocParallel)
# load metadata
meta <- read.table(file = "/meta.txt", sep = "\t", header = TRUE)
meta$Pedigree <- paste0("S", meta$Pedigree)
# read exon count data
all_files <- dir("/final_exon_overlaps", pattern = "*.txt$", full.names = TRUE)
counts_list <- lapply(X = all_files, read.table, header = TRUE)
names(counts_list) <- paste0("S", str_extract(string = basename(all_files), pattern = "[[:digit:]]+"))
# merge exon count data
counts_matrix <- do.call(cbind, counts_list)
colnames(counts_matrix) <- gsub(pattern = "X", replacement = "S", str_extract(string = colnames(counts_matrix), pattern = ".*(?=_Aligned)"))
# arrange metadata to match order of counts_matrix
new_order <- match(colnames(counts_matrix), meta$Pedigree)
meta <- meta[new_order,]
if( all(match(colnames(counts_matrix), meta$Pedigree) == seq_len(length(meta$Pedigree))) ){ print("Correct Order") }
# transform into a summarized experiment
counts_matrix <- as.matrix(counts_matrix)
row_data <- DataFrame(flattened_gtf)[-1]
col_data <- DataFrame(meta)
rownames(col_data) <- col_data$Pedigree
se <- SummarizedExperiment(assays = list(counts = counts_matrix), rowData = row_data, colData = col_data)
colData(se)$Condition <- factor(c("baseline", "ifn"))
#rowRanges(se) <- flattened_gtf
# perform DEXSeq analysis
# load a BCPARAM object to allow multi-threading
param <- MulticoreParam(4)
dxd <- DEXSeqDataSetFromSE(se, design = ~ Pedigree + exon + Condition:exon)
If I run DEXSeqDataSetFromSE
after adjusting the rowRanges, then I get the following error:
Error in SummarizedExperiment(nCountData, colData = colData, rowRanges = featureRanges) :
the rownames of the supplied assay(s) must be NULL or identical to
those of the RangedSummarizedExperiment object (or derivative) to
construct
but if I don't adjust the rowRanges then I get no error, and the analysis seems to work. I can't quite work out what is causing this behavior, because everything I can think to check seems to be okay; the lengths are equal, the rownames are equal, the order is the same. Is anyone able to make any recommendations?
sessionInfo( )
R version 4.4.0 (2024-04-24)
Platform: x86_64-pc-linux-gnu
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS/LAPACK: /n/app/openblas/0.3.19-gcc-9.2.0/lib/libopenblas_haswellp-r0.3.19.so; LAPACK version 3.9.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: America/New_York
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] stringr_1.5.1 DEXSeq_1.52.0
[3] RColorBrewer_1.1-3 DESeq2_1.46.0
[5] BiocParallel_1.40.0 GenomicAlignments_1.42.0
[7] Rsamtools_2.22.0 Biostrings_2.74.1
[9] XVector_0.46.0 SummarizedExperiment_1.36.0
[11] MatrixGenerics_1.18.1 matrixStats_1.5.0
[13] txdbmaker_1.2.1 GenomicFeatures_1.58.0
[15] AnnotationDbi_1.68.0 Biobase_2.66.0
[17] GenomicRanges_1.58.0 GenomeInfoDb_1.42.1
[19] IRanges_2.40.1 S4Vectors_0.44.0
[21] BiocGenerics_0.52.0
loaded via a namespace (and not attached):
[1] tidyselect_1.2.1 dplyr_1.1.4 blob_1.2.4
[4] filelock_1.0.3 bitops_1.0-9 fastmap_1.2.0
[7] RCurl_1.98-1.16 BiocFileCache_2.14.0 XML_3.99-0.18
[10] digest_0.6.37 lifecycle_1.0.4 statmod_1.5.0
[13] survival_3.5-8 KEGGREST_1.46.0 RSQLite_2.3.9
[16] magrittr_2.0.3 genefilter_1.88.0 compiler_4.4.0
[19] rlang_1.1.4 progress_1.2.3 tools_4.4.0
[22] utf8_1.2.4 yaml_2.3.10 rtracklayer_1.66.0
[25] prettyunits_1.2.0 S4Arrays_1.6.0 bit_4.5.0.1
[28] curl_4.3.2 DelayedArray_0.32.0 xml2_1.3.6
[31] abind_1.4-8 hwriter_1.3.2.1 grid_4.4.0
[34] fansi_1.0.6 xtable_1.8-4 colorspace_2.1-1
[37] ggplot2_3.5.1 scales_1.3.0 biomaRt_2.62.0
[40] cli_3.6.3 crayon_1.5.3 generics_0.1.3
[43] httr_1.4.7 rjson_0.2.23 DBI_1.2.3
[46] cachem_1.1.0 splines_4.4.0 zlibbioc_1.52.0
[49] parallel_4.4.0 restfulr_0.0.15 vctrs_0.6.5
[52] Matrix_1.7-0 jsonlite_1.8.9 geneplotter_1.84.0
[55] hms_1.1.3 bit64_4.5.2 locfit_1.5-9.10
[58] annotate_1.84.0 glue_1.8.0 codetools_0.2-20
[61] stringi_1.8.4 gtable_0.3.6 BiocIO_1.16.0
[64] UCSC.utils_1.2.0 munsell_0.5.1 tibble_3.2.1
[67] pillar_1.9.0 rappdirs_0.3.3 GenomeInfoDbData_1.2.13
[70] R6_2.5.1 dbplyr_2.5.0 httr2_1.0.7
[73] lattice_0.22-6 png_0.1-8 memoise_2.0.1
[76] Rcpp_1.0.14 SparseArray_1.6.0 pkgconfig_2.0.3