Hi,
I am trying to get bulkseq data ready for DESeq2 to analyze my data. Below I outlined my steps, but I created an index for salmon from the latest zebrafish transcriptome, did my quantifications with salmon using that index, then tried to use tximeta to get it ready for DESeq2 analysis, where I then run into issues. Any help is appreciated.
salmon index -t Danio_rerio.GRCz11.cdna.all.fa.gz -i danio_rerio_index
I then ran the salmon quantification on my thirteen samples using a poorly made batch script...
cd FishOnlySeq
for fn in *.gz;
do
samp=`basename ${fn}`
echo "Processing sample${samp}"
salmon quant -i danio_rerio_index \
-l A \
-r ${fn} \
--threads 12 \
--validateMappings --rangeFactorizationBins 4 \
--seqBias \
-o quants/${samp}_quant
done
Since it's zebrafish I needed to makeLinkedTxome...
index_dir <- file.path("danio_rerio_index")
fasta_url <- "http://ftp.ensembl.org/pub/release-104/fasta/danio_rerio/cdna/Danio_rerio.GRCz11.cdna.all.fa.gz"
gtf_url <- "http://ftp.ensembl.org/pub/release-104/gtf/danio_rerio/Danio_rerio.GRCz11.104.gtf.gz"
makeLinkedTxome(indexDir = index_dir,
source = "Ensembl",
organism = "Danio rerio",
release = "104",
genome = "GRCz11",
fasta = fasta_url,
gtf = gtf_url)
I needed to make the coldata for tximeta to act on...
file_path=file.path("quants")
sf_files <- list.files(file_path, recursive = TRUE, full.names = TRUE, pattern = "quant.sf")
sample_names <- stringr::word(sf_files, -2, sep ="/")
coldata <- data.frame(files = sf_files,
names = sample_names)
This looked like it all worked, but here's the last bit of input and the error messages...
> txi_data <- tximeta(coldata)
importing quantifications
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 11 12 13
found matching linked transcriptome:
[ Ensembl - Danio rerio - release 104 ]
loading existing EnsDb created: 2021-11-03 18:43:46
loading existing transcript ranges created: 2021-11-03 18:43:46
Warning message:
In checkAssays2Txps(assays, txps) :
Warning: the annotation is missing some transcripts that were quantified.
5607 out of 57192 txps were missing from GTF/GFF but were in the indexed FASTA.
(This occurs sometimes with Ensembl txps on haplotype chromosomes.)
In order to build a ranged SummarizedExperiment, these txps were removed.
To keep these txps, and to skip adding ranges, use skipMeta=TRUE
Example missing txps: [ENSDART00000181147, ENSDART00000185226, ENSDART00000188669, ...]
> gene_summarized <- summarizeToGene(txi_data)
loading existing EnsDb created: 2021-11-03 18:43:46
obtaining transcript-to-gene mapping from database
Error: incomplete input
In addition: Warning messages:
1: In cleanColumns(x, columns) :
Column 'tx_id' is not present in the database and has been removed
2: In cleanColumns(x, columns) :
Column 'tx_id' is not present in the database and has been removed
3: In prefixColumns(x, columns, clean, with.tables) :
The submitted table names are not valid in the database and were thus dropped.
4: In cleanColumns(x, columns) :
Column 'tx_id' is not present in the database and has been removed
I found on http://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#transcript-abundance-files-and-tximport-tximeta that you should be able to feed tximeta into DESeq2 but that failed as well
ddsTxi <- DESeqDataSet(txi_data, design = ~ condition)
using counts and average transcript lengths from tximeta
Error in DESeqDataSet(txi_data, design = ~condition) :
all variables in design formula must be columns in colData
So where about am I going wrong? I appreciate any help.
sessionInfo( )
R version 4.0.4 (2021-02-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 21.04
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
[5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8 LC_PAPER=en_CA.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] DESeq2_1.30.1 GenomicFeatures_1.42.3 AnnotationDbi_1.52.0 SummarizedExperiment_1.20.0
[5] Biobase_2.50.0 GenomicRanges_1.42.0 GenomeInfoDb_1.26.7 IRanges_2.24.1
[9] S4Vectors_0.28.1 BiocGenerics_0.36.1 MatrixGenerics_1.2.1 matrixStats_0.61.0
[13] tximeta_1.8.5 magrittr_2.0.1
loaded via a namespace (and not attached):
[1] ProtGenerics_1.22.0 bitops_1.0-7 bit64_4.0.5 RColorBrewer_1.1-2
[5] progress_1.2.2 httr_1.4.2 tools_4.0.4 utf8_1.2.2
[9] R6_2.5.1 colorspace_2.0-2 DBI_1.1.1 lazyeval_0.2.2
[13] withr_2.4.2 tidyselect_1.1.1 prettyunits_1.1.1 bit_4.0.4
[17] curl_4.3.2 compiler_4.0.4 xml2_1.3.2 DelayedArray_0.16.3
[21] rtracklayer_1.50.0 scales_1.1.1 genefilter_1.72.1 readr_2.0.2
[25] askpass_1.1 rappdirs_0.3.3 stringr_1.4.0 digest_0.6.28
[29] Rsamtools_2.6.0 rmarkdown_2.11 XVector_0.30.0 pkgconfig_2.0.3
[33] htmltools_0.5.2 dbplyr_2.1.1 fastmap_1.1.0 ensembldb_2.14.1
[37] rlang_0.4.12 rstudioapi_0.13 RSQLite_2.2.8 shiny_1.7.1
[41] generics_0.1.1 jsonlite_1.7.2 BiocParallel_1.24.1 vroom_1.5.5
[45] dplyr_1.0.7 RCurl_1.98-1.5 GenomeInfoDbData_1.2.4 Matrix_1.3-2
[49] munsell_0.5.0 Rcpp_1.0.7 fansi_0.5.0 lifecycle_1.0.1
[53] stringi_1.7.5 yaml_2.2.1 zlibbioc_1.36.0 BiocFileCache_1.14.0
[57] AnnotationHub_2.22.1 grid_4.0.4 blob_1.2.2 promises_1.2.0.1
[61] crayon_1.4.2 lattice_0.20-41 splines_4.0.4 Biostrings_2.58.0
[65] annotate_1.68.0 hms_1.1.1 locfit_1.5-9.4 knitr_1.36
[69] pillar_1.6.4 geneplotter_1.68.0 biomaRt_2.46.3 XML_3.99-0.8
[73] glue_1.4.2 BiocVersion_3.12.0 evaluate_0.14 BiocManager_1.30.16
[77] vctrs_0.3.8 tzdb_0.2.0 httpuv_1.6.3 gtable_0.3.0
[81] openssl_1.4.5 purrr_0.3.4 assertthat_0.2.1 ggplot2_3.3.5
[85] cachem_1.0.6 xfun_0.27 mime_0.12 xtable_1.8-4
[89] AnnotationFilter_1.14.0 later_1.3.0 survival_3.2-7 tibble_3.1.5
[93] GenomicAlignments_1.26.0 memoise_2.0.0 tximport_1.18.0 ellipsis_0.3.2
[97] interactiveDisplayBase_1.28.0
>
Could you try this again, but after clearing the EnsDb entry from your BiocFileCache? I think what happened is that you parsed the GTF in the in-between period between release and when Johannes made this change to code: ensembldb support for Ensembl's transcript_name column
To clear a BFC entry do this:
If that doesn't work, could you email me one or two quantification directories in a tar or zip? To maintainer("DESeq2"). On my end, using the latest release, I have
tx_id
in the columns of the EnsDb.I think it won't help even to clear the cache. I've reported this upstream issue here -- ensembldb support for Ensembl's transcript_name column
In the mean time, I think you may actually get around this error by not using the word
"Ensembl"
for thesource
. Try"LocalEnsembl"
. This will avoid using the ones on AnnotationHub, and that may clear up the issue.Hey Michael,
Really appreciate your reply! I decided to get around this by going with tximport instead. If I do revisit the analysis and attempt to use the tximeta I'll try the fixes you've mentioned. Sorry for the delay in getting back to you, and your continued support for this bioconductor package is much appreciated.
No worries. tximport works :)
BTW ensembldb is now updated, if you were to re-install it will give you v2.18.2 and users have reported all issues cleared up. So you don't need to rename the
source
anymore.