Dear friends,
We are trying to use Salmon for DTU analysis. We want to separate exogenous from endogenous transcripts by following this post https://www.biostars.org/p/443701/ and this paper https://f1000research.com/articles/7-952
We are focusing on a gene called ASCL1 (endo-ASCL1). We transduced cells with lentiviral vector containing ASCL1 ORF only (Lenti-ASCL1). There should not be any 5'3' UTR on this ASCL1. We added Lenti-ASCL1 to genecode reference.
However, in the result, we still see large amount of Lenti-ASCL1 and endo-ASCL1 transcripts in some untranduced cells, especially when endo-ASCL1 is high. Looks like high endo-ASCL1 can bring the Lenti-ASCL1 level high as well even though the sample is not transduced.
Have you seen this before? What would be a better way to deconvolute ASCL1 quantification using Salmon? to better separate endo-ASCL1 from Lenti-ASCL1? Are there any parameters we should tune to achieve this?
Thank you so much for your help!
Kai
Here is the code we ran.
#in linux shell
#get the reference geneome for salmon
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_39/gencode.v39.transcripts.fa.gz
#add lenti ASCL1 to this genome
cat ASCL1_transcript.fa >> gencode.v39.transcripts.fa
#use salmon to index transcrpt reference
salmon index -t gencode.v39.transcripts.fa -i gencode.index
#run salmon
for fn in $files;
do
echo "Processing sample ${samp}"
salmon quant -i gencode.index -l IU \
-1 ${sample_dir}/$fn"_R1_001.fastq.gz" \
-2 ${sample_dir}/$fn"_R2_001.fastq.gz" \
-p 20 --gcBias -o ${workdir}/quant/${fn}_quant
done
#in R
library(tidyverse, exclude = "select")
library(rnaseqDTU)
library(tximport)
library(GenomicFeatures)
library(reshape2)
#DTU analysis
samples<-data.frame(sample_id = c("1_S1_quant", "5_S5_quant", "6_S6_quant"),
condition = c('s1_no_lenti-ASCL1',
's5_high_endo-ASCL1',
's6_lenti-ASCL1'))
files <- file.path("dtu/quant",
samples$sample_id, "quant.sf")
names(files) <- samples$sample_id
#Use tximport to read in transcript counts
txi <- tximport(files, type="salmon", txOut=T,
countsFromAbundance="no")
cts <- txi$counts
cts <- cts[rowSums(cts) > 0,]
rownames(cts)<-sub("\\|.*", "", rownames(cts))
#gtf
gtf <- "gencode.v39.annotation.gtf.gz"
txdb.filename <- "gencode.v39.annotation.sqlite"
txdb<-tryCatch(
expr = {
loadDb(txdb.filename)
},
error = function(e){
txdb <- makeTxDbFromGFF(gtf)
saveDb(txdb, txdb.filename)
}
)
txdf <- select(txdb, keys(txdb, "GENEID"), "TXNAME", "GENEID")
ASCL1_transcript<-data.frame(GENEID = "ENSG00000139352.4", TXNAME = "FU-ASCL1-CGW")
txdf<-rbind(txdf, ASCL1_transcript)
tab <- table(txdf$GENEID)
txdf$ntx <- tab[match(txdf$GENEID, names(tab))]
txdf <- txdf[match(rownames(cts),txdf$TXNAME),]
all(rownames(cts) == txdf$TXNAME)
counts <- data.frame(gene_id=txdf$GENEID,
feature_id=txdf$TXNAME,
cts)
#get ascl1
ascl1_counts<-counts[counts$gene_id == "ENSG00000139352.4",] %>%
.[,-(1:2)] %>%
(function(x, add=1){x + add}) %>%
log2() %>%
t %>%
as.data.frame %>%
rownames_to_column("sample")
ascl1_counts
sample ENST00000266744.4 FU-ASCL1-CGW
1 X1_S1_quant 3.459432 0.00000
2 X5_S5_quant 12.715446 7.85213
3 X6_S6_quant 13.197477 10.85678
the ENST00000266744.4 should be endo-ASCL1 transcript FU-ASCL1-CGW is the lenti-ASCL1 transcript
X1_S1_quant is the cell line with low endo ASCL1. The lenti-ASCL1 has zero counts, which should be correct.
X5_S5_quant is the cell line with high endo ASCL1. The problem is we don't know why the lenti-ASCL1 has non zero value. And the value is not low.
X6_S6_quant is the cell line with both high endo-ASCL1 and high lenti-ASCL.
sessionInfo( )
> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] reshape2_1.4.4 GenomicFeatures_1.44.2 tximport_1.20.0 rnaseqDTU_1.12.0 devtools_2.4.3
[6] usethis_2.1.5 rafalib_1.0.0 edgeR_3.34.1 limma_3.48.3 stageR_1.14.0
[11] DEXSeq_1.38.0 RColorBrewer_1.1-2 AnnotationDbi_1.54.1 DESeq2_1.32.0 SummarizedExperiment_1.22.0
[16] GenomicRanges_1.44.0 GenomeInfoDb_1.28.4 IRanges_2.26.0 S4Vectors_0.30.2 MatrixGenerics_1.4.3
[21] matrixStats_0.61.0 Biobase_2.52.0 BiocGenerics_0.38.0 BiocParallel_1.26.2 DRIMSeq_1.20.0
[26] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.8 purrr_0.3.4 readr_2.1.2
[31] tidyr_1.2.0 tibble_3.1.6 ggplot2_3.3.5 tidyverse_1.3.1
loaded via a namespace (and not attached):
[1] colorspace_2.0-3 rjson_0.2.21 hwriter_1.3.2 ellipsis_0.3.2 rprojroot_2.0.2 XVector_0.32.0
[7] fs_1.5.2 rstudioapi_0.13 remotes_2.4.2 bit64_4.0.5 fansi_1.0.2 lubridate_1.8.0
[13] xml2_1.3.3 splines_4.1.2 cachem_1.0.6 geneplotter_1.70.0 pkgload_1.2.4 jsonlite_1.8.0
[19] Rsamtools_2.8.0 broom_0.7.12 annotate_1.70.0 dbplyr_2.1.1 png_0.1-7 compiler_4.1.2
[25] httr_1.4.2 backports_1.4.1 assertthat_0.2.1 Matrix_1.4-0 fastmap_1.1.0 cli_3.2.0
[31] prettyunits_1.1.1 tools_4.1.2 gtable_0.3.0 glue_1.6.2 GenomeInfoDbData_1.2.6 rappdirs_0.3.3
[37] Rcpp_1.0.8.3 cellranger_1.1.0 vctrs_0.3.8 Biostrings_2.60.2 rtracklayer_1.52.1 brio_1.1.3
[43] ps_1.6.0 testthat_3.1.2 rvest_1.0.2 lifecycle_1.0.1 restfulr_0.0.13 statmod_1.4.36
[49] XML_3.99-0.9 zlibbioc_1.38.0 scales_1.1.1 vroom_1.5.7 hms_1.1.1 yaml_2.3.5
[55] curl_4.3.2 memoise_2.0.1 biomaRt_2.48.3 stringi_1.7.6 RSQLite_2.2.10 genefilter_1.74.1
[61] BiocIO_1.2.0 desc_1.4.1 filelock_1.0.2 pkgbuild_1.3.1 rlang_1.0.2 pkgconfig_2.0.3
[67] bitops_1.0-7 lattice_0.20-45 GenomicAlignments_1.28.0 bit_4.0.4 tidyselect_1.1.2 processx_3.5.2
[73] plyr_1.8.6 magrittr_2.0.2 R6_2.5.1 generics_0.1.2 DelayedArray_0.18.0 DBI_1.1.2
[79] pillar_1.7.0 haven_2.4.3 withr_2.5.0 survival_3.3-1 KEGGREST_1.32.0 RCurl_1.98-1.6
[85] modelr_0.1.8 crayon_1.5.0 utf8_1.2.2 BiocFileCache_2.0.0 tzdb_0.2.0 progress_1.2.2
[91] locfit_1.5-9.5 grid_4.1.2 readxl_1.3.1 blob_1.2.2 callr_3.7.0 reprex_2.0.1
[97] digest_0.6.29 xtable_1.8-4 munsell_0.5.0 sessioninfo_1.2.2
What sequence are you using for the transgene transcript you are adding to the reference? Is it just the CDS? The transgene will have a 5' and 3' UTR of some sort, even if its just a Kozak sequence on the 5' end and a SV40 polyA site on the 3' end. Salmon might have a hard time distinguishing CDS from CDS+UTR, particularly if the UTR is not quite correctly annotated in the reference (which is not uncommon). For example if the normally used UTR is shorter than the annotated one (which is often the case), then salmon might have to split the difference between assigning the read to the isoform with the UTR and without. This might be particularly problematic if the UTR contains sequence that looks like sequence in other genome locations that reads could be assigned to. If you look at a read alignment for samples given high transgene expression that shouldn't have the transgene, do you see a consistent, expression level across the UTR, or do you see that coverage is higher across the CDS?
Hi Dr Sudbery, yes I found looking at the coverage in IGV helps a lot. I did find some differences between CDS and UTR and between samples. The conclusion is more clear when I normalized the data by sequencing depth using deepTools. Thank you so much.