I've aligned long read data with oarfish and now I wanted to try out doing differential isoform usage with fishpond + swish
I've aligned to the gencode v42 human fasta file, which has the following header conformation
head /SAN/vyplab/vyplab_reference_genomes/sequence/human/gencode/gencode.v42.transcripts.fa
>ENST00000456328.2|ENSG00000290825.1|-|OTTHUMT00000362751.1|DDX11L2-202|DDX11L2|1657|lncRNA|
GTTAACTTGCCGTCAGCCTTTTCTTTGACCTCTTCTTTCTGTTCATGTGTATTTGCTGTC
TCTTAGCCCAGACTTCCCGTGTCCTTTCCACCGGGCCTTTGAGAGGTCACAGGGTCTTGA
TGCTGTGGTCTTCATCTGCAGGTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCAGGGTG
I've got a dataframe of the file paths of the data and the names.
This is fine
k = tximeta(coldata = data.frame(files = m, names = names(m)), type = 'oarfish')
y <- scaleInfReps(k) # scales counts
y <- labelKeep(y) # labels features to keep
set.seed(120)
y <- swish(y, x="condition") # simplest Swish case
The errors start appearing here:
# run on the transcript-level dataset
iso <- isoformProportions(y)
Error in isoformProportions(y) : geneCol %in% names(mcols(y)) is not TRUE
Ok that's fine, perhaps tximeta wasn't able to properly infer the right species/version, but I should be able to manually make that
tx_2_gene <- mcols(y) %>%
as.data.frame() %>%
rownames_to_column('gene') %>%
separate(gene,sep = "\\|",into = c("transcript","geneCol",NA,NA,"transcript_id"
,"symbol",'tx_leng','biotype'),
remove = FALSE) %>%
select(gene,geneCol)
And let's see if that works
updated_m <- mcols(y) %>%
as.data.frame() %>%
rownames_to_column('gene') %>%
left_join(tx_2_gene) %>%
column_to_rownames('gene')
mcols(y) <- DataFrame(updated_m)
# run on the transcript-level dataset
iso <- isoformProportions(y)
iso <- isoformProportions(y)
Error in isoformProportions(y) : geneCol %in% names(mcols(y)) is not TRUE
> "geneCol" %in% names(mcols(y))
[1] TRUE
Let's try the summarizeToGene gene function
gse <- summarizeToGene(k,tx2gene = tx_2_gene)
Error in missingMetadata(object, summarize = TRUE) :
use of this function requires transcriptome metadata which is missing.
either: (1) the object was not produced by tximeta, or
(2) tximeta could not recognize the digest of the transcriptome.
If (2), use a linkedTxome to provide the missing metadata and rerun tximeta
or use tx2gene, txOut=FALSE (and skipMeta=TRUE if Salmon/piscem)
Okay following the linkedTxome path
fastaFTP <- c("https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_42/gencode.v42.transcripts.fa.gz")
> gtfPath <- file.path("/Users/annaleigh/Downloads/gencode.v42.annotation.gtf.gz")
> makeLinkedTxome(indexDir='/Users/annaleigh/cluster/vyplab/first_weeks/recalled_nanopore/oarfish/',
+ source="localgencode",
+ organism="Homo sapiens",
+ release="42",
+ genome="GRCH38",
+ fasta=fastaFTP,
+ gtf=gtfPath,
+ write=FALSE,
+ jsonFile = "provaLinkedTxome.json")
Error: lexical error: invalid char in json text.
/Users/annaleigh/cluster/vyplab/
(right here) ------^
So - what am I doing wrong here?
Sessioninfo
> sessionInfo()
R version 4.4.2 (2024-10-31)
Platform: x86_64-apple-darwin20
Running under: macOS Ventura 13.7
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Europe/London
tzcode source: internal
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] org.Hs.eg.db_3.20.0 fishpond_2.12.0 lubridate_1.9.4 forcats_1.0.0 purrr_1.0.2
[6] tidyr_1.3.1 tibble_3.2.1 tidyverse_2.0.0 tximeta_1.24.0 IsoformSwitchAnalyzeR_2.6.0
[11] pfamAnalyzeR_1.6.0 dplyr_1.1.4 stringr_1.5.1 readr_2.1.5 sva_3.54.0
[16] genefilter_1.88.0 mgcv_1.9-1 nlme_3.1-166 satuRn_1.14.0 DEXSeq_1.52.0
[21] RColorBrewer_1.1-3 AnnotationDbi_1.68.0 DESeq2_1.46.0 SummarizedExperiment_1.36.0 GenomicRanges_1.58.0
[26] GenomeInfoDb_1.42.1 IRanges_2.40.1 S4Vectors_0.44.0 MatrixGenerics_1.18.1 matrixStats_1.5.0
[31] Biobase_2.66.0 BiocGenerics_0.52.0 BiocParallel_1.40.0 limma_3.62.2 ggplot2_3.5.1
[36] data.table_1.16.4
I'm not certain that this will work with Oarfish, the documentation only mentions Salmon and piscem.
I'd use tximport for now.
Success
Failure
Thank you for any advice
Lol I'm not smart
This solution didn't work because....the column name was wrong
however, tximport does not produce a format that the fishpond data can read out of the gate
That's true. We haven't yet developed the Oarfish to tximeta pipeline yet.
Can you say what functionality you need? Do you want to do testing on isoform proportions (DTU)?
You could use tximeta with
skipMeta=TRUE
which would produce the correctly shaped SummarizedExperiment object, and then manually add the gene ID.Trying to do differential gene expression(for other reasons) and also DTU in the same bucket. I will try the skipMeta=TRUE and update with code for fellow lost Googlers
Thanks Annaleigh! I'll make sure to also reply here once we've worked out Oarfish to tximeta with metadata, or any other analysis pipeline updates wrt Oarfish and Bioc.
More failures encountered