Question

Can Tximport average transcript length be used for GOSeq?

0

Entering edit mode

casey.rimland ▴ 170

@caseyrimland-14915

Last seen 6.7 years ago

University of Cambridge, National Insti…

I’m trying to run GOSeq with human RNA Seq data that has been processed with Salmon, summarized to gene level with tximport, and DGE run using DeSeq2.

I was reading the vignette for GOSeq and got stuck/confused on the part where you need to know what human genome build you used... It says you need to know this to access the correct transcript length in their database.

This may be a stupid question/lack of understanding on my part...But with Salmon/tximport I do not have this information because I did not map to a specific genome, correct? So then could I just use the avgTX length from the tximport function to input to GOSeq to correct for length biases?

Thanks!

deseq2 tximport goseq • 2.4k views

ADD COMMENT • link 7.1 years ago casey.rimland ▴ 170

1

Entering edit mode

Thanks for the reply Steve!

So I spent literally all day trying to figure out how to go about doing that, now that I know its okay...and I think maybe I finally figured it out? Getting the data into the right format for GOseq is a bit of a beast for a novice R user!!

Any chance you're feeling super friendly and could take a look at my code?

Also, just in case someone in the future needs to do something like this, maybe the code will help them avoid a day of banging their head on the wall (assuming its correct code :D).

When I run GOseq the pwf plots are looking a bit odd which is the only thing that has me worried... This is link to the pwf plot https://ibb.co/h69iLH

#Get transcript lengths from Tximport data
txi.length.df <- as.data.frame(txi[["length"]])

#Subset for only genes input to the DeSeq DGE model
txi.length.df <- subset(txi.length.df, rownames(txi.length.df) %in% rownames(results.df))

#Remove samples not being included in analyses at this time
txi.length.df_sub <- within(txi.length.df, rm("SRR000020643", "SRR000020650", "SRR000020653", "SRR000020660", "SRR000020663", "SRR000020670", "SRR000020680", "SRR000020690", "SRR000026811"))

#Calculate mean tx length
txi.length.df_sub$avgTx <- rowMeans(txi.length.df_sub)

#Load file containing the list of genes DE from DeSeq2 output
DE.genes <- read.delim(file = "/users/rimlandca/desktop/RNA Seq Results/GOseq files/DEGs/Upreg overlap.txt")

#Load file containing GO terms (downloaded GO terms from Ensembl Biomart online for protein coding genes)

go.terms <- read.csv("/users/rimlandca/desktop/RNA Seq Results/GOseq files/Category Mappings/GO term.txt", as.is=TRUE)
head(go.terms)
names(go.terms) <- c("Gene.stable.ID","GO.term.accession")
summary(go.terms)

#Background Genes
expressed.genes <- row.names(txi.length.df_sub)

#Gene Length
gene.lengths <- txi.length.df_sub$avgTx

#Make vector where DEGs = TRUE and non-DEGs = FALSE

DEgenes <- expressed.genes %in% DE.genes$ID
names(DEgenes) <- expressed.genes
head(DEgenes)
sum(DEgenes)

#Run GoSeq!

nullp.result <- nullp(DEgenes = DEgenes, bias.data = gene.lengths)

GO <- goseq(pwf = nullp.result, gene2cat = go.terms)

write.table(GO[GO$over_represented_pvalue < 0.01],row.names=FALSE,file="GO_terms.txt", quote = FALSE,col.names = FALSE)

ADD REPLY • link 7.1 years ago casey.rimland ▴ 170

1

Entering edit mode

There's a lot of code here, but I'm confused. If you need the gene or transcript lengths, it's simply:

lengthData <- rowMeans(txi$length)
lengthData <- lengthData[ x ]

Where 'x' is a vector of the row names in the order you want them (lined up with DEgenes)

ADD REPLY • link 7.1 years ago Michael Love 43k

0

Entering edit mode

Yea, I am still at the learning R stage where I tend to be very very inefficient with my code :D!

I have edited the above post with updated code and made it much more efficient. In my "banging my head" state yesterday I did a ton of rather useless steps I now realize.

The pwf plots are still looking a bit off I think, especially the second one? They are for two different DEG lists for contrasts between different biological groups.

https://imgur.com/a/1kyV8

https://ibb.co/iw8Bcx

Also, one more question: is it acceptable to use the built in goseq database's GO classification for hg19, even though I technically did not map to hg19 since I used Salmon? For some reason the data I downloaded from Ensembl Biomart for GO classifications seems to be different to what GOseq has and is maybe not as complete...

GO <- goseq(pwf = nullp.result, "hg19", "ensGene") gives me 22,205 GO classifications.

GO <- goseq(pwf = nullp.result, gene2cat = go.terms) gives me 17,705 GO classifications

p.s. Go Tar Heels!

ADD REPLY • link 7.1 years ago casey.rimland ▴ 170

0

Entering edit mode

The second plot is definitely not correct. I don't know what these are supposed to look like other than the power should increase with length. I'd make sure that you are providing the correct DEgenes, because it looks like you may have swapped DE and non-DE status.

For the second part, what matters is that the gene IDs that you used after quantifying and summarizing to gene-level match up with the gene IDs in GO. I'm not sure exactly about the internal mechanics of goseq().

ADD REPLY • link 7.1 years ago Michael Love 43k

0

Entering edit mode

I am pretty confident I included correctly the DE genes n the second plot as the number of "True"/"1" is equal to the number of differentially expressed genes in the list I imported... To get this list, I have been taking an overlap of genes significantly up-regulated in Tissue C when compared to either Tissue A or B.

The second plot was definitely odd though and I can't figure out why... It seems like a similar problem to what was posted here: Question: GoSeq: opposite pwf plot for up- and downregulated genes. But there wasn't really a clear resolution or reason for why the plots were coming out reversed when looking at subsets of genes that either up or down-regulated...

My first pwf plot which looks normal-ish is for the overlap of genes upregulated in Tissue A when compared to Tissue B and Tissue C. Similar thing holds true for the Tissue B upregulated genes. Tissue C is where it gets odd...

Tissue A upregulated (https://ibb.co/fD4wUc)
Tissue B upregulated (https://ibb.co/dKGh2x)
Tissue C upregulated (https://ibb.co/j7FLpc)

So something weird is happening with the list for Tissue C it seems?... Is it biologically possible that perhaps the genes DE in Tissue C are just shorter?

ADD REPLY • link 7.1 years ago casey.rimland ▴ 170

score 0 · Answer 1 · 2018-03-17

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 1 day ago

United States

Yes it can, and it's arguably a better value to use than what is in the database anyway.

ADD COMMENT • link 7.1 years ago Steve Lianoglou ★ 13k