Visualization tool for GC-bias in SE data?
1
0
Entering edit mode
AKSR ▴ 20
@aksr-5026
Last seen 4.5 years ago

Need your help with 2 related questions, please. Thanks, in advance!

QUESTION 1

From this ALPINE link, under the section Description:

It is currently designed for un-stranded pairedend RNA-seq data.

So for both visualization and correction of GC-bias in RNA-Seq, I understand ALPINE is a preferred tool, but requires PE data. Right? Is there a GC-bias visualization tool for Single End Illumina RNA-Seq reads?

QUESTION 2

Is there a GC-bias correction tool for Single End Illumina RNA-Seq reads?

GC-bias ALPINE visualization • 1.8k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
3
Entering edit mode
@mikelove
Last seen 5 days ago
United States

Alpine requires PE. For SE, you don’t know the size of the fragment, but can try to infer it. An approximation with longer SE reads is to just look at read GC. I always recommend MultiQC with the FASTQC GC curves and even wrote a plugin to draw a theoretical GC curve for human or mouse transcriptomes.

https://multiqc.info/docs/#theoretical-gc-content

ADD COMMENT
0
Entering edit mode

Thank you, this is helpful. A naive follow-up question, sorry:

What R syntax lines / package(s) do I use to directly input transcript FASTA file to generateDistn of your fastqcTheoreticalGC BioConductor package, to generate the final 2 column tsv file of %GC Vs. % of genome for use with MultiQC?

Does the syntax below look OK to you (I did generate a text file that looked OK, but still, to be 100% sure), THANKS!:

library("BSgenome")
library("fastqcTheoreticalGC")

Input_FASTA <- readDNAStringSet("Alltranscripts.fasta", "fasta")
generateDistn(seqs=Input_FASTA,
              file="fastqc_theoretical_gc.txt",
              name="MyFaveSpecies")
ADD REPLY
0
Entering edit mode

What organism are you using? If it's human or mouse, you don't have to generate the distribution, just use the built-in curves.

ADD REPLY
0
Entering edit mode

Medicago truncatula, a legume plant species

ADD REPLY
0
Entering edit mode

Ok so the first argument of generateDistn is actually described in the man page, see:

?generateDistn

It says you should input a DNAStringSet of transcript sequences. There are many places to obtain a transcript FASTA file and then you can read this into R using the Biostrings package to obtain something to provide to the first argument of generateDistn.

ADD REPLY
0
Entering edit mode

Yes, thank you. Since my original comment, I'd updated it to reflect the syntax I used ( I think successfully) to obtain the end result. I am not sure if the syntax lines using readDNAStringSet, followed by your generateDistn are visible to you. If they are visible, I request you please confirm if it looks OK to you, is what I am requesting you. Thanks, in advance.

ADD REPLY
1
Entering edit mode

Yes, this looks correct

ADD REPLY
0
Entering edit mode

Thanks a lot, I appreciate it.

ADD REPLY
0
Entering edit mode

With your help I added this dashed GC theoretical curve for the transcriptome for the species of interest. Thanks Michael - the green curves are for the ~ 150 libraries I want to check GC-bias for (single ended, cuz sorta old data) Theoretical-Vs-Observed-GCcurves

In this image, does the shift of actual data (green) to the right of the theoretical GC curve (black dotted) mean that there is indeed some GC bias, resulting on-average in the preferential sequencing of genomic regions with a slightly higher GC % than would be expected if there were no bias? Could you please confirm or correct my interpretation? Thanks, in advance!

ADD REPLY
1
Entering edit mode

My guess is that there is probably not technical bias (which looks a little different, see alpine paper), but instead you are expressing some genes with higher GC content than what is computed by the theoretical curve (all transcripts at a constant rate). You can actually modify the "weights" per transcript in the script if you want to explore that - you could put in the estimated expression of each transcript instead of constant.

ADD REPLY

Login before adding your answer.

Traffic: 447 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6