Hi all,
I have previously worked directly with read counts files, but this is my first time trying to generate read count from fastq files.
QIAseq miRNA Library prep was used for the experiment. (I was not the one who performed the library prep. I just received the fastq files)
Reading few tutorials and following http://master.bioconductor.org/packages/release/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html, Salmon seems to be the state of the art method for generating read count files. My project is specifically focused on small non-coding RNAs.
Can someone please help me with these questions ?
- I'd want to identify specific sncRNA as biomarkers (something like miR-182-5p is increased for cancer patients). So is it necessary to create read count files with rownames as genes ? Can read count files with sncRNAs as rownames be directly created ? (rather than first creating one with genes and then mapping it to sncRNAs with something like miRBase)
- Is Salmon (following the steps in the link above) recommended for quantifying sncRNAs ? https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-018-4869-5 specifies that alignment-free tools are not ideal for small RNAs
I'll tell you what I know ... I think we need to start a biocfortunes package ...
Thank you for sharing your thoughts.
I used Salmon and indexed using hg38 gencodes transcriptome. After the tximport step, I ended up getting read counts file with significant number of protein-coding genes. Not sure if this is some impurity in the sample, or something wrong with the quantification using Salmon (without adapter trimming)
Hoping that adapter trimming would give better results.
If the sequences are 20-something bp long then it appears logical to me to throw away any reads that are longer than that after adapter trimming, at least if you want to quantify the mature sequences. That would at least remove the obvious noise. For salmon how did you index the the transcriptome? You would need to lower the default kmer size of 31 quite a bit to even be able to ma these small sequences. And if so I think you would need to use the genome as decoy to have a notable advantage of salmon. You would actually not need tximport as small RNAs do not have isoforms, do they? Can you add some details? By the way, is this normal RNA-seq or specifically a smallRNA-seq, if the former then these small RNAs will not be properly represented, both because standard RNA extraction kits do not capture them well and because the size selection steps during library prep are optimized for fragments notable longer than 20-something bp (rather in the 200bp range).
Thank you for your suggestions.
Throwing away larger reads sounds true, I'll try using Cutadapt for that besides adapter trimming.
I did realize default k-mer size is large. I'm rerunning with 10 and 20 k values
I did not quite understand the reason for not using txtimport. I have been wondering if mapping to genes is necessary - can it directly be mapped to miRNA or other sncRNA names
QIAseq miRNA Library was used, I was not the one who prepared the samples. I just received the fastq files
k has to be odd - trying 9, 21 instead
I did use the genome as decoy. And I used whole transcriptome from Gencode for indexing. Maybe I should index using just non coding transcriptome. However, only lncRNA transcriptome was present in gencode, which is not what I want.
I wonder what would be a good metric to assess the quantification quality
You may well get lots of protein coding genes, depending on what's in the transcripts file. Do note that most miRNA transcripts are complementary to some portion of the mature mRNA of their targets, which is how they work after all, so it's not unexpected that salmon would count reads complementary to a transcript as a hit. There may be some way to tell salmon not to do that, but I would in general not use salmon, instead I would probably use bowtie. You absolutely don't want gapped alignments, and you don't want the aligner to think that you have paired end reads (you don't, do you? That would be hilarious) because one of the pairs will be on the opposite strand. You could look at the documentation for GeneGlobe to see if they say how they do the alignment, but I am pretty sure they just use bowtie.
Thanks, I will try out Bowtie.
Is GeneGlobe and Qiagen the same company ? https://www.qiagen.com/us/resources/download.aspx?id=bea2dcfa-0a5c-47c5-afd8-8b0fe90a471a&lang=en From this link, Bowtie is used
If you used the QIASeq miRNA library, then you are wasting your time. That library adds the UMI barcodes and is intended for you to then upload the FASTQ files into GeneGlobe and processing to get the UMI counts. Doing anything else, unless you are an expert and have some special knowledge that allows you to do better than Qiagen, is counter-productive.
Oh, thank you for that.
Do they allow downloading the UMI counts file or directly show visualizations / results based on that ? And can the UMI counts file can be just used like a read count file ? (i.e. can be normalized and then DE analysis ?)
That question is about GeneGlobe, which is a Qiagen product. I would recommend going to their site and figuring out for yourself.
Alright thanks