Differential expression analysis for STAR-aligned RNA-Seq data
1
0
Entering edit mode
sum31278 • 0
@sum31278-11334
Last seen 8.0 years ago

I have done alignment for paired-end RNA-Seq data using STAR. I now want to do differential expression analysis using the following workflow:

http://www.bioconductor.org/help/workflows/rnaseqGene/

I have a few questions before I proceed.  

1) I have generated both Unsorted and SortedByCoordinate BAM files. Which one should I use for differential expression analysis?

2) I used the option --sjdbGTFfile during alignment. Do I need to modify the downstream workflow because of this?

3) I prepared a sampleTable as suggested in workflow. It has just two columns: sample name and group (case, control). Do I need to add other details or it would be sufficient.

Thanks.

Sumit Paliwal

 

 

 

 

deseq2 rnaseq STAR • 5.2k views
ADD COMMENT
0
Entering edit mode

There is more than one Bioc tools available for counting reads. featureCounts in Rsubread package is one of them and it has the same speed for counting reads in location-sorted bam files and unsorted bam files. Counting 20 million reads only costs about half a minute.

ADD REPLY
0
Entering edit mode

Hi Wei Shi,

I did use featureCounts as follows:


> fc <- featureCounts(files= filenames, annot.ext = "Rnor_6.0.84.gtf", isGTFAnnotationFile =TRUE, GTF.featureType = "exon", GTF.attrType = "gene_id", chrAliases=NULL, useMetaFeatures = TRUE, allowMultiOverlap = FALSE, isPairedEnd=TRUE, requireBothEndsMapped = TRUE, checkFragLength = TRUE, nthreads=6)
   
|| Load annotation file Rnor_6.0.84.gtf ...
||    Number of features is 305352 
||    Number of meta-features is 32662 
||    Number of chromosomes is 162                

                          
|| Process BAM file SM-C-F1_Aligned.sortedByCoord.out.bam...                  ||
||    Assign fragments (read pairs) to features...                            ||
||    Each fragment is counted once.                                          ||
||    Found reads that are not properly paired.                               ||
||    (missing mate or the mate is not the next read)                         ||
||    128001 reads have missing mates.                                        ||
||    Input was converted to a format accepted by featureCounts.              ||
||    Found reads that are not properly paired.                               ||
||    (missing mate or the mate is not the next read)                         ||
|| Process BAM file SM-C-M5_Aligned.sortedByCoord.out.bam...                  ||
||    Assign fragments (read pairs) to features...                            ||
||    Each fragment is counted once.                                          ||
||    Found reads that are not properly paired.                               ||
||    (missing mate or the mate is not the next read)                         ||


There are 5 BAM files but it was not progressing beyond this point.  I am using R for Ubuntu-14.04-LTS on a system that has 64GB RAM with 8-cores. Any suggestions.

One more thing to notice is that  for GTC file , the number of chromosomes it shows is 162. Is something wrong here?

Thanks

ADD REPLY
0
Entering edit mode

Is your Rsubread version up to date?

ADD REPLY
0
Entering edit mode

I am using 'Rsubread' version 1.12.6.

ADD REPLY
0
Entering edit mode

Your Rsubread version is 5 years old. You need to update both your R and Rsubread to their latest version.

ADD REPLY
0
Entering edit mode
@mikelove
Last seen 5 days ago
United States

1) You need to decide first how you are going to count the reads, which tool you will use from the table in the workflow. I believe they are all fastest if you provide name sorted BAM files, but you should check the documentation of the tool you choose to use.

2) no

3) That is sufficient, but the important thing is that you confirm that the files you provide to the counting tool are in the same order as the rows of sampleTable. We make sure this is the case by having the files be a column of the sampleTable and provide this column to the counting tool.

ADD COMMENT
0
Entering edit mode

Thank you Michael Love for the reply. However, I did not understand how the order of the file would make the difference.

ADD REPLY
0
Entering edit mode
Just make sure the order of samples in colData matches the columns of the count matrix.
ADD REPLY
0
Entering edit mode

Hi Michael, 

I am following the workflow as suggested.

When I ran the following command, after sometime the system responded with a warning stating it is low on memory and R should be closed.


se <- summarizeOverlaps(features=ebg, reads = bamfiles, mode = "Union", singleEnd = FALSE, ignore.strand = TRUE, fragments =TRUE)


To solve this issue I did the following:


library("BiocParallel")
register(SnowParam(stop.on.error = TRUE, timeout = Inf)


However, it displayed  the following warning: 


Warning message: In initRefFields(.self, .refClassDef, as.environment(.self), list(...)) : NAs introduced by coercion to integer range


 Also, it did not solve the purpose.

I have generated the alignment files using paired end reads with a total of 50 million reads/sample.  Here, I am processing 5 files.

I am using R-3.3.1 (64-bit) for windows. The system has 64GB RAM with 8-cores.

I am surely missing something. Please suggest

ADD REPLY
0
Entering edit mode

There are numerous ways to produce a count matrix suitable for DESeq2. If you are having difficulty with one, I'd recommend you try an alternative method, for example featureCounts() from the Rsubread package, or you can use a lightweight transcript quantifier such as Salmon, Sailfish or kallisto and then use the Bioconductor package txmiport to build a DESeqDataSet.

ADD REPLY

Login before adding your answer.

Traffic: 415 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6