outlier detection of RNAseq samples
3
1
Entering edit mode
wd ▴ 30
@wd-7410
Last seen 4.7 years ago
Germany

Hi

I have RNA seq data for six different treatments (A,B,C,D,E,F) of a model organism, with four-fold biological (NOT technical) replicates.

FASTQC revealed no abnormalites in the RNAseq data and after normalization (rlogtransformation) with DESeq2 I generated a PCA plot (using the 500 most variable genes).

Based on the PCA plot (see link: http://imgur.com/NVcWv5j) and a hierachical clustering (HC) analysis (not shown) I would think that the dots with a rectangle (1,2,3) can be considered as outliers and might be left out for further differential expression analysis (between treatments).

However, this is just based on visual inspection of the PCA/HC analysis. I was wondering if there is any objective metric to determine whether an RNAseq sample can be considered as an outlier (instead of just by visual inspection of PCA, like most papers do).

In a recent paper of Conesa et al  2016 (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0881-8) they state the following:

"Reproducibility among technical replicates should be generally high (Spearman R2 > 0.9) [1], but no clear standard exists for biological replicates, as this depends on the heterogeneity of the experimental system."

So one might consider to include all replicates (incl. outliers) based on Conesa et al. 2016, but then you might end up with a lower number of diff. expressed genes between treatments...

Any advice/help regarding this topic would be much appreciated

 

 

 

deseq2 rnaseq PCA hierarchical clustering outlier • 7.3k views
ADD COMMENT
1
Entering edit mode

based on eye-balling your PCA plot I am not sure if you can justify the exclusion of the marked points as outliers. Your sample size is quite low (statistically speaking - I know that it's hard to have more) and the variability is not so small as to clearly flag the points as 'wild' outliers. But if you want to use a statistical test for outlier removal you can calculate the mean (or median) pairwise distance (within group or maybe for all groups pooled) and the standard deviation. Then you can flag those points that are greater then mean/median ± 2 sd. I'll note though that while this makes it consistent between groups, the threshold is still arbitrary (although frequently used).

ADD REPLY
1
Entering edit mode
@ryan-c-thompson-5618
Last seen 6 weeks ago
Icahn School of Medicine at Mount Sinai…

The samples you have highlighted are certainly farther than average from the group means, but I wouldn't consider them outliers. For example, the highlighted sample from group A is far away from the others along PC2, but all of group A is spread out over PC2, so this is not out of the ordinary, and group A does cluster tightly along PC1. Similarly, group B clusters tightly along PC2. So in both groups, there are clearly at least a subset of genes that are quite consistent within the groups.

If you are really concerned that these samples may be dragging down your analysis, I recommend you use voomWithQualityWeights from the limma package. It will attempt to identify and down-weight outlier samples in the analysis. In addition, you can compare the list of lowest-weighted samples to the list of outlier samples that you identified by eye to see if the weighting method matches your intuitions.

ADD COMMENT
1
Entering edit mode

I agree with Ryan. Probably it's just large within-group variability relative to the large-scale differences across groups. But you can try out limma's quality weighting to see if it helps.

ADD REPLY
0
Entering edit mode
@peter-langfelder-4469
Last seen 4 weeks ago
United States

As far as "objective" measures to identify outlier samples go, I would check out the article by Oldham et al (myself included), Network methods for describing sample relationships in genomic datasets: application to Huntington's disease. BMC Syst Biol. 2012 Jun 12;6(1):63. PMID: 22691535 46(11) 1-17.

The gist of the method is to sum the distances (or conversely network adjacencies is a sample network), standardize them, and flag as outliers samples with high (or conversely high negative) standardized distance (connectivity).

ADD COMMENT
0
Entering edit mode
wd ▴ 30
@wd-7410
Last seen 4.7 years ago
Germany

Dear Fabian, Ryan, Michael and Peter

Thank you for your valuable advice! Very much appreciated.

Kind regards

Wannes

 

ADD COMMENT

Login before adding your answer.

Traffic: 865 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6