I have an issue with the density plot of my microarray data. My data span 6 chips of the Illumina HT12 v4 beadarray. I uploaded all the raw data to Illumina IGS and extracted the SampleProbeProfile and ControlProbeProfile. I proceeded with Limma neqc normalization and transformation. The PCA plot puts the data in two clearly distinct clusters. The density plots of the normalized data are normal distributions that form two separate peaks as well: some lines have one pick and some another.These are corresponding to the two different clusters. But can I continue using these data or not? I tried standardizing them (mean=0, sd=1) and the two peaks remain present but this time they are much closer. Shall I use this second transformation?
I see two populations that correspond to different treatments. So, it falls in the second case you mention...the alerting one :)
Here I provide my code for better understanding:
library(limma)
library(RColorBrewer)
# The expression profiles for all the samples (parental and gefR) together
raw.data <- read.ilmn(files="Sample_Probe_Profile.txt", ctrlfiles="TableControl.txt", other.columns="Detection")
propexpr(raw.data) #to see what percentage of my probes are truly expressed. Around 52% of probes are truly expressed
table(raw.data$genes$Status) #there are 769 negative control probes
# Normalization and filtering
norm.data <- neqc(raw.data)
expressed <- rowSums(norm.data$other$Detection > 0.99) >= 2 #filter out probes that are not expressed.
norm.data <- norm.data[expressed,]
dim(norm.data) # [1] 23922 24
colnames(norm.data) = c(rep('CrizoG1',3), rep('CrizoG3',3), rep('LDKG3',3), rep('AlecG2',3), rep('AlecG3',3), rep('APG1',3), rep('APG2',3), rep('Par',3)) #Crizo, Alec, AP and LDK correspond to different treatments G1 and G3 correspond to different generations of the same treated population
targets <- readTargets() # targets file containing information about the treatment (which drug) and the replicates(I have 3 technical replicates for each condition.
# Plot densities after normalizing and cleaning (truly expressed only) the data
require(graphics)
numOfBioReplicates = targets$BiolRep[length(targets$BiolRep)]
col.palette = palette(rainbow(numOfBioReplicates))
plot(density(norm.data$E[,1]), ylim=c(0,0.3), main='Normalized and Cleaned Expressions', xlab='log2 Expression Values', ylab='Density')
# Density plot all the technical replicates with the same color. Different colors only for different samplesfor(i in 2:dim(norm.data$E)[2]) lines(density(norm.data$E[,i]),col=(col.palette[factor(targets$BiolRep)])[i])
again, I'm not an Illumina expert, but I would assume/guess that the detection p-value is significant, i.e. very small, if a probe's signal is different from/above the negative controls -- yes or no? :-)
If yes, then your filtering step is keeping probes only if they have at least two (out of 24) "very bad" detection p-values (> 0.99) i.e. you would be filtering for probes that are *not* expressed in at least two samples. Doesn't make sense to me.
If norm.data$other$Detection is supposed to be 1 if the probe actually is expressed, my assumption/guess is wrong and I apologize :-). But I'd anyway try a density plot without the filtering step since I cannot see any other potential reason for your problem in your code.
Thank you Axel for taking time to look into my problem. Actually the p-value that I see there is not the detection p-value (as the neqc() guide suggests) but exceedance p-values....Thus, norm.data$other$Detection is supposed to be 1 if the probe actually is expressed. So the higher it is the better (I had a long discussion with Gordon Smyth on a previous experiment I analyzed - see post https://support.bioconductor.org/p/77385/#77700).
Actually, I also tried a density plot without filtering, and the distributions were well aligned-one on top of the other. However, I think I should do some filtering to capture the truly expressed genes...or else I may have many false positives, don't you think?
sorry for that late response; I'm on holidays and hence very busy... :-)
Some thoughts:
- AFAIK independent filtering is supposed to increase the power in the massive multiple testing situation, i.e., reducing the number of false negatives rather than false positives.
- it should be independent of your experimental groups and your density plots suggest that this is not the case for your detection-based filtering.
- if the experts tend to slightly disagree, what should the average applied bioinformatician do who is usually expected to deliver analysis results by yesterday? *I* usually start by doing no filtering; often we have hundreds or thousands of DEG and little need for additional power. A more practical consideration is that some customer will inevitably scroll down the top table looking for (his or her favourite) gene XYZ and conclude that my/your analysis is wrong if that gene does not show up because it has been removed by a filter.
- yet another thought is that you will of course be interested in genes that are consistently below detection level in one experimental group and above it in another group although fold change and p-value etc cannot easily be quantified. Hence you should not require more than your smallest group size to be detected. In your particular setting with many rather small groups I'd be very careful with filtering by detection.
Just a couple of thoughts; it is your data and hence up to you to make a decision that you can justify... :-)
Thank you Axel for your response.
I see two populations that correspond to different treatments. So, it falls in the second case you mention...the alerting one :)
Here I provide my code for better understanding:
Thank you very much for any feedback!
Best,
Eleni
Dear Eleni,
again, I'm not an Illumina expert, but I would assume/guess that the detection p-value is significant, i.e. very small, if a probe's signal is different from/above the negative controls -- yes or no? :-)
If yes, then your filtering step is keeping probes only if they have at least two (out of 24) "very bad" detection p-values (> 0.99) i.e. you would be filtering for probes that are *not* expressed in at least two samples. Doesn't make sense to me.
If norm.data$other$Detection is supposed to be 1 if the probe actually is expressed, my assumption/guess is wrong and I apologize :-). But I'd anyway try a density plot without the filtering step since I cannot see any other potential reason for your problem in your code.
Hope this helps,
- axel
Thank you Axel for taking time to look into my problem. Actually the p-value that I see there is not the detection p-value (as the neqc() guide suggests) but exceedance p-values....Thus, norm.data$other$Detection is supposed to be 1 if the probe actually is expressed. So the higher it is the better (I had a long discussion with Gordon Smyth on a previous experiment I analyzed - see post https://support.bioconductor.org/p/77385/#77700).
But thank you very much for the input!
Best regards,
Eleni
Actually, I also tried a density plot without filtering, and the distributions were well aligned-one on top of the other. However, I think I should do some filtering to capture the truly expressed genes...or else I may have many false positives, don't you think?
Thanks a lot!
Eleni
Dear Eleni,
sorry for that late response; I'm on holidays and hence very busy... :-)
Some thoughts:
- AFAIK independent filtering is supposed to increase the power in the massive multiple testing situation, i.e., reducing the number of false negatives rather than false positives.
- it should be independent of your experimental groups and your density plots suggest that this is not the case for your detection-based filtering.
- you may of course consider another filtering criterion. There is the PNAS paper by Bourgon et al. (http://www.ncbi.nlm.nih.gov/pubmed/20460310) and some, hmm, discussion about it on this list (https://support.bioconductor.org/p/53061/).
- if the experts tend to slightly disagree, what should the average applied bioinformatician do who is usually expected to deliver analysis results by yesterday? *I* usually start by doing no filtering; often we have hundreds or thousands of DEG and little need for additional power. A more practical consideration is that some customer will inevitably scroll down the top table looking for (his or her favourite) gene XYZ and conclude that my/your analysis is wrong if that gene does not show up because it has been removed by a filter.
- yet another thought is that you will of course be interested in genes that are consistently below detection level in one experimental group and above it in another group although fold change and p-value etc cannot easily be quantified. Hence you should not require more than your smallest group size to be detected. In your particular setting with many rather small groups I'd be very careful with filtering by detection.
Just a couple of thoughts; it is your data and hence up to you to make a decision that you can justify... :-)
Hope this helps,
- axel