Question

duplicateCorrelation function and custom array design

0

Entering edit mode

Bela Tiwari ▴ 60

@bela-tiwari-339

Last seen 10.6 years ago

Hello, I have been working with data given to me by biologists who designed and printed their own arrays. In short, they have 48 blocks with 1089 spots per block. Duplicate spots of a "gene" are printed on the same block. Ie. a gene appears in an array on two spots, both within a single block. I wanted to run the duplicateCorrelation function on their data, and in so doing, I discovered rather a lot about their array layout, which in hindsight, I should have asked about first. Effectively, while genes and species-specific control spots have been spotted twice per block, with a nice even spacing of 520 between them, they also have blank "buffer" wells which cause problems because these do not appear to be at a 520 spacing, and their inclusion in an object passed into the duplicateCorrelation function causes a failure when it gets to running the unwrapdups function. The latter function sensibly expects that the number of spots on an array, divided by the "spacing" and the number of duplicates to give a whole number - which, because of the way they have laid out their blank buffer wells, and labelled them in the GAL file, data with the array layout they defined, does not. I have to be grateful in some ways that this function failed and I stopped to find out a lot more about the array layout! So, here are my questions: Firstly, after normalising my RGList object and getting an MAList object, I can get a list of indices for just the genes and species-specific controls. I have read through the duplicateCorrelation function and believe that if I inut the indices into the function sensibly, I can get a value for the correlation. However, just because I believe it, doesn't mean its true! Below are the lines I have changed in the duplicateCorrelation function so that only the gene and control spots are used to generate the correlation values. If anyone out there knows this function well and has a moment, can they check that what I have done at least verges on sensible? And if not, any advice on how to deal with this situation would be most welcome! Effectively, only 3 lines have changed - the parameter list now has indices (an integer vector), and M and weights now take in only those entries from the object with those indices. mydupcorr <- function (object, indices, design = rep(1, ncol(M)), ndups = 2, spacing = 1, block = NULL, trim = 0.15, weights = NULL) { if (is(object, "MAList")) { M <- object$M[indices,] #altered to add indices if (missing(design) && !is.null(object$design)) design <- object$design if (missing(ndups) && !is.null(object$printer$ndups)) ndups <- object$printer$ndups if (missing(spacing) && !is.null(object$printer$spacing)) spacing <- object$printer$spacing if (missing(weights) && !is.null(object$weights)) weights <- object$weights[indices,] #altered to add indices } In my case, a sample command would be: dupcorr <- mydupcorr(myMAList, indices, design = design1, ndups = 2, spacing = 520) Hmmm, as I write this, I just realised that I could have just done this all on the command line like: dupcorr <- mydupcorr(myMAList[indices,], design = design1, ndups = 2, spacing = 520) but my essential question remains the same - is this sensible? My second question is due to my lack of experience with the functions involved - if I try to use the correlation consensus value generated via the above function as input into the lmFit function, will it matter if I include only the MAList elements for my genes and species-specific controls? I.e. does it matter if I give myMAList[indices,] as the object parameter to the lmFit function, rather than the whole MAList object. I don't think lmFit needs to refer back to the array layout as stored within myMAList$printer, but I'm not well versed enough to know if there are downsteam effects of entering only a subset of the MAList object to lmFit or not. And finally, if you have made it this far in the email, if anyone has suggestions for web pages, articles, other documents, etc, that gives advice on how to design a good array layout, I'd love to hear about them. The biologists I'm working with will be designing some new arrays soon, and tips for how they should lay things out, especially with considerations to the "usual" requirements software programs/functions may have, would be great! thank you, Bela Tiwari ************************* Dr. Bela Tiwari Lead Bioinformatician CEH Oxford Mansfield Road Oxford, OX1 3SR 01865 281975

• 1.1k views

ADD COMMENT • link updated 20.5 years ago by Gordon Smyth 52k • written 20.5 years ago by Bela Tiwari ▴ 60

score 0 · Answer 1 · 2004-10-29

>Bela Tiwari btiwari at ceh.ac.uk >Fri Oct 29 16:45:17 CEST 2004 ... >Hmmm, as I write this, I just realised that I could have just done this >all on the command line like: > >dupcorr <- mydupcorr(myMAList[indices,], design = design1, ndups = 2, >spacing = 520) There is no need to make your own mydupcorr() function. Just dupcorr <- duplicateCorrelation(myMAList[indices,], design = design1, ndups = 2, spacing = 520) fit <- lmFit(myMAList[indices,], design = design1, ndups = 2, spacing = 520, correlation=dupcorr$consensus) is the way to do it. All you need is that myMAList[indices, ] is a valid MAList object with genuinely regularly spaced duplicates. For example, if your blocks contain 520 genes printed twice, followed by a number of unwanted spots, then you could use pr <- printorder(myMAList$printer) indices <- pr$printerorder <= 2*520 and proceed from there. Do check that the estimated correlation is reasonably large though -- with dups at this spacing you'd expect to get a correlation around 0.6 or higher. If you don't get a correlation which is at least 0.3 say, then something may be wrong. >but my essential question remains the same - is this sensible? >My second question is due to my lack of experience with the functions >involved - if I try to use the correlation consensus value generated via >the above function as input into the lmFit function, will it matter if I >include only the MAList elements for my genes and species-specific >controls? I.e. does it matter if I give myMAList[indices,] as the >object parameter to the lmFit function, rather than the whole MAList >object. I don't think lmFit needs to refer back to the array layout as >stored within myMAList$printer, but I'm not well versed enough to know >if there are downsteam effects of entering only a subset of the MAList >object to lmFit or not. No, there's no problem. Removing blanks and unwanted negative control spots should actually improve the process. Best Gordon