Question

RNA degradation plot for GeneChip Mouse Gene 2.0 ST Array

0

Entering edit mode

alakatos ▴ 130

@alakatos-6983

Last seen 5.2 years ago

United States

Hello All,

I processed my data with package "oligo" function "rma". I would like to make an RNA degradation plot as part of my QC.

Unfortunately, package "oligo" did not offer this function.

However, I found the following function in package "affy"

AffyRNAdeg(abatch,log.it=TRUE)
summaryAffyRNAdeg(rna.deg.obj,signif.digits=3)
plotAffyRNAdeg(rna.deg.obj, transform = "shift.scale", cols = NULL, ...)

I tried to feed my "oligo" processed eset object to "affy" but it threw an error message.

Would you please recommend a way around how to make an RNA degradation plot for my "oligo" processed eset object?

Thank you for your help in advance.

Anita

rnadegradationplot affymetrix mouse gene arrays • 1.8k views

ADD COMMENT • link updated 7.8 years ago by James W. MacDonald 67k • written 7.8 years ago by alakatos ▴ 130

score 0 · Answer 1 · 2017-02-15

0

Entering edit mode

James W. MacDonald 67k

@james-w-macdonald-5106

Last seen 19 hours ago

United States

RNA degradation plots don't make any sense for Affy's random primer arrays.

Back in the good 'ol days (you know, the 2000's), Affy array processing was based on oligo-dT primers that were used for converting mRNA to cDNA, starting at the poly-A tail. Because of this, the probes for each gene were clustered near the 3' end of the gene, to ensure that you would be able to get reasonable measurements. In other words, there is no guarantee that the RT will proceed to the end of the mRNA (and the mRNA may have been partially degraded, starting from the 5' end), so you want to put the probes right close to where you start the RT to ensure you have enough cDNA to get a good measurement.

Since the Exon ST arrays, Affy has changed to using random primers for generating the cDNA, so there is no longer a 3' bias to the resulting labeled cDNA, and no expectation that you would be able to detect any mRNA degradation. You could hypothetically figure out where each of the probes 'belong' in each transcript and come up with your own version of the plot, but it wasn't particularly useful for the 3' biased arrays and I would imagine will be even less so for the random primer arrays.

ADD COMMENT • link 7.8 years ago James W. MacDonald 67k

0

Entering edit mode

Hi James,

Thank you for your answer.

One of the reviews of our paper insisted on showing (figure) no RNA degradation in our sample despite of RIN > 9.

Would you please suggest a plot which could satisfy him/her.

Thank you very much.

Anita

ADD REPLY • link 7.8 years ago alakatos ▴ 130

0

Entering edit mode

That's awesome! They don't like the RNA Integrity Number? That, like, everybody else uses? That's being a true contrarian.

Anyway, you could do some sort of bootleg approximation, with emphasis on the bootleg part of that statement. First, do note that the Gene ST arrays have (in general) four probes per probe set region (PSR), which is the 'probeset' summarization level, and they have any number of PSRs that they summarize to make an estimate for a transcript. So this isn't going to be as easy as it was for the 3' biased arrays. First, let's extract our data. I will be using Rat Gene 2.0 ST arrays, but it's all the same. First let's get set up.

> library(oligo)
> dat <- read.celfiles(list.celfiles())
> con <- db(pd.ragene.2.0.st)

Now let's get all the probe IDs (fids) per transcript-level probeset:

> z <- dbGetQuery(con, "select meta_fsetid, featureSet.fsetid, fid, chrom, start, stop, strand from featureSet inner join core_mps on core_mps.fsetid=featureSet.fsetid inner join pmfeature on pmfeature.fsetid=featureSet.fsetid where type='1';")

> head(z)
  meta_fsetid   fsetid     fid chrom    start     stop strand
1    17610302 17610307 2015417     1 70465581 70465633      1
2    17610302 17610307  502283     1 70465581 70465633      1
3    17610302 17610307 1373826     1 70465581 70465633      1
4    17610302 17610307 1406169     1 70465581 70465633      1
5    17610302 17610312 2146007     1 70457581 70457605      1
6    17610302 17610313 1606586     1 70457356 70457384      1

Now let's make a GRanges object and sort EDIT:I forgot to make the strand

> strand <- c("+","-")[z$strand+1]
> strand[is.na(strand)] <- "*"
> zgr <- GRanges(paste0("chr", z$chrom), IRanges(z$start, z$stop), strand, meta_fsetid = z$meta_fsetid, fsetid = z$fsetid, fid = z$fid)

> zgr <- sort(zgr)

And split that out by meta_fsetid (transcript-level probeset ID)

> zgrlst <- splitAsList(zgr, mcols(zgr)[,1])

> zgrlst
GRangesList object of length 30472:
$17610302
GRanges object with 9 ranges and 3 metadata columns:
      seqnames               ranges strand | meta_fsetid    fsetid       fid
         <Rle>            <IRanges>  <Rle> |   <integer> <integer> <integer>
  [1]     chr1 [70457356, 70457384]      - |    17610302  17610313   1606586
  [2]     chr1 [70457356, 70457384]      - |    17610302  17610313    871869
  [3]     chr1 [70457356, 70457384]      - |    17610302  17610313   1811125
  [4]     chr1 [70457356, 70457384]      - |    17610302  17610313   2454300

So now we have, for each probeset, in PSR order, the fid values. In the example above we have four fids that all go into one fsetid (PSR), that are then combined into a meta_fsetid. The order of the PSRs is correct, but we aren't sure that the fids are in genomic order. And I don't know how to ensure that they are, so let's ignore that, shall we?

We now just want to extract the probe expression values for a set of these probesets, in order (which we already have), and then plot. So let's just use a subset,

> whatevs <- lapply(zgrlst[sample(1:length(zgrlst), 5000)], function(x) mcols(x)$fid)

Lots of these are pretty short, so let's just use those with >= 20 probes, and also subset longer probesets to just 20 probes.

> whatevs <- whatevs[sapply(whatevs, length) >= 20]
> whatevs <- lapply(whatevs, function(x) x[1:20])
> length(whatevs)
[1] 2959

So this list has the rows for the probes for each probeset, in genomic-ish order, so we want to extract and compute a mean expression level for each probeset-position. Put another way, we want the mean expression for all the probes that come first in a probeset, by sample, the mean for the second set of probes, etc. So we end up with a mean expression value for each relative position over all this subset of probesets.

Not only is that a mouthful, this code is going to be a whopper. First get the expression values from our GeneFeatureSet, then we'll extract the probes in order, and compute the mean values. Note that I have 36 samples in my GeneFeatureSet, so I iterate over each sample here.

> vals <- exprs(dat)

> whatevs2 <- sapply(1:36, function(x) rowMeans(do.call(cbind, lapply(whatevs, function(y) log2(vals[y,x])))))

> plot(1:20, whatevs2[,1], "l", ylim = c(min(whatevs2), max(whatevs2) + 36), xlab = "5' <-----> 3' Probe Number", ylab = "Sample number")
> for(i in 2:36) lines(1:20, whatevs2[,i]+i)

ADD REPLY • link 7.8 years ago James W. MacDonald 67k

0

Entering edit mode

Thank you very, very much for this detailed help.

Anita

ADD REPLY • link 7.8 years ago alakatos ▴ 130

0

Entering edit mode

FWIW: after removing the NAs from the transcript-level probeset table (z), this code is also working nicely for mouse Gene ST1.1 arrays (pd.mogene.1.1.st.v1)

Specifically; after the dbGetQuery (code box 2), and before making the GRanges object (box 3), add this line :

z <- z[!is.na(z$start) | !is.na(z$stop),]

Otherwise an error will occur:

> zgr <- GRanges(paste0("chr", z$chrom), IRanges(z$start, z$stop), strand, meta_fsetid = z$meta_fsetid, fsetid = z$fsetid, fid = z$fid)
Error in .Call2("solve_user_SEW0", start, end, width, PACKAGE = "IRanges") :
  solving row 200731: range cannot be determined from the supplied arguments (too many NAs)

ADD REPLY • link 7.8 years ago Guido Hooiveld ★ 4.1k

0

Entering edit mode

Hi,

I am trying to follow this step and i am getting an error throwing up:

Error in vals[x, y] : subscript out of bounds

This is when i run the following line:

> whatevs2 <- sapply(1:36, function(x) rowMeans(do.call(cbind, lapply(whatevs, function(y) log2(vals[y,x])))))

Is the exprs(dat) is that using affy or oligo? and what level was the exprs computed? as i have computed it at the core level according to the oligo package. I am pretty new to this area and have been asked to do this plot for this type of chip for my analysis.

The chips were run using mogene.2.0.st array and pd.mogene.2.0.st annotation

Thanks

Danielle

ADD REPLY • link 7.7 years ago danielle.newby • 0

0

Entering edit mode

To analyze your data set you likely copied all of James's code above "as is". Then you also copied this line:

whatevs2 <- sapply(1:36, function(x) rowMeans(do.call(cbind, lapply(whatevs, function(y) log2(vals[y,x])))))

The problem is the number 36; as James indicated this line of code iterates over each of the 36 samples he has in his dataset. If you have a different number of samples you have to change it in that line, i.e. to dim(vals)[2].

The intensity values that are used to generate the 'RNA digestion plot' are the raw, non-normalized expression values. Recall that you read in the raw data using dat <- read.celfiles(list.celfiles()), and subsequently you extract the expression values from the non-normalized GeneFeatureSet using the function exprs() [vals <- exprs(dat)].

Finally, as was also discussed above, you could wonder what the value is of a RNA digestion plot for these arrays since in the labeling procedure random primers are used.

ADD REPLY • link 7.7 years ago Guido Hooiveld ★ 4.1k

0

Entering edit mode

Hi Guido,

Thanks for your comments really helpful! Yep i was using the RMA normalised data so now it works thank you! I have been told to do this plot even though its not really suitable for this type of chip - i dont think it will tell me anything different that i dont already know from my NUSE, RLE, PCA but there you go! :)

Thanks for your help,

Danielle

ADD REPLY • link 7.7 years ago danielle.newby • 0