Question

Extracting Data from MSn experiment data ("OnDiskMSnExp")

1

Entering edit mode

jamesrgraham ▴ 20

@jamesrgraham-21485

Last seen 5.7 years ago

Hello All,

I am doing some targeted metabolomics and was wondering if there were a quicker way to extract data from an MSn experiment data ("OnDiskMSnExp") data object.

I read in a number of files:

raw_data <- readMSData(files = FILES, pdata = new("NAnnotatedDataFrame", pd),
               mode = "onDisk", centroided = FALSE, msLevel = 1)

And perform peakPicking:

comp_sg_cent_mz <- raw_data %>%
  smooth(method = "SavitzkyGolay", halfWindowSize = 4L) %>%
  pickPeaks(refineMz = "descendPeak") %>%
  filterRt(initial_rtr) %>%
  filterMz(initial_mzr)

I then write out the data:

write.table(comp_sg_cent_mz, file = main_peak_file_name, row.names = FALSE, append = TRUE, col.names = TRUE, sep = "\t")

And get something like this:

"file"  "rt"    "mz"    "i"
1       404.2169952     391.283958025663        14271.6536796537
1       404.7310068     391.283864868878        14570.7012987013
1       405.245991      391.2839380729  13788.5194805195
1       405.760002      391.28338580945 10999.5714285714

Which I then process further.

The issue is that the write.table function takes at least one minute to write out (which is problematic with many files and compounds). Is there a faster way to access this data?

Thanks for any and all advice! james

MSnbase • 1.6k views

ADD COMMENT • link updated 5.7 years ago by Laurent Gatto 1.6k • written 5.7 years ago by jamesrgraham ▴ 20

score 1 · Answer 1 · 2019-07-30

Hi James,

the reason the write.table function takes so long is that in that call all processings are applied to the data. In the on disk mode all data manipulation operations are cached and only applied whenever you access the data (which in your case is when you call write.table, which in turn (I guess) calls the as.data.frame function). This means that when you call e.g. smooth on your data the smooth function is only added to a lazy processing queue and not applied to the data (because the data is not kept in memory it can also not be changed/modified). Now, each time you access intensity or m/z values, the data is imported from the original (mzML) files and the smooth function is applied before the values are returned.

To improve the speed of your function calls you have however two possibilities:

1) call the filterRt and filterMz before you call smooth and pickPeaks. That way the data processing will only applied to the subset you are actually interested in. With your code you are performing the smoothing and peak picking on the full data set for each compound.

2) Call smooth and pickPeaks once on the full data set and export the data as mzML files (with writeMSData). Then re-read this data and call the filterRt and filterMz on this already processed data.

hope this helps.

cheers, jo

score 1 · Answer 2 · 2019-07-30

You can use the respective accessors to extract these information from the object. Below, I use the serine object created in the MSnbase centroiding vignette:

> str(head(intensity(serine)))
List of 6
 $ F1.S628: num [1:10] 0 48 0 48 48 48 0 144 48 0
 $ F1.S629: num [1:12] 0 43 43 87 0 0 173 43 0 0 ...
 $ F1.S630: num [1:9] 0 84 84 42 84 0 42 42 0
 $ F1.S631: num [1:9] 0 90 134 90 45 90 45 45 0
 $ F1.S632: num [1:7] 0 42 42 42 42 42 0
 $ F1.S633: num [1:8] 0 37 0 111 74 37 148 0
> str(rtime(serine))
 Named num [1:43] 175 175 176 176 176 ...
 - attr(*, "names")= chr [1:43] "F1.S628" "F1.S629" "F1.S630" "F1.S631" ...
> str(head(mz(serine)))
List of 6
 $ F1.S628: num [1:10] 106 106 106 106 106 ...
 $ F1.S629: num [1:12] 106 106 106 106 106 ...
 $ F1.S630: num [1:9] 106 106 106 106 106 ...
 $ F1.S631: num [1:9] 106 106 106 106 106 ...
 $ F1.S632: num [1:7] 106 106 106 106 106 ...
 $ F1.S633: num [1:8] 106 106 106 106 106 ...
> head(fromFile(serine))
F1.S628 F1.S629 F1.S630 F1.S631 F1.S632 F1.S633 
      1       1       1       1       1       1

And if you need a data.frame, you can coerce your object with

> head(as(serine, "data.frame"))
  file      rt       mz  i
1    1 175.212 106.0407  0
2    1 175.212 106.0422 48
3    1 175.212 106.0437  0
4    1 175.212 106.0451 48
5    1 175.212 106.0466 48
6    1 175.212 106.0480 48

In addition, you can read in multiple files at once with readMSData, and the processing of these will be done in parallel on a file by file basis using BiocParallel.