Extracting Data from MSn experiment data ("OnDiskMSnExp")
2
1
Entering edit mode
jamesrgraham ▴ 20
@jamesrgraham-21485
Last seen 5.4 years ago

Hello All,

I am doing some targeted metabolomics and was wondering if there were a quicker way to extract data from an MSn experiment data ("OnDiskMSnExp") data object.

I read in a number of files:

raw_data <- readMSData(files = FILES, pdata = new("NAnnotatedDataFrame", pd),
               mode = "onDisk", centroided = FALSE, msLevel = 1)

And perform peakPicking:

comp_sg_cent_mz <- raw_data %>%
  smooth(method = "SavitzkyGolay", halfWindowSize = 4L) %>%
  pickPeaks(refineMz = "descendPeak") %>%
  filterRt(initial_rtr) %>%
  filterMz(initial_mzr)

I then write out the data:

write.table(comp_sg_cent_mz, file = main_peak_file_name, row.names = FALSE, append = TRUE, col.names = TRUE, sep = "\t")

And get something like this:

"file"  "rt"    "mz"    "i"
1       404.2169952     391.283958025663        14271.6536796537
1       404.7310068     391.283864868878        14570.7012987013
1       405.245991      391.2839380729  13788.5194805195
1       405.760002      391.28338580945 10999.5714285714

Which I then process further.

The issue is that the write.table function takes at least one minute to write out (which is problematic with many files and compounds). Is there a faster way to access this data?

Thanks for any and all advice! james

MSnbase • 1.5k views
ADD COMMENT
1
Entering edit mode
Johannes Rainer ★ 2.1k
@johannes-rainer-6987
Last seen 10 weeks ago
Italy

Hi James,

the reason the write.table function takes so long is that in that call all processings are applied to the data. In the on disk mode all data manipulation operations are cached and only applied whenever you access the data (which in your case is when you call write.table, which in turn (I guess) calls the as.data.frame function). This means that when you call e.g. smooth on your data the smooth function is only added to a lazy processing queue and not applied to the data (because the data is not kept in memory it can also not be changed/modified). Now, each time you access intensity or m/z values, the data is imported from the original (mzML) files and the smooth function is applied before the values are returned.

To improve the speed of your function calls you have however two possibilities:

1) call the filterRt and filterMz before you call smooth and pickPeaks. That way the data processing will only applied to the subset you are actually interested in. With your code you are performing the smoothing and peak picking on the full data set for each compound.

2) Call smooth and pickPeaks once on the full data set and export the data as mzML files (with writeMSData). Then re-read this data and call the filterRt and filterMz on this already processed data.

hope this helps.

cheers, jo

ADD COMMENT
1
Entering edit mode

Thank you, jo! I will try this out.

ADD REPLY
1
Entering edit mode
@laurent-gatto-5645
Last seen 8 days ago
Belgium

You can use the respective accessors to extract these information from the object. Below, I use the serine object created in the MSnbase centroiding vignette:

> str(head(intensity(serine)))
List of 6
 $ F1.S628: num [1:10] 0 48 0 48 48 48 0 144 48 0
 $ F1.S629: num [1:12] 0 43 43 87 0 0 173 43 0 0 ...
 $ F1.S630: num [1:9] 0 84 84 42 84 0 42 42 0
 $ F1.S631: num [1:9] 0 90 134 90 45 90 45 45 0
 $ F1.S632: num [1:7] 0 42 42 42 42 42 0
 $ F1.S633: num [1:8] 0 37 0 111 74 37 148 0
> str(rtime(serine))
 Named num [1:43] 175 175 176 176 176 ...
 - attr(*, "names")= chr [1:43] "F1.S628" "F1.S629" "F1.S630" "F1.S631" ...
> str(head(mz(serine)))
List of 6
 $ F1.S628: num [1:10] 106 106 106 106 106 ...
 $ F1.S629: num [1:12] 106 106 106 106 106 ...
 $ F1.S630: num [1:9] 106 106 106 106 106 ...
 $ F1.S631: num [1:9] 106 106 106 106 106 ...
 $ F1.S632: num [1:7] 106 106 106 106 106 ...
 $ F1.S633: num [1:8] 106 106 106 106 106 ...
> head(fromFile(serine))
F1.S628 F1.S629 F1.S630 F1.S631 F1.S632 F1.S633 
      1       1       1       1       1       1 

And if you need a data.frame, you can coerce your object with

> head(as(serine, "data.frame"))
  file      rt       mz  i
1    1 175.212 106.0407  0
2    1 175.212 106.0422 48
3    1 175.212 106.0437  0
4    1 175.212 106.0451 48
5    1 175.212 106.0466 48
6    1 175.212 106.0480 48

In addition, you can read in multiple files at once with readMSData, and the processing of these will be done in parallel on a file by file basis using BiocParallel.

ADD COMMENT
0
Entering edit mode

Thank you, Laurent! I will do a bunch of testing.

ADD REPLY

Login before adding your answer.

Traffic: 811 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6