Dear all,
I was searching Bioconductor for a peptide to protein group assembly function (i.e protein inference). The problem and my favorite solution are nicely described in Figure 1 of Zhang et al. IDPicker paper (https://www.ncbi.nlm.nih.gov/pubmed/17676885):
What are your suggestions?
Kind regards, Daniel
My investigation so far
I looked at MSnID::infer_parsimonious_accessions()
. Here, no grouping of equivalent proteins/peptides occurs (Step B in the figure). Internally the which.max()
call will pick only the first of equal scoring protein accesions, where the order is depending on the input. Ideally, I would like to keep this equally good information.
In the example from the figure (step D, middle cluster) the following difference would occur:
- MSnID::inferparsimoniousaccessions gets 2 protein groups "pro4,9" with "pep2;pep10" and "pro6" with "pep6"
- IDpicker gets 1 protein group: "pro4,9;pro6" with 3 peptides: "pep2;pep6;pep10"
Off topic
- To decide on the order within a protein group more information from the measurement is needed and should not be part of this question.
- For later, i.e. after step D in the figure, intensity aggregation by protein group the MSnbase::combineFeatures() functions seems to be a good way.
Thanks!
What I think is quite challenging about protein grouping, is the question what to do if there are many runs in the experiment. To me it feels like the problem immediately becomes a lot more complicated. Which proteins and which precursors to consider identified? Use run-specific FDR, global FDR or both? Is it better to have protein with 2 peptides in a single run than a protein with one peptide in two runs? What to do if the runs are strongly heterogeneous (e.g. fractionated sample), and are not expected to have the same proteins? How to avoid unreasonable increase of the number of protein groups along with the number of runs?
I wonder if there are any packages/publications which can deal with protein grouping in the multi-run setting?