As my understanding the package pRoloc allows one to study the localisation of protein inside cells , using relative quantitation of known organelle residents, termed organelle markers.
In the the vignette it uses `tan2009r1` data of markers that have been obtained by mining the pRolocdata datasets and curation by various members of the Cambridge Centre for Proteomics.
table(pRolocmarkers("dmel"))
##
## Cytoskeleton ER Golgi Lysosome Nucleus
## 7 24 7 8 21
## PM Peroxisome Proteasome Ribosome 40S Ribosome 60S
## 25 4 14 22 32
## mitochondrion
## 15
This shows only markers for a subset of organelles: Cytoskeleton, ER, Golgi, Lysosome, Nucleus, PM (Plasma Membrane I assume), Peroxisome, ?Proteasome? (not sure what this is?), Ribosome 40S, Ribsome 60S, and Mitochondrion. It misses many other sub-cellular compartments like Cytosol, Actin Filaments, Vesicles, etc.
Are these markers for proteins that are only specific to the organelle of interest? Meaning, it is not a protein that can be found in other subcellular compartments (multi-localizing protein)
What I would like to do is use information from the Human Proteome Atlas with pRoloc's `addMarkers()` function. This way instead of 11 organelles I have data on more subcellular compartments. The subcellular location data from their website has many proteins which mult-localize. Can this data be used instead as marker data? Or can you only use pRolocmarkers (Homo sapiens only has 872 fir example) with their Uniprot Protein Identifiers
Thank you for your prompt and detailed reply Laurent.
Thanks for clearing up that markers can be selected in a number of ways (e.g. pRolocmarkers, GO CC [with additional curation], HPA, etc.).
There are 11 high-confidence marker categories in
pRolocmarkers
:I want to use the Human Protein Atlas (HPA) instead because it has a finer scale of information for the subcellular location. For example, instead of just Nucleus like
pRolocmarkers
, the HPA lists sub-compartments:We designed a "Chromatome" assay and want to assess the efficacy of this technique; success of enriching for chromatin bound proteins in the nucleus. Would the number of marker categories have an effect on the
pRoloc
algorithm - how well it separates categories?pRolocmarkers
has 12 while HPA has 34. Your manuscript says "failure to extract organelle markers that cover the whole subcellular diversity in the data; this leads to prediction errors, as protein profiles of unknown localization can only be associated with organelles that appear in the labeled training data" so shouldn't more organelle markers be better?HPA has four categories of "reliability"/confidence for where proteins localize:
1) Validated: i) genetic methods using siRNA silencing or CRISPR/Cas9 knockout, ii) expression of a fluorescent protein-tagged protein at endogenous levels, iii) independent antibodies targetting different epitopes.
2) Supported: Agreement with external experimental data from the Uniprot database
3) Approved: Lack of external experimental information (i.e. only found by HPA method: integrating transcriptomics data and antibody-based image profiling approach)
4) Uncertain: HPA showed contradictory results compared to complementary information about the protein location.
So in theory, if I were to take only those proteins with high-reliability (e.g. Validated or Supported) which did not multi-localize and use those as markers I should get good classification for which parts of the nucleus our proteins are localizing in?
Another caveat is how many of these markers are used. There are 116 high-confidence nucleus markers in
pRolocmarkers
, I haven't yet checked how many are in HPA data but you do mention in the manuscript "an inevitable trade-off...increasing the number of markers to better characterize the multivariate data." I'm guessing the best way to find out is in a heuristic fashion (like manual selection of perplexity hyperparameter for t-SNE clustering) or would it be a faux-pas like trying out different statistical methods until you get the best result (p-hacking)?P.S. I should have really used Wikipedia to search for proteosome before asking that here, sorry! But after a bit more reading I now know the differences between proteosome/lysosome/endosome/phagosome/aggresome now - thanks!
My decision on whether I should use multi-localizing markers for my biological question:
From Gatto et al., 2014 "Although proteins with genuine multiple localizations are of particular interest (see below), one must be careful when assessing multiple GO CC terms and distinguish proteins present in more than one subcellular niche (multilocalization) from changes in localization under different conditions and incorrect annotation." According to the HPA, half of all proteins localize to multiple locations. This reflects spatial restriction and ordering of timing of molecular function in one compartment; some proteins may have context specific function in different parts of the cell (moonlighting).
With this information in hand I think, for my application, that I should only use high-confidence single localization markers.
This is of course a good approach, but whether this can be done also depends on the resolution in your data. If there aren't enough sub-nuclear markers or they don't form clusters, over-annotation will end up being counterproductive.
If you focus on a set of classes of interest, and limit the annotation for others, that's fine.
Hope this helps.