Question

Spectra and PSMatch- trouble in reducePSMs, joinSpectraData, or countIdentifications?

0

Entering edit mode

Jane_Huang • 0

@4ce3b8bd

Last seen 5 months ago

United States

Hello,

I used MS-GF+ on the mzML file, using the .faa file as reference, and produce the .mzid file. All files are in the following link: https://figshare.com/s/b65fc594da19f0f9347f . The raw data was produced by Bruker Impact II (Q-TOF MS/MS), and was transformed into mzML file by ProteoWizard msconvert.

However, when I tried to use the function countIdentifications, an error popped up. I am not sure whether there was something wrong about the reducePSMs, joinSpectraData, or countIdentifications.

My codes were as below. The MS-GF+ was run in Mac terminal; other codes were in R Studio.

java -version 
java version "1.8.0_421"

java -Xmx3500M -jar MSGFPlus.jar -s "HH090441864_2024-09-09_2648.mzML" -d "protein.faa" -tda 0 -inst 2 -e 1 -maxMissedCleavages 2 -o HH090441864_2.mzid

 library(Spectra)
library(PSMatch)
sp<-Spectra("HH090441864_2024-09-09_2648.mzML")
id<-readPSMs("HH090441864_2.mzid")

id_filtered<-filterPSMs(id)
id_f_r<-reducePSMs(id_filtered)

sp_ident <- joinSpectraData(sp, id_f_r,
                      by.x = "spectrumId",
                      by.y = "spectrumID")

sp_ident_count <- countIdentifications(sp_ident)

#Error: BiocParallel errors
 # 1 remote errors, element index: 1
 # 0 unevaluated and other errors
 # first remote error:
#Error in as.vector(x, mode): coercing an AtomicList object to an atomic vector is supported only for
  #objects with top-level elements of length <= 1

Thank you in advance for any thoughts!

ProteomicsWorkflow PSMatch Spectra RforProteomics • 659 views

ADD COMMENT • link updated 5 months ago by Laurent Gatto 1.6k • written 5 months ago by Jane_Huang • 0

score 1 · Answer 1 · 2024-09-28

1

Entering edit mode

Laurent Gatto 1.6k

@laurent-gatto-5645

Last seen 7 days ago

Belgium

Tthis happens because the sequence spectraVariable is expected to be an atomic vector or a List with elements of length 1, but it's not - some of your MS2 scans have > 1 sequence:

> lengths(sp_ident[["sequence"]])
  [1] 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 2 1 0 1 1 0 1 0
 [38] 0 1 1 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1
 [75] 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 2 1 1 0 0 0 0 1 0 0
> sp_ident[["sequence"]][30]
CharacterList of length 1
[[1]] LERRFGR EFGKGIKR

You should filter out your PSMs first, and only keep the best one for each MS2 scan.

This can be traced back to the identification data:

> which(lengths(id_f_r$sequence) > 1)
scan=1054  scan=109  scan=127 scan=1284  scan=133 scan=1377 scan=1459 scan=1464 
       22        26        72        79        93       106       135       138 
scan=1611 scan=1629 scan=1833 scan=2296 scan=2301 scan=2369 scan=2433 scan=2543 
      190       196       272       419       422       448       474       510 
scan=2547 scan=2586   scan=30 scan=3089 scan=3157 scan=3201  scan=322 scan=3269 
      512       517       650       680       705       720       728       750 
scan=3333 scan=3718 scan=3758 scan=3864 scan=4104 scan=4108 scan=4111 scan=4126 
      775       914       925       956      1035      1037      1038      1044 
scan=4168 scan=4289 scan=4292 scan=4709 scan=4742 scan=4802 scan=4822 scan=4876 
     1053      1090      1091      1230      1241      1267      1272      1297 
scan=4923 scan=4961 scan=5122 scan=5524 scan=5692  scan=571 scan=5713 scan=5718 
     1318      1328      1392      1538      1579      1585      1586      1588 
scan=5948 scan=6086 scan=6177 scan=6206 scan=6234 scan=6414 scan=6422 scan=6424 
     1671      1735      1757      1772      1790      1852      1856      1858 
scan=6429 scan=6463 scan=6618 scan=6668 scan=6788 scan=6943 scan=6954 scan=7136 
     1861      1872      1924      1942      1984      2060      2065      2137 
scan=7152 scan=7171 scan=7282 scan=7369 scan=7650 scan=7660 scan=7896 scan=7904 
     2145      2155      2194      2215      2314      2316      2375      2376 
scan=7934  scan=804 scan=8308   scan=91 
     2380      2412      2498      2568

I suggest you check why this happens.

ADD COMMENT • link 5 months ago Laurent Gatto 1.6k

0

Entering edit mode

Thank you very much for the detailed explanation!! Sorry for the delay response. Indeed, there were multiple MS2 scan matching to more than one peptide- and strangely all peptides were ranked 1.

In fact, all of the matches in my original data were ranked 1, and when I did filterPSMs, no peak was removed due to rank >1. I also didn't use decoy database- can this be the reason why there's a problem with ranking?

I used MSGFPlus for generating my mzID file, and also tried posting the question there (https://github.com/MSGFPlus/msgfplus/issues/156#issuecomment-2380245837), but haven't got a definite solution yet (currently trying to create a new column for redundant sequence).

> id_0927_filtered<-filterPSMs(id_0927)
Starting with 32852 PSMs:
Removed 0 decoy hits.
Removed 0 PSMs with rank > 1.
Removed 29928 shared peptides.
2924 PSMs left.

ADD REPLY • link 5 months ago Jane_Huang • 0

0

Entering edit mode

I used scan 30 as an example:

>test2<-as.data.frame(id_filtered)
>test3<-filter(test2,spectrumID="scan=30")

>test3
  sequence spectrumID chargeState rank passThreshold experimentalMassToCharge calculatedMassToCharge
1  LERRFGR    scan=30           3    1          TRUE                 312.1834               311.8507
2 EFGKGIKR    scan=30           3    1          TRUE                 312.1834               312.1871
    peptideRef modNum isDecoy post pre start end DatabaseAccess DBseqLength DatabaseSeq
1  Pep_LERRFGR      0   FALSE    M   R   147 153 XP_023107941.2         321            
2 Pep_EFGKGIKR      0   FALSE    G   R   872 879 XP_044889830.1         935            
                                                                                  DatabaseDescription
1 XP_023107941.2 LOW QUALITY PROTEIN: putative testis-specific Y-encoded-like protein 3 [Felis catus]
2               XP_044889830.1 xin actin-binding repeat-containing protein 2 isoform X4 [Felis catus]
  scan.number.s. scan.start.time acquisitionNum                     spectrumFile                idFile
1             30         189.737             30 HH090441864_2024-09-09_2648.mzML HH090441864_1001.mzid
2             30         189.737             30 HH090441864_2024-09-09_2648.mzML HH090441864_1001.mzid
  MS.GF.RawScore MS.GF.DeNovoScore MS.GF.SpecEValue MS.GF.EValue modPeptideRef modName modMass modLocation
1             20                33     8.410246e-07     9.655486          <NA>    <NA>      NA          NA
2             20                33     8.410246e-07     9.804787          <NA>    <NA>      NA          NA
  subOriginalResidue subReplacementResidue subLocation
1               <NA>                  <NA>          NA
2               <NA>                  <NA>          NA

ADD REPLY • link 5 months ago Jane_Huang • 0

1

Entering edit mode

Hmm, it is a bit surprising that you have so many the same scores. I would try to look at the actual spectra, and check the matching peaks, while filtering them out and continue with the unique matches.

ADD REPLY • link 5 months ago Laurent Gatto 1.6k