Entering edit mode
peirinl
•
0
@peirinl-14078
Last seen 4.5 years ago
Dear all,
I do annotation and removing multiple mappings,
anno_dataSet <- AnnotationDbi::select(hgu133plus2.db, keys=(featureNames(dataSet_filtered)), columns = c("SYMBOL", "GENENAME"), keytype="PROBEID") probe_stats <- anno_dataSet %>% group_by(PROBEID) %>% summarize(no_of_matches = n_distinct(SYMBOL)) %>% filter(no_of_matches > 1) ids_to_exlude <- ((featureNames(dataSet_filtered) %in% probe_stats$PROBEID) | featureNames(dataSet_filtered) %in% subset(anno_dataSet, is.na(SYMBOL))$PROBEID) dataSet_final <- subset(dataSet_filtered, !ids_to_exlude) fData(dataSet_final)$PROBEID <- rownames(fData(dataSet_final)) fData(dataSet_final) <- left_join(fData(dataSet_final), anno_dataSet) rownames(fData(dataSet_final)) <-fData(dataSet_final)$PROBEID
But I get
Error in `row.names<-.data.frame`(`*tmp*`, value = value) : duplicate 'row.names' are not allowed Besides: Warning message: non-unique value when setting 'row.names': '227320_at'
My question is,
1. is it normal to have a duplicate probe unremoved after I removing multiple mappings?
2. can I remove the duplicate data, so that I can change the rownames using PROBEID? If so, what is the best way?
Thank you.
If I run your code with
as the dataSet_filtered then I get no errors, so I would advise double-checking the integrity of the original object - I can't offhand see an easy way that '227320_at' would have got through the filtering by ids_to_exlude (or be re-introduced by the left_join) - you might want to debug by tracing this particular IDs appearance (or any other ID that is duplicated in fData(dataSet_final)$PROBEID before the final line of your analysis).
I agree with Gavins's answer - the code will exclude all probe IDs with multiple mappings to gene symbols, so by definition there should not be any duplicate PROBEIDs in fData as I retain only the PROBEIDs matching to exactly one gene symbol.
Are you using the correct annotation database for your array? Is it the hgu133plus2 platform?
The code obviously won't work if the annotation database / package is not the right one for your array.
You are right, the duplication is induced by left_join. Before left_join,
> anyDuplicated(rownames(fData(dataSet_final)))
[1] 0
> anyDuplicated(anno_dataSet$PROBEID)
[1] 2
> rownames(fData(riester_final))[28589]
[1] "227321_at"
> rownames(fData(riester_final))[28588]
[1] "227320_at"
After Left join,
> anyDuplicated(fData(riester_final)$PROBEID)
[1] 28589
> fData(riester_final)$PROBEID[28589]
[1] "227320_at"
> fData(riester_final)$PROBEID[28588]
[1] "227320_at"
Clearly, I dont know how to explain. Base on my understanding, it should not have such insertion on left_joint. How can I solve that?
I try another dataset, also using hgu133plus2.db. The outcome is same.
This is very strange. 2273020_at _is_ appearing in riester_final, so that implies it is not duplicated in the anno_dataSet, so the inner join - assuming it is the dplyr version and is being done on PROBEID (maybe set an explicit 'by' in the join to enforce this) - should only return one row, as you expect. Could you report the result of doing
anno_dataSet %>% filter(PROBEID=="2273020_at")
andprobe_stats %>% filter(PROBEID=="2273020_at")
and similar with your riester_final before the join: I can only imagine that somehow the n_distinct isn't resolving all identical cases - maybe your annotation has two identical symbols with different gene-names, or something like that.This is the behavior of left_join()
and the documentation, consistent with this behavior
There are multiple symbols associated with key 227320_at
So there are no surprises in the code.
It looks like the two instances aren't distinguished by the SYMBOL (RFLNA in both cases), so rather than use n_distinct(SYMBOL) (which will give 1), you may want to use n() instead, which will give the number of rows (2 in this case) in the group. Or use an alternative philosophy regarding multi-symbol probeids, such as just quoting the first (like James' annotateEset method will), or concatenating them...