Is it possible to keep sequence names when haplotypes are collapsed using the Collapse tool of the QSutils package?
I have a number of fasta formatted sequences, some of which are the same, and therefore redundant for a phylogenetic analysis; since I would like to remove those that are already represented, I am using the Collapse function of QSutils to remove them. My issue is that after using the tool, the sequences have been renamed from ">Species1", ">Species2", ">Species3" etc. to "1", "2", "3"... I would like to keep the name of at least one of the sequences that have been collapsed together, rather than have them renamed to numbers. Is it possible to do this with said tool?
Example:
>example_sequences
A DNAStringSet instance of length 5
width seq names
[1] 18 ATTAGACACCAGAGGCTT Example_A
[2] 18 ATTAGACATCAGAGGCTT Example_B
[3] 18 ATTAGACATCAGAGGCTT Example_C
[4] 18 ATTAGACACCAGAGGCTT Example_D
[5] 18 ATTAGACACGTTAGGCTT Example_E
>Collapse(example_sequences)
$nr
[1] 2 2 1
$hseqs
A DNAStringSet instance of length 3
width seq names
[1] 18 ATTAGACACCAGAGGCTT 1
[2] 18 ATTAGACATCAGAGGCTT 2
[3] 18 ATTAGACACGTTAGGCTT 3
Result of sessionInfo():
> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17134)
Matrix products: default
Random number generation:
RNG: Mersenne-Twister
Normal: Inversion
Sample: Rounding
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] QSutils_1.4.0 Biostrings_2.54.0 XVector_0.26.0 IRanges_2.20.2 S4Vectors_0.24.3 BiocGenerics_0.32.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.3 lattice_0.20-38 ape_5.3 psych_1.9.12.31 grid_3.6.1 nlme_3.1-143
[7] zlibbioc_1.32.0 tools_3.6.1 compiler_3.6.1 mnormt_1.5-6 BiocManager_1.30.10