Hi,
I'm totally new working with Bioconductor and hope you can help me with my following problem.
My data is a "Large DataFrame" including different DNA sequences (seq) in every row. These are from the "DNAStringSet" class from "Biostring"-package. One variable (pos) contains the information of the position of the first nucleobase. The goal is to filter out one nucleobase at one specific position. This position is not included in every row and each row does not start at the same position. So the distance between the starting position of the row and the position I'm looking for is varying. The position information is as well stored in the @ranges, which is from the class "GroupedIRanges" of the "XVector"-package. So I tryed using the subseq-function:
data$subseq_test <- subseq(data$seq@ranges, start = 6, end = 6) # not working Fehler in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘subseq’ for signature ‘"GroupedIRanges"’
Without @ranges:
data$subseq_test <- subseq(data$seq, start = 6, end = 6) # working, but not the way I want. It gives me the 6th nucleobase of every row counting from 1 from the beginning
As I read in another post, @ranges should not be used. My question is, how I can get this one position?
Here you can see some information of the data:
https://www.dropbox.com/s/ktffp9s7mm183hu/data.png?dl=0
And my sessioninfo:
> sessionInfo() R version 3.3.3 (2017-03-06) Platform: x86_64-apple-darwin13.4.0 (64-bit) Running under: OS X Yosemite 10.10.5 locale: [1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8 attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] Biostrings_2.42.1 XVector_0.14.1 IRanges_2.8.2 S4Vectors_0.12.2 BiocGenerics_0.20.0 BiocInstaller_1.24.0
If 'pos' is a vector telling you the location of the base you're interested in, can you achieve what you want with something along the lines of
subseq(data$seq, start=pos, width=1
) or, if pos is only defined for certain rows, then you may need to subset data first to be those rows with a defined pos.Thank you, Gavin, for your answer! The idea with using width was a good one. But for start I needed to calculate a new variable, that calculates me the actual position for every line, starting counting from one. As I recognized, this is also not the „real“ solution, cause there are some insertions and delations that needs to be consider as well for the position calculating.
In general this solution is working, but my start position is a position, which is not in every DNAString of every data row. So the start position in my case has NA values. If I just use the complied cases, so dropping all the rows with NA in the start position variable, than the code works fine. For my further analysis I need the other rows as well, so I am not allowed to drop the NA cases. That is why I tried the following:
I do not get the error here. The ifelse, should skip the the NA rows for the subseq-function, but as it seams (in row 1 pos is NA), it is not doing, what it should do. Would be great, if somebody has a reason or solution for my problem.