Hello community,
I have some DNA sequencing data that I am analyzing using R (I am a fairly inexperienced user). For downstream analysis, I need to check if the reads found within a .bam file are aligned correctly. Currently I am performing the following:
library(Rsamtools)
library(GenomicAlignments)
param <- ScanBamParam(what="seq") #sets up the parameter to read
gal <- readGAlignments("file.bam", param=param) #opens the BAM file
qseq <- mcols(gal)$seq # extracts the sequencing data
qseq_on_ref <- sequenceLayer(qseq, cigar(gal)) # aligns the sequencing data
qseq_on_ref
This gives me the following (see picture - colours are automatically applied?)
>qseq_on_ref
DNAStringSet object of length 47135:
width seq
[1] 280 TTCACGCACGTGA-CTATGAACTCAGGAGTCGACGATTT...AGGCGGCCGTGACTCTAGATCATAATCAGCCATACCAC
[2] 274 TTCACGCACGTGACCTATGAACTCAGGAGTCGACGATTT...CTTAATAGGCGGCCGCGACTCTAGATCATAATCAGCCA
[3] 280 TTCACGCACGTGACCTATGAACTCAGGAGTCGACGATTT...AGGCGGCCGCGAATCTAGATCATAATCAGCCATACCAC
[4] 280 TTCACGCACGTGACCTATGAACTCAGGAGTCGACGATTT...AGGCGGCCGCGACTCTAGATCATAATCAGCCATACCAC
[5] 280 TTCACGCACGTGACCTATGAACTCAGGAGTCGACGATTT...AGGCGGCCGCGACTCTAGATCATAATCAGCCATACCAC
... ... ...
[47131] 280 TTCACGCACGTGACCTATGAACTCAGGAGTCGACGATTT...AGGCGGCCGCGACTCTAGATCATAATCAGCCATACCAC
[47132] 180 TTCACGCACGTGACCTATGAACTCAGGAGTCGACGATTT...ATCGAGGGATTCATCGAAAACGGATGGGAAGGCATGAT
[47133] 280 TTCACGCACGTGACCTATGAACTCAGGAGTCGACGATTT...AGGCGGCCGCGACTCTAGATCATAATCAGCCATACCAC
[47134] 280 TTCACGCACGTGACCTATGAACTCAGGAGTCGACGATTT...AGGCGGCCGCGACTCTAGATCATAATCAGCCATACCAC
[47135] 280 TTCACGCACGTGACCTATGAACTCAGGAGTCGACGATTT...AGGCGGCCGCGACTCTAGATCATAATCAGCCATACCAC
I am stuck here because I can't expand the ' ... ' in the middle of the sequence. I also can't display more than 10 sequences on the screen.
q_seq_on_ref[1:20]
The above only displays 1-5 then 16-20:
[1] 280 TTCACGCACGTGA-CTATGAACTCAGGAGTCGACGATTTTGACCTTGACATGCTCCCCGGGGGATC...GCCCAGGGCGAGGGCACGGCCGCTTAATAGGCGGCCGTGACTCTAGATCATAATCAGCCATACCAC
[2] 274 TTCACGCACGTGACCTATGAACTCAGGAGTCGACGATTTTGACCTTGACATGCTCCCCGGGGGATC...CAGAACGCCCAGGGCGAGGGCACGGCCGCTTAATAGGCGGCCGCGACTCTAGATCATAATCAGCCA
[3] 280 TTCACGCACGTGACCTATGAACTCAGGAGTCGACGATTTTGACCTTGACATGCTCCCCGGGGGATC...GCCCAGGGCGAGGGCACGGCCGCTTAATAGGCGGCCGCGAATCTAGATCATAATCAGCCATACCAC
[4] 280 TTCACGCACGTGACCTATGAACTCAGGAGTCGACGATTTTGACCTTGACATGCTCCCCGGGGGATC...GCCCAGGGCGAGGGCACGGCCGCTTAATAGGCGGCCGCGACTCTAGATCATAATCAGCCATACCAC
[5] 280 TTCACGCACGTGACCTATGAACTCAGGAGTCGACGATTTTGACCTTGACATGCTCCCCGGGGGATC...GCCCAGGGCGAGGGCACGGCCGCTTAATAGGCGGCCGCGACTCTAGATCATAATCAGCCATACCAC
... ... ...
[16] 279 TTCACGCACGTGACCTATGAACTCAGGAGTCGACGATTTTGACCTTGACATGCTCCCCGGGGGATC...CGCCCAGGGCGAGGGCACGGCCGCTTAATAGGCGGCCGCGACTCTAGATCATAATCAGCCATACCA
[17] 280 TTCACGCACGTGACCTATGATCTCAGGAGTCGACGATTTTGACCTTGACATGCTCCCCGGGGGATC...GCCCAGGGCGAGGGCACGGCCGCTTAATAGGCGGCCGCGACTCTAGATCATAATCAGCCATACCAC
[18] 280 TTCACGCACGTGACCTATGAACTCAGGAGTCGACGATTTTGACCTTGACATGCTCCCCGGGGGATC...GCCCAGGGCGAGGGCACGGCCGCTTAATAGGCGGCCGCGACTCTAGATCATAATCAGCCATACCAC
[19] 280 TTCACGCACGTGACCTATGAACTCAGGAGTCGACGATTTTGACCTTGACATGCTCCCCGGGGGATC...GCCCAGGGCGAGGGCACGGCCGCTTAATAGGCGGCCGCGACTCTAGATCATAATCAGCCATACCAC
[20] 280 TTCACGCACGTGACCTATGAACTCAGGAGTCGACGATTTTGACCTTGACATGCTCCCCGGGGGATC...GCCCAGGGCGAGGGCACGGCCGCTTAATAGGCGGCCGCGACTCTAGATCATAATCAGCCATACCAC
I suspect the answers to this will be fairly trivial - I am not used to working with this type of ouput in R.
Any suggestions on
- Viewing the full range (1 to 47135) of sequencing reads.
- Viewing full length of sequencing reads.
- Alternative methods to view data.
Would be greatly appreciated.
> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Mojave 10.14.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] FastqCleaner_1.8.0 shinyBS_0.61 shiny_1.6.0 stringr_1.4.0 ShortRead_1.48.0
[6] GenomicAlignments_1.26.0 SummarizedExperiment_1.20.0 Biobase_2.50.0 MatrixGenerics_1.2.1 matrixStats_0.58.0
[11] Rsamtools_2.6.0 GenomicRanges_1.42.0 GenomeInfoDb_1.26.2 Biostrings_2.58.0 XVector_0.30.0
[16] IRanges_2.24.1 S4Vectors_0.28.1 BiocParallel_1.24.1 BiocGenerics_0.36.0
loaded via a namespace (and not attached):
[1] bslib_0.2.4 lattice_0.20-41 htmltools_0.5.1.1 yaml_2.2.1 rlang_0.4.10 jquerylib_0.1.3
[7] later_1.1.0.1 withr_2.4.1 RColorBrewer_1.1-2 jpeg_0.1-8.1 GenomeInfoDbData_1.2.4 lifecycle_1.0.0
[13] zlibbioc_1.36.0 hwriter_1.3.2 htmlwidgets_1.5.3 latticeExtra_0.6-29 fastmap_1.1.0 crosstalk_1.1.1
[19] httpuv_1.5.5 Rcpp_1.0.6 xtable_1.8-4 promises_1.2.0.1 DT_0.17 BiocManager_1.30.10
[25] cachem_1.0.4 DelayedArray_0.16.2 jsonlite_1.7.2 mime_0.10 png_0.1-7 digest_0.6.27
[31] stringi_1.5.3 grid_4.0.3 tools_4.0.3 bitops_1.0-6 magrittr_2.0.1 sass_0.3.1
[37] RCurl_1.98-1.2 crayon_1.4.1 ellipsis_0.3.1 Matrix_1.3-2 rstudioapi_0.13 R6_2.5.0
[43] compiler_4.0.3
Can you say more about what your goals are? Trying to look at 47,000 sequences in order to see if they are 'aligned correctly' doesn't seem like a useful exercise. I mean that's a lot of sequences. And what's the definition/criteria for correct alignments in this context, and how would you know if you have fulfilled those criteria by eyeballing a bunch of letters on your screen?
The idea behind lots of the functions in Bioconductor (like the
show
method for aDNAStringSet
) is that A.) It's not useful to crank out a bunch of stuff on the console because how would you ever even look at that, and B.) when dealing with large data it's easy enough to say 'hmm, let me look at this' and end up with R compliantly printing out a metric gobton of data while you are maniacally punching on the keyboard trying to get it to just stop. So by definitionshow
ing the data will just provide the small snippet you see, which is probably more useful than any other alternative.