Hi,
I'm new to using BioStrings and the BSgenome packages in R, and am having some trouble with my analysis.
What I'm trying to do is to take the latest human genome reference release from the GRC and transform it into something useable for GATK analysis. Essentially, I want to read the FASTA file, remove the alt contigs and patches and mask the PARs in the Y chromosome.
I wanted to ask a couple of questions:
1. I want to remove the alternate contigs as well as the patches on the latest GRC reference (GRCh38.p11) to create a primary reference assembly. I've laid out what I've done but it is quite long-winded - is there a quicker way of doing this? What I've done is read the file as a DNAStringSet, select the sequences to keep, reorder the data and write sequences out.
In addition, is there a way to read the sequence description separate from the sequence ID? I've had to manually separate the strings and then piece them back together so that it makes sense. i.e.
x <- readDNAStringSet("GCA_000001405.26_GRCh38.p11_genomic.fa") y <- data.frame(seqID=sapply(strsplit(names(x)," "), `[`, 1), chromInfo = gsub("^\\S+ ", "", sapply(strsplit(names(x),","), `[`, 1), perl=T), assemblyInfo = sapply(strsplit(names(x),","), `[`, 2), seqHeader = names(x), seqLen = x@ranges@width)
then decide which to keep and concatenate the columns together to make it output like the GATK format. It works, but was wondering whether there was a more efficient method.
2. I want to hard mask the PARs in the Y chromosome (i.e. change the sequence to N's). I've tried using the 'ExtractAt' and 'ReplaceAt' functions, but they haven't worked well for me (e.g. I keep getting errors like:
> extractAt(fgenome, at) Error in .normarg_at2(at, x) : some ranges in 'at' are off-limits with respect to their corresponding sequence in 'x' In addition: Warning messages: 1: In .normarg_at2(at, x) : 'at' names were ignored 2: In .V_recycle(at, x, "at", "'length(x)'") : 'length(x)' is not a multiple of 'length(at)'
I've also tried creating a data frame of the Y chromosome, manually substituted the ACTGs into Ns and then created a DNAStringSet from there. But when I try to replace the Y Chromosome DNAStringSet in the original data set with this StringSet, I am unable to do it...
at <- data.frame(chrom="Y GRCh38.p11:CM000686.2:1:57227415", start=10001, end = 2781479) temp <- getSeq(fgenome, as(at, "GRanges")) at <- gsub("A", "N", temp) at <- gsub("C", "N", at) at <- gsub("T", "N", at) at <- gsub("G", "N", at) YPAR1mask <- DNAStringSet(at) replaceAt(fgenome, grep ("Y GRCh38.p11:CM000686.2:1:57227415", fgenome), YPAR1mask)
What I want to mask is the following:
Name Chr Start Stop
PAR#1 | Y | 10,001 | 2,781,479 |
PAR#2 | Y | 56,887,903 | 57,217,415 |
Any help would be much appreciated.
Thanks!