I'm dealing with the results of some VCF files generated with exactSNP over BAM files made with subjunc aligner (all installed from Rsubread package) from a RNA-Seq project (made with HiSeq/TruSeq/mRNA only). Phred scores are the default (phred+33). The score scale of the FASTQ was detected as Sanger/Illumina-1.8.
Now, I want keep only the high quality variants with at least 2 supporting reads. I've tried this two filtering expressions:
vcf <- vcf[which(vcf@fixed$QUAL > 22 & !(vcf@info$SR %in% c(NA, '1'))), ]
And:
vcf <- vcf[which(vcf@fixed$QUAL > 22 & !(vcf@info$MM %in% c(NA, '1'))), ]
The first one simply get rid of all variants (empty CollapsedVCF as result). The second returns about 5% of the variants in the original file (which seems reasonable to me).
Can someone explain the difference between these two results? A suggestion on how to filter VCFs generated with subjunc/exactSNP is much welcome too.
My parameters were based on guidelines used by GATK pipelines. They suggest at least 2 supporting reads to call a SNPs as a good candidate. Your parameters work fine. But, as we are using these results for clinical investigation, a more stringent criteria seems appropriate. About indels, I'm treating the file generated by subjunc with this expression:
Why is SR field coded as char instead of int?
Thanks for pointing this out. We will change the 'SR' field to Integer type.
For your indel filtering, you should not use the 'QUAL' and 'DP' fields. See my explanation above.
For indels, I'm using the subjunc-generated indels file. The QUAL field there has other scale (I'm not sure exactly which scale, though). As we already had some independent clinical/laboratorial results, I've calibrated the parameters based on those previous findings. The expected indels were there. With SR > 2 those findings are filtered out. About SNPs, with QUAL > 20 and DP > 5 I've got pretty reasonable results. I confess I'm very satisfied with Rsubread/subread.