Entering edit mode
Hi,
I have a number of single end fastq files which contain sequencing from a barcoding experiment. I have a large list of barcodes (~120,000) and I want to count the number of exact barcode matches in the fastq files.
I have been looking into the ShortRead package but I'm not entirely sure if it's the right tool for this as I can't figure out how to use it to do this.
Can anyone suggest a way I can get counts for exact matches in R
What do you mean by "barcoding experiment"? Something like sequencing reads that contains a barcode, like a CRISPRi screen? If so then it probably comes down to making a fasta file with all barcode sequences and end-to-end alignment with something like bowtie2 with penalty parameters set to a high value like 10000 so only perfect end-to-end matches will get aligned, and everything else will go unmapped. In R directly probably the Rsubread package can do that, but it is basically a one-liner in bash to run bowtie2. Then you could use something like
featureCounts
to count reads per barcode.Hi, you can do this is bash with a one liner,
assuming the barcode is, GTGAAA, here I count the first 4000 but if you eliminate the head pipe it will count the entire file.
count first 4000
count entire file ( takes a least a few mins for a typical rnaseq file)
A
You may be able to do it with
umi_tools extract
: https://umi-tools.readthedocs.io/en/latest/reference/extract.htmlThis question is more suited to a general forum like Biostars or Bioinformatics Stack Exchange.