Retrieving specific sequences from a fastq
2
0
Entering edit mode
ferbecneu • 0
@ferbecneu-16188
Last seen 4.1 years ago

Hi, Im working with a 4c-seq experiment and one of the issues with this technique is that you have to delete reads that originated adjacent from the sequence of interest, I have the sequence corresponding to these reads, I need to delete these reads, till now I only found the reads that have this sequence using vmatchpattern but how can i delete these reads and create a new fastq archive without them?

Thank you very much for your help!

shortread biostrings • 4.4k views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 4 months ago
United States

See ShortRead::fastqFilter() including the example.

ADD COMMENT
0
Entering edit mode
ferbecneu • 0
@ferbecneu-16188
Last seen 4.1 years ago

Thank you Martin, I read the function you said: (filterFastq(files, destinations, ..., filter = FilterRules(), compress=TRUE, yieldSize = 1000000L). However Im starting with bioinformatics and I dont know how to write filter/filterRules to remove the reads that contain the sequence: GCAGTGGGGCGGGTCTAAGCAGA.

Can you help me doing so please? Thank you very much!

ADD COMMENT
0
Entering edit mode

Looks like this is a COMMENT rather than Answer; please use the 'Add comment' button for comments.

To get some example data, I entered the following into my R session

library(ShortRead)
example(fastqFilter)

This creates a variable that points to a small fastq file; I read it in and extracted the short read sequences, just to get a look

> sr = readFastq(fl)
> sread(sr)
  A DNAStringSet instance of length 256
      width seq
  [1]    36 GGACTTTGTAGGATACCCTCGCTTTCCTTCTCCTGT
  [2]    36 GATTTCTTACCTATTAGTGGTTGAACAGCATCGGAC
  [3]    36 GCGGTGGTCTATAGTGTTATTAATATCAATTTGGGT
  [4]    36 GTTACCATGATGTTATTTCTTCATTTGGAGGTAAAA
  [5]    36 GTATGTTTCTCCTGCTTATCACCTTCTTGAAGGCTT
  ...   ... ...
[252]    36 GTTTAGATATGAGTCACATTTTGTTCATGGTAGAGT
[253]    36 GTTTTACAGACACCTAAAGCTACATCGTCAACGTTA
[254]    36 GATGAACTAAGTCAACCTCAGCACTAACCTTGCGAG
[255]    36 GTTTGGTTCGCTTTGAGTCTTCTTCGGTTCCGACTA
[256]    36 GCAATCTGCCGACCACTCGCGATTCAATCATGACTT

Suppose I wanted to get rid of sequences with "TTACC". I could use the grepl() function to find the reads that do not contain this pattern, and subset the original reads to get those that I want to keep

> sr[!grepl("TTACC", sread(sr))]
class: ShortReadQ
length: 244 reads; width: 36 cycles

So it looks like there are 244 reads that satisfy my criterion. Now write a function that does this, and test...

fun <- function(x)
    x[ !grepl("TTACC", sread(x)) ]

Verify that it works

> fun(sr)
class: ShortReadQ
length: 244 reads; width: 36 cycles

Then use it in the filterFastq function, creating a new fastq file with the filtered results

> out <- tempfile()
> filterFastq(fl, out, filter=fun)
[1] "/var/folders/yn/gmsh_22s2c55v816r6d51fx1tnyl61/T//Rtmpnfw9lM/file172496fd16a17"
attr(,"filter")
                 Reads KeptReads Nucl KeptNucl
s_1_sequence.txt   256       244 9216     8784

verify that the output file contains the correct number of sequences

> readFastq(out)
class: ShortReadQ
length: 244 reads; width: 36 cycles
ADD REPLY
0
Entering edit mode

Thank you very much! This was super useful for me :D

ADD REPLY

Login before adding your answer.

Traffic: 682 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6