Question

help with cleanUpdTSeq for PAS analysis

0

Entering edit mode

Julie Zhu ★ 4.3k

@julie-zhu-3596

Last seen 17 months ago

United States

Dear Bo, Sounds like you have not clustered reads into peaks yet. After you do, you should not have too many distinct potential polyA sites. FYI, we got 205470 peaks for one time point and 75511 for another time point. Sarah Sheppard, a Ph.D student who recently graduated, wrote some perl script to do such clustering. I cced her in this email in case you are interested in her scripts. Also, we divide all potential peaks into batches, each with about 10,000 peaks and run them in the school cluster hpcc simultaneously. Here is how to run your R script using qsub in the cluster assuming your R script is called predPas.R . qsub cwd predPas.sh Here is the content in the predPas.sh #!/bin/bash #$ -V #$ -l mem_free=26G /share/bin/R-3.0.1/lib64/R/bin/R CMD BATCH predPas.R The R session output will be in predPas.Rout. For missing libraries, you could contact the cluster management group to install the proper R packages. Here is the user manual for cleanUpdTSeq if you have not read it http: //www.bioconductor.org/packages/2.13/bioc/vignettes/cleanUpdTSeq/inst/ doc/cleanUpdTSeq.pdf. We probably need to add a section on how to run it in a cluster environment and add the clustering script. Thanks for your feedback! Best regards, Julie On 9/12/13 10:04 AM, "Han, Bo" <bo.han@umassmed.edu> wrote: Hello, Dear Prof. Zhu, This is a Graduate Student in the Zamore Lab. Congratulations on your recent paper on bioinformatics and thanks to that, our analysis on PAS could be easier. Xin, my collaborator has met with you last week and got the document of cleanUpdTSeq. I was following that document, but met a problem. Seems that the program requires lots of memory and I couldn't finish the running even on our best node in school. I have merged the reads that mapped to the same position, ,and even separated the reads to different chromosomes (each 510M). But still took too much memory and cannot finish running. I would really appreciate it if you can give me some suggestions. The data (with individual chromosomes) are located at /home/hanb/scratch/p repachytene/PAS/mouse.testis.wild.type.adult.PASseq/chromosomes. Here is the code that I ran the R: #source("http://bioconductor.org/biocLite.R") #biocLite("cleanUpdTSeq") library(cleanUpdTSeq) library(BSgenome.Mmusculus.UCSC.mm9) args = commandArgs (T) inputBedFile = args[1] testSet = read.table(inputBedFile, sep = "\t", header = F) peaks = BED2GRangesSeq (testSet, withSeq=F) testSet.NaiveBayes = buildFeatureVector (peaks,BSgenomeName = Mmusculus, upstream = 40, downstream = 30, wordSize = 6, alphabet=c("ACGT"), sampleType = "unknown",replaceNAdistance = 30, method = "NaiveBayes", ZeroBasedIndex = 1, fetchSeq = TRUE) data(data.NaiveBayes) predictTestSet(data.NaiveBayes$Negative, data.NaiveBayes$Positive, testSet.NaiveBayes, outputFile = paste (inputBedFile, ".tsv", sep=""), assignmentCutoff = 0.5) And how I got the bed from fastq and ran the Rscript (last two lines) #! /bin/bash -x ################ # Major Config # ################ export PPP_PATH=/home/hanb/nearline/ppp export PATH=$PPP_PATH:$PATH # FQ input FQ=$1 # CPU CPU=8 # step STEP=1 # genome Genome=mm9 # bowtie2 index BOWTIE2_INDEX=/home/hanb/nearline/Mus_musculus/UCSC/mm9/Sequence/Whole GenomeFasta/genome # chrom Chrome=/home/hanb/nearline/small_RNA_Pipeline/common_files/mm9/ChromIn fo.txt # GTF file with gene annotation GTF_FILE=/home/hanb/scratch/mm9.newest.07dpp.gtf # check read length READ_LEN=`head -2 $FQ | awk '{getline; printf "%d", length($1)}'` # STAR sjdbOverhang option STARsjdbOverhang=$((READ_LEN-1)) # STAR genome index directory STAR_Genome_INDEX_DIR=/home/hanb/nearline/PE_Pipeline/common_files/mm9 /STARgenomeIndex STAR_rRNA_INDEX_DIR=/home/hanb/nearline/PE_Pipeline/common_files/mm9/r RNAIndex # STAR genome index STAR_Genome_INDEX=$STAR_Genome_INDEX_DIR/$STARsjdbOverhang STAR_rRNA_INDEX=$STAR_rRNA_INDEX_DIR/$STARsjdbOverhang # MM MM=$(((READ_LEN+1)/25)) # prefix PREFIX=${FQ%.fq} # trimmed 3' adaptor and filter quality ADAPTOR_SEQ="AATGGAATTC" # allow two As in front of the 3' adaptor # ADAPTOR_SEQ="AAAAAAAAAAAAAAAAAAAATGGAATTCTCGGGTGCCAAGGC" # reads are subjected to quality trimming first until reaching phred >= 3 # then find and trim adaptor # reads without adaptor found are still reported but later filtered # reads with N is dumped # reads shorter than 25nt after quality trimming and adaptor clipping are dumped as well [ ! -f .status.${STEP}.flexbar ] && \ flexbar \ -r $FQ \ -t ${PREFIX}.trimmed \ -f fastq-i1.8 \ -n $CPU \ -as $ADAPTOR_SEQ \ -at 1.0 \ -ao 10 \ --min-readlength 25 \ --max-uncalled 0 \ -q 3 && \ awk '{NAME=$1;getline;SEQ=$1;getline;getline;QUAL=$1; ++A; if (substr (SEQ,length($0)-5)=="AAAAAA") {++B; printf "%s\n%s\n+\n%s\n", NAME, SEQ, QUAL}}END{print A"\t"B >> "filter_6A.log"}' ${PREFIX}.trimmed.fastq > ${PREFIX}.trimmed.fq && \ touch .status.${STEP}.flexbar STEP=$((STEP+1)) # map to genome using STAR with stringent condition # ideally, polyA by internal priming shoud be able to be mapped INPUT=${PREFIX}.trimmed.fq [ ! -f .status.${STEP}.genome_mapping_with_polyA ] && \ STAR \ --runMode alignReads \ --genomeDir $STAR_Genome_INDEX \ --readFilesIn ${INPUT} \ --runThreadN $CPU \ --outFilterScoreMin 0 \ --outFilterScoreMinOverLread 0.90 \ --outFilterMatchNmin 0 \ --outFilterMatchNminOverLread 0.90 \ --outFilterMultimapScoreRange 1 \ --outFilterMultimapNmax -1 \ --outFilterMismatchNmax 10 \ --outFilterMismatchNoverLmax 0.10 \ --alignIntronMax 0 \ --alignIntronMin 21 \ --alignSJDBoverhangMin 1 \ --outFilterIntronMotifs RemoveNoncanonicalUnannotated \ --outSAMstrandField intronMotif \ --outSAMattributes All \ --genomeLoad NoSharedMemory \ --outFileNamePrefix ${PREFIX}.x_rRNA.${Genome}.+polyA.unmapped.softSlipping. \ --outReadsUnmapped Fastx \ --outSJfilterReads Unique \ --seedSearchStartLmax 20 \ --seedSearchStartLmaxOverLread 1.0 \ --chimSegmentMin 0 && \ touch .status.${STEP}.genome_mapping_with_polyA STEP=$((STEP+1)) # map unmapped without polyA ## run trimpoly ## convert FQ to FA for trimpoly [ ! -f .status.${STEP}.trimpoly ] && \ awk '{name=substr($1,2);print ">"name; getline;print $1;getline;getline}' ${PREFIX}.trimmed.fq > ${PREFIX}.trimmed.fa && \ trimpoly -e 3 -s 5 < ${PREFIX}.trimmed.fa > ${PREFIX}.trimmed.trimpoly.log && trimpoly2 ${PREFIX}.trimmed.fq ${PREFIX}.trimmed.trimpoly.log > ${PREFIX}.trimmed.-polyA.fq && \ touch .status.${STEP}.trimpoly STEP=$((STEP+1)) # map to genome again INPUT=${PREFIX}.trimmed.-polyA.fq [ ! -f .status.${STEP}.genome_mapping_without_polyA ] && \ STAR \ --runMode alignReads \ --genomeDir $STAR_Genome_INDEX \ --readFilesIn ${INPUT} \ --runThreadN $CPU \ --outFilterScoreMin 0 \ --outFilterScoreMinOverLread 0.72 \ --outFilterMatchNmin 0 \ --outFilterMatchNminOverLread 0.72 \ --outFilterMultimapScoreRange 1 \ --outFilterMultimapNmax -1 \ --outFilterMismatchNmax 15 \ --outFilterMismatchNoverLmax 0.15 \ --alignIntronMax 0 \ --alignIntronMin 21 \ --alignSJDBoverhangMin 1 \ --outFilterIntronMotifs RemoveNoncanonicalUnannotated \ --outSAMstrandField intronMotif \ --outSAMattributes All \ --genomeLoad NoSharedMemory \ --outFileNamePrefix ${PREFIX}.x_rRNA.${Genome}.-polyA. \ --outReadsUnmapped Fastx \ --outSJfilterReads Unique \ --seedSearchStartLmax 20 \ --seedSearchStartLmaxOverLread 1.0 \ --chimSegmentMin 0 && \ touch .status.${STEP}.genome_mapping_without_polyA STEP=$((STEP+1)) # getting statistics InputReads=`grep 'Number of input reads' ${PREFIX}.x_rRNA.${Genome}.-polyA.Log.final.out | awk '{print $NF}'` UniquReads=`grep 'Uniquely mapped reads number' ${PREFIX}.x_rRNA.${Genome}.-polyA.Log.final.out | awk '{print $NF}'` MultiReads=`grep 'Number of reads mapped to multiple loci' ${PREFIX}.x_rRNA.${Genome}.-polyA.Log.final.out | awk '{print $NF}'` AllMapReads=$((UniquReads+MultiReads)) UnMapReads=$((InputReads-UniquReads-MultiReads)) # normalization factor NormScale=`echo $UniquReads | awk '{printf "%f",1000000.0/$1}'` # convert format [ ! -f .status.${STEP}.data_convert1 ] && \ samtools view -bS ${PREFIX}.x_rRNA.${Genome}.-polyA.Aligned.out.sam > ${PREFIX}.x_rRNA.${Genome}.-polyA.Aligned.out.bam && \ samtools sort -@ $CPU ${PREFIX}.x_rRNA.${Genome}.-polyA.Aligned.out.bam ${PREFIX}.x_rRNA.${Genome}.-polyA.sorted && \ rm -rf ${PREFIX}.x_rRNA.${Genome}.-polyA.Aligned.out.sam ${PREFIX}.x_rRNA.${Genome}.-polyA.Aligned.out.bam && \ touch .status.${STEP}.data_convert1 STEP=$((STEP+1)) # making bigWig [ ! -f .status.${STEP}.bigWig ] && \ samtools view -h ${PREFIX}.x_rRNA.${Genome}.-polyA.sorted.bam | \ filterSoftClippedReads -3 | \ samtools view -bS - | \ bedtools bamtobed -i - | \ awk 'BEGIN{OFS="\t";FS="\t"}{if ($6=="+") {$2=$3-1} else {$3=$2+1} print $0}' | \ uniq \ > ${PREFIX}.x_rRNA.${Genome}.-polyA.sorted.noS.5p.bed && \ awk 'BEGIN{OFS="\t"}{if ($5==255) {print $0 > "/dev/stdout"} else {print $0 > "/dev/stderr"}}' ${PREFIX}.x_rRNA.${Genome}.-polyA.sorted.noS.5p.bed \ 1> ${PREFIX}.x_rRNA.${Genome}.-polyA.sorted.noS.5p.unique.bed \ 2> ${PREFIX}.x_rRNA.${Genome}.-polyA.sorted.noS.5p.multip.bed && \ paraFile="${RANDOM}${RANDOM}.para" && \ echo "bedtools genomecov -scale $NormScale -bg -strand + -i ${PREFIX}.x_rRNA.${Genome}.-polyA.sorted.noS.5p.unique.bed -g $Chrome > ${PREFIX}.x_rRNA.${Genome}.-polyA.sorted.noS.5p.unique.plus.bedgraph && bedGraphToBigWig ${PREFIX}.x_rRNA.${Genome}.-polyA.sorted.noS.5p.unique.plus.bedgraph $Chrome ${PREFIX}.x_rRNA.${Genome}.-polyA.sorted.noS.5p.unique.plus.BW" > $paraFile && \ echo "bedtools genomecov -scale $NormScale -bg -strand - -i ${PREFIX}.x_rRNA.${Genome}.-polyA.sorted.noS.5p.unique.bed -g $Chrome | awk 'BEGIN{OFS=\"\t\"}{\$4 = -\$4; print \$0}' > ${PREFIX}.x_rRNA.${Genome}.-polyA.sorted.noS.5p.unique.minus.bedgraph && bedGraphToBigWig ${PREFIX}.x_rRNA.${Genome}.-polyA.sorted.noS.5p.unique.minus.bedgraph $Chrome ${PREFIX}.x_rRNA.${Genome}.-polyA.sorted.noS.5p.unique.minus.BW" >> $paraFile && \ ParaFly -CPU $CPU -c $paraFile -failed_cmds ${STEP}.failed_commands && \ echo "track name=U.5p.${PREFIX}(+) description=${PREFIX}(+) maxHeightPixels=25 alwaysZero=on autoScale=on yLineMark=0 yLineOnOff=on type=bigWig color=$((RANDOM%255)),$((RANDOM%255)),$((RANDOM%255)) visibility=full bigDataUrl=http://zlab.umassmed.edu/~hanb/Bo/${PREFIX}.x_rRNA.${Genome }.-polyA.sorted.noS.5p.unique.plus.BW" > ${PREFIX}.x_rRNA.${Genome}.-polyA.5p.unique.track && \ echo "track name=U.5p.${PREFIX}(-) description=${PREFIX}(-) maxHeightPixels=25 alwaysZero=on autoScale=on yLineMark=0 yLineOnOff=on type=bigWig color=$((RANDOM%255)),$((RANDOM%255)),$((RANDOM%255)) visibility=full bigDataUrl=http://zlab.umassmed.edu/~hanb/Bo/${PREFIX}.x_rRNA.${Genome }.-polyA.sorted.noS.5p.unique.minus.BW" >> ${PREFIX}.x_rRNA.${Genome}.-polyA.5p.unique.track && \ touch .status.${STEP}.bigWig STEP=$((STEP+1)) [ ! -f .status.${STEP}.doCleanUpdTSeq ] && \ awk 'BEGIN{FS="\t";OFS="\t"}{print $1,$2,$3,1,1,$6}' ${PREFIX}.x_rRNA.${Genome}.-polyA.sorted.noS.5p.unique.bed | uniq | awk 'BEGIN{FS="\t";OFS="\t"}{print $1,$2,$3,NR,1,$6}' > ${PREFIX}.x_rRNA.${Genome}.-polyA.sorted.noS.5p.bed2 && \ Rscript $PPP_PATH/cleanUpdTSeq.R ${PREFIX}.x_rRNA.${Genome}.-polyA.sorted.noS.5p.bed2 && \ touch .status.${STEP}.doCleanUpdTSeq On Sep 4, 2013, at 7:28 PM, "Li, Xin" <xin.li@umassmed.edu> wrote: <cleanupdtseq.pdf> [[alternative HTML version deleted]]

Normalization Clustering convert cleanUpdTSeq Normalization Clustering convert cleanUpdTSeq • 1.6k views

ADD COMMENT • link 11.6 years ago Julie Zhu ★ 4.3k