Entering edit mode
Dear Bo,
Sounds like you have not clustered reads into peaks yet. After you do,
you should not have too many distinct potential polyA sites. FYI, we
got 205470 peaks for one time point and 75511 for another time point.
Sarah Sheppard, a Ph.D student who recently graduated, wrote some perl
script to do such clustering. I cced her in this email in case you are
interested in her scripts.
Also, we divide all potential peaks into batches, each with about
10,000 peaks and run them in the school cluster hpcc simultaneously.
Here is how to run your R script using qsub in the cluster assuming
your R script is called predPas.R .
qsub cwd predPas.sh
Here is the content in the predPas.sh
#!/bin/bash
#$ -V
#$ -l mem_free=26G
/share/bin/R-3.0.1/lib64/R/bin/R CMD BATCH predPas.R
The R session output will be in predPas.Rout. For missing libraries,
you could contact the cluster management group to install the proper R
packages.
Here is the user manual for cleanUpdTSeq if you have not read it http:
//www.bioconductor.org/packages/2.13/bioc/vignettes/cleanUpdTSeq/inst/
doc/cleanUpdTSeq.pdf. We probably need to add a section on how to run
it in a cluster environment and add the clustering script.
Thanks for your feedback!
Best regards,
Julie
On 9/12/13 10:04 AM, "Han, Bo" <bo.han@umassmed.edu> wrote:
Hello, Dear Prof. Zhu,
This is a Graduate Student in the Zamore Lab.
Congratulations on your recent paper on bioinformatics and thanks to
that, our analysis on PAS could be easier.
Xin, my collaborator has met with you last week and got the document
of cleanUpdTSeq. I was following that document, but met a problem.
Seems that the program requires lots of memory and I couldn't finish
the running even on our best node in school. I have merged the reads
that mapped to the same position, ,and even separated the reads to
different chromosomes (each 510M). But still took too much memory and
cannot finish running.
I would really appreciate it if you can give me some suggestions. The
data (with individual chromosomes) are located at /home/hanb/scratch/p
repachytene/PAS/mouse.testis.wild.type.adult.PASseq/chromosomes.
Here is the code that I ran the R:
#source("http://bioconductor.org/biocLite.R")
#biocLite("cleanUpdTSeq")
library(cleanUpdTSeq)
library(BSgenome.Mmusculus.UCSC.mm9)
args = commandArgs (T)
inputBedFile = args[1]
testSet = read.table(inputBedFile, sep = "\t", header = F)
peaks = BED2GRangesSeq (testSet, withSeq=F)
testSet.NaiveBayes = buildFeatureVector (peaks,BSgenomeName =
Mmusculus, upstream = 40,
downstream = 30, wordSize = 6, alphabet=c("ACGT"),
sampleType = "unknown",replaceNAdistance = 30,
method = "NaiveBayes", ZeroBasedIndex = 1, fetchSeq = TRUE)
data(data.NaiveBayes)
predictTestSet(data.NaiveBayes$Negative, data.NaiveBayes$Positive,
testSet.NaiveBayes,
outputFile = paste (inputBedFile, ".tsv", sep=""), assignmentCutoff =
0.5)
And how I got the bed from fastq and ran the Rscript (last two lines)
#! /bin/bash -x
################
# Major Config #
################
export PPP_PATH=/home/hanb/nearline/ppp
export PATH=$PPP_PATH:$PATH
# FQ input
FQ=$1
# CPU
CPU=8
# step
STEP=1
# genome
Genome=mm9
# bowtie2 index
BOWTIE2_INDEX=/home/hanb/nearline/Mus_musculus/UCSC/mm9/Sequence/Whole
GenomeFasta/genome
# chrom
Chrome=/home/hanb/nearline/small_RNA_Pipeline/common_files/mm9/ChromIn
fo.txt
# GTF file with gene annotation
GTF_FILE=/home/hanb/scratch/mm9.newest.07dpp.gtf
# check read length
READ_LEN=`head -2 $FQ | awk '{getline; printf "%d", length($1)}'`
# STAR sjdbOverhang option
STARsjdbOverhang=$((READ_LEN-1))
# STAR genome index directory
STAR_Genome_INDEX_DIR=/home/hanb/nearline/PE_Pipeline/common_files/mm9
/STARgenomeIndex
STAR_rRNA_INDEX_DIR=/home/hanb/nearline/PE_Pipeline/common_files/mm9/r
RNAIndex
# STAR genome index
STAR_Genome_INDEX=$STAR_Genome_INDEX_DIR/$STARsjdbOverhang
STAR_rRNA_INDEX=$STAR_rRNA_INDEX_DIR/$STARsjdbOverhang
# MM
MM=$(((READ_LEN+1)/25))
# prefix
PREFIX=${FQ%.fq}
# trimmed 3' adaptor and filter quality
ADAPTOR_SEQ="AATGGAATTC" # allow two As in front of the 3' adaptor
# ADAPTOR_SEQ="AAAAAAAAAAAAAAAAAAAATGGAATTCTCGGGTGCCAAGGC"
# reads are subjected to quality trimming first until reaching phred
>= 3
# then find and trim adaptor
# reads without adaptor found are still reported but later filtered
# reads with N is dumped
# reads shorter than 25nt after quality trimming and adaptor clipping
are dumped as well
[ ! -f .status.${STEP}.flexbar ] && \
flexbar \
-r $FQ \
-t ${PREFIX}.trimmed \
-f fastq-i1.8 \
-n $CPU \
-as $ADAPTOR_SEQ \
-at 1.0 \
-ao 10 \
--min-readlength 25 \
--max-uncalled 0 \
-q 3 && \
awk '{NAME=$1;getline;SEQ=$1;getline;getline;QUAL=$1; ++A; if (substr
(SEQ,length($0)-5)=="AAAAAA") {++B; printf "%s\n%s\n+\n%s\n", NAME,
SEQ, QUAL}}END{print A"\t"B >> "filter_6A.log"}'
${PREFIX}.trimmed.fastq > ${PREFIX}.trimmed.fq && \
touch .status.${STEP}.flexbar
STEP=$((STEP+1))
# map to genome using STAR with stringent condition
# ideally, polyA by internal priming shoud be able to be mapped
INPUT=${PREFIX}.trimmed.fq
[ ! -f .status.${STEP}.genome_mapping_with_polyA ] && \
STAR \
--runMode alignReads \
--genomeDir $STAR_Genome_INDEX \
--readFilesIn ${INPUT} \
--runThreadN $CPU \
--outFilterScoreMin 0 \
--outFilterScoreMinOverLread 0.90 \
--outFilterMatchNmin 0 \
--outFilterMatchNminOverLread 0.90 \
--outFilterMultimapScoreRange 1 \
--outFilterMultimapNmax -1 \
--outFilterMismatchNmax 10 \
--outFilterMismatchNoverLmax 0.10 \
--alignIntronMax 0 \
--alignIntronMin 21 \
--alignSJDBoverhangMin 1 \
--outFilterIntronMotifs RemoveNoncanonicalUnannotated \
--outSAMstrandField intronMotif \
--outSAMattributes All \
--genomeLoad NoSharedMemory \
--outFileNamePrefix
${PREFIX}.x_rRNA.${Genome}.+polyA.unmapped.softSlipping. \
--outReadsUnmapped Fastx \
--outSJfilterReads Unique \
--seedSearchStartLmax 20 \
--seedSearchStartLmaxOverLread 1.0 \
--chimSegmentMin 0 && \
touch .status.${STEP}.genome_mapping_with_polyA
STEP=$((STEP+1))
# map unmapped without polyA
## run trimpoly
## convert FQ to FA for trimpoly
[ ! -f .status.${STEP}.trimpoly ] && \
awk '{name=substr($1,2);print ">"name; getline;print
$1;getline;getline}' ${PREFIX}.trimmed.fq > ${PREFIX}.trimmed.fa && \
trimpoly -e 3 -s 5 < ${PREFIX}.trimmed.fa >
${PREFIX}.trimmed.trimpoly.log &&
trimpoly2 ${PREFIX}.trimmed.fq ${PREFIX}.trimmed.trimpoly.log >
${PREFIX}.trimmed.-polyA.fq && \
touch .status.${STEP}.trimpoly
STEP=$((STEP+1))
# map to genome again
INPUT=${PREFIX}.trimmed.-polyA.fq
[ ! -f .status.${STEP}.genome_mapping_without_polyA ] && \
STAR \
--runMode alignReads \
--genomeDir $STAR_Genome_INDEX \
--readFilesIn ${INPUT} \
--runThreadN $CPU \
--outFilterScoreMin 0 \
--outFilterScoreMinOverLread 0.72 \
--outFilterMatchNmin 0 \
--outFilterMatchNminOverLread 0.72 \
--outFilterMultimapScoreRange 1 \
--outFilterMultimapNmax -1 \
--outFilterMismatchNmax 15 \
--outFilterMismatchNoverLmax 0.15 \
--alignIntronMax 0 \
--alignIntronMin 21 \
--alignSJDBoverhangMin 1 \
--outFilterIntronMotifs RemoveNoncanonicalUnannotated \
--outSAMstrandField intronMotif \
--outSAMattributes All \
--genomeLoad NoSharedMemory \
--outFileNamePrefix ${PREFIX}.x_rRNA.${Genome}.-polyA. \
--outReadsUnmapped Fastx \
--outSJfilterReads Unique \
--seedSearchStartLmax 20 \
--seedSearchStartLmaxOverLread 1.0 \
--chimSegmentMin 0 && \
touch .status.${STEP}.genome_mapping_without_polyA
STEP=$((STEP+1))
# getting statistics
InputReads=`grep 'Number of input reads'
${PREFIX}.x_rRNA.${Genome}.-polyA.Log.final.out | awk '{print $NF}'`
UniquReads=`grep 'Uniquely mapped reads number'
${PREFIX}.x_rRNA.${Genome}.-polyA.Log.final.out | awk '{print $NF}'`
MultiReads=`grep 'Number of reads mapped to multiple loci'
${PREFIX}.x_rRNA.${Genome}.-polyA.Log.final.out | awk '{print $NF}'`
AllMapReads=$((UniquReads+MultiReads))
UnMapReads=$((InputReads-UniquReads-MultiReads))
# normalization factor
NormScale=`echo $UniquReads | awk '{printf "%f",1000000.0/$1}'`
# convert format
[ ! -f .status.${STEP}.data_convert1 ] && \
samtools view -bS ${PREFIX}.x_rRNA.${Genome}.-polyA.Aligned.out.sam >
${PREFIX}.x_rRNA.${Genome}.-polyA.Aligned.out.bam && \
samtools sort -@ $CPU
${PREFIX}.x_rRNA.${Genome}.-polyA.Aligned.out.bam
${PREFIX}.x_rRNA.${Genome}.-polyA.sorted && \
rm -rf ${PREFIX}.x_rRNA.${Genome}.-polyA.Aligned.out.sam
${PREFIX}.x_rRNA.${Genome}.-polyA.Aligned.out.bam && \
touch .status.${STEP}.data_convert1
STEP=$((STEP+1))
# making bigWig
[ ! -f .status.${STEP}.bigWig ] && \
samtools view -h ${PREFIX}.x_rRNA.${Genome}.-polyA.sorted.bam | \
filterSoftClippedReads -3 | \
samtools view -bS - | \
bedtools bamtobed -i - | \
awk 'BEGIN{OFS="\t";FS="\t"}{if ($6=="+") {$2=$3-1} else {$3=$2+1}
print $0}' | \
uniq \
> ${PREFIX}.x_rRNA.${Genome}.-polyA.sorted.noS.5p.bed && \
awk 'BEGIN{OFS="\t"}{if ($5==255) {print $0 > "/dev/stdout"} else
{print $0 > "/dev/stderr"}}'
${PREFIX}.x_rRNA.${Genome}.-polyA.sorted.noS.5p.bed \
1> ${PREFIX}.x_rRNA.${Genome}.-polyA.sorted.noS.5p.unique.bed \
2> ${PREFIX}.x_rRNA.${Genome}.-polyA.sorted.noS.5p.multip.bed && \
paraFile="${RANDOM}${RANDOM}.para" && \
echo "bedtools genomecov -scale $NormScale -bg -strand + -i
${PREFIX}.x_rRNA.${Genome}.-polyA.sorted.noS.5p.unique.bed -g $Chrome
> ${PREFIX}.x_rRNA.${Genome}.-polyA.sorted.noS.5p.unique.plus.bedgraph
&& bedGraphToBigWig
${PREFIX}.x_rRNA.${Genome}.-polyA.sorted.noS.5p.unique.plus.bedgraph
$Chrome
${PREFIX}.x_rRNA.${Genome}.-polyA.sorted.noS.5p.unique.plus.BW" >
$paraFile && \
echo "bedtools genomecov -scale $NormScale -bg -strand - -i
${PREFIX}.x_rRNA.${Genome}.-polyA.sorted.noS.5p.unique.bed -g $Chrome
| awk 'BEGIN{OFS=\"\t\"}{\$4 = -\$4; print \$0}' >
${PREFIX}.x_rRNA.${Genome}.-polyA.sorted.noS.5p.unique.minus.bedgraph
&& bedGraphToBigWig
${PREFIX}.x_rRNA.${Genome}.-polyA.sorted.noS.5p.unique.minus.bedgraph
$Chrome
${PREFIX}.x_rRNA.${Genome}.-polyA.sorted.noS.5p.unique.minus.BW" >>
$paraFile && \
ParaFly -CPU $CPU -c $paraFile -failed_cmds ${STEP}.failed_commands &&
\
echo "track name=U.5p.${PREFIX}(+) description=${PREFIX}(+)
maxHeightPixels=25 alwaysZero=on autoScale=on yLineMark=0
yLineOnOff=on type=bigWig
color=$((RANDOM%255)),$((RANDOM%255)),$((RANDOM%255)) visibility=full
bigDataUrl=http://zlab.umassmed.edu/~hanb/Bo/${PREFIX}.x_rRNA.${Genome
}.-polyA.sorted.noS.5p.unique.plus.BW" >
${PREFIX}.x_rRNA.${Genome}.-polyA.5p.unique.track && \
echo "track name=U.5p.${PREFIX}(-) description=${PREFIX}(-)
maxHeightPixels=25 alwaysZero=on autoScale=on yLineMark=0
yLineOnOff=on type=bigWig
color=$((RANDOM%255)),$((RANDOM%255)),$((RANDOM%255)) visibility=full
bigDataUrl=http://zlab.umassmed.edu/~hanb/Bo/${PREFIX}.x_rRNA.${Genome
}.-polyA.sorted.noS.5p.unique.minus.BW" >>
${PREFIX}.x_rRNA.${Genome}.-polyA.5p.unique.track && \
touch .status.${STEP}.bigWig
STEP=$((STEP+1))
[ ! -f .status.${STEP}.doCleanUpdTSeq ] && \
awk 'BEGIN{FS="\t";OFS="\t"}{print $1,$2,$3,1,1,$6}'
${PREFIX}.x_rRNA.${Genome}.-polyA.sorted.noS.5p.unique.bed | uniq |
awk 'BEGIN{FS="\t";OFS="\t"}{print $1,$2,$3,NR,1,$6}' >
${PREFIX}.x_rRNA.${Genome}.-polyA.sorted.noS.5p.bed2 && \
Rscript $PPP_PATH/cleanUpdTSeq.R
${PREFIX}.x_rRNA.${Genome}.-polyA.sorted.noS.5p.bed2 && \
touch .status.${STEP}.doCleanUpdTSeq
On Sep 4, 2013, at 7:28 PM, "Li, Xin" <xin.li@umassmed.edu> wrote:
<cleanupdtseq.pdf>
[[alternative HTML version deleted]]