Entering edit mode
writetoroopali
•
0
@writetoroopali-18024
Last seen 6.1 years ago
Hi
I am trying to do alignment and counting from a fastq file using package Rsubread. I am following the following instructions:
http://combine-australia.github.io/RNAseq-R/07-rnaseq-day2.html
However, every time I try to build index for the human genome the process is killed after 50% index is build and .files and .log are there but no .reads file is generated
My code:
buildindex(basename="GRCh38",reference="GRCh38_latest_genomic.fna")
Please help.
Thank you.
Roopali
What is the version of Rsubread you use? You need to provide your session info when you post.
There are some changes made to the default setting of the
buildindex()
function. The function now generates a full index by default, meaning that ~17GB of memory is required when building the index for human genome. Does your computer have enough memory?Hi Wei,
Thank you so much for your reply. This is the sessionInfo for package "Rsubread":
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8
attached base packages:
character(0)
other attached packages:
[1] Rsubread_1.30.9
loaded via a namespace (and not attached):
[1] compiler_3.5.1 Matrix_1.2-14 graphics_3.5.1 tools_3.5.1
[5] utils_3.5.1 yaml_2.2.0 grDevices_3.5.1 stats_3.5.1
[9] gumbel_1.10-2 datasets_3.5.1 grid_3.5.1 methods_3.5.1
[13] numDeriv_2016.8-1 base_3.5.1 lattice_0.20-35
Also I have more than ~17GB memory on my computer. I also have done this on a server provided by my college and the same things happens there also. Process is killed after 40% or 50% of index.
Please help.
Roopali
Hi Roopali, could you also provide your R command?
My R command is:
buildindex(basename="GRCh38",reference="GRCh38_latest_genomic.fna")
The reference genome sequence was downloaded from here: https://www.ncbi.nlm.nih.gov/projects/genome/guide/human/index.shtml
Thank you.
Roopali
Hi Roopali, could you also provide the screen output from
buildindex
? This function has been successfully tested on unix servers.Would you please run the following command on your Mac to see if it works?
buildindex(basename="GRCh38",reference="GRCh38_latest_genomic.fna",gappedIndex=TRUE,indexSplit=TRUE, memory=4000)
Hi Wei,
When I use buildindex(basename="GRCh38",reference="GRCh38_latest_genomic.fna") I get:
//=========================== indexBuilder setting ===========================\\
|| Index name : GRCh38 ||
|| Index space : base-space ||
|| One block index : yes ||
|| Repeat threshold : 100 repeats ||
|| Distance to next subread : 1 ||
|| Input files : 1 file in total ||
|| o GRCh38_latest_genomic.fna ||
|| ||
\\===================== http://subread.sourceforge.net/ ======================//
//================================= Running ==================================\\
|| Check the integrity of provided reference sequences ... ||
|| There were 103 notes for reference sequences. ||
|| The notes can be found in the log file, 'GRCh38.log'. ||
|| Scan uninformative subreads in reference sequences ... ||
|| 8%, 2 mins elapsed, rate=5964.7k bps/s, total=3251m ||
|| 16%, 2 mins elapsed, rate=5583.2k bps/s, total=3251m ||
|| 24%, 3 mins elapsed, rate=5272.4k bps/s, total=3251m ||
|| 33%, 4 mins elapsed, rate=4985.5k bps/s, total=3251m ||
|| 41%, 6 mins elapsed, rate=4729.3k bps/s, total=3251m ||
|| 49%, 7 mins elapsed, rate=4549.2k bps/s, total=3251m ||
|| 58%, 8 mins elapsed, rate=4339.3k bps/s, total=3251m ||
|| 66%, 9 mins elapsed, rate=4172.1k bps/s, total=3251m ||
|| 74%, 11 mins elapsed, rate=4028.8k bps/s, total=3251m ||
|| 83%, 13 mins elapsed, rate=3840.1k bps/s, total=3251m ||
|| 91%, 14 mins elapsed, rate=3663.4k bps/s, total=3251m ||
|| 99%, 17 mins elapsed, rate=3437.3k bps/s, total=3251m ||
|| 796781 uninformative subreads were found. ||
|| These subreads were excluded from index building. ||
|| Build the index... ||
|| 8%, 19 mins elapsed, rate=1794.6k bps/s, total=3251m ||
|| 16%, 22 mins elapsed, rate=1748.1k bps/s, total=3251m ||
|| 24%, 25 mins elapsed, rate=1721.8k bps/s, total=3251m ||
|| 33%, 27 mins elapsed, rate=1703.2k bps/s, total=3251m ||
Killed
But when I used your command buildindex(basename="GRCh38",reference="GRCh38_latest_genomic.fna",gappedIndex=TRUE,indexSplit=TRUE, memory=4000) it works. I have .reads, .log, .files, .tab, .array files.
Can you please explain me the difference in the two settings?
The use of
indexSplit=TRUE
andmemory=4000
significantly reduces the amount of memory used bybuildindex
for index building. With this setting, the index is split into multiple blocks (indexSplit=TRUE
) and each block has a size no greater than 4000MB (memory=4000
).Setting
gappedIndex=TRUE
allows a gapped index to be built. A gapped index is about 8GB in size for human/mouse genome, which is much smaller than the full index. If you setgappedIndex=FALSE
to generate a full index it should still work on your Mac. The important thing here is to split the index and make sure that the size of index blocks fits into your computer memory.How the index is built has an impact on your mapping speed. A one-block full index (generated by default setting) will enable maximum mapping speed to be achieved. Splitting index into blocks will slow down the mapping. The use of a gapped index will also slow down the mapping compared to using a full index. However all these depend on the computing resource you have. You might further fine-tune the parameters (eg. slightly increasing the value of
memory
parameter) to try to get the best setting for your computer.Hi Wei,
Thank you so much for your response.
I just have one more question: Does the one-block index or split index have impact on the number of assigned reads while using featureCounts?
Hi Roopali, the counts will be slightly different as the mapping results will be slightly different when using different types of indices. However this should have very little effect on the downstream analysis such as differential expression analysis.
Hi Wei,
Thank you for your help so far. I am using the gapped index to do alignment and counting. Here's the code:
fastq.files <- list.files(pattern = ".fastq.gz$", full.names = TRUE)
buildindex(basename="GRCh37",reference="GRCh37_latest_genomic.fna",
gappedIndex=TRUE,indexSplit=TRUE, memory=4000)
align(index="/storage/work/r/rus82/GRCh37",readfile1=fastq.files)
bam.files <- list.files(pattern = ".BAM$", full.names = TRUE)
fc <- featureCounts(bam.files, annot.inbuilt="hg19")
So after alignment 87127 reads are mapped but when I count, only around 400 reads are assigned out of 120000. Is this due to gapped index?
No worries Roopali. The
featureCounts
counting is not affected by the use of a full index or a gapped index. Could you please show the counting summary by running the following command? This will tell you the reasons why your reads were not counted.fc$stat
Hi Wei,
This is the output of fc$stat.
> fc$stat
Status
1 Assigned
2 Unassigned_Unmapped
3 Unassigned_MappingQuality
4 Unassigned_Chimera
5 Unassigned_FragmentLength
6 Unassigned_Duplicate
7 Unassigned_MultiMapping
8 Unassigned_Secondary
9 Unassigned_Nonjunction
10 Unassigned_NoFeatures
11 Unassigned_Overlapping_Length
12 Unassigned_Ambiguity
X.storage.work.r.rus82.sample1.fastq.gz.subread.BAM
1 431
2 32873
3 0
4 0
5 0
6 0
7 0
8 0
9 0
10 86696
11 0
12 0
X.storage.work.r.rus82.sample2.fastq.gz.subread.BAM
1 446
2 35721
3 0
4 0
5 0
6 0
7 0
8 0
9 0
10 83833
11 0
12 0
Also, is the variable 'nTrim3' and 'nTrim5' in the align function for removing the adapter sequences?
The reason why most of your reads were not assigned to genes is because they do not overlap any genes (see the 'Unassigned_NoFeatures' category). This suggests that a lot of off-target reads were generated in your samples.
Hi Dr. Shi,
I followed the code in this discussion. The result is very good with 80% mapped and counted using featureCounts. Since I want to use DESeq2 for the downstream analysis, would you please let me know how I can export the featureCounts data as matrix table for the DESeq2? Thank you,
If you read the help page for
featureCounts
function, you will find that it returns the matrix table in thecounts
component of its returned object.