Question

Import bamfiles for RNAsequencing analysis

0

Entering edit mode

Merlin ▴ 10

@merlin-15723

Last seen 5.3 years ago

Vancouver

Hello Everyone,
I'm new to RNAseq world and Rstudio in general and I want to analyze a couple of bamfiles from illumina.
So far, I successfully followed tutorials for the analysis of RNAseq like this:

https://master.bioconductor.org/help/course-materials/2016/CSAMA/lab-3-rnaseq/rnaseq_gene_CSAMA2016.html

Unfortunately, all the tutorials available start by giving packages and descriptions about how to import into R but none of them shows how to import bamfiles not packaged.

I learned about how to look into bamfile for example by running this commands:

library(Rsamtools)

bam1 <-"test.bam"
bam1 <-scanBam("test.bam")

bam1

so that I can look what's inside but after that I'm stopped because I don't know how to proceed.
I would be super happy to receive some help, just for one step forward.

Thank you very much

P

R rnaseq bam file • 29k views

ADD COMMENT • link updated 7.0 years ago by Gordon Smyth 52k • written 7.0 years ago by Merlin ▴ 10

0

Entering edit mode

Can you say a bit more about what you are trying to do? What exactly do you mean by 'analyze a couple of bamfiles' and also what do you mean by 'import bamfiles not packaged'?

ADD REPLY • link 7.0 years ago James W. MacDonald 68k

0

Entering edit mode

Hi James,
I want to do differential gene expression analysis, same as described in the link but using my own bamfiles.
The link provides a package named airway which is easy to import into R following the description but I can not do the same thing with my files.

Thanks

ADD REPLY • link 7.0 years ago Merlin ▴ 10

0

Entering edit mode

Don't answer your own question unless you are answering your own question. Instead use the ADD REPLY button and type in the box that pops up.

ADD REPLY • link 7.0 years ago James W. MacDonald 68k

0

Entering edit mode

ok sorry about that

ADD REPLY • link 7.0 years ago Merlin ▴ 10

score 0 · Answer 1 · 2018-05-02

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 19 hours ago

United States

To import data from bam files, you usually start R in the directory that contains your bam files and then do something like

library(Rsamtools)
bfl <- BamFileList(dir(".", "bam$"), yieldsize = 2e6)

Which is a shorter way of doing something like this

## look for all files in the current directory that end with 'bam'
bamfiles <- dir(".", "bam$")

## load all those files into a BamFileList
bfl <- BamFileList(bamfiles, yieldsize = 2e6)

And then you would proceed with the rest of the RNA-Seq workflow.

ADD COMMENT • link 7.0 years ago James W. MacDonald 68k

0

Entering edit mode

After setting the directory where my two bam files are located, I run the lines you gave me and this is the output:

> library(Rsamtools)

> fls <- dir(".", "bam$")
> bfl <- BamFileList(bamfiles, yieldsize = 2e6)
Error in validObject(.Object) :
invalid class “BamFileList” object: the 'listData' slot must be a list containing BamFile objects

There is something to adjust in the third line, probably I have to make a list first?

thanks

ADD REPLY • link 7.0 years ago Merlin ▴ 10

0

Entering edit mode

You are creating a character vector of your bam files (called fls) and then passing something else (called bamfiles) to your call to BamFileList. I don't know what bamfiles are in this context, so I can't really tell you anything.

Also note that I misspelled the yieldSize argument. R is case-sensitive, so using yieldsize won't work.

ADD REPLY • link 7.0 years ago James W. MacDonald 68k

0

Entering edit mode

I adjusted the misspelled of yealdsize to yealdSize and seems that it's ok now:

> fls
[1] "sample1.bam" "sample2.bam"

Can you confirm me that now I have imported the bam files and I can go for the next step?

It should be an excel table with detailed informations for each of my samples named comma-separated value (CSV)

Thank you

ADD REPLY • link 7.0 years ago Merlin ▴ 10

0

Entering edit mode

You haven't imported the bam files. All you have done is created a BamFileList, and now if you use summarizeOverlaps it will automatically import the data from the bam files and count up the number of reads that overlap the exons for each of your genes.

Do note that any workflow that is intended for people to run will necessarily use convenient, artificially small data sets. This is true of the workflow you are using as a template. So as a high level overview, you now want to read in the data from your bam files and then count up all the reads that overlap any gene for your organism. So you need to have something that says where all the genes are.

The workflow uses an artificial example that you won't use. Instead you would likely use something like the TxDb.Hsapiens.UCSC.hg38.knownGene package, which has information about where all the genes in the hg38 genome are (assuming you are using human data). You would install and load that package, and then call exonsBy, just like in the workflow, but using the TxDb instead. You would then call summarizeOverlaps to get counts of all the reads in your two bam files that overlap any of the genes in the TxDb.

ADD REPLY • link 7.0 years ago James W. MacDonald 68k

0

Entering edit mode

Hi James,

As you told me to do, I downloaded the package TxDb.Hsapiens.UCSC.hg38.knownGene

but when I I go to create the txt database with this line

txdb <- makeTranscriptDbFromUCSC( genome='hg38', tablename='refGene' )
tx_by_exon <- exonsBy(txdb, 'tx')

I have this error:

Error: could not find function "makeTranscriptDbFromUCSC"

I noticed that many people had this issue with this line command but I still can't find the solution to go ahead.
Do you have any suggestion?

Thank you

-P

ADD REPLY • link 7.0 years ago Merlin ▴ 10

0

Entering edit mode

You don't need to make a transcript database, because you just downloaded and installed one! The next step is

library(TxDb.Hsapiens.UCSC.hg38.knownGene)
tx_by_exon <- exonsBy(TxDb.Hsapiens,UCSC.hg38.knownGene)

ADD REPLY • link 7.0 years ago James W. MacDonald 68k

Ryan C. Thompson · Answer 2 · 2018-05-09

0

Entering edit mode

Merlin ▴ 10

@merlin-15723

Last seen 5.3 years ago

Vancouver

Hi,
It has been days that I'm struggling to make this line working .. but I can't do by myself.

this is what I get

> library( "GenomicFeatures" )
> library(TxDb.Hsapiens.UCSC.hg38.knownGene)
> tx_by_exon <- exonsBy(TxDb.Hsapiens,UCSC.hg38.knownGene)
Error in exonsBy(TxDb.Hsapiens, UCSC.hg38.knownGene) : 
  object 'TxDb.Hsapiens' not found

exonBy is under GenomicFeature but it gives me errors, can you please help me with that?

Or maybe I can use a different way for counting reads in genes and make the matrice

I will appreciate a lot

thank you

ADD COMMENT • link updated 7.0 years ago by Ryan C. Thompson ★ 7.9k • written 7.0 years ago by Merlin ▴ 10

0

Entering edit mode

There is a typo in the code. There is a comma between Hsapiens and UCSC, when there should be a period. Hence, the line of code should be:

tx_by_exon <- exonsBy(TxDb.Hsapiens.UCSC.hg38.knownGene)

ADD REPLY • link 7.0 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Thank you for the answer,it's working now.

In order to use summarizeOverlap I'm using this script :

> tx_by_exon <- exonsBy(TxDb.Hsapiens.UCSC.hg38.knownGene)
> library( "GenomicAlignments" )
> se <- summarizeOverlaps( exonsByGene, BamFileList( bamFiles ), mode="Union",
+ singleEnd=FALSE, ignore.strand=TRUE, fragments=TRUE )
Error in summarizeOverlaps(exonsByGene, BamFileList(bamFiles), mode = "Union", :
object 'exonsByGene' not found

I m not sure why do I get those error, is it because I'm missing packages?

Thank you

P

ADD REPLY • link 7.0 years ago Merlin ▴ 10

0

Entering edit mode

I'm afraid I can't debug your code line by line. It looks to me like you are trying to cut and paste together pre-existing code from different sources, and that is not going to work. In this case, the summarizeOverlaps line refers to a variable named exonsByGene, but in your previous line of code, you called the variable tx_by_exon. If you are inexperienced in programming, I recommend you find some introductory tutorials and go through them to learn the basics of R programming before attempting more complex analyses.

ADD REPLY • link 7.0 years ago Ryan C. Thompson ★ 7.9k

score 0 · Answer 3 · 2018-05-09

I messed up the topic, I changed something from the first posts.
I will summarize here all my codes:

I created fist the bam files list:

list_bam <- lapply ( c("SRR1039508_subset.bam", "SRR1039512_subset.bam",

"SRR1039516_subset.bam","SRR1039520_subset.bam","SRR1039509_subset.bam",

"SRR1039513_subset.bam" , "SRR1039517_subset.bam","SRR1039521_subset.bam" ) , scanBam )

I generated the full paths to the files I want to perform the counting operation on:

bamFiles <- file.path("/Users/patriziopanelli/Documents/BamFilesTest",sampleTable$fileName )

bamFiles

Launch this code by protocol

seqinfo( BamFile( bamFiles[1] ) )

I created a GRange list

txdb <-TxDb.Hsapiens.UCSC.hg38.knownGene

grl <- exonsBy(txdb, by="gene")

grl

And now I'm stuck here...

tx_by_exon <- exonsBy(TxDb.Hsapiens.UCSC.hg38.knownGene)

library( "GenomicAlignments" )

se <- summarizeOverlaps( exonsByGene, BamFileList( bamFiles ), mode="Union",

singleEnd=FALSE, ignore.strand=TRUE, fragments=TRUE )

I can assure that it ain't been easy to reach this point, yes I m new to Rstudio and RNAseq analysis but to me every time I go further each step it's a great satisfaction and I don't want to get lost in tons or codes of R ..I'm trying to learn it by using it.

I hope this post can clarify the work that I m doing ..or if there is something wrong, please tell me. My goal for now is to create the count matrice

Thank you

score 0 · Answer 4 · 2018-05-14

0

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 7 hours ago

WEHI, Melbourne, Australia

The simplest and fastest Bioconductor way to get gene counts from aligned BAM files is to use featureCounts(). For example

files <- c("SRR1039508_subset.bam", "SRR1039512_subset.bam",
           "SRR1039516_subset.bam", "SRR1039520_subset.bam",
           "SRR1039509_subset.bam", "SRR1039513_subset.bam",
           "SRR1039517_subset.bam", "SRR1039521_subset.bam")
fc <- featureCounts(files, annot.inbuilt="hg38")

Then the matrix of counts is found in fc$counts.

ADD COMMENT • link 7.0 years ago Gordon Smyth 52k

0

Entering edit mode

Hi Gordon,
I used a different code but at the end the result is the same.
Please find the attachment.
I also normalized in Rlog the reads by using this codes:

#Save the counts and statistics
write.table(x=data.frame(counts.75$annotation[,c("GeneID","Length")],counts.75$counts,stringsAsFactors=FALSE),file="counts.<samples>.txt",quote=FALSE,sep="\t",row.names=FALSE)
write.table(x=data.frame(counts.75$stat,stringsAsFactors=FALSE),file="stat.<samples>.txt",quote=FALSE,sep="\t",row.names=FALSE)

counts_df.75 <- as.data.frame(counts.75$counts)
colnames(counts_df.75) <- rownames(sampleTable1)

#Normalization with rlog:
library("DESeq2")
dds <- DESeqDataSetFromMatrix(counts_df.75, colData=sampleTable1, design = ~1 )

nrow(dds)
dds <- dds[ rowSums(counts(dds)) > 1, ]
nrow(dds)

rld <- rlog(dds, blind = TRUE)
colnames(rld) <- sampleTable1$EXTERNAL_ID

I'm looking forward for the next step..
Can you tell me what should I do to get the count matrice?

Thank you

ADD REPLY • link 6.9 years ago Merlin ▴ 10