Hi Ashutosh,
Please keep these exchanges on the BioC mailing list. I've cc'd the
list here.
> But when I exactly follow the script from edgeR manual on the data
and on making DGE list, I get library size as NA.
This depends on how the data is stored. readDGE() expects read counts
where absence of a tag id in a single sample means a 0 count for that
sample, while ArrayExpress appears to store 0s as NAs (with aligned
ids across the dataset). So, I believe this is why your library sizes
come out as NA (summing over columns with NAs present). As you see
below, that's why I added:
counts[is.na(counts)] <- 0
before I created the 'DGEList'.
> Also can you please explain or give reference to function "grp <-
sapply(colnames(counts),function(u) strsplit(u,"_")[[1]][1])". At the
moment, I just copied into the terminal but I want to know what it
signifies?
It's just a way to convert this vector:
> colnames(counts)
[1] "DLCK.TG_1" "DLCK.TG_2" "DLCK.TG_3" "DLCK.TG_4" "WT_1" "WT_3"
[7] "WT_4" "WT_6"
into this:
> grp
DLCK.TG_1 DLCK.TG_2 DLCK.TG_3 DLCK.TG_4 WT_1 WT_3 WT_4
WT_6
"DLCK.TG" "DLCK.TG" "DLCK.TG" "DLCK.TG" "WT" "WT" "WT"
"WT"
There are many other ways to do this, including:
grp <- gsub("_[0-9]","",colnames(counts))
but, these are typically customized to the dataset (and filename
structure) on hand.
Best, Mark
On 26.12.2013, at 11:02, "Dhingra, Ashutosh" <a.dhingra at="" vumc.nl="">
wrote:
> Dear Mark,
>
> Thanks a lot for your kind help! It works with your script. I was
also able to find a error in the target list because of which I was
getting out of bound error.
>
>
> But when I exactly follow the script from edgeR manual on the data
and on making DGE list, I get library size as NA.
>
> Also can you please explain or give reference to function "grp <-
sapply(colnames(counts),function(u) strsplit(u,"_")[[1]][1])". At the
moment, I just copied into the terminal but I want to know what it
signifies?
>
>
>
> Best regards,
> Ashutosh
> ________________________________________
> From: Mark Robinson [mark.robinson at imls.uzh.ch]
> Sent: 23 December 2013 17:35
> To: Ashutosh [guest]
> Cc: bioconductor at r-project.org; Dhingra, Ashutosh
> Subject: Re: [BioC] differential expression- edgeR
>
> Dear Ashutosh,
>
> I wasn't able to (nicely) get exactly the same files from GEO, but
here is an alternative:
>
> See:
>
http://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-71/samples/
>
> f <- "
ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/experiment/M
TAB/E-MTAB-71/E-MTAB-71.raw.1.zip"
> bf <- basename(f)
> download.file(f, bf) # download ZIP from ArrayExpress
>
> unzip(bf) # should put the 8 TXT files in current dir
>
> tg <- dir(".","^DLCK|WT")
>
> library("edgeR")
> counts <- readDGE(tg)$counts
> counts[is.na(counts)] <- 0
>
> grp <- sapply(colnames(counts),function(u) strsplit(u,"_")[[1]][1])
> d <- DGEList(counts=counts, group=grp)
>
> This should return:
>
>> d
> An object of class "DGEList"
> $counts
> DLCK.TG_1 DLCK.TG_2 DLCK.TG_3 DLCK.TG_4 WT_1 WT_3
WT_4 WT_6
> AAAAAAAAAAAAAAAAA 22653 3059 1366 6574 7782 35096
6623 9633
> AAAAAAAAAAAAAAAAC 82 51 55 93 412 134
335 519
> AAAAAAAAAAAAAAAAG 2 3 7 9 59 5
45 84
> AAAAAAAAAAAAAAAAT 118 471 359 717 1842 94
2465 3311
> AAAAAAAAAAAAAAACA 67 4 4 12 17 108
12 21
> 844311 more rows ...
>
> $samples
> group lib.size norm.factors
> DLCK.TG_1 DLCK.TG 651172 1
> DLCK.TG_2 DLCK.TG 2685418 1
> DLCK.TG_3 DLCK.TG 3202246 1
> DLCK.TG_4 DLCK.TG 2460753 1
> WT_1 WT 3142262 1
> WT_3 WT 294909 1
> WT_4 WT 3517977 1
> WT_6 WT 3558260 1
>
> [? proceed from here ?]
>
> Alternatively, from your original code, I think you want 'skip=1'
and is it possible that you have an extra unwanted file in your
'targets' list ?
>
> Anyways, hope that helps.
>
> Best, Mark
>
>
> ----------
> Prof. Dr. Mark Robinson
> Bioinformatics, Institute of Molecular Life Sciences
> University of Zurich
>
http://ow.ly/riRea
>
>
>
>
>
>
> On 23.12.2013, at 16:50, Ashutosh [guest] <guest at="" bioconductor.org=""> wrote:
>
>>
>> I am new to R and edgeR. I am trying to follow to example
>> 9. Case Study: deep-sequenced short tags from the edgeR manual.
>>
>> I download the data from
http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM272105
>> by clicking the full table and then saving webpage as .txt file
containing
>>
>> First 4 rows as below folled by fifth which contains the tag seq
and count.
>>
>> #SEQUENCE =
>> #COUNT =
>> #TPM = tags per million
>> SEQUENCE COUNT TPM
>> CATCGCCAGCGGGCACC 1 0.37
>>
>> Now we I follow the steps of 9.3 reading data and creating the
DGElist: After running
>> < d<- readDGE(targets, skip = 5, comment.char = "#"), I get
>> Error in taglist[[i]] : subscript out of bounds
>>
>> Can anyone please help, how I can solve this issue.
>>
>> Best regards,
>> Ashutosh
>>
>>
>> -- output of sessionInfo():
>>
>> R version 3.0.2 (2013-09-25)
>> Platform: x86_64-pc-linux-gnu (64-bit)
>>
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods
base
>>
>> other attached packages:
>> [1] edgeR_3.4.2 limma_3.18.7
>>
>>
>> --
>> Sent via the guest posting facility at bioconductor.org.
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>>
https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
>