Entering edit mode
saddamhusain77
•
0
@saddamhusain77-19420
Last seen 5.8 years ago
HI, I am analysing differential transcript usage on rna seq data and I am stuck at this point- It seems to me that issue is in the ctsgenco object where a single row contains multiple transcript ids but dont know how to correct it.
> filesgenco <- list.files( pattern = ".txt",full.names = TRUE)
> filesgenco
[1] "./hesalc1.txt" "./hesalc2.txt" "./hesalt1.txt" "./hesalt2.txt"
> names(filesgenco)
NULL
> mycols
condition sample_id
HECON1 control HECON1
HECON2 control HECON2
HETRE1 treated HETRE1
HETRE2 treated HETRE2
> names(filesgenco) <- mycols$sampleID
> filesgenco
[1] "./hesalc1.txt" "./hesalc2.txt" "./hesalt1.txt" "./hesalt2.txt"
> txigenco <- tximport(filesgenco, type="salmon", txOut=TRUE,
+ countsFromAbundance="scaledTPM")
reading in files with read_tsv
1 2 3 4
>
> ctsgenco <- txigenco$counts
> ctsgenco <- ctsgenco[rowSums(ctsgenco) > 0,]
> gtf <- "gencode.v29.annotation.gtf"
> txdfgenco <- select(txdb, keys(txdb, "GENEID"), "TXNAME", "GENEID")
'select()' returned 1:many mapping between keys and columns
> tabgenco <- table(txdfgenco$GENEID)
> txdfgenco$ntx <- tabgenco[match(txdfgenco$GENEID, names(tabgenco))]
> head(txdfgenco)
GENEID TXNAME ntx
1 ENSG00000000003.14 ENST00000612152.4 5
2 ENSG00000000003.14 ENST00000373020.8 5
3 ENSG00000000003.14 ENST00000614008.4 5
4 ENSG00000000003.14 ENST00000496771.5 5
5 ENSG00000000003.14 ENST00000494424.1 5
6 ENSG00000000005.5 ENST00000373031.4 2
> ctsgenco[1:3,1:3]
[,1]
ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-202|DDX11L1|1657|processed_transcript| 0.6308597
ENST00000450305.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000002844.2|DDX11L1-201|DDX11L1|632|transcribed_unprocessed_pseudogene| 0.0000000
ENST00000488147.1|ENSG00000227232.5|OTTHUMG00000000958.1|OTTHUMT00000002839.1|WASH7P-201|WASH7P|1351|unprocessed_pseudogene| 28.3130559
[,2]
ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-202|DDX11L1|1657|processed_transcript| 0.6636265
ENST00000450305.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000002844.2|DDX11L1-201|DDX11L1|632|transcribed_unprocessed_pseudogene| 0.0825407
ENST00000488147.1|ENSG00000227232.5|OTTHUMG00000000958.1|OTTHUMT00000002839.1|WASH7P-201|WASH7P|1351|unprocessed_pseudogene| 18.5500264
[,3]
ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-202|DDX11L1|1657|processed_transcript| 0.00000
ENST00000450305.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000002844.2|DDX11L1-201|DDX11L1|632|transcribed_unprocessed_pseudogene| 0.00000
ENST00000488147.1|ENSG00000227232.5|OTTHUMG00000000958.1|OTTHUMT00000002839.1|WASH7P-201|WASH7P|1351|unprocessed_pseudogene| 54.07508
> range(colSums(cts)/1e6)
[1] 16.26968 26.21900
> range(colSums(ctsgenco)/1e6)
[1] 12.64907 17.67470
> all(rownames(ctsgenco) %in% txdfgenco$TXNAME)
[1] FALSE
Any help/suggestion would be appreciated.
Thank you Michael. I have just corrected my post and have added entire code I am using.
I can see by eye why the rownames in
cts
are not contained in theTXNAME
column oftxdf
. Can you see it too?You should chop off the extra characters.
Try using this function:
The gene symbol and description. Please tell me if I have used the function properly as I am getting following output.
That is not correct. You are applying it to the counts. I suggested to remove the extra characters from the rownames of the counts.
Do you see the problem I am pointing out? You have extra characters on the rownames of the counts, and this doesn't match what's in
TXNAME
. You may want to collaborate with a bioinformatician to help you with this step, if the manipulation of objects in R is very difficult.Yeah sure. I will consider your advice. Thank you for your time.
I am sorry to bother you again but the dmDSdata output does'nt look like what is there in the vignettes. It shows just 1 genes.
I'm not sure.
But I also can't tell from what you have posted above, what you are providing as
countsgenco
. You could show for example thedim
of this. Are you providing it with the counts for all the genes?The row names of countgenco is feature_id as you could notice above. Countsgenco is nothing but the counts and ctsgenco is cts; I have just added "genco" acronym.
It might be relevant to ask, since I have run the code beyond this point just to see how things work and got 900 genes after stageR testing. Do i need to run the DEXseq codes as the stageR output has already given me the differential used transcript along with the p-adj values and it appears to work similar to DRIMseq?
You don't need to run both methods. You can just run one or the other.
If you are having continued difficulty with the DRIMSeq set up above, I'd make a new post with all your code and tag the DRIMSeq package so those authors are notified. (I think if you add a tag to an existing post it doesn't email the package maintainers).
Thank you for helping me on this issue.