I have 10 sample with two condition (case/controll). I aligned using STAR 2.25a . On the left image after aligned I count using htseq-count and then I select only counts are from protein-coding and I use for generate pca plot. In the second case (right image) I use rsem with STAR and I use to import the counts using tximport. In the first case I use ensemble 74 in the second case hg38/84. Both pca are different What could be wrong? In the first case seem the date of the run do not affect pca (Using htseqcount) in the second case PC1 are able to divide one batch .
Here the script for prepare the counts that I use (only protein coding)
ret<-data.frame(rownames(dt)) ret$ensembl <- sapply(strsplit(rownames(dt),split="nn+"),"[",1) #use biomart #ensembl = useMart( "ensembl", dataset = "hsapiens_gene_ensembl" ) ensembl = useMart( host="dec2013.archive.ensembl.org",biomart="ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl" ) genemap <- getBM( attributes = c("ensembl_gene_id", "hgnc_symbol","family","gene_biotype","family_description"), filters = "ensembl_gene_id", values = list(ret$ensembl,"protein_coding"), mart = ensembl ) idx <- match( ret$ensembl, genemap$ensembl_gene_id ) ret$hgnc_symbol <- genemap$hgnc_symbol[ idx ] ret$entrez <- genemap$entrezgene[ idx ] ret$hgnc_symbol <- genemap$hgnc_symbol[ idx ] ret$family <- genemap$family[ idx ] ret$biotype <- genemap$gene_biotype[ idx ] ret$family_description <- genemap$family_description[ idx ] #prepare coding matrice1<-ret[ret$biotype=="protein_coding",] head(matrice1$ensembl) countRaw<-as.data.frame(counts(dds,normalized=FALSE)) countRawCoding <- countRaw[matrice1$ensembl,]
and for the plot pca
Thanks so much:
Well I don't see anything obvious. Those look correct.
I don't know then why gene level counts would be very different across htseq and RSEM. It's expected to have some differences, because RSEM is also probabilistically assigning multimapping reads, but the correlation across methods is usually very high.
Try some basic EDA. Compare column sums across software, plot htseq vs RSEM for the same sample, etc.
It could also be the Ensembl version that makes a difference.
All the little differences in a pipeline can add up to different quantifications...
I try to do some deep analysis . I found something strange. If I use tximport and rsem I found 2 sample are really different . Please see the Image
http://imgur.com/AntGAu9
In the first row you can see data obtained from htseq-count and the rlog trasfromation seem work. (From Left to right comparison 1vs2,3vs4,4vs5,boxplot). In the second row the same but using data obtained from tximport and rsem. Sample 3 and Sample 4 seem are not normalize.
If you want to make a clean comparison you will need to eliminate any unnecessary differences. For example, you are using different genomes: GRCh38.84 vs GRCh37.74
But in the end, you can feel free to use whichever pipeline you prefer or think makes the most sense for your data.
I have try a new comparison with both methods. I found some really relevant difference from both experiment. Here I report pca from both methods. In this case I see the same distortion but with TPM I see more separation. Any suggestion?? http://imgur.com/a/ojx7b
I don't have any specific recommendation here.
My preferred pipeline these days is to use transcript quantifiers and tximport.