Hello,
I have run DESeq2 for genome wide ATAC-seq data. At the moment genomic locations are reported in the DESeq results table in this style: "Chr112421625" where 1245 and 1625 are the start and end positions along the chromosome. I would like to obtain the peaks for just one chromosome for downstream analysis. Is there any way of subsetting the results of DESeq2 by providing it with a chromosome number?
I am reluctant to subset the dataset and just run the analysis on the single chromosome as I assume that would reduce the power of the analysis,
I hope this makes sense,
Thank you!
By the way, if you just want to subset the
dds
by gene name, you can do:Where
genes
is a character vector of the names of genes on a particular chromosome.Thank you. I don't think I can use tximeta as I work on a non-model organism. How do you manually add the gene ranges into the dds?
Ok, the part where you annotate the gene ranges is up to you.
You may want to ask a bioinformatics collaborator about how to annotate your data, as this is out of scope for the DESeq2 software. There are various annotation packages on Bioconductor that you can use to get the information you need.
Another option would be to post to Biostars (if you do, link from here to there so others can see your post).
I tried to use dds[dds %over% x,] where x is a vector containing the list of coordinates of the chromosome of interest. However, this returned the error:
Instead I used this:
ddsObjchr1 <- ddsObj[subrows, ] where sub_rows contains the row names corresponding to Chr1
Is this a valid way? I am not sure as I noticed when I check the table of padj < 0.05 for chr1 for the full dataset I retrieve less hits than when I check the table of padj < 0.05 for chr1 for the subsetted results (ddsObj_chr1).
The
%over%
trick would only worked if you had a RangedSummarizedExperiment.The second way is how to go if you don't have ranges, it is perfectly valid.
"when I check the table of padj < 0.05 for chr1 for the full dataset I retrieve less hits than when I check the table of padj < 0.05 for chr1"
You will get different p-values depending on if you give DESeq2 different rows of data. It is a hierarchical Bayesian model, so it depends on information shared across genes. I can't even say if you will get more or less with subsetting, there are many complex interdependencies in the data and the model.
Ok thank you, I thought it would be due to that. I just thought it was worth double checking the validity as the number of hits for the chr rose from 4 to 13 which was surprising - I didn't want to proceed with the method simply because it gave more more hits without it being statistically sound