I'm not sure biotype is quite the right field to query. Perhaps a particular GO annotation would be appropriate. GO:0003700 is for 'DNA binding transcription factor activity'. It probably doesn't necessarily mean that something annotated with that is a transcription factor, but I guess it would be hard for a transcription factor to fall outside that classification.
You can query for genes annotated directly with that term in biomaRt with something like:
library(biomaRt)
ensembl_mart = useMart("ensembl", dataset = "hsapiens_gene_ensembl")
results <- getBM(attributes = c("ensembl_gene_id",
"external_gene_name",
"chromosome_name",
"start_position",
"end_position"),
filters = "go",
values = "GO:0003700",
mart = ensembl_mart)
If you want things annotated with that term, or anything below it in the ontology then it's slightly different:
results <- getBM(attributes = c("ensembl_gene_id",
"external_gene_name",
"chromosome_name",
"start_position",
"end_position"),
filters = "go_parent_term",
values = "GO:0003700",
mart = ensembl_mart)
Very helpful thank you!
My aim here is to filter a full HT-seq list of counts for factors with DNA binding activity and do it in such a way as to grab only those above some cutoff, lets say just DNA binding factors with absolute counts above 100 for example. Do you think just importing that list as a data.frame into R and doing a simple intersect/matching operation would do it? Thanks again.
I guess filtering one list vs the other for