Hi all,
I'm trying to create a dendrogram with qualitative data in a txt file. This is what I have so far:
RQ1_cloud <- readLines("RQ1.txt")
RQ1_corpus <- Corpus(VectorSource(RQ1_cloud))
RQ1_clean <- tm_map(RQ1_corpus, tolower)
RQ1_clean <- tm_map(RQ1_clean, removeNumbers)
RQ1_clean <- tm_map(RQ1_clean, removePunctuation)
RQ1_clean <- tm_map(RQ1_clean, stripWhitespace)
RQ1_clean <- tm_map(RQ1_clean, removeWords, stopwords())
wordcloud(RQ1_clean, min.freq = 10, scale = c(2, 0.2), colors = brewer.pal(9, "RdPu"))
RQ1_tdm <- TermDocumentMatrix(RQ1_clean)
freq <- colSums(as.matrix(RQ1_tdm))
length(freq)
ord <- order(freq)
dtms <- removeSparseTerms(RQ1_tdm, 0.2)
RQ1_matrix <- as.matrix(RQ1_tdm)
RQ1_sorted <- sort(colSums(RQ1_matrix), decreasing = T)
head(RQ1_sorted)
1 2 5 3 4
1484 1430 104 0 0
RQ1_df <- data.frame(word = names(RQ1_sorted), freq = RQ1_sorted)
head(RQ1_df)
word freq
1 1 1484
2 2 1430
5 5 104
3 3 0
4 4 0
findFreqTerms(RQ1_tdm, lowfreq=50) character(0)
findAssocs(RQ1_tdm, c("digital"), corlimit=0.85) $digital
abstract abstractly” abstractness
1.00 1.00 1.00
acceptance access acclimate
1.00 1.00 1.00
etc.
dtmss <- removeSparseTerms(RQ1_tdm, 0.15)
library(cluster)
d <- dist(t(dtmss), method="euclidian")
fit <- hclust(d, method="complete", members= NULL) Error in hclust(d, method = "complete", members = NULL) : NA/NaN/Inf in foreign function call (arg 10)
However, I get an error with that says Error in hclust(d, method = "complete", members = NULL) : NA/NaN/Inf in foreign function call (arg 10)
I've seen online that I need to remove zero variance columns/rows but I'm unsure how to do that. Thank you all for your help!
-Abby
I put it the function: apply(dtmss, 1, var) == 0
It gave me : logical(0)
Does that mean there are rows having zero variance? And if there are is there a way I can take them out of the data? I'm trying to make a dendrogram about word association from an article.
Also this is the output
Let me know if you can help thank you!
Hey, so,
dtmss
is not numerical data. Are you following some tutorial to do this work?No it's not numerical data I'm trying to make a dendrogram about word association from an the article "6 Ways Digital Media Impacts the Brain" by Saga Briggs. First I tried to follow this YouTube video but I got "Error in train_word2vec("C:/Users/Abby/Documents/R/RQ1", output_file = "C:/Users/Abby/Documents/R/RQ1", : could not find function "train_word2vec". I think it's because train_word2vec was a function from the package wordVectors which I believe is not available anymore. So then I used this video to make a word cloud which was successful. Then I was playing around with this video https://www.youtube.com/watch?v=ys6y18Piqfc and this website https://rstudio-pubs-static.s3.amazonaws.com/265713_cbef910aee7642dc8b62996e38d2825d.html to make my dendrogram. With the video, I was able to get the word association:
1 2 5 3 4
1484 1430 104 0 0
word freq
1 1 1484
2 2 1430
5 5 104
3 3 0
4 4 0
abstract abstractly” abstractness
1.00 1.00 1.00
acceptance access acclimate
1.00 1.00 1.00
However, when I go on to:
That's when I'm getting the message: "Error in hclust(d, method = "complete", members = NULL) : NA/NaN/Inf in foreign function call (arg 10)"
When I do apply(dtmss, 1, var) == 0, I get "logical(0)" which I've been told indicates that I have a vector that's supposed to contain boolean values, but the vector has zero length.
Hi, I may not have time to look through all of those videos - sorry. The
apply()
function will not work if the data is non-numerical. Basically,dist(t(dtmss), method="euclidian")
will only work correctly ifdtmss
is numerical and has no NA, Nan, NULL, or -Inf values. What is the output ofstr(dtmss)
?Unfortunately, this is also somewhat out of the scope of this forum, which is dedicated to Bioconductor packages. I can continue to respond, though, in order to help
No problem! Thank you so much for all your help! The output of str(dtmss) is
List of 6
$ i : int(0)
$ j : int(0)
$ v : num(0)
$ nrow : int 0
$ ncol : int 5
$ dimnames:List of 2
..$ Terms: NULL
..$ Docs : chr [1:5] "1" "2" "3" "4" ...
attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
attr(*, "weighting")= chr [1:2] "term frequency" "tf"
install.packages("devtools")
devtools::install_github("bmschmidt/wordVectors")
library(wordVectors)
library(magrittr)
if (!file.exists("RQ1.txt"))
unzip("RQ1.txt",exdir="RQ1")
if (!file.exists("RQ1.txt")) prep_word2vec(origin="RQ1",destination="RQ1.txt",lowercase=T,bundle_ngrams=2)
if (!file.exists("RQ1_vectors.bin")) {model = train_word2vec("RQ1.txt","RQ1_vectors.bin",vectors=200,threads=4,window=12,iter=5,negative_samples=0)} else model = read.vectors("RQ1_vectors.bin")
model %>% closest_to("digital")
RQ1 = c("digital")
term_set = lapply(RQ1, function(rq1) {nearest_words = model %>% closest_to(model[[rq1]],20)
nearest_words$word}) %>% unlist
subset = model[[term_set,average=F]]
subset %>%
cosineDist(subset) %>%
as.dist %>%
hclust %>%
plot
This is what I ended up going with if anyone is interested! I suggest cleaning the text first and saving it as a txt file