Question

Error with hclust

0

Entering edit mode

Abby ▴ 10

@03f4b235

Last seen 3.6 years ago

Sharon

Hi all,

I'm trying to create a dendrogram with qualitative data in a txt file. This is what I have so far:

RQ1_cloud <- readLines("RQ1.txt")

RQ1_corpus <- Corpus(VectorSource(RQ1_cloud))

RQ1_clean <- tm_map(RQ1_corpus, tolower)

RQ1_clean <- tm_map(RQ1_clean, removeNumbers)

RQ1_clean <- tm_map(RQ1_clean, removePunctuation)

RQ1_clean <- tm_map(RQ1_clean, stripWhitespace)

RQ1_clean <- tm_map(RQ1_clean, removeWords, stopwords())

wordcloud(RQ1_clean, min.freq = 10, scale = c(2, 0.2), colors = brewer.pal(9, "RdPu"))

RQ1_tdm <- TermDocumentMatrix(RQ1_clean)

freq <- colSums(as.matrix(RQ1_tdm))

length(freq)

ord <- order(freq)

dtms <- removeSparseTerms(RQ1_tdm, 0.2)

RQ1_matrix <- as.matrix(RQ1_tdm)

RQ1_sorted <- sort(colSums(RQ1_matrix), decreasing = T)

head(RQ1_sorted)

1 2 5 3 4

1484 1430 104 0 0

RQ1_df <- data.frame(word = names(RQ1_sorted), freq = RQ1_sorted)

head(RQ1_df)

word freq

1 1 1484

2 2 1430

5 5 104

3 3 0

4 4 0

findFreqTerms(RQ1_tdm, lowfreq=50) character(0)

findAssocs(RQ1_tdm, c("digital"), corlimit=0.85) $digital

           abstract             abstractly”            abstractness 

               1.00                    1.00                    1.00 

         acceptance                  access               acclimate 

               1.00                    1.00                    1.00

etc.

dtmss <- removeSparseTerms(RQ1_tdm, 0.15)

library(cluster)

d <- dist(t(dtmss), method="euclidian")

fit <- hclust(d, method="complete", members= NULL) Error in hclust(d, method = "complete", members = NULL) : NA/NaN/Inf in foreign function call (arg 10)

However, I get an error with that says Error in hclust(d, method = "complete", members = NULL) : NA/NaN/Inf in foreign function call (arg 10)

I've seen online that I need to remove zero variance columns/rows but I'm unsure how to do that. Thank you all for your help!

-Abby

dendrogram hclust textanalysis R RStudio • 3.6k views

ADD COMMENT • link 3.7 years ago • updated 3.6 years ago Abby ▴ 10

Kevin Blighe · Answer 1 · 2021-04-05

0

Entering edit mode

Kevin Blighe ★ 4.0k

@kevin

Last seen 24 days ago

Republic of Ireland

Hi,

It can also mean that there are NA, -Inf, NULL, or NaN values in your data, dtmss.

To search for rows of 0 variance, it would be:

apply(dtmss, 1, var) == 0

..or,

matrixStats::rowVars(dtmss) == 0

These will return boolean vectors of TRUE | FALSE, with TRUE representing any row having zero variance. Note that these functions will also return NA if there is even 1 NA value in the row; thus, you can use this information to perform the additional filtering for NA values.

[[[[[[[[[[[[[[

Another option to consider would be to impute the missing values and then filter for zero-variance genes prior to running dist() / hclust().

For example, impute missing values as 0:

dtmss[is.na(dtmss)] <- 0

Impute with half the lowest non-zero value:

dtmss[is.na(dtmss)] <- (min(dtmss, na.rm = TRUE) / 2)

The best strategy will depend on the distribution of the input data and how it was processed.

Kevin

ADD COMMENT • link 3.7 years ago Kevin Blighe ★ 4.0k

0

Entering edit mode

I put it the function: apply(dtmss, 1, var) == 0

It gave me : logical(0)

Does that mean there are rows having zero variance? And if there are is there a way I can take them out of the data? I'm trying to make a dendrogram about word association from an article.

Also this is the output

dput(dtmss)
structure(list(i = integer(0), j = integer(0), v = numeric(0), 
    nrow = 0L, ncol = 5L, dimnames = list(Terms = NULL, Docs = c("1", 
    "2", "3", "4", "5"))), class = c("TermDocumentMatrix", "simple_triplet_matrix"
), weighting = c("term frequency", "tf"))

Let me know if you can help thank you!

ADD REPLY • link updated 3.6 years ago by Kevin Blighe ★ 4.0k • written 3.7 years ago by Abby ▴ 10

0

Entering edit mode

Hey, so, dtmss is not numerical data. Are you following some tutorial to do this work?

ADD REPLY • link 3.6 years ago Kevin Blighe ★ 4.0k

0

Entering edit mode

No it's not numerical data I'm trying to make a dendrogram about word association from an the article "6 Ways Digital Media Impacts the Brain" by Saga Briggs. First I tried to follow this YouTube video but I got "Error in train_word2vec("C:/Users/Abby/Documents/R/RQ1", output_file = "C:/Users/Abby/Documents/R/RQ1", : could not find function "train_word2vec". I think it's because train_word2vec was a function from the package wordVectors which I believe is not available anymore. So then I used this video to make a word cloud which was successful. Then I was playing around with this video https://www.youtube.com/watch?v=ys6y18Piqfc and this website https://rstudio-pubs-static.s3.amazonaws.com/265713_cbef910aee7642dc8b62996e38d2825d.html to make my dendrogram. With the video, I was able to get the word association:

RQ1_tdm <- TermDocumentMatrix(RQ1_clean)

freq <- colSums(as.matrix(RQ1_tdm))

length(freq)

ord <- order(freq)

dtms <- removeSparseTerms(RQ1_tdm, 0.2)

RQ1_matrix <- as.matrix(RQ1_tdm)

RQ1_sorted <- sort(colSums(RQ1_matrix), decreasing = T)

head(RQ1_sorted)

1 2 5 3 4

1484 1430 104 0 0

RQ1_df <- data.frame(word = names(RQ1_sorted), freq = RQ1_sorted)

head(RQ1_df)

word freq

1 1 1484

2 2 1430

5 5 104

3 3 0

4 4 0

findFreqTerms(RQ1_tdm, lowfreq=50) character(0)

findAssocs(RQ1_tdm, c("digital"), corlimit=0.85) $digital

abstract abstractly” abstractness

1.00 1.00 1.00

acceptance access acclimate

1.00 1.00 1.00

However, when I go on to:

dtmss <- removeSparseTerms(RQ1_tdm, 0.15)

library(cluster)

d <- dist(t(dtmss), method="euclidian")

fit <- hclust(d, method="complete", members= NULL)

plot(fit, hang=-1)

That's when I'm getting the message: "Error in hclust(d, method = "complete", members = NULL) : NA/NaN/Inf in foreign function call (arg 10)"

When I do apply(dtmss, 1, var) == 0, I get "logical(0)" which I've been told indicates that I have a vector that's supposed to contain boolean values, but the vector has zero length.

ADD REPLY • link 3.6 years ago Abby ▴ 10

0

Entering edit mode

Hi, I may not have time to look through all of those videos - sorry. The apply() function will not work if the data is non-numerical. Basically, dist(t(dtmss), method="euclidian") will only work correctly if dtmss is numerical and has no NA, Nan, NULL, or -Inf values. What is the output of str(dtmss)?

Unfortunately, this is also somewhat out of the scope of this forum, which is dedicated to Bioconductor packages. I can continue to respond, though, in order to help

ADD REPLY • link 3.6 years ago Kevin Blighe ★ 4.0k

0

Entering edit mode

No problem! Thank you so much for all your help! The output of str(dtmss) is

List of 6

$ i : int(0)

$ j : int(0)

$ v : num(0)

$ nrow : int 0

$ ncol : int 5

$ dimnames:List of 2

..$ Terms: NULL

..$ Docs : chr [1:5] "1" "2" "3" "4" ...

attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
attr(*, "weighting")= chr [1:2] "term frequency" "tf"

ADD REPLY • link 3.6 years ago Abby ▴ 10

1

Entering edit mode

install.packages("devtools")

devtools::install_github("bmschmidt/wordVectors")

library(wordVectors)

library(magrittr)

if (!file.exists("RQ1.txt"))

unzip("RQ1.txt",exdir="RQ1")

if (!file.exists("RQ1.txt")) prep_word2vec(origin="RQ1",destination="RQ1.txt",lowercase=T,bundle_ngrams=2)

if (!file.exists("RQ1_vectors.bin")) {model = train_word2vec("RQ1.txt","RQ1_vectors.bin",vectors=200,threads=4,window=12,iter=5,negative_samples=0)} else model = read.vectors("RQ1_vectors.bin")

model %>% closest_to("digital")

RQ1 = c("digital")

term_set = lapply(RQ1, function(rq1) {nearest_words = model %>% closest_to(model[[rq1]],20)

nearest_words$word}) %>% unlist

subset = model[[term_set,average=F]]

subset %>%

cosineDist(subset) %>%

as.dist %>%

hclust %>%

plot

This is what I ended up going with if anyone is interested! I suggest cleaning the text first and saving it as a txt file

ADD REPLY • link 3.6 years ago Abby ▴ 10