TransWikia.com

Detect pairs after simple frequency

Stack Overflow Asked on November 27, 2021

After these steps:

library(quanteda)

df <- data.frame(text = c("only a small text","only a small text","only a small text","only a small text","only a small text","only a small text","remove this word lower frequency"))
tdfm <- df$text %>%
  tokens(remove_punct = TRUE, remove_numbers = TRUE) %>%
  dfm()
dfm_keep(tdfm, pattern = featnames(tdfm)[docfreq(tdfm) > 5])

How is it possible to find pairs or triples of words (ngram = 2:3) which exist in more than 5 documents?

2 Answers

ngrams need to be constructed before converting to dfm. Because the order of words in a dfm is lost.

The clean quanteda way then would be:

library(quanteda)

df <- data.frame(text = c("only a small text","only a small text","only a small text","only a small text","only a small text","only a small text","remove this word lower frequency"))
tdfm <- df %>%
  corpus() %>%                        # when you have a data.frame it usually makes sense to construct a corpus first to retain the other columns as meta-data
  tokens(remove_punct = TRUE, 
         remove_numbers = TRUE) %>%
  tokens_ngrams(n = 2:3) %>%          # construct ngrams
  dfm() %>%                           # convert to dfm
  dfm_trim(min_docfreq = 5)           # select ngrams that appear in at least 5 documents

tdfm
#> Document-feature matrix of: 7 documents, 5 features (14.3% sparse).
#>        features
#> docs    only_a a_small small_text only_a_small a_small_text
#>   text1      1       1          1            1            1
#>   text2      1       1          1            1            1
#>   text3      1       1          1            1            1
#>   text4      1       1          1            1            1
#>   text5      1       1          1            1            1
#>   text6      1       1          1            1            1
#> [ reached max_ndoc ... 1 more document ]

Created on 2020-07-22 by the reprex package (v0.3.0)

Update based on comment

If you want to create ngrams only from words which appear in more than 4 documents, I think it makes most sense to first construct a dfm without ngrams, filter terms that appear in more than 4 docs and use this dfm to subset the tokens before constructing ngrams (as no tokens_trim function exists):

# first construct dfm without ngrams and
dfm_onegram <- df %>%
  corpus() %>%      
  dfm() %>%                           
  dfm_trim(min_docfreq = 4) 

dfm_ngram <- df %>% 
  corpus() %>% 
  tokens(remove_punct = TRUE, 
         remove_numbers = TRUE) %>%
  tokens_keep(featnames(dfm_onegram)) %>% # keep only tokens that appear in more than 4 docs (in the dfm_onegram object)
  tokens_ngrams(n = 2:3) %>%         
  dfm() %>%
  dfm_trim(min_docfreq = 5) 

Keep in mind though that rare words will now be ignored when in ngrams. If you have the text "only a rare small text", the resulting ngram will still be "only_a_small".

Answered by JBGruber on November 27, 2021

As in the previous question, just expand what you are looking for.

tdfm <- df$text %>%
  tokens(remove_punct = TRUE, remove_numbers = TRUE) %>%
  # 2 and 3 grams
  tokens_ngrams(n = 2:3) %>% 
  dfm()
           
dfm_keep(tdfm, pattern = featnames(tdfm)[docfreq(tdfm) > 5])
Document-feature matrix of: 7 documents, 5 features (14.3% sparse).
       features
docs    only_a a_small small_text only_a_small a_small_text
  text1      1       1          1            1            1
  text2      1       1          1            1            1
  text3      1       1          1            1            1
  text4      1       1          1            1            1
  text5      1       1          1            1            1
  text6      1       1          1            1            1
  text7      0       0          0            0            0

Answered by phiver on November 27, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP