Identifying common keyphrase frequency in large dataset

Question

I have a dataset of profiles which contain freeform text describing the work history of a number of individuals.

I would like to attempt to identify frequently used words or groups of words across the set of profiles in order that I can build a taxonomy (of skills) related to the profiles.

For example if the words 'conversion rate optimisation' appear together 300 times across all profiles, I would see this on my list as a high frequency keyphrase. I would expect to be able to filter the list based on single keywords, 2 words and 3 word strings.

I would then be able to manually pick out frequently used keyphrases relating to skills, that could be added to a master taxonomy list.

I would also need some way of filtering out invalid words like ('I', 'and' etc)

What is the best way to get something like this done?

Erwan · Answer

I would like to attempt to identify frequently used words or groups of words

The difficulty here would be to capture multiword terms, as opposed to single words. This implies using n-grams for various values of $n$, and that can cause a bias when comparing the frequency of two terms of different length (number of words).

I would also need some way of filtering out invalid words like ('I', 'and' etc)

These are called stop words (sometimes function words or grammatical words). They are characterized by the fact that they appear very frequently even though they consist in a quite small subset of the vocabulary (fyi this is related to Zipf's law for natural language). These two properties make them easy enough to list in a predefined list so that they can be excluded, there are many lists available (e.g. here or there).

Since you don't have any predefined list of terms, a baseline approach could go along these lines:

For every value of $n$ to consider, collect all the $n$-grams
Remove any $n$-gram which contains only stop words (or which contains mostly stop-words) (note: it might be better to do this step first, but only if it's safe to assume that multiword terms don't contain stop words)
Calculate the document frequency for every candidate term (the same DF as in TF-IDF weights)
Filter out the terms which have a very low document frequency (experiment with different values for the threshold). This step should eliminate a lot of noise, but probably not all of it.
You will probably still need a bit of manual filtering here if your goal is to obtain a clean list of actual terms. Normally there should be few long n-grams left and the ones left should be good for the most part, however there might still be a lot of false positive unigrams and bigrams.

This approach is very basic but it's easily adjustable, you can adapt it to your data, possibly add steps etc. Otherwise there are probably specialized tools for terminology extraction, but I'm not familiar with any.

Has QUIT--Anony-Mousse · Answer

Clustering is the wrong tool for this purpose.

If you want to identify frequent patterns, use frequent pattern mining.

Here, you will want to consider order and locality, so some form of frequent sequence mining is certainly the way to go.

But since you likely only have a few hundred CVs, you likely can simply afford to count all words, 2-grams, 3-grams, 4-grams (which is still linear in the size of the input) and print the most frequent combinations each.

If you can afford to load multiple copies of your data into main memory, I suggest you simple use a dict and count all occurrences.

Identifying common keyphrase frequency in large dataset

2 Answers

Add your own answers!

Ask a Question