Wikipedia word frequency list

Domain names are one of most valuable assets for each mISV (and each online business). Domain have to represent a lot of information in limited number of characters. It have to be SEO friendly (descriptive), easy to remember and easy to spell. I have done a lot of work to find good (and not registered) domain names for my products.

It is written in a lot of places that good keywords in domain name help you customers to better understand what you product does and they are especially useful for SEO purposes (Search Engine Optimization). Countless articles suggest to look at Google search count for popularity of words before registering domain (and also at search result count). So, I needed a list of words with some indication of how important they are. More precisely I needed a list of phrases not single words, but this article is about a single words only. I use these word list with combination of Google External Keyword tool and Google Trends to hunt for a perfect domain name.

Computational linguistics is specifically concerned with the question of how frequently given words appear in different written contexts (or known as corpus). Frequency list is a list of words (in given language) and associated frequency in given texts. It is like a dictionary with additional ×.importance×. number.

We all think about dictionaries as some fixed list of words, but it is more like a list of words that continuously appear and other that disappear from the list and each word having a rank in it.

And there is a power laws (or long tails as some prefer to name it). Small number of words get all the attention (they are used most frequently) and a large number of words are used rarely (long tail keywords).

There are some well known word frequency lists:

My own word frequency list

I decided to create my own list of words and associated frequencies based on all articles that are in the English version of Wikipedia.

Wikipedia is HUGE. Only the English part is 21GB in XML format. It takes a 5h to parse entire file and extract statistics for all tokens that looks like a word.

Some statistics:

It seems that the words frequency distribution follow the Zipf×.s law and you can even see similar to the following plot here.

Wikipedia word friequency plot

The chart can be divided on four parts:

Google study shows that there are 14M one word and 315M two word phrases (bigrams). Currently I have no plans to extract two words phrases due to their large number, but it is interesting to analyze them in context of two words domain names.

Extracting words from Wikipedia

The process of extracting all words and counting them is not an easy task. I used Qt XML library for parsing. The steps to create your own word frequency list are:

The good news is that Wikipedia is much clean and organized then the rest of the web. My main difficulties were to parse Wikipedia markup language (it is not strict at some parts) and to manage memory (limited to 2GB and memory leaks at some point). On Linux you can use Valgrind to check for leaks and other memory problems.

×.Collect statistics×. part can be done in different ways. I used my own implementation of ternary search tree.  It is fast and memory efficient for counting words. It also implements some filtering of strings that can be found in Wikipedia like exceptionally long strings (URL-s for example) and other noise.

Some selected words and associated counts:

When you look at counts published on the web, take in mind that only relative counts is that matter.  Relative count =  (word count/total words count) have meaning of probability of occurrence of given word in given corpus.

Free Service to search for available domain names