Archive for the ‘SEO’ Category
Researching Competitors Keywords in Delicious
Currently I develop a software that creates collages semi-automatically. There are many strong competitors that have outstanding collage making products. I am researching those products and associated search keywords for SEO purposes. Here is what I found very useful for researching the competitors keywords and understanding how users think about competitor products.
First, lets create list of important competitors. Collage Software industry have many participants each one focused around particular niche. My competitors are those who produce software for photo collages, particularly those kind of collages that you can put on a wall as poster (or as computer wall paper), create postcards or combined image for a blog post. I have a short list of around 20 software products that appear in Google when searching particular phrases that describe my product: “photo collage software”, “photo poster” etc…
People get to bookmark web sites in different ways. For example, I use XMarks for Firefox. XMarks company have knowledge of how people describe different websites and what keyword they associate with them. We have no access to that information, so lets look what we can collect from the publicly visible bookmarks at Delicious.com
If you haven’t yet completed you list of competitors just type the search phrases in Delicious search form and look for most bookmarked web sites: At the right of each bookmark is a number. Clicking on the number reveals the users that bookmark this site. On the right of this page is the most important information, the total number each keyword is associated with this web site:
Top of this list are most important words describing the product and the rest are long tail words somehow related to it. On the left is the user descriptions of the site. Descriptions are commonly in English but some are in German, Chinese and Spanish. This competitor product have users in these countries. The main part of descriptions are copy-pasted from the home page and are of no particular interest to me. I dig into descriptions and found some very interesting ones that can be classified as follows:
- Product description in the users terms. Most users have no knowledge in the field of image processing so they use simple words to describe what the product do (and probably use the same words to search for products).
- Possible product application. For example: “I can use this for gifts for my mom”, “Fun to teach students”.
- Keywords translated in other languages. Translating keywords can be tricky, but tags in other languages gives me idea of translations and some variations when the exact translation is not possible: “Montagens de fotos”, “Collagen aus Fotos”.
- Product opinions: Example: “Looks very easy and fun”, “It is not customizable” “looks like fun to try, but probably cannot download at work”
It will be interesting to look at descriptions when my site become popular but meanwhile I have to dig at my competitors. The main difficulty of this way to research for keywords and user opinion is that it is time consuming. If there is a way to automatically extract all the descriptions and filter duplicate ones I will be happy.
Wikipedia word frequency list
Domain names are one of most valuable assets for each mISV (and each online business). Domain have to represent a lot of information in limited number of characters. It have to be SEO friendly (descriptive), easy to remember and easy to spell. I have done a lot of work to find good (and not registered) domain names for my products.
It is written in a lot of places that good keywords in domain name help you customers to better understand what you product does and they are especially useful for SEO purposes (Search Engine Optimization). Countless articles suggest to look at Google search count for popularity of words before registering domain (and also at search result count). So, I needed a list of words with some indication of how important they are. More precisely I needed a list of phrases not single words, but this article is about a single words only. I use these word list with combination of Google External Keyword tool and Google Trends to hunt for a perfect domain name.
Computational linguistics is specifically concerned with the question of how frequently given words appear in different written contexts (or known as corpus). Frequency list is a list of words (in given language) and associated frequency in given texts. It is like a dictionary with additional “importance” number.
We all think about dictionaries as some fixed list of words, but it is more like a list of words that continuously appear and other that disappear from the list and each word having a rank in it.
And there is a power laws (or long tails as some prefer to name it). Small number of words get all the attention (they are used most frequently) and a large number of words are used rarely (long tail keywords).
There are some well known word frequency lists:
- Wikipedia http://en.wiktionary.org/wiki/Wiktionary:Frequency_lists
- Google 1 trillion tokens: Web 1T 5-Gram
- BNC: ftp://ftp.itri.bton.ac.uk/bnc/all.num.o5 (Large file!)
My own word frequency list
I decided to create my own list of words and associated frequencies based on all articles that are in the English version of Wikipedia.
Wikipedia is HUGE. Only the English part is 21GB in XML format. It takes a 5h to parse entire file and extract statistics for all tokens that looks like a word.
Some statistics:
- Total tokens (words, no numbers): 1,570,455,731
- Unique tokens (words, no numbers): 5,800,280
It seems that the words frequency distribution follow the Zipf’s law and you can even see similar to the following plot here.
The chart can be divided on four parts:
- Rank(1-50) Count(86M-3M) Examples(the, of, and, in, to, a, is) Words that are stop words.
- Rank(51-3K) Count(2.4M-56K) Examples(university, January, tea, sharp) Words form the “core” of the English dictionary — words that are most frequently used.
- Rank(3K-200K) Count(56K-118) Examples(officiates, polytonality, neoligism) Words that can be found in some large and comprehensive dictionaries (above rank 50K are mostly Long Tail words)
- Rank(200K-5.8M) Count(117-1) Examples(euprosthenops, eurotrochilus, lokottaravada) Terms from obscure niches, misspelled words, transliterated words from other languages, new words and “not words at all”
Google study shows that there are 14M one word and 315M two word phrases (bigrams). Currently I have no plans to extract two words phrases due to their large number, but it is interesting to analyze them in context of two words domain names.
Extracting words from Wikipedia
The process of extracting all words and counting them is not an easy task. I used Qt XML library for parsing. The steps to create your own word frequency list are:
- Download a copy of Wikipedia. I used version dumped in XML format.
- Write parser to extract text from <title> and <text> tags.
- Wikipedia uses its own markup language. Write parser to extract all data from markup language and filter-out some unnecessary parts. (this is difficult and vague part)
- Filter out numbers, special characters.
- Tokenize.
- Collect useful statistics.
The good news is that Wikipedia is much clean and organized then the rest of the web. My main difficulties were to parse Wikipedia markup language (it is not strict at some parts) and to manage memory (limited to 2GB and memory leaks at some point). On Linux you can use Valgrind to check for leaks and other memory problems.
“Collect statistics” part can be done in different ways. I used my own implementation of ternary search tree. It is fast and memory efficient for counting words. It also implements some filtering of strings that can be found in Wikipedia like exceptionally long strings (URL-s for example) and other noise.
Some selected words and associated counts:
- Google 197920
- Twitter 894
- domain 111850
- domainer 22
- Wikipedia 3226237
- Wiki 176827
- Obama 22941
- Oprah 3885
- Moniker 4974
- GoDaddy 228
When you look at counts published on the web, take in mind that only relative counts is that matter. Relative count = (word count/total words count) have meaning of probability of occurrence of given word in given corpus.


