Well, this has certainly been a remarkably debate-ridden week in the world of search. Not only have we had the death of SEO (or not, actually, if have half a brain or more!) caused by Google Instant, but also the principle of latent Dirichlet allocation has been thrust into the awareness (if not always the understanding) of the SEO community by Ben Hedrickson at SEOMoz.
However, Latent Dirichlet allocation (LDA) is a fairly advanced statistical concept, using complex probability maths. If you have a mathematically inclined brain, and you’re brave, you could check out the Wikipedia page entry for LDA here. Several people in the SEO community seem to confuse LDA with a concept many of us have known about in SEO for a few years – latent semantic indexing (LSI).
Latent semantic indexing is also a statistical concept using probability maths, but it is less complex than LDA. If you’re still feeling brave, the Wikipedia page describing the mathematical concept is here.
LDA and LSI both describe mathematical models that are designed to be used for information retrieval – i.e. returning search results.
In a nutshell, LSI examines the words used in a document and looks for their relationships with other words. Google issued a patent in 2005 that discuss looking at three types of relationship – lateral (where a word means the same or very similar to something else, e.g. car, automobile and auto), “kind of” (where a word is a kind of something else, e.g. car and vehicle), and “part of” (where a words is part of a larger concept, e.g. engine and car).
LSI allows a search engine to determine the kinds of words that a web page might be relevant for, even if they are not actually used on the web page itself. Bu writing content that is packed full of words that have relationships with each other, you are strengthening the document for all of those words.
However, LSI has one major weakness – ambiguity. How could a seach engine determine if you are talking about Microsoft office, or the office in which you work. Especially if you excel in what you do, providing a fresh outlook in your job as a publisher of bibles (you want to give more people access to the word of God, see). I can’t think of any OneNote or Powerpoint puns, but you get the idea.
LDA, on the other hand is a significant extension of LSI. Words are grouped into topics. They can exist in more than one topic, in fact most do. LDA tackles ambiguity by comparing a document to two topics and determining which topic is closer to the document, across all combinations of topics which seem broadly relevant. In doing so, LDA helps an information retrieval system (such as a search engine) to determine which documents are most relevant to which topics.
The boffins at SEOMoz (apologies for the Bigmouthmedia-style start to that sentence) have built an excellent tool to determine an LDA score for text documents against search terms, and have tests many hundreds of sets of Google search results against what might be expected by their LDA score.
Now clearly, the results did not exactly match the LDA-based predictions, but they came so very close that SEOMoz – famous for their belief that link profile far outweighed most other ranking factors in Google – now suggest that the text on your web pages has a much stronger influence than they used to believe.
Looking beyond that, clearly there are a large number of factors that Google uses to rank websites – it’s all a rich tapestry, so they say. However, LDA is possibly the most spam resistant way of determining what a web page is about. LSI synonym spamming won’t necessarily help – especially if the synonyms are ambiguous, or worse change the context of the meaning of the original word. Google bombing links won’t help LDA scoring. Admittedly, it doesn’t prevent some of the black-hat mainstays like hidden content and cloaking, but then there are human-applied ranking penalties and bans for that kind of behaviour.
I’m not suggesting LDA can’t be spammed (actually, clearly it can by some elements of keyword density spamming), but it does allow documents that aren’t spammed to compete well in topics with those that are.
So, do Google use LDA algorithms? There don’t seem to be any patents specifically mentioning it yet, but the patents aren’t always published quickly. Additionally, the patents (if they exist) will almost certainly be clouded in a little fog to prevent SEOs from backward engineering Google. However, the correlation seems very strong, and of course LDA is a specific mathematical information retrieval model. My belief is that it is likely that they do apply it.
The final, albeit tenuous evidence is the way that Google talk about their search results, and about how you should write your content.
Matt Cutts is always saying things like “if you want to rank well, write great content.” To many this seems like a cop out, but for an LDA-based information retrieval system, this is exactly what you should be doing.
Additionally, at the search event earlier this week, while Google Instant was being extolled to the masses of gathered journalists, Ben Gomes – one of the chief engineers at Google, and in many ways, the public face of Google Instant – referred over and over again to searching for topics, and refining your search topic. Not keywords. Not search terms. Topics.