Session IDs and SEO

This page was originally written by Robert Morrison on 8 Nov 2007 as an article for Oyster Web.

There is no such thing as a session. There is no spoon. It's all fake, but more importantly it's a problem.

HTTP, the mechanism by which web pages are retrieved from the Internet to be displayed on your display, is stateless. This means that from request to request each one is a new 'conversation' with no recollection of any past 'conversations' and therefore no persisting 'state' of relationship exists for the duration of your visit to a site. In essence each request is a brand new 'visit'.

You have a goldfish librarian; generally your requested documents are returned quickly and efficiently, but the librarian never remembers your face*. Perhaps you request another document from the same librarian, but there is no recollection that you have just borrowed something else a few moments earlier. But, to beat the last out of my metaphor, goldfish are not great with shopping carts and other great Web 2.0 stuff and Internet Services.

So we create the myth of a persistent session; by some other mechanism (on top of simple HTTP) clients (in this case the 'client' is the web browsing program on your computer) can identify themselves to the Web Site serving the requests. This forms the basis for storing specific information about that particular client. All you need is a session ID. An accepted and widely used means for communicating session IDs between client and server is the cookie:

  1. The client (browser or Search Engine spider) makes a request - it 'asks' for a web page from the server

  2. The server generates a Session ID and passes it back to the client, as a cookie, as part of the data (i.e. the web page) that was requested.

  3. On every subsequent request the browser now includes the cookie. The Cookie contains the session ID.

  4. When handling each of these requests, the server now 'knows' who the client is so that Dynamic Pages such as shopping carts can be handled appropriately.

See? It's all an illusion. And it works most of the time, but some (very old) older browsers do not support cookies; and in any case users are free to disable them in their browser settings. In our example of a shopping cart, a client losing their session identifier would mean losing the contents of their cart. We must find another means than cookies to propagate the session ID: Lots of application developers choose to do this by appending a variable to the end of the URL.

You are likely to have seen appended variables in this manner on occasions that you conducted a search on Google or filled out another type of website form. And talking of Google, we've reached the main point of this article. Search engines can't be relied upon to support cookies. This also opens up the issue of session hijacking, but I shall leave that particular can-o-worms for another post...

Session IDs in the URLs of your website can cause 'duplicate content' problems with search engines. This can lead to diminished search engine rankings.

Normally, URL variables mean the page will be different. Dynamic 'product' pages often use them to display the correct product on an otherwise standard page:

http://www.oyster-web.co.uk/shop/products.htm?product=widgets
http://www.oyster-web.co.uk/shop/products.htm?product=sprockets

These pages are different and we want them to both be indexed, however:

http://www.oyster-web.co.uk/shop/products/widgets.htm
http://www.oyster-web.co.uk/shop/products/widgets.htm?SESSIONID=3y483io2u8902urk9802u9032m903

These pages are really the same and we only want that Content indexed once. Furthermore, in order that the session ID will persist, it must be placed on to each of the links on the page! This means that the pages will have different links for every user that visits the site without cookies, yet the indexable content will always be the same.

What can be done?

Fortunately, we can make use of another piece of HTTP technology. A User Agent is another piece of information usually included in HTTP requests. It describes the type of browser that is being used, and usually the particular version of the software. Here are some examples:

  • Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727)

  • Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6

  • Opera/9.23 (Windows NT 5.1; U; en)

  • Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

These represent (1) IE 6 (2)Firefox (3) Opera (4) Googlebot

"Googlebot?!" you say? Why yes. We can identify when Googlebot is examining our pages by looking at all the page requests that identify themselves as having come from Googlebot. There are equivalents for Yahoo and the others too, plus specialised versions that, for example, examine the images or other special content on your website.

So we alter the script that dynamically builds the page, and now it looks out for Googlebot's User Agent signature. Deciding that Googlebot (or any other search engine spider) doesn't really need a session ID, our script sends the requested page without the appended session ID (remember that normally all the links are dynamically written with the session ID for anyone not using cookies). In this way we present to the search engines a dynamic site but without the effect of an enormous potential for duplicate contents.

Of course, this shouldn't be abused. To supply different content to search engines than to ordinary users contravenes search engine ethical guidelines.

Opening Hours: 9:30am to 5pm, Mon to Fri, except public holidays.

Phone us on 0871 900 8407

IndiciumWeb are now on twitter

You can now find us on twitter: http://twitter.com/indicium

Gold Standard Seo Audit for large websitesSilver Standard Seo Audit for medium websitesBronze Standard Seo Audit for small  websites
Indicium Web on Facebook