Google: Changes in Google Ranking Strategies



Site Map
Site Map 2


Introduction
Page 2
Page 3


SCI-FI LINKS
Xenite.Org Home
The Search Engine Channel
SF-FANDOM Forums
Xena Online Resources
Lord of the Rings news
Staff Essays
White Cheese Dip
Visualizing Middle-earth
SF-WORLDS
Celebrity News


WHAT PEOPLE SAY ABOUT 'On The Googleness of Being'
On the Googleness of Being
Fantomaster's Blog, FantomNews
Marketing Mania
ThreadWatch
Everything Else
SEOMOZ.ORG
Behind the Scenes in Google


WHAT PEOPLE SAY ABOUT 'Changes in Google Ranking Strategies'
Spider-Food discussion thread
Fantomaster's Blog, FantomNews
All Your Web Are Blog To Us
ThreadWatch
The Anomaly
Everything Else
I Hate Google.Org


Other Papers And Articles About Google
PageRank: Where it helps, where it doesn't help, and other facts


A paper with more current research was posted to the new Spider-Food SEO forum on December 21, 2005. Google Reputation and Trusted Content looks at the impact of the several updates in Google's algorithm during the year of 2005 (February - May, July, and October-November).

Google: Changes in Google Ranking Strategies

INTRODUCTION

This paper is a followup to On the Googleness of Being, published at Spider-Food.Net on February 11, 2005.

Around the middle of December 2004, I realized that Google had begun adding fresh Xenite content into their primary cache on a weekly basis. By submitting newly created content for crawling at different times through the week, I found that anything submitted by Friday would usually appear in the cache (and, by extension, in search results) by the next Monday. The pattern has held consistent through the middle of April 2005.

In February, I wrote that Google displays a secondary cache which is built from their daily crawls and is only used for minimal reporting. I have noticed differences of only a few days between the primary and secondary caches for some of my pages. In fact, I believe that I was only seeing cache from different data centers[1]. However, subsequent observations have led me to conclude that Google is, in fact, maintaining a historical footprint of Web site caches. Around March 6, 2005, Google's search results began displaying far fewer instances of descriptions and cache for a broad variety of search queries. I estimate that of my queries, somewhere between 50% and 75% of the results lacked cache and description data.

Within two weeks of this event, Google began displaying cache and title data for many of the uncached sites from 2004. Sampling the cache results, I found data from as far back as January 2004. It appears that Google was relying on shards[2] from a period extending back over a year to supply cache and title data. From about March 10 through April 10, I observed steady updates to the Google cache on a weekly basis. Old data was consistently replaced by new data extracted from recent crawls. During this period of time, several search engine optimization forums were used by people to report that their sites, which had not been crawled by Google for months, were being recrawled.

In February, writing about what I perceived to be the two distinct caches in Google's search results, I said: "I believe Google is comparing the pages in primary and secondary cache, and if it finds a difference it reschedules a crawl for the site. The second crawl seems to be what kicks the content of secondary cache into primary cache." I now think the process is more sophisticated. Google maintains a footprint of reported changes in content. The footprint appears to be established by simple HTTP header requests. Any page which generates a 200 response code from the server is fetched. Any page which generates a 304 (NOT MODIFIED) code from the server is not fetched. The sampling of server response codes helps Google determine whether a site should be recrawled.

By mid-March, Google had substantially reduced the number of full-page requests from Xenite.Org. The nature of Xenite's content is largely static, but once or twice a year I usually revise the basic appearance of the site. These revisions usually result in significant pages to on-site navigation, advertising, disclaimers, and cross-linking. The changes do not normally alter the body text of the content. By implementing a partial revision of page layouts across selected portions of Xenite.Org's network, as well as adding new content, I observed increased full-page fetching activity from Google. All the changed content was eventually recrawled and reindexed, although updates to the cache might lag by 9-10 days.

New content, optimized for high placement in rankings with standard on-page factors[3], generally appeared in Google's search results within 5-10 days of being submitted for crawling. The rankings for targeted search terms usually began in the 30s or 40s. Within 2 weeks, a typical new page would break into the top 10 results for the targeted term. Within 3-4 weeks, many pages were ranked 1st or in the top 5. Competitiveness for search terms varied[4]. Many search expressions rated in the "Not Optimized" range of 1..10. The competitiveness of a search expression is not directly related to the level of traffic for that expression.

For example, a search for "google" produces only results from Google's Web sites. There is no level of competitiveness because there is no competition in the top 10 listing. But Google receives millions of visitors each day.

Recap of "On The Googleness of Being"

In "On the Googleness of Being", I asserted that Google was replacing forwarding pages with on-page content indicating that a URL had been changed with the actual (dynamic) URLs of the moved content for Xenite's forums. In fact, we maintain duplicate sets of forwarding pages on both Xenite.Org and SF-FANDOM for historical reasons (there are many off-network inbound links for the Xenite pages which send traffic to our forums). The Xenite forwarding pages have been largely dropped from Google's search results, whereas prior to February 1, 2005, it was common for both the Xenite and SF-FANDOM forwarding pages to be listed in the top 10 results. Now, about 50% of SF-FANDOM's forwarding pages are listed and the rest of the search expressions produce direct links to the dynamic forum URLs. We maintain 1st through 5th place rankings for our forums, which is consistent with past performance.

There have been several reports in search engine optimization forums that Google is now increasing its cache size per page. That is, the reported cache now exceeds the 101 Kilobyte limit which Google had previously imposed. The increased amount of cache per page and the use of cached data from the previus 14 months implied that Google has substantially increased its server resources.

In "On the Googleness of Being" I reported that:
  • "Google is influenced by smaller content pages than it is by larger content pages." My observations since early February have not been consistent with that statement. However, the de-evolution of the Google cache in March was probably a significant factor in the change in observed behavior.

  • "Google has swung back to embracing random fresh content." In fact, the crawling behavior I have observed, where 304 NOT MODIFIED codes are returned by servers, explains this shift in results. Google is not "embracing random fresh content", it is actively seeking ALL fresh content.

  • "Inbound links are not important for the new content on established sites, provided that those sites are internally well-linked." Continued observations of Google's changes in results where new content appears supports that conclusion.

  • "Where Google detects redirection or supercession of content, it is bumping the new content up in the rankings at the expense of the older, redirecting pages ... WITHOUT REGARD FOR WHERE INBOUND LINKS ARE POINTING." This continues to be so in the search results I have monitored. However, this behavior does not appear to be related to the 302 redirect issues which have caused much concern in search engine optimization communities. The redirection referred to here is handled through Javascript and/or HTTP-EQUIV meta tags in page headers.


I also introduced several terms to explain Google's apparent behavior. They are:
  • REPUTATION, where Google appears to distinguish a site's importance on the basis of its past performance in Google's database. Performance may include ranking for multiple search queries. Performance may include obtaining a large number of inbound links. Performance may include obtaining inbound links from trusted sources (see point 2 below). Performance may include measurable growth in specific content (as opposed to growth through the addition of random topics).

  • TRUSTED CONTENT SITE, where Google appears to handle changes and additions to the content of an older, well-established, large-content site better than changes and additions made to a smaller, younger site. Or, where Google appears to confer a status or reputation upon a site due to its top-level domain (in particular, sites with .EDU, .GOV, and .MIL top-level domains now seem to be treated as more than equals with other sites).

  • LISTING INHERITANCE, where Google appears to transfer the search results positioning of one page to another page because the first (older) page is redirecting to the second (newer) page. The relative difference in page origination dates may be a factor. That is, an older page does not appear to replace a newer page.

  • CHILD INHERITANCE, where Google appears to confer a measure of importance to a page newly added to a large content site. The child page may be ranked in search results in part according to criteria associated with its parent page or related pages (siblings) from the same site. A child page may therefore be deemed as important and valuable a resource as a parent page. Children of TRUSTED CONTENT SITES are most likely to inherit parent REPUTATION.

  • TIMERANK, where Google appears to measure a site's value by accumulating timestamps or measurements of timestamps over a period of six to twelve months.


With respect to REPUTATION, in "Googleness" I asked, "how much data can Google track for a Web site?" The question may have been answered in part by the subsequent release of a patent application titled Information Retrieval Based on Historical Data. The Abstract describes the methodology as:
A system identifies a document and obtains one or more types of history data associated with the document. The system may generate a score for the document based, at least in part, on the one or more types of history data.
This patent application also supports the concept of TIMERANK (referred to as AGERANK in some other discussions, and possibly by other names elsewhere). A thorough summary of the patent application presents in plain English an "analysis and interpretation of 63 patent components", many of which are similar in nature to the concepts of TIMERANK and REPUTATION.

NEXT





Google: Changes in Google Ranking Strategies

This page is Copyright © 2005-2006 Michael L. Martinez. All Rights Reserved. No portions of this document may be reproduced electronically or otherwise without express written permission, except as occurs through normal browser caching or search engine indexing. Original document copyrights remain those of their respective owners.
The page was created by Michael Martinez.
Houston Search Engine Optimization provided by SE cOnsulting.