|
Site Map 2
|
A paper with more current research was posted to the new Spider-Food SEO forum on December 21, 2005. Google Reputation and Trusted Content looks at the impact of the several updates in Google's algorithm during the year of 2005 (February - May, July, and October-November). Google: Changes in Google Ranking StrategiesINTRODUCTIONAround the middle of December 2004, I realized that Google had begun adding fresh Xenite content into their primary cache on a weekly basis. By submitting newly created content for crawling at different times through the week, I found that anything submitted by Friday would usually appear in the cache (and, by extension, in search results) by the next Monday. The pattern has held consistent through the middle of April 2005. In February, I wrote that Google displays a secondary cache which is built from their daily crawls and is only used for minimal reporting. I have noticed differences of only a few days between the primary and secondary caches for some of my pages. In fact, I believe that I was only seeing cache from different data centers[1]. However, subsequent observations have led me to conclude that Google is, in fact, maintaining a historical footprint of Web site caches. Around March 6, 2005, Google's search results began displaying far fewer instances of descriptions and cache for a broad variety of search queries. I estimate that of my queries, somewhere between 50% and 75% of the results lacked cache and description data. Within two weeks of this event, Google began displaying cache and title data for many of the uncached sites from 2004. Sampling the cache results, I found data from as far back as January 2004. It appears that Google was relying on shards[2] from a period extending back over a year to supply cache and title data. From about March 10 through April 10, I observed steady updates to the Google cache on a weekly basis. Old data was consistently replaced by new data extracted from recent crawls. During this period of time, several search engine optimization forums were used by people to report that their sites, which had not been crawled by Google for months, were being recrawled. In February, writing about what I perceived to be the two distinct caches in Google's search results, I said: "I believe Google is comparing the pages in primary and secondary cache, and if it finds a difference it reschedules a crawl for the site. The second crawl seems to be what kicks the content of secondary cache into primary cache." I now think the process is more sophisticated. Google maintains a footprint of reported changes in content. The footprint appears to be established by simple HTTP header requests. Any page which generates a 200 response code from the server is fetched. Any page which generates a 304 (NOT MODIFIED) code from the server is not fetched. The sampling of server response codes helps Google determine whether a site should be recrawled. By mid-March, Google had substantially reduced the number of full-page requests from Xenite.Org. The nature of Xenite's content is largely static, but once or twice a year I usually revise the basic appearance of the site. These revisions usually result in significant pages to on-site navigation, advertising, disclaimers, and cross-linking. The changes do not normally alter the body text of the content. By implementing a partial revision of page layouts across selected portions of Xenite.Org's network, as well as adding new content, I observed increased full-page fetching activity from Google. All the changed content was eventually recrawled and reindexed, although updates to the cache might lag by 9-10 days. New content, optimized for high placement in rankings with standard on-page factors[3], generally appeared in Google's search results within 5-10 days of being submitted for crawling. The rankings for targeted search terms usually began in the 30s or 40s. Within 2 weeks, a typical new page would break into the top 10 results for the targeted term. Within 3-4 weeks, many pages were ranked 1st or in the top 5. Competitiveness for search terms varied[4]. Many search expressions rated in the "Not Optimized" range of 1..10. The competitiveness of a search expression is not directly related to the level of traffic for that expression. For example, a search for "google" produces only results from Google's Web sites. There is no level of competitiveness because there is no competition in the top 10 listing. But Google receives millions of visitors each day. Recap of "On The Googleness of Being"There have been several reports in search engine optimization forums that Google is now increasing its cache size per page. That is, the reported cache now exceeds the 101 Kilobyte limit which Google had previously imposed. The increased amount of cache per page and the use of cached data from the previus 14 months implied that Google has substantially increased its server resources. In "On the Googleness of Being" I reported that:
I also introduced several terms to explain Google's apparent behavior. They are:
With respect to REPUTATION, in "Googleness" I asked, "how much data can Google track for a Web site?" The question may have been answered in part by the subsequent release of a patent application titled Information Retrieval Based on Historical Data. The Abstract describes the methodology as: A system identifies a document and obtains one or more types of history data associated with the document. The system may generate a score for the document based, at least in part, on the one or more types of history data.This patent application also supports the concept of TIMERANK (referred to as AGERANK in some other discussions, and possibly by other names elsewhere). A thorough summary of the patent application presents in plain English an "analysis and interpretation of 63 patent components", many of which are similar in nature to the concepts of TIMERANK and REPUTATION. NEXT |
The page was created by Michael Martinez.This page is Copyright © 2005-2006 Michael L. Martinez. All Rights Reserved. No portions of this document may be reproduced electronically or otherwise without express written permission, except as occurs through normal browser caching or search engine indexing. Original document copyrights remain those of their respective owners.