Google: Changes in Google Ranking Strategies page 2



Site Map
Site Map 2


Introduction
Page 2
Page 3


SCI-FI LINKS
Xenite.Org Home
The Search Engine Channel
SF-FANDOM Forums
Xena Online Resources
Lord of the Rings news
Staff Essays
White Cheese Dip
Visualizing Middle-earth
SF-WORLDS
Celebrity News


WHAT PEOPLE SAY ABOUT 'On The Googleness of Being'
On the Googleness of Being
Fantomaster's Blog, FantomNews
Marketing Mania
ThreadWatch
Everything Else
SEOMOZ.ORG
Behind the Scenes in Google


WHAT PEOPLE SAY ABOUT 'Changes in Google Ranking Strategies'
Spider-Food discussion thread
Fantomaster's Blog, FantomNews
All Your Web Are Blog To Us
ThreadWatch
The Anomaly
Everything Else
I Hate Google.Org


Google: Changes in Google Ranking Strategies

Continued from previous page

NEW CONCEPTS

One of the ideas expressed in the patent application is STALENESS (I will refer to this as FIXED CONTENT). The application suggests that stale information (formerly active content which has become fixed) may be preferable with respect to some queries. I have found evidence that this may indeed be true of queries concerning movies. Web sites devoted to highly anticipated movies have undergone fundamental shifts in search results since the beginning of February and March. For movie franchises which have completed their initial release cycles (trailers-in-theaters, release-in-theaters, release-on-DvD), FIXED CONTENT now ranks more highly than active content, with notable exceptions that may be explained by REPUTATION and TRUSTED CONTENT SITE.

Another idea expressed in the application is what I will call REASSOCIATION, in which a document which was originally relevant to one expression becomes more relevant to another. I have tested the principle of REASSOCIATION by altering a subset of on-network links to a specific page which had been hard-branded for an older but related expression. Only a small number of changes in link anchor text were required to propel the older site, actively being updated, into the top ten results for the new expression. The majority of inbound links for the site, both on-network and off-network, continue to use the older anchor texts established over a period of several years.

The patent distinguishes between the aging factors for documents and the aging factors for links. The implication is that Google has moved toward a link object paradigm. A link has properties (equivalent to function methods) associated with it: For example, there may be the LINK SPACE (the document on which the link appears), the LINK RANGE (the document to which the link points), the LINK TEXT (usually called "anchor text", but this may be explicit text or text embedded in another element, such as an IMG tag), the LINK SCOPE (the current age of the LINK RANGE or LINK TEXT), and the LINK WEIGHT (the importance of the LINK SPACE). There could also be meta-properties, such as STATE (is the link pointing to an active page?), or REDUNDANCY (is the link a duplicate within the LINK SPACE?), or RICHNESS (does the link include attributes?), or STYLE (is it a text link, a Javascript link, an image link?).

By separating links from their parent documents for evaluation, Google would be able to determine if a document's content is being used to manipulate the relevance of another document. For example, if the document.text remains unchanged but its associated links change periodically, the document may be serving as a pointer to active content. For example, a directory which is actively updated will, on occasion, change the properties of its links due to Webmaster update requests or editorial decisions. A site map may change the properties of its links as URLs are altered or replaced. A forum or shopping site which makes use of session IDs may change its link content frequently.

PRODUCTIVE HISTORY would be one way to describe the idea of tracking a site's performance in search results. Since millions of queries are conducted across Google each day, the simplest method of measuring a URL's performance would be to create a ranking vector which records every position from 1 to 1000 that a URL is returned for (without regard for the queries). This could lead to an exorbitant amount of data for extremely popular sites, but the daily vector could be averaged and then stored in a monthly vector. The monthly vector would have up to 31 elements, of which the first 28 would be most signicant (alternatively, Google could just go with an artificial 28-day month to match its approximate weekly update cycles and allow for a 4-week rebuild process). The average of a URL's vector performances could be used as a measure of the site's popularity, scope of content, and general importance.

But should Google attempt to normalize URLs? How is Google to distinguish between Web hosting domains like Geocities and large content domains like Xenite.Org?

RELEVANCE

Relevance has long been a special factor in Google's methodology. When fulfilling a query, Google creates a list of up to 1,000 documents deemed to be "relevant" to the query. Relevance is determined by a combination of on-page factors (words in the document) and off-page factors (anchor text of links pointing to the document, directory descriptions). A document's relevance score is calculated based on the query. After the results set is sorted by relevance, other factors are used to refine the sorting.

The patent application classifies search engines by their methods of determining relevance:
[0007] Ideally, a search engine, in response to a given user's search query, will provide the user with the most relevant results. One category of search engines identifies relevant documents based on a comparison of the search query terms to the words contained in the documents. Another category of search engines identifies relevant documents using factors other than, or in addition to, the presence of the search query terms in the documents. One such search engine uses information associated with links to or from the documents to determine the relative importance of the documents.


What seems especially important here is the final statement, "one such engine uses information associated with links to or from the documents to determine the relative importance of the documents" (emphasis added). This overview distinguishes between relevance and importance, and only associates importance with the latter stipulated method of determining relevance. Ranking results may be determined more by relevance than by importance, or more by importance than by relevance, or equally by both factors (in addition to other factors taken into consideration).

On the basis of these relationships, we can classify Google's determination of RELEVANCE as WEAK, MODERATE, or STRONG dependent upon the weighting of IMPORTANCE. That is, in any results sets where the order of the documents is largely determined by RELEVANCE, the RELEVANCE is STRONG. In any results sets where the order of the documents is largely determined by IMPORTANCE, the RELEVANCE is WEAK. The patent application provides an example of how WEAK RELEVANCE may be indicated in the results:
[0033] Ranking component 330 may assign a ranking score (also called simply a "score" herein) to one or more documents in document corpus 340. Ranking component 330 may assign the ranking scores prior to, independent of, or in connection with a search query. When the documents are associated with a search query (e.g., identified as relevant to the search query), search engine 125 may sort the documents based on the ranking score and return the sorted set of documents to the client that submitted the search query. Consistent with aspects of the invention, the ranking score is a value that attempts to quantify the quality of the documents. In implementations consistent with the principles of the invention, the score is based, at least in part, on the history data from history component 320.
Emphasis has been added.

The significance of pre-ordered ranking is that Google may feel certain documents are SUPER-RELEVANT, that is, they are important to a class of related queries regardless of both their present content and the present content of their inbound links' anchor text. One possible use of SUPER-RELEVANCE would be to elevate a news site article to the top of search results rankings if the news site has a history of specific relevance to a general class of queries.

Suppose CBSNEWS.COM has carried approximately 300 stories over the course of the past 5 years dealing with the Star Wars movies. Although CBS' news site may not currently offer any stories about Star Wars, one of its most recent stories on Star Wars may be offered as a highly favored result in a search for star wars movie news (NOTE: the performance of this query may be time-sensitive).

We can infer that CBS is deemed SUPER-RELEVANT for a query on "star wars movie news" because it has several hundred documents which include the expression "star wars" and it is a news site. Other factors may have propelled the CBS March 2005 story to the top of current search results.

SUPER-RELEVANCE may be indicated in other ways. For example, a query about "movies" brings up sites like HOLLYWOOD.COM and ROGEREBERT.COM, but a refined query such as "action movies" or "romantic movies" is dominated by categorical sites such as ABOUT.COM, AMAZON.COM, and dating sites. ABOUT and AMAZON in particular have hierarchical structures which are defined according to concept. These concept sections have frequently appeared in numerous queries.

Google may have inferred that most queries about specific types of movies are associated with specific actions, such as dating or shopping. Hence, any categorical search for movies may be deemed part of the class of queries for movies which have, in the past, led searchers to sites like AMAZON and ABOUT.

SUPER-RELEVANCE may also be applied to query topics. A topic may be deemed relevant to other topics. For example, a search for michael martinez produces many results. The names "michael" and "martinez" are very common, and there are numerous individuals whose names are variations on "michael martinez". But why does Google propel one site to the top of the listings over others? After all, any page about anyone named Michael Martinez should be equally relevant to a query about "Michael Martinez". In this case, one might reasonably expect that Google would propel the most important page to the top of the search results.

At the time of this writing, however, a relatively unimportant page (MartinezPhoto.Com) is preferred over a domain (Xenite.Org). The first domain has fewer than 10 inbound links, aproximately as few mentions on other pages. A query for michael martinez xenite.org produces over 10,000 results, all of which appear to refer to the Michael Martinez of Xenite.Org, as one would expect. The implication is that "Michael Martinez" is closely associated with "Xenite.Org".

But it turns out that "Michael Martinez" is more often associated with "Houston". A search for michael martinez houston returns 1.6 million results, and about half the top 10 results refer to the Michael Martinez of Xenite.Org, who lives (or has lived) in the Houston area.

The original query for "Michael Martinez" is thus deemed relevant to "Michael Martinez of Houston" and "Michael Martinez in Houston", both of which produce "Local results" pointing to the photography studio. It should be noted that "Michael Martinez Houston" returns two Local Results for individuals, neither of whom is the Michael Martinez of Xenite.Org.

A reasonable inference is that many people have performed searches for "Michael Martinez" in conjunction with "Houston", and both on-page and off-page content identify the photography page more closely with Houston than the content pages of Xenite.Org. In appearance, it seems as if Google has applied an aggregation of a class of query result sets to determine the rankings for a specific result set within the class.

If this inference is correct, then we can conclude that Google is indeed adapting user behavior to modify its ranking algorithm, which implies that search results rankings will be more dependent upon where query results terminate in chains of successive searches than upon other off-page factors (such as inbound link anchor text). Google may be attempting to learn what is relevant by watching for how users refine their queries.

While some extremely elaborate hypotheses have been put forward to explain how Google might track user behavior, a simpler method has been overlooked in the literature I have reviewed so far. That is, Google need only store a little bit of information with each query in order to identify any patterns: IP addresses, query strings, date+time, data center, user-agent. A combination of IP address and user-agent would serve as a "primary key" to indexing the query data. If certain query strings are frequently typed in (and WordTracker has long documented the popuarlity of queries, Google could optimize its service by storing query strings with pre-determined result sets.

These pre-determined result sets could be updated in an offline indexing process similar to the classic PageRank calculation process. The query tool would then only have to search against a popular queries index, extract a pre-determined result set, and then do some minor manipulation of the results to determine ordering with respect to the specific query.

If Google is pursuing this kind of query analysis and pre-ordering of results, one possible retaliatory consequence would be the evolution of "query spam", by which people with sufficient resources alter the SUPER RELEVANCE of targeted queries by inserting weighted queries into Google's collection stream. Over a period of time, the artificial queries should induce Google's pre-process to reorder its results sets on the basis of new query patterns. However, Google may be banking on the sheer number of natural queries outweighing the impact of artificial queries.

NEXT





Google: Changes in Google Ranking Strategies page 2

This page is Copyright © 2005-2006 Michael L. Martinez. All Rights Reserved. No portions of this document may be reproduced electronically or otherwise without express written permission, except as occurs through normal browser caching or search engine indexing. Original document copyrights remain those of their respective owners.
The page was created by Michael Martinez.
SE cOnsulting provided the Houston search engine optimization for this page.