News

Google Data Leak Reveals Inner Workings of Ranking Algorithm

TL:DR

Foreword

Recently, a significant amount of internal documentation on Google Search’s Content Warehouse API has been exposed, shedding light on key factors that Google considers when ranking content. The leaked documents were made public on March 13th on Github by an automated bot named yoshi-code-bot. They were also shared with Rand Fishkin, co-founder of SparkToro, earlier this month.

Here is a top-level insight of all the core outtakes from the internal leak:

Ranking Features

2,596 modules are represented in the API documentation with a total of 14,014 attributes. These modules are related to components of YouTube, Assistant, Books, video search, links, web documents, crawl infrastructure, an internal calendar system, and the People API.

Weighting ranking

The documentation does not specify the weighting of ranking features, only their existence.

Twiddlers
These re-ranking functions can adjust a document’s information retrieval score or alter its ranking. They can impose category constraints to promote diversity by limiting the type of results, such as allowing only a certain number of blog posts in a search engine results page (SERP).

Here are some boosts identified in the docs:

NavBoost
QualityBoost
RealTimeBoost
WebImageBoost

Demotions

Content can be demoted for a variety of reasons, such as:

Anchor Mismatch - A link doesn’t match the target site its linking to
SERP - signals indicate user dissatisfaction.
Nav demotion - Applied to pages that exhibit poor navigation practises or user experience issues
Product reviews incorrectly implemented
Location - ‘global’ and ‘super global’ pages can be demoted. Suggesting Google attempts to associate pages with location and ranks them accordingly
Exact match domains

Change History

Google keeps a copy of every version of every page it has ever indexed. Meaning, Google can “remember” every change ever made to a page. However, Google only uses the last 20 changes of a URL when analysing links.

Links

Links continue to hold significant importance, and Google’s PageRank algorithm remains a key factor in its ranking features. The presence of these factors does not necessarily contradict previous statements by Google spokespeople regarding the importance of links in ranking. It is possible for multiple factors to play a role simultaneously. The weighting of these features remains undisclosed.

Successful user clicks are also considered important. To achieve high rankings, it is essential to consistently produce high-quality content and prioritise user experiences, as indicated in the documents. Google utilises various metrics, including badClicks, goodClicks, lastLongestClicks, and unsquashedClicks.

Lengthier documents might be cut off, whereas shorter content is assigned a score ranging from 0 to 512 based on its originality. Scores are also assigned to content related to Your Money Your Life, such as health and news.

Google retains author information associated with content and tries to determine if an entity is the author of a document.

SiteAuthority: Google utilises something called “siteAuthority”.

Google indicated the existence of this concept in 2011 after the launch of the Panda update, publicly stating that “low-quality content on a section of a site can impact the site’s overall ranking.” However, Google has denied the existence of a website authority score since that time.

Chrome data: A module known as ChromeInTotal indicates that Google incorporates data from its Chrome browser for ranking purposes.

Whitelists: A couple of modules indicate Google whitelist certain domains related to elections and COVID – isElectionAuthority and isCovidLocalAuthority. Even though it has been common knowledge for some time that Google and Bing have exceptions lists for specific algorithms that unintentionally impact websites.

Other interesting findings:

Freshness matters – considers dates in the byline (bylineDate), URL (syntacticDate) and on-page content (semanticDate).
To determine whether a document is or isn’t a core topic of the website, Google utilises pages and sites vectorisation, then compares the page embeddings (siteRadius) to the site embeddings (siteFocusScore).
Google stores domain registration information (RegistrationInfo).
Importance of page titles: Google has a feature called titlematchScore which evaluates the alignment between a page title and a search query.
Google analyses the average weighted font size of terms in documents (avgTermWeight) and anchor text for ranking purposes.

Copy Link to Share

Copied

Sepideh Masihpour

Content Marketing Executive

Related Case Studies...

No items found.

You may be interested in...

Agentic Commerce Has Arrived, Is Your eCommerce Platform Ready to Be Chosen by AI?

Insight

The Most Undervalued Revenue Channel In eCommerce Is Already Sitting In Your Database

Insight

Is Your Meta Setup Feeding the Algorithm, or Starving It of Performance?

Insight