What are some of the top classified websites

E-A-T: Classification of websites via vector space analysis according to authority and expertise

This is a translated post by Bill Slawski. The original is here.

Table of Contents

Classification of websites according to E-A-T

Google writes in a patent that it may use vector space analysis to classify websites based on features found on those websites.

This post is about a new Google patent application filed in August 2018 and published early 2020 week at the World Intellectual Property Organization (WIPO).

The patent application uses neural networks to understand patterns and characteristics behind websites and to classify those websites.

This website classification system refers to a vector, for a website classification within a certain knowledge domain.

These domains of knowledge can be topics such as health, finance, and others. Sites classified in certain knowledge domains can have an advantage in ranking.

These website classifications can be more diverse than the categories of websites within knowledge domains. The patent breaks up categories much further:

For instance, the website classifications may include the first category of websites authored by experts in the knowledge domain, eg, doctors, the second category of websites authored by apprentices in the knowledge domain, eg, medical students, and a third category of websites authored by laypersons in the knowledge domain.

I remember discussions in the SEO industry about the Google Quality Rater Guidelines and the references in them to E-A-T or expertise, authority and trustworthiness. The QRG point to health websites with different levels of E-A-T, similar to the classifications from this new Google patent application about website presentation vectors:

High E-A-T medical advice should be written or prepared by a person or organization with appropriate medical expertise. Medical advice or information with high E-A-T should be included in one
written in a professional style and should be edited, reviewed and updated regularly.

The guidelines indicate that there are sites that have been created by people with less expertise on topics:

It's even possible to have everyday expertise in YMYL topics. For example, there are forums and support pages for people with specific diseases. Sharing personal experience is a form of everyday expertise. Consider this example.
Here, forum participants are telling how long their loved ones lived with liver cancer. This is an example of sharing
personal experiences (in which they are experts), not medical advice. Specific medical information and advice (rather
than descriptions of life experiences) should come from doctors or other health professionals.

The classifications include an expert level, an apprentice level and a lay level.

These classifications are based on different expert levels and the patent says that it also classifies pages on the basis of authority, but says nothing about trustworthiness, so it does not consider a full classification of websites based on E-A-T. This process captures two aspects of E-A-T so that it can achieve part of the goal of the Quality Ratings Guidelines by allowing human reviewers to rate pages that are well-ranked and have a high level of authority and expertise.

If this process also limits the number of websites Google must return search results from based on what knowledge domain they are in, it means that Google crawls fewer websites to return results than the entire web index available from Google Has. Let's take a closer look at the process behind this patent application.

The process divides websites into specific knowledge domains and tries to find different types of websites within those specific knowledge domains:

  • Receipt of web site presentation vectors and quality ratings, which represent the quality measurements of web sites compared to other web sites.
  • Ranking of first websites, each website having a quality rating below a certain threshold, with at least one of the websites having a quality rating below this threshold
  • Classifying second websites, each of the websites having a quality rating above a certain threshold that is higher than the first threshold, wherein at least one of the websites has a quality rating above the first threshold
  • Generation of a first summarized sample representation of the websites classified as the first websites
  • Creation of a second summarized sample representation from the group of the second classified websites
  • Receiving another website
  • Determining a first measure of the difference between the first summarized sample representation and the representation of the individual other website
  • Determining the second measure for the difference between the second combined sample representation and the representation of the individual other website
  • Classification of the other website on the basis of the first difference measured variable and the second difference measured variable or into a class that does not match either the first or the second difference measured variable.

 

Search queries require results from identified sources of knowledge

The patent application shows that the method involves the use of terms from the search query in order to understand that the search query requires data from a particular knowledge domain.

The search query can search for answers from that particular knowledge domain. The procedure includes:

  • Generation of preprocessed answers for future search queries from the authority sources
  • Receipt of a search query that is intended for the particular knowledge domain
  • Answering the search query with one of the preprocessed answers

Advantages of using vector space analysis

The search engine can select, search, or both data only for websites with a certain classification, thereby reducing the computational resources required to find search results, such as not selecting or searching a website that does not match the classification, or both. This can:

  • Reduce the storage space required to store data for potential search results, e.g. you may only need to store data for websites with a certain classification
  • Reduction of the websites that have to be analyzed by the search system, e.g. by restricting a search to websites with the specific classification
  • Reducing the bandwidth used to deliver search results to a requesting device
  • Addressing potential issues with previous systems, such as increased use of bandwidth, memory, processor cycles, performance, or a combination of two or more of these factors
  • Improvement of the search result pages generated by a search system by only including websites with a certain classification, e.g. a qualitative classification, in the generated search result pages
  • Using patterns or characteristics learned from existing websites to classify previously unseen websites without the need for manual user ratings or signals to classify
  • Identifying websites that are more likely to match search queries for a knowledge domain, e.g., that are more likely to be authorities for the knowledge domain, by classifying previously unseen websites
  • Use of summarized representations based on existing website classifications, which means that the features used by the classification are not limited by human perceptible features and that can be learned algorithmically by analyzing the website.

Note that it says it helps identify sites that are authoritative for various areas of knowledge.

This website representation vector patent application can be found at

Website Representation Vector to Generate Search Results and Classify Website
Publication number: WO2020033805
Applicant: GOOGLE LLC
Inventor: Yevgeny Tsykynovsky
Publication number WO / 2020/033805
Submitted: August 10, 2018
Release date February 13, 2020

Abstract:

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for using website representations to generate, store, or both, search results. One of the methods includes data representing each website in the first plurality of websites associated with a first knowledge domain of a plurality of knowledge domains and having a first classification; receiving data representing each website in the second plurality of websites associated with the first knowledge domain and having a second classification; generating a first composite representation of the first plurality of websites; generating a second composite representation of the second plurality of websites; receiving a representation of a third website; determining a first difference measure between the first composite representation and the representation; determining a second difference measure between the second composite representation and the representation; and based on the first difference measure and the second difference measure, classifying the third website.

How data from the classification system can be used

The search engine can use data from this website display vector classification system to return search results.

This classification system can use representations for each of the many websites A-N and use the representations to determine a classification for each of the many websites A-N.

The search engine decides to use the classification for a search query to select a category of websites with the same or a similar classification.

It can return search results from this category of websites.

The classifications of these websites are based on the sample characteristics that the websites contain.

How are website classifications generated?

This was the part of the description of the patent that interested me the most.

The part starts by saying that this website rendering vector classification system could use any suitable method of generating classifications, which gives Google great flexibility.

But then she goes even further by telling us that the classification can be based on the content of websites to generate representations of those websites.

This content can include:

  • Text from the website
  • Pictures on the website
  • Other website content, e.g. links
  • Or a combination of two or more of these elements

The patent then provides details on how a neural network is incorporated:

The website classification system may use a mapping that maps the website content for the website A to a vector space that identifies a representation for website A.

For instance, the website classification system may use a neural network, that represents the mapping, to create a feature vector A that represents the website A using the content of the website A as input to the neural network.

Labels used in website classification

The website classification can be based on the use of labels. The labels:

  • can be alphanumeric, numeric, or alphabetic characters, symbols, or a combination of two or more thereof
  • May identify a type of company that published the relevant website, such as a nonprofit or for-profit company
  • indicates an industry that is described on the website, e.g. about artificial intelligence or education.
  • Can identify a type of person who authored content, such as a doctor, medical student, or layperson
  • Could also be ratings that represent a website classification

The scores for classifications could be used:

  • To meet various thresholds to meet categories
  • Can be specific to a particular area of ​​knowledge
  • Can classify websites that cover more than one knowledge domain
  • can select websites that provide an answer to multiple searches for specific areas of knowledge
  • provided with the authority of the respective website for the respective knowledge domain
  • or both

Input data used to classify websites can relate to the following things, for example:

  • A position of certain words in relation to one another, e.g. that the word "artificial" is generally near or in the vicinity of the word "intelligence".
  • Certain phrases found on the website
  • For each of the classifications A-B, a measure of the difference or a similarity measure that represents a similarity between the respective classification and the other website
  • The classification A-B that is most similar
  • The classification A-B with the highest degree of similarity or with the shortest distance between the other feature vector and the respective mean feature vector A-B, to name a few examples
  • A ratio between two measures of similarity for choosing a classification for the other website

The patent offers several other ways in which input data can be viewed during the classification process

Quality ratings that can be used to classify a website can be the following metrics:

  • authority
  • Responsiveness to a specific area of ​​knowledge
  • Another feature of the website
  • Or a combination of two or more of these elements

Takeaways from the patent

  • Websites can be classified based on text, images, and links
  • Classified site quality ratings can indicate a site's authority or relevance to a particular knowledge domain, or both.
  • Labels used to classify sites could include information about the company behind a site, the industry described in the site, and the type of person who wrote the content.
  • A website can be classified to cover more than one area of ​​knowledge

If you want to know more about E-A-T and how you optimize the signals that lead to a better E-A-T classification, I recommend the article E-A-T optimization: How do you optimize E-A-T on Google?

This is a translated post by Bill Slawski. The original is here.

Classification of websites according to E-A-T

Google writes in a patent that it may use vector space analysis to classify websites based on features found on those websites. This post is about a new Google patent application filed in August 2018 and published early 2020 week at the World Intellectual Property Organization (WIPO). The patent application uses neural networks to understand patterns and characteristics behind websites and to classify those websites. This website classification system refers to a vector, for a website classification within a certain knowledge domain. These domains of knowledge can be topics such as health, finance, and others. Sites classified in certain knowledge domains can have an advantage in ranking. These website classifications can be more diverse than the categories of websites within knowledge domains. The patent breaks up categories much further:
For instance, the website classifications may include the first category of websites authored by experts in the knowledge domain, eg, doctors, the second category of websites authored by apprentices in the knowledge domain, eg, medical students, and a third category of websites authored by laypersons in the knowledge domain.
I remember discussions in the SEO industry about the Google Quality Rater Guidelines and the references in them to E-A-T or expertise, authority and trustworthiness. The QRG refer to health websites with different levels of E-A-T, similar to the classifications from this new Google patent application on website presentation vectors: Medical advice with high E-A-T should be written or created by people or organizations with appropriate medical expertise. Medical advice or high E-A-T information should be written in a professional style and should be edited, reviewed, and updated on a regular basis. The guidelines indicate that there are sites that have been created by people with less expertise on topics:
It's even possible to have everyday expertise in YMYL topics. For example, there are forums and support pages for people with specific diseases. Sharing personal experience is a form of everyday expertise. Consider this example. Here, forum participants are telling how long their loved ones lived with liver cancer. This is an example of sharing personal experiences (in which they are experts), not medical advice.Specific medical information and advice (rather than descriptions of life experiences) should come from doctors or other health professionals.
The classifications include an expert level, an apprentice level and a lay level. These classifications are based on different expert levels and the patent says that it also classifies pages on the basis of authority, but says nothing about trustworthiness, so it does not consider a full classification of websites based on E-A-T. This process captures two aspects of E-A-T so that it can achieve part of the goal of the Quality Ratings Guidelines by allowing human reviewers to rate pages that are well-ranked and have a high level of authority and expertise. If this process also limits the number of websites Google must return search results from based on what knowledge domain they are in, it means that Google crawls fewer websites to return results than the entire web index available from Google Has. Let's take a closer look at the process behind this patent application. The process divides websites into specific knowledge domains and tries to find different types of websites within those specific knowledge domains:
  • Receipt of web site presentation vectors and quality ratings, which represent the quality measurements of web sites compared to other web sites.
  • Ranking of first websites, each website having a quality rating below a certain threshold, with at least one of the websites having a quality rating below this threshold
  • Classifying second websites, each of the websites having a quality rating above a certain threshold that is higher than the first threshold, wherein at least one of the websites has a quality rating above the first threshold
  • Generation of a first summarized sample representation of the websites classified as the first websites
  • Creation of a second summarized sample representation from the group of the second classified websites
  • Receiving another website
  • Determining a first measure of the difference between the first summarized sample representation and the representation of the individual other website
  • Determining the second measure for the difference between the second combined sample representation and the representation of the individual other website
  • Classification of the other website on the basis of the first difference measured variable and the second difference measured variable or into a class that does not match either the first or the second difference measured variable.
 

Search queries require results from identified sources of knowledge

The patent application shows that the method involves the use of terms from the search query in order to understand that the search query requires data from a particular knowledge domain. The search query can search for answers from that particular knowledge domain. The procedure includes:
  • Generation of preprocessed answers for future search queries from the authority sources
  • Receipt of a search query that is intended for the particular knowledge domain
  • Answering the search query with one of the preprocessed answers

Advantages of using vector space analysis

The search engine can select, search, or both data only for websites with a certain classification, thereby reducing the computational resources required to find search results, such as not selecting or searching a website that does not match the classification, or both. This can:
  • Reduce the storage space required to store data for potential search results, e.g. you may only need to store data for websites with a certain classification
  • Reduction of the websites that have to be analyzed by the search system, e.g. by restricting a search to websites with the specific classification
  • Reducing the bandwidth used to deliver search results to a requesting device
  • Addressing potential issues with previous systems, such as increased use of bandwidth, memory, processor cycles, performance, or a combination of two or more of these factors
  • Improvement of the search result pages generated by a search system by only including websites with a certain classification, e.g. a qualitative classification, in the generated search result pages
  • Using patterns or characteristics learned from existing websites to classify previously unseen websites without the need for manual user ratings or signals to classify
  • Identifying websites that are more likely to match search queries for a knowledge domain, e.g., that are more likely to be authorities for the knowledge domain, by classifying previously unseen websites
  • Use of summarized representations based on existing website classifications, which means that the features used by the classification are not limited by human perceptible features and that can be learned algorithmically by analyzing the website.
Note that it says it helps identify sites that are authoritative for various areas of knowledge. This website representation vector patent application can be found at Website Representation Vector to Generate Search Results and Classify Website Publication number: WO2020033805 Applicant: GOOGLE LLC Inventor: Yevgeny Tsykynovsky Publication number WO / 2020/033805 Submitted: August 10, 2018 Publication date February 13th 2020 Abstract:
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for using website representations to generate, store, or both, search results. One of the methods includes data representing each website in the first plurality of websites associated with a first knowledge domain of a plurality of knowledge domains and having a first classification; receiving data representing each website in the second plurality of websites associated with the first knowledge domain and having a second classification; generating a first composite representation of the first plurality of websites; generating a second composite representation of the second plurality of websites; receiving a representation of a third website; determining a first difference measure between the first composite representation and the representation; determining a second difference measure between the second composite representation and the representation; and based on the first difference measure and the second difference measure, classifying the third website.

How data from the classification system can be used

The search engine can use data from this website display vector classification system to return search results. This classification system can use representations for each of the many websites A-N and use the representations to determine a classification for each of the many websites A-N. The search engine decides to use the classification for a search query to select a category of websites with the same or a similar classification. It can return search results from this category of websites. The classifications of these websites are based on the sample characteristics that the websites contain.

How are website classifications generated?

This was the part of the description of the patent that interested me the most. The part starts by saying that this website rendering vector classification system could use any suitable method of generating classifications, which gives Google great flexibility. But then she goes even further by telling us that the classification can be based on the content of websites to generate representations of those websites. This content can include:
  • Text from the website
  • Pictures on the website
  • Other website content, e.g. links
  • Or a combination of two or more of these elements
The patent then provides details on how a neural network is incorporated:
The website classification system may use a mapping that maps the website content for the website A to a vector space that identifies a representation for website A. For instance, the website classification system may use a neural network, that represents the mapping, to create a feature vector A that represents the website A using the content of the website A as input to the neural network.

Labels used in website classification

The website classification can be based on the use of labels. The labels:
  • can be alphanumeric, numeric, or alphabetic characters, symbols, or a combination of two or more thereof
  • May identify a type of company that published the relevant website, such as a nonprofit or for-profit company
  • indicates an industry that is described on the website, e.g. about artificial intelligence or education.
  • Can identify a type of person who authored content, such as a doctor, medical student, or layperson
  • Could also be ratings that represent a website classification
The scores for classifications could be used:
  • To meet various thresholds to meet categories
  • Can be specific to a particular area of ​​knowledge
  • Can classify websites that cover more than one knowledge domain
  • can select websites that provide an answer to multiple searches for specific areas of knowledge
  • provided with the authority of the respective website for the respective knowledge domain
  • or both
Input data used to classify websites can relate to the following things, for example:
  • A position of certain words in relation to one another, e.g. that the word "artificial" is generally near or in the vicinity of the word "intelligence".
  • Certain phrases found on the website
  • For each of the classifications A-B, a measure of the difference or a similarity measure that represents a similarity between the respective classification and the other website
  • The classification A-B that is most similar
  • The classification A-B with the highest degree of similarity or with the shortest distance between the other feature vector and the respective mean feature vector A-B, to name a few examples
  • A ratio between two measures of similarity for choosing a classification for the other website
The patent offers several other ways in which input data can be viewed during the classification process.
  • authority
  • Responsiveness to a specific area of ​​knowledge
  • Another feature of the website
  • Or a combination of two or more of these elements

Takeaways from the patent

  • Websites can be classified based on text, images, and links
  • Classified site quality ratings can indicate a site's authority or relevance to a particular knowledge domain, or both.
  • Labels used to classify sites could include information about the company behind a site, the industry described in the site, and the type of person who wrote the content.
  • A website can be classified to cover more than one area of ​​knowledge
If you want to know more about E-A-T and how you optimize the signals that lead to a better E-A-T classification, I recommend the article E-A-T optimization: How do you optimize E-A-T on Google?