How Search Engines Crawl and Index Web Content

How Search Engines Crawl and Index Web Content

Movéo Meetup #3

Web content is crawled and indexed before it’s displaying in search results

> Step 1: Search engine spider discovers a page

> Step 2: Search engine decides if content is worthy of inclusion in main index

> Step 3: Content is placed in appropriate index

> Step 4: If content is relevant (according to search engine’s algorithm) it is displayed in SERP’s

Ranking Process

Ideally, search engines only want to rank one version (whichever is the original source) of an article or page. Doing so gives the user variety, versus multiple instances of the same thing at different URL’s (a.k.a. duplicate content).

This is important to consider whencreating content or Web pages

Crawling

Search engines constantly “crawl” the Web for new or updated content

> Search engines generally find new content via links, but you can submit URL’s manually

> You can block pages from being crawled (print version of pages)

> Good idea to have a site architecture that allows easy crawling experience - use sitemaps

> Make sure robots don’t waste time crawling duplicate content

> Crawl rate is how long/often your site is crawled

Crawling

Search engine robots will spend a limited amount of time crawling your content. Avoid duplicate content to ensure you get the most out of the allotted time the robot is on your site.

To improve your crawl efficiency

> Point more internal links to most important pages

> Get external links

> Review your site architecture

> “nofollow” attribute prevents crawling

> Update your content often

> Site maps

> View crawl stats in Webmaster Tools

Improving Crawl Rate

Want to know more about how Google crawls your site? Google Webmaster Tools allows you to see how often they crawl your site.

http://www.google.com/webmasters/tools/



Indexing

Indexing

Avoid duplicate content and shady link building techniques for the best chance reach in the main index.

Main issues

> Once content is discovered, search engines decide whether or not it’s worthy of their main index

> Duplicate content is a huge concern here - if multiple versions are published, which will be indexed?

> You can work content into the main index if it’s not there already, but this could take months

> Good indicator of Website “health” in search engines

> Shady backlinks and other black hat tactics can get your content penalized

> There are ways to prevent indexation

Tools/Solutions

Diagnose your indexing issues

> Site command (site:yourdomain.com)

> Internal link distribution (Webmaster Tools)

> Robots.txt

> Meta-robots tag

> Sitemaps and site maps (HTML and XML)

> URL removal in Google Webmaster Tools

Tools/Solutions

There are a number of tools that can help you discover crawling and indexing errors. Check them routinely, especially after new content is published.

Use site command to view indexed pages

Site Command

Site command works in Google, Bing and Yahoo.

site:example.com

How Search Engines Crawl and Index Web Content

Documents

Transcript of How Search Engines Crawl and Index Web Content