getting_rid_of_duplicate_content_iss-ben_dangelo.ppt

12
1 Getting Rid of Duplicate Content Issues Once and For All Ben D’Angelo Software Engineer PubCon, Las Vegas November 13, 2008

description

 

Transcript of getting_rid_of_duplicate_content_iss-ben_dangelo.ppt

Page 1: getting_rid_of_duplicate_content_iss-ben_dangelo.ppt

1

Getting Rid of Duplicate Content Issues Once and For All

Ben D’AngeloSoftware Engineer

PubCon, Las VegasNovember 13, 2008

Page 2: getting_rid_of_duplicate_content_iss-ben_dangelo.ppt

2

What are “duplicate content issues”?

Multiple disjoint situations!

• Duplicate content within your site or sites

Multiple URLs pointing to the same page, similar pages

Different countries (same language)

• Duplicate content across other sites

Syndicated content

Scraped content

Page 3: getting_rid_of_duplicate_content_iss-ben_dangelo.ppt

3

Guiding principle

One URL for one piece of content

Why?

• Users don’t like duplicates in results

• Saves resources in our index—more room for other pages from your site!

• Saves resources on your server

Page 4: getting_rid_of_duplicate_content_iss-ben_dangelo.ppt

4

Sources of duplicates within your sites

• Multiple URLs pointing to the same page

www vs non-www

Session ids, URL parameters

Printable versions of pages

CNAMEs

• Similar content on different pages

• Manufacturer’s databases

• Different countries

Page 5: getting_rid_of_duplicate_content_iss-ben_dangelo.ppt

5

• Many systems for de-duping URLs at various stages in our crawl/index

pipeline

General idea: cluster pages, choose the “best” representative

• Different filters are used for different types of duplicate content

• Goal: serve one version of the content in search results

• Generally just a filter: it will not destroy your site

How does Google handle this?

Page 6: getting_rid_of_duplicate_content_iss-ben_dangelo.ppt

6

What can you do about your site?

• For exact dupes: 301 Tracking URLs

www vs non-www (also Google Webmaster Tools)

• Near duplicates: noindex / robots.txt Printable pages

Clones of other sites

• Domains by country Different languages is not duplicate content

Use unique content specific to the country

Use different TLDs (also Google Webmaster Tools) for geo-targeting

• Url parameters Put data which does not affect the substance of a page in a cookie

Page 7: getting_rid_of_duplicate_content_iss-ben_dangelo.ppt

7

What can you do about your site?

Choose www or non-www as preferred

Page 8: getting_rid_of_duplicate_content_iss-ben_dangelo.ppt

8

What can you do about your site?

Page 9: getting_rid_of_duplicate_content_iss-ben_dangelo.ppt

9

What can you do about another site?

• Include original absolute URL in syndicated content

• Syndicate different content

• If you use syndicated content, manage your expectations

• Don’t worry about scrapers or proxies too much; they generally don’t affect

your rankings

If you are concerned, file a

• DMCA request (http://www.google.com/dmca.html)

• Spam report (https://www.google.com/webmasters/tools/spamreport)

Page 10: getting_rid_of_duplicate_content_iss-ben_dangelo.ppt

10

Best practices for Google

• Avoid duplicate URLs / sites

• Generate unique, compelling content for users

• Don’t be overly concerned with duplicate content

• Let us know about any issues at the Webmaster Help Forum

Page 11: getting_rid_of_duplicate_content_iss-ben_dangelo.ppt

11

Useful links

Webmaster Central http://google.com/webmasters/

• Webmaster Central Blog

http://googlewebmastercentral.blogspot.com/

• Webmaster Help Center

http://www.google.com/support/webmasters/

• Webmaster Discussion Group

http://groups.google.com/group/Google_Webmaster_Help

Page 12: getting_rid_of_duplicate_content_iss-ben_dangelo.ppt

12

Thank You!