Datamining the Australian web graph by Frank Vitetta

42

description

How easy is it to crawl the Australian web graph - or, in other words, crawl all Australian sites? Frank has set himself this challenge and in his talk he will cover web crawling in depth, as well as a number of interesting findings and trends about the Australian web market that he came across along the way.

Transcript of Datamining the Australian web graph by Frank Vitetta

Page 1: Datamining the Australian web graph by Frank Vitetta
Page 2: Datamining the Australian web graph by Frank Vitetta

@[email protected]

Clients we work for

Page 3: Datamining the Australian web graph by Frank Vitetta

@[email protected]

Overview

The basic techniques of web crawling

Backlink tools - Moz, Hrefs and Majestic SEO

Outreachr.com – the tool

An Australian challenge

Insights into the Outreachr database

Owning the data – what you can do with it

Take-aways

Page 4: Datamining the Australian web graph by Frank Vitetta

@[email protected]

Web crawling – an introduction

- A web crawler is a computer program that browses the web in a methodical and automated manner.

- They are called crawlers because they crawl through a site one page at a time, following the links to other pages on the site until all pages have been read.

- All major search engines and SEO tools deploy crawlers - also known as "spiders" or "bots”.

Page 5: Datamining the Australian web graph by Frank Vitetta

@[email protected]

Breadth First Search

Web crawling – an Introduction

• BFS begins at a root node and inspects all neighbouring nodes.

• For each neighbour node, in turn it inspects the neighbour nodes which were unvisited, and continues.

• Assumption: If we start with "good" pages, this keeps us close to other good pages.

• Variation of this algorithms are more memory efficient and popular in computing.

Page 6: Datamining the Australian web graph by Frank Vitetta

@[email protected]

Web Crawling – An Introduction

Depth First Search

• Invented in 19th century by French mathematician Charles Pierre Trémaux (strategy for solving mazes).

• Algorithm for traversing or searching tree or graph data structures.

• Starts at the root and explores as far as possible along each branch before backtracking.

Page 7: Datamining the Australian web graph by Frank Vitetta

@[email protected]

Popular SEO tools

Page 8: Datamining the Australian web graph by Frank Vitetta

@[email protected]

Web crawling tools

Page 9: Datamining the Australian web graph by Frank Vitetta

@[email protected]

Tool index sizes

Moz Majestic Fresh Hrefs0

200

400

600

800

1000

1200

1400

1600

1800

2000

Billion UrlsMillion Root DomainsBillion Links

Remember- Number of pages per domain- Number of links per domainEg ebay AU has 80M pages

Page 10: Datamining the Australian web graph by Frank Vitetta

@[email protected]

2 years ago we came up with an internal tool to handle outreach

Page 11: Datamining the Australian web graph by Frank Vitetta

@[email protected]

We had to come up with a new tool

Page 12: Datamining the Australian web graph by Frank Vitetta

@[email protected]

- Be more efficient in finding the right sites for our clients

- Speed up the contact process

- Outsource some of the most repetitive work (e.g. sending emails/filling contact forms)

- Work for various clients in various languages

- Codebase ownership = freedom to run custom campaign

- We don’t want to piss people off! We have an historical index of who we have contacted in the past.

Why?

Page 13: Datamining the Australian web graph by Frank Vitetta

@[email protected]

Outreachr.com - how we do it

Discovery(engine scraping,

Twitter,own index)

Get SEO stats (Moz &PR)

Social

Contact extraction

(crawling sites, Whois data)

Sorting algorithm

New campaign queries

Page 14: Datamining the Australian web graph by Frank Vitetta

@[email protected]

Outreachr - interface

Page 15: Datamining the Australian web graph by Frank Vitetta

@[email protected]

Insights into the Aussie web graph

Page 16: Datamining the Australian web graph by Frank Vitetta

@[email protected]

The Australian challenge

Australian challenge

Page 17: Datamining the Australian web graph by Frank Vitetta

@[email protected]

Step 1 - We started with a small tight seeding (abc.net.au, news.com.au, theaustralian.com.au and other popular Australian news sites)After obtaining over 1M urls and analysing over 8M links, we only found 90,000 unique domains over 2.4M registered .au Domains

The Australian web graph is hard to crawl

Page 18: Datamining the Australian web graph by Frank Vitetta

@[email protected]

2012 stats from AusRegistry – 2.4M registered urls

Source http://www.auda.org.au/pdf/ausregistry-q4-1112.pdf

Page 19: Datamining the Australian web graph by Frank Vitetta

@[email protected]

Any tools using first breadth search will struggle to efficiently crawl Aussie sites

Page 20: Datamining the Australian web graph by Frank Vitetta

@[email protected]

Australian sites link out to sites all over the world

Page 21: Datamining the Australian web graph by Frank Vitetta

@[email protected]

Com(40)

AU(45)

domain.com.au

.com.au sites link to .com as much as .com.au

Net(24)

Page 22: Datamining the Australian web graph by Frank Vitetta

@[email protected]

So what we have learned from our Data Base?

Ranking domains(1.5M)

First Breadth from ranking domains

(2M)

Twitter Domains(0.4M)

Page 23: Datamining the Australian web graph by Frank Vitetta

@[email protected]

EDU have avg. PR of 5.49!

.com .net .org .edu0

1

2

3

4

5

6

PR

Page 24: Datamining the Australian web graph by Frank Vitetta

@[email protected]

.co.uk .fr .au .nz0

0.5

1

1.5

2

2.5

3

3.5

PR

Regional Level – Australia has got the highest AVG PR

Page 25: Datamining the Australian web graph by Frank Vitetta

@[email protected]

Moz loves .org sites

.com .net .org .edu23

24

25

26

27

28

29

30

DA

Page 26: Datamining the Australian web graph by Frank Vitetta

@[email protected]

Australia has got the highest AVG DA

.co.uk .fr .au .nz0

10

20

30

40

50

60

DA

Page 27: Datamining the Australian web graph by Frank Vitetta

@[email protected]

.ac.uk .co.uk .fr .au .nz0

10

20

30

40

50

60

DA

.ac.uk .co.uk .fr .au .nz0

1

2

3

4

5

6

PR

Quite big disparity between PR and DA

Page 28: Datamining the Australian web graph by Frank Vitetta

@[email protected]

You need fewer links to rank in Australia

.com .uk .au0

10

20

30

40

50

60

70

80

90

84

67

48

Root Domain Links

Page 29: Datamining the Australian web graph by Frank Vitetta

@[email protected]

43% success rate in grabbing emails off domains

AU

Email foundNo email found

Page 30: Datamining the Australian web graph by Frank Vitetta

@[email protected]

COM

Email foundNo email found

50% success rate in grabbing emails off domains

Page 31: Datamining the Australian web graph by Frank Vitetta

@[email protected]

16% of sites linked to their Facebook page

AU

link to facebook pageno link found

Page 32: Datamining the Australian web graph by Frank Vitetta

@[email protected]

COM

link to facebook pageno link found

18% of sites linked to their Facebook page

Page 33: Datamining the Australian web graph by Frank Vitetta

@[email protected]

AU

link to twitter pageno link found

61% of sites linked to their Twitter page

Page 34: Datamining the Australian web graph by Frank Vitetta

@[email protected]

COM

link to twitter pageno link found

70% of sites linked to their Twitter page

Page 35: Datamining the Australian web graph by Frank Vitetta

@[email protected]

And … domain extension distribution

74%

7%

19%

.com.auother .au (net.au, org.au ..)other (com,net ..) usually au.domain.com

Page 36: Datamining the Australian web graph by Frank Vitetta

@[email protected]

Owning this data is really cool

Page 37: Datamining the Australian web graph by Frank Vitetta

@[email protected]

Analysing ranking pages on G (eg. PR, DA, keywords in url)

How difficult it is to rank based on sites we found on 1st page?

Page 38: Datamining the Australian web graph by Frank Vitetta

@[email protected]

Who are my online SERP competitors?

Based on a keyword set you control and you care about

Page 39: Datamining the Australian web graph by Frank Vitetta

@[email protected]

Ebay is the most visible site across 17k keywords analysed

Domain In top 10 Saturationebay.com.au 2691 25.07truelocal.com.au 2308 21.5yellowpages.com.au 1894 17.65gumtree.com.au 1819 16.95google (images/video/shopping) 1765 16.44tripadvisor.com.au 1753 16.33forums.whirlpool.net.au 1392 12.97productreview.com.au 1208 11.26myshopping.com.au 1130 10.53abc.net.au 1101 10.26smh.com.au 1100 10.25itunes.apple.com 1077 10.03whitepages.com.au 990 9.22yelp.com.au 965 8.99whereis.com 893 8.32news.com.au 833 7.76wotif.com 783 7.3au.answers.yahoo.com 774 7.21expedia.com.au 672 6.26getprice.com.au 628 5.85

Compiled analysing over

100,000 ranking domains

Page 40: Datamining the Australian web graph by Frank Vitetta

@[email protected]

Big surprise! Nothing to do with home appliances

broadband choiceadsl 2microsoft certificationmodem router

Page 41: Datamining the Australian web graph by Frank Vitetta

@[email protected]

Take-aways

- If you want to outreach in Australia, you probably need to be on Twitter.

- The top Aussie sites are aggregators (products, reviews or local business) - get listed to increase visibility.

- You are already lucky! You don’t need to work to get as many root domains as you would in other countries like the UK.

- Use a range of tools, including Open Site Explorer, hrefs.com and MajesticSEO to check backlink profile as no single tool seems to do a great job at indexing the Australian subnet.

- You need a com.au to rank in Australia. 19% are .com but usually with an Australian subdomain (e.g. au.domain.com)

Page 42: Datamining the Australian web graph by Frank Vitetta

@[email protected]

@[email protected]@orchidbox.com(Send me a tweet to get free Outreachr pro access for a month!)