Datamining the Australian web graph by Frank Vitetta
-
Upload
outreachrcom -
Category
Technology
-
view
173 -
download
0
description
Transcript of Datamining the Australian web graph by Frank Vitetta
Clients we work for
Overview
The basic techniques of web crawling
Backlink tools - Moz, Hrefs and Majestic SEO
Outreachr.com – the tool
An Australian challenge
Insights into the Outreachr database
Owning the data – what you can do with it
Take-aways
Web crawling – an introduction
- A web crawler is a computer program that browses the web in a methodical and automated manner.
- They are called crawlers because they crawl through a site one page at a time, following the links to other pages on the site until all pages have been read.
- All major search engines and SEO tools deploy crawlers - also known as "spiders" or "bots”.
Breadth First Search
Web crawling – an Introduction
• BFS begins at a root node and inspects all neighbouring nodes.
• For each neighbour node, in turn it inspects the neighbour nodes which were unvisited, and continues.
• Assumption: If we start with "good" pages, this keeps us close to other good pages.
• Variation of this algorithms are more memory efficient and popular in computing.
Web Crawling – An Introduction
Depth First Search
• Invented in 19th century by French mathematician Charles Pierre Trémaux (strategy for solving mazes).
• Algorithm for traversing or searching tree or graph data structures.
• Starts at the root and explores as far as possible along each branch before backtracking.
Popular SEO tools
Web crawling tools
Tool index sizes
Moz Majestic Fresh Hrefs0
200
400
600
800
1000
1200
1400
1600
1800
2000
Billion UrlsMillion Root DomainsBillion Links
Remember- Number of pages per domain- Number of links per domainEg ebay AU has 80M pages
2 years ago we came up with an internal tool to handle outreach
We had to come up with a new tool
- Be more efficient in finding the right sites for our clients
- Speed up the contact process
- Outsource some of the most repetitive work (e.g. sending emails/filling contact forms)
- Work for various clients in various languages
- Codebase ownership = freedom to run custom campaign
- We don’t want to piss people off! We have an historical index of who we have contacted in the past.
Why?
Outreachr.com - how we do it
Discovery(engine scraping,
Twitter,own index)
Get SEO stats (Moz &PR)
Social
Contact extraction
(crawling sites, Whois data)
Sorting algorithm
New campaign queries
Outreachr - interface
Insights into the Aussie web graph
Step 1 - We started with a small tight seeding (abc.net.au, news.com.au, theaustralian.com.au and other popular Australian news sites)After obtaining over 1M urls and analysing over 8M links, we only found 90,000 unique domains over 2.4M registered .au Domains
The Australian web graph is hard to crawl
2012 stats from AusRegistry – 2.4M registered urls
Source http://www.auda.org.au/pdf/ausregistry-q4-1112.pdf
Any tools using first breadth search will struggle to efficiently crawl Aussie sites
Australian sites link out to sites all over the world
Com(40)
AU(45)
domain.com.au
.com.au sites link to .com as much as .com.au
Net(24)
So what we have learned from our Data Base?
Ranking domains(1.5M)
First Breadth from ranking domains
(2M)
Twitter Domains(0.4M)
.co.uk .fr .au .nz0
0.5
1
1.5
2
2.5
3
3.5
PR
Regional Level – Australia has got the highest AVG PR
.ac.uk .co.uk .fr .au .nz0
10
20
30
40
50
60
DA
.ac.uk .co.uk .fr .au .nz0
1
2
3
4
5
6
PR
Quite big disparity between PR and DA
You need fewer links to rank in Australia
.com .uk .au0
10
20
30
40
50
60
70
80
90
84
67
48
Root Domain Links
COM
link to facebook pageno link found
18% of sites linked to their Facebook page
And … domain extension distribution
74%
7%
19%
.com.auother .au (net.au, org.au ..)other (com,net ..) usually au.domain.com
Owning this data is really cool
Analysing ranking pages on G (eg. PR, DA, keywords in url)
How difficult it is to rank based on sites we found on 1st page?
Who are my online SERP competitors?
Based on a keyword set you control and you care about
Ebay is the most visible site across 17k keywords analysed
Domain In top 10 Saturationebay.com.au 2691 25.07truelocal.com.au 2308 21.5yellowpages.com.au 1894 17.65gumtree.com.au 1819 16.95google (images/video/shopping) 1765 16.44tripadvisor.com.au 1753 16.33forums.whirlpool.net.au 1392 12.97productreview.com.au 1208 11.26myshopping.com.au 1130 10.53abc.net.au 1101 10.26smh.com.au 1100 10.25itunes.apple.com 1077 10.03whitepages.com.au 990 9.22yelp.com.au 965 8.99whereis.com 893 8.32news.com.au 833 7.76wotif.com 783 7.3au.answers.yahoo.com 774 7.21expedia.com.au 672 6.26getprice.com.au 628 5.85
Compiled analysing over
100,000 ranking domains
Big surprise! Nothing to do with home appliances
broadband choiceadsl 2microsoft certificationmodem router
Take-aways
- If you want to outreach in Australia, you probably need to be on Twitter.
- The top Aussie sites are aggregators (products, reviews or local business) - get listed to increase visibility.
- You are already lucky! You don’t need to work to get as many root domains as you would in other countries like the UK.
- Use a range of tools, including Open Site Explorer, hrefs.com and MajesticSEO to check backlink profile as no single tool seems to do a great job at indexing the Australian subnet.
- You need a com.au to rank in Australia. 19% are .com but usually with an Australian subdomain (e.g. au.domain.com)
@[email protected]@orchidbox.com(Send me a tweet to get free Outreachr pro access for a month!)