SEO scraping with Excel (Google suggest and more)

Post on 20-Feb-2017

91 views 1 download

Transcript of SEO scraping with Excel (Google suggest and more)

#SMConnect @Zen2SEO

Search Marketing Connect - 20 e 21 Novembre 2015

SEO Scraping with Excel: From an “infinite” Google Suggest to SERPs estractions for several

goals, without any cost and with no programming skills needed

#SMConnect @Zen2SEO

salsa dancing + travel + crime novels + lot of fun

=

Giuseppe Pastore(unconventional SEO manager)

Say hello!

@Zen2SEO

#SMConnect @Zen2SEO

Web Scraping - WhatWeb scraping = extracting information from websites, simulating human exploration with a

software

#SMConnect @Zen2SEO

Web Scraping - Whyprice comparison, contact scraping, weather data monitoring, website change detection, research,web mashup and web data integration.

#SMConnect @Zen2SEO

Web Scraping - HowLots of techniques... That need coding.

I can’t code, but I like Excel.

#SMConnect @Zen2SEO

ExcelSEO tools for Excel

RegExXpath

#SMConnect @Zen2SEO

http://seotoolsforexcel.com

#SMConnect @Zen2SEO

Regular Expression (regex or regexp) = a

sequence of characters that define a search pattern, mainly for use in pattern

matching with strings

http://goo.gl/pqtNE0

#SMConnect @Zen2SEO

Xpath = a query language for selecting nodes from

an XML document

//*[@id="rso"]/div/div/h3/a

#SMConnect @Zen2SEO

SCRAPING (EVERY!!!) SUGGEST

#SMConnect @Zen2SEO

Google Suggest API to be discontinued

http://googlewebmastercentral.blogspot.it/2015/07/update-on-autocomplete-api.html

#SMConnect @Zen2SEO

UberSuggest (takes data from Bing)

http://ubersuggest.org

#SMConnect @Zen2SEO

Keyword Tool

http://keywordtool.io

#SMConnect @Zen2SEO

Target #1 – Google Suggest

#SMConnect @Zen2SEO

http://suggestqueries.google.com/complete/search?output=toolbar&hl=it&q=milan

Step 1

#SMConnect @Zen2SEO

Step 2

=DownloadString("http://suggestqueries.google.com/complete/search?output=toolbar&hl=it&q="&A2)

<?xml version="1.0"?><toplevel><CompleteSuggestion><suggestion data="milan"/></CompleteSuggestion><CompleteSuggestion><suggestion data="milan news"/></CompleteSuggestion><CompleteSuggestion><suggestion data="milano finanza"/></CompleteSuggestion><CompleteSuggestion><suggestion data="milano"/></CompleteSuggestion><CompleteSuggestion><suggestion data="milano meteo"/></CompleteSuggestion><CompleteSuggestion><suggestion data="milano marittima"/></CompleteSuggestion><CompleteSuggestion><suggestion data="milanotoday"/></CompleteSuggestion><CompleteSuggestion><suggestion data="milano expo"/></CompleteSuggestion><CompleteSuggestion><suggestion data="milano malpensa"/></CompleteSuggestion><CompleteSuggestion><suggestion data="milanuncios"/></CompleteSuggestion></toplevel>

Downloading the entire page code

#SMConnect @Zen2SEO

Step 3

=RegexpReplace(DownloadString("http://suggestqueries.google.com/complete/search?output=toolbar&hl="&B2&"&q="&A2);"((.*)toplevel>)?<CompleteSuggestion><suggestion(\s)data=";"")

"milan"/></CompleteSuggestion>....</toplevel>

<?xml version="1.0"?><toplevel><CompleteSuggestion> <suggestion data="milan"/></CompleteSuggestion>...</toplevel>

Deleting nodes opening

#SMConnect @Zen2SEO

Step 4

=RegexpReplace(A11;"/></CompleteSuggestion>(</toplevel>)?";",")

"milan", "milan news","milano finanza", "milan","milano","milano meteo","milano marittima","milano expo","milano malpensa","milanotoday","milan store",

"milan"/></CompleteSuggestion> "milano news"/></CompleteSuggestion>...</toplevel>

Deleting nodes closing

#SMConnect @Zen2SEO

Step 5

=SINISTRA(A14;TROVA(",";A14;1))

"milan","milan", "milan news","milano finanza", "milan","milano","milano meteo","milano marittima","milano expo","milano malpensa","milanotoday","milan store",

Finding comma and isolating everything at its left

#SMConnect @Zen2SEO

Step 6

=RegexpReplace(SINISTRA(A17;TROVA(",";A17;1));""",?";"")

milan"milan",

Removing quotes: I’ve isolated the first result

#SMConnect @Zen2SEO

Step 7

=DESTRA(A14;LUNGHEZZA(A14)-TROVA(",";A14;1))

"milan news","milano finanza","milan","milano","milano meteo","milano marittima","milano expo","milano malpensa","milanotoday","milan store",

From the 10 results string I’m isolating the part that’s at the right of the first term

"milan","milan news","milano finanza", "milan","milano","milano meteo","milano marittima","milano expo","milano malpensa","milanotoday","milan store",

143 caratteri

8 caratteri135 caratteri

#SMConnect @Zen2SEO

"milan news","milano finanza","milan","milano","milano meteo","milano marittima","milano expo","milano malpensa","milanotoday","milan store",

Iterating 5-6-7

milanmilan news

milanomilano finanzamilano meteo

milano marittimamilano expo

milano malpensamilanotodaymilan store

=SINISTRA(A14;TROVA(",";A14;1))=RegexpReplace(SINISTRA(A17;TROVA(",";A17;1));""",?";"")=DESTRA(A14;LUNGHEZZA(A14)-TROVA(",";A14;1))

#SMConnect @Zen2SEO

Iterating 5-6-7

=RegexpReplace(RegexpReplace(DownloadString("http://suggestqueries.google.com/complete/search?output=toolbar&tbm=&hl="&B2&"&lang_"&B2&"&q="&A2);"((.*)toplevel>)?<(/?Complete)?suggestion((\s)data=)?>?(</toplevel>)?";"");"/>";",")

#SMConnect @Zen2SEO

Target #2 – Bing Suggest

#SMConnect @Zen2SEO

http://api.bing.com/osjson.aspx?query=milan

Step 1

12 resultsBased on IP

["milan",["milan news","milano finanza","milan","milano today","milano","milan live","milanotoday","milannews.it","milannews","milanofinanza.it","milano meteo","milan calciomercato"]]

https://hide.me/en/proxy

#SMConnect @Zen2SEO

Target #3 – Amazon Suggest

#SMConnect @Zen2SEO

http://completion.amazon.com/search/complete?method=completion&q=%q&search-alias=aps&mkt=1

http://completion.amazon.co.uk/search/complete?method=completion&q=%q&search-alias=aps&mkt=4

http://completion.amazon.co.jp/search/complete?method=completion&q=%q&search-alias=aps&mkt=6

Aps = All Product Selection (?)

Step 1

["milano",["milani","milano cookies","milano bride","milano knife","kiko milano","milano moda","milano lego","giorgio milano","milano poker chips","milanos"],[{"sc":"1","nodes":[{"name":"Beauty","alias":"beauty"},{"name":"Health & Personal Care","alias":"hpc"}]},{},{},{},{},{},{},{},{},{}],[]]

#SMConnect @Zen2SEO

Target #4 – Google Image Suggest

http://suggestqueries.google.com/complete/search?json&client=toolbar&ds=i&q=%q

<?xml version="1.0"?><toplevel><CompleteSuggestion><suggestion data="milano"/></CompleteSuggestion><CompleteSuggestion><suggestion data="milan"/></CompleteSuggestion><CompleteSuggestion><suggestion data="milan napoli"/></CompleteSuggestion><CompleteSuggestion><suggestion data="milano expo"/></CompleteSuggestion><CompleteSuggestion><suggestion data="milano metro"/></CompleteSuggestion><CompleteSuggestion><suggestion data="milano skyline"/></CompleteSuggestion><CompleteSuggestion><suggestion data="milano marittima"/></CompleteSuggestion><CompleteSuggestion><suggestion data="milano metropolitana"/></CompleteSuggestion><CompleteSuggestion><suggestion data="milano navigli"/></CompleteSuggestion><CompleteSuggestion><suggestion data="milan news"/></CompleteSuggestion></toplevel>

#SMConnect @Zen2SEO

Target #5 – Youtube Suggest

http://suggestqueries.google.com/complete/search?json&client=toolbar&ds=yt&q =%q

<?xml version="1.0"?><toplevel><CompleteSuggestion><suggestion data="milano bangkok"/></CompleteSuggestion><CompleteSuggestion><suggestion data="milan napoli 0 4"/></CompleteSuggestion><CompleteSuggestion><suggestion data="milan napoli"/></CompleteSuggestion><CompleteSuggestion><suggestion data="milan"/></CompleteSuggestion><CompleteSuggestion><suggestion data="milan palermo"/></CompleteSuggestion><CompleteSuggestion><suggestion data="milan palermo 3 2"/></CompleteSuggestion><CompleteSuggestion><suggestion data="milano"/></CompleteSuggestion><CompleteSuggestion><suggestion data="milan napoli 0 4 auriemma"/></CompleteSuggestion><CompleteSuggestion><suggestion data="milan napoli 0 4 crudeli"/></CompleteSuggestion><CompleteSuggestion><suggestion data="milan udinese 3 2"/></CompleteSuggestion></toplevel>

#SMConnect @Zen2SEO

Target #6 – Wikipedia Suggest

http://it.wikipedia.org/w/api.php?action=opensearch&search=%q

["milano",["Milano","Milano-Sanremo","Milano 2","Milano-Torino","Milano-Sanremo 2012","Milano-Sanremo 2014","Milano-Sanremo 2013","Milano-Sanremo 2015","Milano-Sanremo 2011","Milano-Sanremo 2010"],["Milano ( pronuncia /mi\u02c8lano/, in lombardo Milan, pronunciato /mi\u02c8l\u00e3\u02d0/ nel dialetto locale) \u00e8 una citt\u00e0 italiana di 1 342 806 abitanti, capoluogo dell'omonima citt\u00e0 metropolitana e della regione Lombardia, secondo comune italiano per numero di abitanti, tredicesimo comune dell'Unione europea e diciannovesimo del continente e, con l'agglomerato urbano, terza area metropolitana pi\u00f9 popolata d'Europa dietro Londra e Parigi.","La Milano-Sanremo \u00e8 una corsa in linea maschile di ciclismo su strada professionistico, una delle pi\u00f9 importanti corse ciclistiche del relativo circuito internazionale e prima grande classica nel calendario ciclistico stagionale.","Milano 2 (o anche Milano Due, abbreviato MI2 e M2) \u00e8 un quartiere residenziale sito nel territorio del comune italiano di Segrate, nella citt\u00e0 metropolitana di Milano.","La Milano-Torino \u00e8 una corsa in linea maschile di ciclismo su strada, che si svolge tra Milano e Torino, in Italia, ogni anno nel mese di ottobre, ed \u00e8 una delle classiche d'autunno.","La Milano-Sanremo 2012, centotreesima edizione della corsa, si \u00e8 disputata il 17 marzo 2012, per un percorso totale di 298 km.","La Milano-Sanremo 2014, centocinquesima edizione della corsa, valida come quarta prova del circuito UCI World Tour 2014, si svolse il 23 marzo 2014 su un percorso di 294km, con partenza da Milano ed arrivo a Sanremo.","La Milano-Sanremo 2013, centoquattresima edizione della corsa, si \u00e8 disputata il 17 marzo 2013 su un percorso accorciato per motivi meteorologici da 298 km a 255 km.","La Milano-Sanremo 2015, centoseiesima edizione della corsa, valida come quarta prova del circuito UCI World Tour 2015, si \u00e8 svolta il 22 marzo 2015 su un percorso di 293 km, con partenza da Milano ed arrivo a Sanremo.","La Milano-Sanremo 2011, centoduesima edizione della corsa, si \u00e8 disputata il 19 marzo 2011, per un percorso totale di 298 km.","La Milano-Sanremo 2010, centunesima edizione della corsa, si \u00e8 disputata il 20 marzo 2010 e ha affrontato un percorso totale di 298 km."],["https://it.wikipedia.org/wiki/Milano","https://it.wikipedia.org/wiki/Milano-Sanremo","https://it.wikipedia.org/wiki/Milano_2","https://it.wikipedia.org/wiki/Milano-Torino","https://it.wikipedia.org/wiki/Milano-Sanremo_2012","https://it.wikipedia.org/wiki/Milano-Sanremo_2014","https://it.wikipedia.org/wiki/Milano-Sanremo_2013","https://it.wikipedia.org/wiki/Milano-Sanremo_2015","https://it.wikipedia.org/wiki/Milano-Sanremo_2011","https://it.wikipedia.org/wiki/Milano-Sanremo_2010"]]

#SMConnect @Zen2SEO

SCRAPING (GOOGLE) SERPs

#SMConnect @Zen2SEO

Target #2 – Google SERP

#SMConnect @Zen2SEO

Xpath Identification

Step 1

//h3[@class='r']/a

#SMConnect @Zen2SEO

Href element estraction

Step 2

=XPathOnUrl("https://www.google.it/search?q=%q&hl=it&&tbs=lr:lang_1it,qdr:a&prmd=ivns&num=10&source=lnt";"(//h3[@class='r']/a)["1"]";"href")

#SMConnect @Zen2SEO

Target #3 – Google Cache

#SMConnect @Zen2SEO

http://webcache.googleusercontent.com/search?hl=it&q=cache:http://www.miosito.it

Step 1

#SMConnect @Zen2SEO

=RegexpFindOnUrl("http://webcache.googleusercontent.com/search?hl=it&q=cache%3Ahttp://www.giuseppepastore.com");"cache di Google di(.*)</a>\.(\s)")

Step 2

cache di Google di <a href="http://www.giuseppepastore.com" dir="ltr">http://www.giuseppepastore.com</a>.

=RegexpFindOnUrl("http://webcache.googleusercontent.com/search?hl=it&q=cache%3Ahttp://www.giuseppepastore.com");" visualizzata il(.*)GMT ")

visualizzata il 16 nov 2015 14:29:53 GMT

#SMConnect @Zen2SEO

Conclusions

Google SuggestBing Suggest

Google Image SuggestYoutube SuggestAmazon Suggest

Wikipedia Suggest

(What-Ever-You-Want Suggest – as long you can query an URL)

#SMConnect @Zen2SEO

Conclusions

Google SERPsGoogle Cache

(What-Ever-You-Want from any web page)

#SMConnect @Zen2SEO

Thank you!Giuseppe Pastore

@Zen2SEO