Penguin, Penalties and 'Big Data' - How I analyzed 250,000 domains

27
Welcome to hell Penalties and Big Data

Transcript of Penguin, Penalties and 'Big Data' - How I analyzed 250,000 domains

Page 1: Penguin, Penalties and 'Big Data' - How I analyzed 250,000 domains

Welcome to hell

Penalties and Big Data

Page 2: Penguin, Penalties and 'Big Data' - How I analyzed 250,000 domains

@portentint

ask me questions

Page 3: Penguin, Penalties and 'Big Data' - How I analyzed 250,000 domains

Kinda cool

Page 4: Penguin, Penalties and 'Big Data' - How I analyzed 250,000 domains

and a cautionary tale

Page 5: Penguin, Penalties and 'Big Data' - How I analyzed 250,000 domains

How hard could machine learning be?

Page 6: Penguin, Penalties and 'Big Data' - How I analyzed 250,000 domains

I was a history major

Page 7: Penguin, Penalties and 'Big Data' - How I analyzed 250,000 domains

Used: Links we disavowed and/or Google pointed at and said “these are bad.”

Page 8: Penguin, Penalties and 'Big Data' - How I analyzed 250,000 domains

Trained: Based on domains that were clear disavows

Page 9: Penguin, Penalties and 'Big Data' - How I analyzed 250,000 domains

Trained: Based on links Google pointed out

Page 10: Penguin, Penalties and 'Big Data' - How I analyzed 250,000 domains

Total: 250,000 domains

Page 11: Penguin, Penalties and 'Big Data' - How I analyzed 250,000 domains

Then: Hand-review results (random, 1,000 links)

Page 12: Penguin, Penalties and 'Big Data' - How I analyzed 250,000 domains

Lesson 1

Page 13: Penguin, Penalties and 'Big Data' - How I analyzed 250,000 domains

Excel doesn’t like 250,000 row spreadsheets

Page 14: Penguin, Penalties and 'Big Data' - How I analyzed 250,000 domains

Work with text files, or use SQLLite

Page 15: Penguin, Penalties and 'Big Data' - How I analyzed 250,000 domains

Here are a few rules of thumb for detecting bad links.

These are not causal They are correlations

Page 16: Penguin, Penalties and 'Big Data' - How I analyzed 250,000 domains

BigML: Upload source

Page 17: Penguin, Penalties and 'Big Data' - How I analyzed 250,000 domains

BigML: Dataset

Page 18: Penguin, Penalties and 'Big Data' - How I analyzed 250,000 domains

BigML: Model

Page 19: Penguin, Penalties and 'Big Data' - How I analyzed 250,000 domains
Page 20: Penguin, Penalties and 'Big Data' - How I analyzed 250,000 domains
Page 21: Penguin, Penalties and 'Big Data' - How I analyzed 250,000 domains

Lesson 2

Page 22: Penguin, Penalties and 'Big Data' - How I analyzed 250,000 domains

Crap data = crap results

Page 23: Penguin, Penalties and 'Big Data' - How I analyzed 250,000 domains

Check your assumptions: Don’t behave like a History major trying to analyze 4 million records using machine learning.

Page 24: Penguin, Penalties and 'Big Data' - How I analyzed 250,000 domains

Just enough to be dangerous

Page 25: Penguin, Penalties and 'Big Data' - How I analyzed 250,000 domains

Lesson 3

Page 26: Penguin, Penalties and 'Big Data' - How I analyzed 250,000 domains

These are not causal These are correlations

Page 27: Penguin, Penalties and 'Big Data' - How I analyzed 250,000 domains

@portentint portent.co/playpenguin

That’s it