Penguin, Penalties and 'Big Data' - How I analyzed 250,000 domains

Post on 14-Jul-2015

2.367 views 1 download

Tags:

Transcript of Penguin, Penalties and 'Big Data' - How I analyzed 250,000 domains

Welcome to hell

Penalties and Big Data

@portentint

ask me questions

Kinda cool

and a cautionary tale

How hard could machine learning be?

I was a history major

Used: Links we disavowed and/or Google pointed at and said “these are bad.”

Trained: Based on domains that were clear disavows

Trained: Based on links Google pointed out

Total: 250,000 domains

Then: Hand-review results (random, 1,000 links)

Lesson 1

Excel doesn’t like 250,000 row spreadsheets

Work with text files, or use SQLLite

Here are a few rules of thumb for detecting bad links.

These are not causal They are correlations

BigML: Upload source

BigML: Dataset

BigML: Model

Lesson 2

Crap data = crap results

Check your assumptions: Don’t behave like a History major trying to analyze 4 million records using machine learning.

Just enough to be dangerous

Lesson 3

These are not causal These are correlations

@portentint portent.co/playpenguin

That’s it