Data Gone Wrong - GDCNext 2013

It’s the cornerstone of many of the biggest businesses in the US, includingGoogle & Amazon, and the backbone of most scientific undertakings.

2

There’s plenty of cheerleading for data so I want to spend today oncautionary tales & advice, in the hope of helping keep data more on the IronMan side of the Robert Downey Jr spectrum.

3

I love using numbers & testing to understand the world, I’m not a data hater by anymeans.

If you want to know about medieval Transylvania or the Ottoman invasion ofHungary, I’m you’re woman. That didn’t get me far on the job market.

Partly to do something different, partly to do games. But despite that intention ithasn’t been as different as I expected – user acquisition for games and catalogs isfundamentally the same. But Kongregate, because we’re an open platform with over70k games and now a mobile game publisher, has provided some incredibly richopportunities for data mining.

4

Part of the reason I’m telling you this is to make my first point:

And for an organization to do data right you can’t toss analysis back and forthover a wall to quants. It takes intimate knowledge of a game (and thedevelopment) to do good analysis and multiple perspectives and theories aregood.

5

Sometimes it’s immediately obvious – we had a mobile game we launchedrecently, an endless runner, that wasn’t filtering purchases from jailbrokenphones and was showing an ARPPU of $500, not very plausible and easilycaught. But most issues are much more subtle – tracking pixels not firingcorrectly for a particular game on a particular browser, tutorial steps beingcompleted twice by some players but not by others, clients reporting strangetimestamps, etc.

For this reason you should never rely on any analytic system where you can’tgo in and inspect individual records. If you can’t check the detail you’ll neverbe able to find and fix problems. We use Google Analytics for reference andcorroboration but nothing very crucial, and are using it less and less becauseof this.

6

This looks like 4 separate pictures photoshopped together to create anappealing color grid, right?

7

Wrong.

So much of data is like these pictures – a set-up that appearsstraightforwardly to be one thing from one angle, turns out to be completelydifferent from another.

8

Except of course you know I’m setting you up

9

I mentioned lifetime conversion and showed daily ARPPU. Lifetimeconversion may be similar between the two games, but daily conversion is40% higher for game 1.

This is why $/DAU is not a very interesting stat on its own. If someonequotes just D1 retention and $/DAU that’s not enough information to judgehow a game monetizes.

10

It’s a living, changing system. Flat views are not enough.

11

So here are a series of likely traps analysts can fall into. I know I have.They’re not in a particular order of importance because they’re all important.

We tend to think of playerbases as monolithic but really they areaggregations of all sorts of subgroups.

It’s sort of like watching a meal go through a snake.

Though with time cohorts it’s easy to lose track of events and changes in thegame, so you can’t rely on those, either.

12

You may have noticed that win rates got a bit wacky towards the later missions of the graphof the last chart – this is a sample size issue.

Even games that overall have very substantial playerbases like Tyrant may end up withsmall sample sizes when you’re looking at uncommon behavior in subgroups.

Early test market data is often tantalizing & fascinating, but it’s often the most unreliablebecause you’re combining small sample sizes and a non-representative subgroup – thepeople who discover you first are the most hard-core.

15

not normal (bell-curve) distributions, which affects everything.

Theoretically it’s not even possible to calculate the average value of a power distributionsince the infrequent top values could be infinite.

The sample size depends on the frequency of the event – tutorial completion & D1retention should be fine with just a few hundred users, % buyer with 500+, but I don’t liketo look at ARPPU with much less than 5,000. These are just my rules of thumb based onexperience and probably have no mathematical basis.

16

If you ask small questions you’ll usually get small answers. And the dirtysecret of testing is that most test are inconclusive anyway. It’s hard to moveimportant metrics. So prioritize tests that significanly affect the game, likeenergy limits and regeneration over button colors.

19

Your existing players are used to things working a certain way – a change inUI that makes things clearer for a new player may annoy/confuse an elderplayer. Where possible I like to test disruptive changes on new players only,and then roll out the test to other players if the test proves successful. Apricing change that increases non-buyer conversion might reduce repeatbuyer revenue.

For example if you’re A/B testing your store, don’t assign people to the testunless they interact with the store. It’s often easier to split people as theyarrive in your game, or some other thing, but a) there’s a chance you wouldend up with non-equal distribution of interaction with the tested feature andb)any signal from the test group would get lost in the noise of a largersample.

20

Early results tend to be both volatile and fascinating – differences areexaggerated or totally change direction. People tend to remember the early,interesting results rather than the actual results. People also often want toend the test early if they see a big swing, which is a bad idea.

We tested to see what gain we were getting from bonusing buying largercurrency packages, which had to be judged on total revenue to make surewe were capturing both transaction size and repeat purchase factors. Tomake sure the 15% lift was real we broke buyers into cohorts by how muchthey’d spent ($0-$10, $100-$200, $200-$500, etc) and checked thedistribution in each test. On the bonus side of the test we saw fewer buyer<$20 and 30%+ gains on all the cohorts above $100+, so we were confidentthat the gain was not being driven but a few big spenders.

Again this should be worked backward from the frequency and distribution ofthe metric you’re judging the test on. There’s internet calculators to help youfigure out what you need to get to statistical significance given an expectedlift. My advice (if you have the playerbase and patience) is to then double ortriple that. Why do I want my sample sizes so much bigger than theminimum?

21

It comes down to some of the issues with judging results by statisticalsignificance itself. It doesn’t mean what you probably think it means.

Statistical significance tests assume that there is some true difference in lift,and that if you test there will be a bell curve distribution of results, with thetrue lift as the average. Your 5% result could be right on the mean, or it couldbe an outlier on either end. If it’s statistically significant then the chance islow (usually 5% or less) that there’s no lift at all. But the true lift could be 1%or 10%.

It’s possible you’d get two outlier results in the same direction, but becomes less and lesslikely, and more likely that your test results represent the true mean. And the size of theeffect you are testing does matter as it helps you understand the relative importance ofdifferent factors, and what to prioritize testing next.

22

For example we’ve had a lot of tests that increased registration and reducedretention, so much so that we now judge tests based on % retainedregistrations because that’s what we really care about, but that’s not alwayspossible.

23

A good example of this is adding a Facebook login button on our website. If aplayer comes back on a different browser they need to be able to login.

24

This is about how you think about your business.

26

Data Gone Wrong - GDCNext 2013

Business

Transcript of Data Gone Wrong - GDCNext 2013