The limits to AB Testing

Last week was about all the ways to test a game – broadly broken into quantitative, qualitative and technical testing.

  • The technical testing should test that the game is not buggy.
  • The qualitative testing is using small numbers (below 100) of people that you follow very closely or even meet in-person.
  • The quantitative testing is where you just treat people as statistics, and go for large volumes (1000 or above).

As I wrote last week, all of these are needed as they give unique viewpoints into how the game works. This week, I’m going to have a deeper look at the quantitative testing part, where we do A/B tests on large groups of people. This approach has its limits, but it can be stretched surprisingly far, if you want to.

Here’s how I visualise the problem myself: if you put features on the x- and-y axis of a graph, and the commercial potential on the z-axis, you will get a landscape like this. (Of course, it’s way more complex than what we can actually draw with a 3D landscape like this, but for illustration it works).

Say you are convinced that combining an endless runner with gacha mechanics is the best idea ever. This means that you envision there to be a large, unconquered mountain at the intersection of these two features.

chart_IIIYou then make a first version of the game, and drop it on some unsuspecting test players. From there on, you iterate on your game, always trying to improve it slightly by running a series of A/B tests.

In these tests, you have (at least) two different versions of the game: A and B. Randomly, you assign either version to a group of players. The easiest way is to advertise to get a group of, say, 1000 players. You then assign 500 of them to play version A, and 500 to play version B. A simple way to do it is to generate a User ID number, and say that even IDs get version A and odd IDs get version B. From there, there are a ton of more sophisticated stuff you can do, but this is the bare bones version.

Now you will look at which group played for longer, or spent more money. After watching them play for some time, you will have more information to feed into the development team to make the next version, with the next A/B test.

For those of you who studied optimisation, this might feel familiar. It becomes close to a version of the steepest ascent algorithm. This means that you might be able to optimise your way to the nearest hill (local optima), but you will not be able to jump to a nearby even higher hill, since that would require you to go downhill a short distance before you start climbing up again. It’s like a blind person climbing a hill. He will get up on the hill, but cannot see the higher hills nearby.


So, how well can this approach work in real life, and what are the limits? Some of the wildest things we have done are fairly bold tests. On an upcoming level-based puzzle game, we were unsure about the difficulty and how fast we should introduce new concepts. To find an answer, we tried out 3 versions: the original one that felt a bit slow for us, another that had dropped every second level (and thus introduced concepts twice as quickly), and a third one that had dropped out 2 of 3 levels (introducing new concepts three times faster). In this particular test, the original fared best, closely followed by the double speed version. The triple speed version was way worse – apparently confusing people with too much info.

We had another fairly ambitious test when developing Benji Bananas. One of our role models for the game was Tiny Wings, another Jetpack Joyride. If you have played the two of these games, you will know that while both are endless runners, they have different way to end the game. Jetpack Joyride will kill you as soon as you make a mistake, and force you to start over. Tiny Wings, in contrast, will forgive your mistake, and let you continue. It will just cost you a few seconds of time, and eventually time will run out, which ends the game.

So, which one should we adopt for our cute swinging monkey? Should we be forgiving or should we be harsh when the player makes a mistake? We thought we knew the answer, but wanted to test anyway. Actually, I have since asked whole roomfuls of game professionals which way they think we should go, and about 2-to-1 favour our own intuition.

JetpackVsTinyWe were fairly sure we should be forgiving and go the Tiny Wings route. After all, it’s a very cute and casual game. When we tested it, however, the Jetpack Joyride inspired instant-death version won out.

We did not believe our data, we were sure we must have made a mistake somehow. So, we improved both versions (and especially the timed/Tiny Wings version) and ran another test. With the same result.

We still did not believe it, and polished up things a third time, only to get the same result a third time. After a few months of wasted effort, we finally accepted the data and moved on.

After 9 months of tests like this, we finally had a working game. Some measures had improved by a lot over the course of testing and iterating. For instance, measured by how many players completed 100 games or more during a few weeks after downloading, we had improved from 0.5% when we started testing to 20.5% in a version before launch. That’s about 40X improvement – so much for A/B testing only being about small improvements!


A few more hints if you too decide to do some wild testing:

One concern is that you might tarnish your brand or the game brand when doing this. There’s an easy solution for it. You simply invent another name that you use during the testing period. If you are really concerned, you can have another company account as well. And if you are super concerned, you can switch out some key graphics so that no one can recognize your famous IP. With these measures, you get to test things out completely anonymously. The game will speak for itself and no positive or negative brand associations will tarnish your data. If the ideas your testing turn out to be bad, you can just quietly kill things with no bad PR as a result.

I would, however, suggest that you do not involve money in these early and risky tests. As long as you are giving away free entertainment that no one can in any way pay for, I think it is fair to run a few tests and see how people react. When you have paying customers later on, you need to be way more risk averse.


Until next week!


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s