The limits to AB Testing

Last week was about all the ways to test a game – broadly broken into quantitative, qualitative and technical testing.

  • The technical testing should test that the game is not buggy.
  • The qualitative testing is using small numbers (below 100) of people that you follow very closely or even meet in-person.
  • The quantitative testing is where you just treat people as statistics, and go for large volumes (1000 or above).

As I wrote last week, all of these are needed as they give unique viewpoints into how the game works. This week, I’m going to have a deeper look at the quantitative testing part, where we do A/B tests on large groups of people. This approach has its limits, but it can be stretched surprisingly far, if you want to.

Here’s how I visualise the problem myself: if you put features on the x- and-y axis of a graph, and the commercial potential on the z-axis, you will get a landscape like this. (Of course, it’s way more complex than what we can actually draw with a 3D landscape like this, but for illustration it works).

Say you are convinced that combining an endless runner with gacha mechanics is the best idea ever. This means that you envision there to be a large, unconquered mountain at the intersection of these two features.

chart_IIIYou then make a first version of the game, and drop it on some unsuspecting test players. From there on, you iterate on your game, always trying to improve it slightly by running a series of A/B tests.

In these tests, you have (at least) two different versions of the game: A and B. Randomly, you assign either version to a group of players. The easiest way is to advertise to get a group of, say, 1000 players. You then assign 500 of them to play version A, and 500 to play version B. A simple way to do it is to generate a User ID number, and say that even IDs get version A and odd IDs get version B. From there, there are a ton of more sophisticated stuff you can do, but this is the bare bones version.

Now you will look at which group played for longer, or spent more money. After watching them play for some time, you will have more information to feed into the development team to make the next version, with the next A/B test.

For those of you who studied optimisation, this might feel familiar. It becomes close to a version of the steepest ascent algorithm. This means that you might be able to optimise your way to the nearest hill (local optima), but you will not be able to jump to a nearby even higher hill, since that would require you to go downhill a short distance before you start climbing up again. It’s like a blind person climbing a hill. He will get up on the hill, but cannot see the higher hills nearby.


So, how well can this approach work in real life, and what are the limits? Some of the wildest things we have done are fairly bold tests. On an upcoming level-based puzzle game, we were unsure about the difficulty and how fast we should introduce new concepts. To find an answer, we tried out 3 versions: the original one that felt a bit slow for us, another that had dropped every second level (and thus introduced concepts twice as quickly), and a third one that had dropped out 2 of 3 levels (introducing new concepts three times faster). In this particular test, the original fared best, closely followed by the double speed version. The triple speed version was way worse – apparently confusing people with too much info.

We had another fairly ambitious test when developing Benji Bananas. One of our role models for the game was Tiny Wings, another Jetpack Joyride. If you have played the two of these games, you will know that while both are endless runners, they have different way to end the game. Jetpack Joyride will kill you as soon as you make a mistake, and force you to start over. Tiny Wings, in contrast, will forgive your mistake, and let you continue. It will just cost you a few seconds of time, and eventually time will run out, which ends the game.

So, which one should we adopt for our cute swinging monkey? Should we be forgiving or should we be harsh when the player makes a mistake? We thought we knew the answer, but wanted to test anyway. Actually, I have since asked whole roomfuls of game professionals which way they think we should go, and about 2-to-1 favour our own intuition.

JetpackVsTinyWe were fairly sure we should be forgiving and go the Tiny Wings route. After all, it’s a very cute and casual game. When we tested it, however, the Jetpack Joyride inspired instant-death version won out.

We did not believe our data, we were sure we must have made a mistake somehow. So, we improved both versions (and especially the timed/Tiny Wings version) and ran another test. With the same result.

We still did not believe it, and polished up things a third time, only to get the same result a third time. After a few months of wasted effort, we finally accepted the data and moved on.

After 9 months of tests like this, we finally had a working game. Some measures had improved by a lot over the course of testing and iterating. For instance, measured by how many players completed 100 games or more during a few weeks after downloading, we had improved from 0.5% when we started testing to 20.5% in a version before launch. That’s about 40X improvement – so much for A/B testing only being about small improvements!


A few more hints if you too decide to do some wild testing:

One concern is that you might tarnish your brand or the game brand when doing this. There’s an easy solution for it. You simply invent another name that you use during the testing period. If you are really concerned, you can have another company account as well. And if you are super concerned, you can switch out some key graphics so that no one can recognize your famous IP. With these measures, you get to test things out completely anonymously. The game will speak for itself and no positive or negative brand associations will tarnish your data. If the ideas your testing turn out to be bad, you can just quietly kill things with no bad PR as a result.

I would, however, suggest that you do not involve money in these early and risky tests. As long as you are giving away free entertainment that no one can in any way pay for, I think it is fair to run a few tests and see how people react. When you have paying customers later on, you need to be way more risk averse.


Until next week!


Testing and Iterating

I have a great idea: let’s throw away half of everything we do!

This is about the process of testing and iterating on a game until it works, or you decide that it will never work well enough. There are a huge number of ways to test a game, all with their own weaknesses and strengths. You should likely use a combination of several of these. The constant iteration and testing means that you will design and implement a lot of things that you end up throwing out. It’s frustrating, but it works.

Testing out a game will usually begin with a small number of other game experts discussing the high level drafts. At this point, I am convinced that you should already be talking to others about the idea. It is more likely that you lose money because you made the wrong product, than it is likely that you lose money because someone heard about your idea, copied it and stole your market. Just talking to others in the industry might very well help you make a much stronger concept to start with.

Once you have some first prototype, you can start testing it out on friends, family and other unfortunate people you happen to meet. At this stage, they can give you some general pointers about how interesting they find the concept, and help you roughly figure out who might be the target audience and who definitely is not. Just remember that is is very, very rough at this stage. Do not assume that your friends are in any way a representative sample of your customer base.

When we are a little further along, we have often been testing games in the lobby of our nearby university. Of course, the sample of people is again clearly skewed, but we can catch early UI misses this way.

We take an smartphone or tablet loaded up with our latest game version in one hand, and our own smartphones in our other hand. Then we stop a random person in the lobby, and ask them if they would like to help us out by playing our games for a minute. We hand them the smartphone with the game, and record a video of their fingers (and voice) with the other smartphone. Then we just say nothing, apart from encouraging them to speak their minds.

It is quite common for the first test to reveal that 7 out of 10 participants had trouble at the same spot in the tutorial. We fix that, and then go back to do 10 more such tests.

A more automated way to get such tests done – as well as going a bit deeper into the game – is available at They don’t stop people in lobbies, but rather have people test play a game while recording what is happening on their screens and what they are saying. The game company then gets a video of the whole thing, and can watch and annotate that back at the office. It is a very useful service.

We have also done some more traditional user experience testing with several cameras, one way mirrors and questionnaires. While they work, they are quite cumbersome and, in the end, no more useful than the lobby testing or PlayTestCloud.

The deepest of the qualitative testing we do, is in collaboration with our nearby university. Here we wire up people with an Emotiv EPOC brainscanner, and a Tobii eyetracking device in front of them.

Together with videos, this allows us to see exactly what they are experiencing and where they are looking. It is useful for pinpointing some very specific problems in the game.

So far, it has been all about user experience testing. Of course, you should also test the game functionality technically. On the Apple side, there is a somewhat manageable set of devices. On the Android side, there is not. (Our Benji game has reported over X thousand device versions that it runs on).

TestDroid is a convenient service where you can test out your app on a huge number of different Android devices/versions. We simply make the game play itself and record it doing that. There are, of course, multiple other options as well for how you might outsource technical quality assurance, and a lot of companies offering such services.

At this point, we have tested the game out conceptually, technically, and with a limited number of players that we have listened carefully to. It is now time to go for larger numbers and start working statistically.

We try to go into pre-alpha soft launch with our games as soon as possible, and then develop the games in iterations, gathering feedback all the time. We release the game in some place on the other side of the world (to make sure our friends do not influence the data), and advertise to get small cohorts of users. Usually, we buy some 500-1000 users in each round we test.

To have a look at what these users do in the game, we need some analytics software integrated. So far at Tribeflame, we have made our own bare bones version, as well as integrated a number of others like Flurry, Game Analytics, Google (Play) Analytics, DeltaDNA, etc. Some solutions are very basic, while others are quite comprehensive. The important part is that you can see at least some basic numbers about where you lose players during the first sessions, and you are able to track retention numbers over the first month.

The different forms of testing will each give you it’s own unique look into some aspect of how the game works. None of them will give you the complete story, but they complement each other nicely. The soft launch metrics of thousands of players and show you how people behave with good certainty, but is does not tell you why they behave like that. In contrast, small groups of players that you meet face to face, or bring in through PlaytestCloud, will be able to describe the problems much, much better, but on their own, they are only a small biased sample. Together these two approaches give quite a good picture of how the game works.

Social features in Games

How to best use social features in games is changing. It is now less about reaching real-world friends for virality, and more about forming in-game communities of strangers with retention as the goal. Let me explain.

The big boom for social games came with Facebook. Games like Mob Wars came in 2008, while Farmville took off in 2009. This first wave of social games were engineered for virality above everything else. They kept pestering their users to post to their friends, and to get those friends to also start playing the game.

The social features of these games were not really that deep. The games behaved sort of like my 2 year old son. Here, he has loudly demanded that his uncle plays with Legos with him – only to then completely ignore said uncle while happily playing next to him. They are both doing the same thing, but with very limited interaction.


That still has some value, even though there was widespread scorn for the term “social” when describing those games. There is social proof in having friends doing the same thing you do. The mainstream consumer starts doing something only when all their friends and acquaintances are also doing it.  

These games used a variety of ways to get people to invite their friends. There were suggestions that you brag about every achievement you got in the game by posting as visibly as possible on your Facebook wall. There were walls to unlock more gameplay that could only be passed by connecting to 3 or more friends in the game. And there were ways to send gifts to each other, in the hope of triggering the social obligation of reciprocation from your friends. (Have a look at Cialdini’s book “Influence: The Psychology of Persuasion” for more on tricks like these.)

All this was done to achieve a good “k-factor”, which is the measure of virality. The k-factor means “how many new customers does every existing customer bring in”. There’s an excellent explanation of it here.

In short, if your k-factor is above 1, that means that the game spreads on its own. You just need to seed it with some customers. Say you bring in 1000 customers through featuring and advertising. If the k-factor is 2, they will bring in 2000 of their friends, who will in turn bring in 4000 of their friends, etc. Eventually the whole world plays your game! (Or, what actually happens: the k-factor declines over time).

If the k-factor is below 1 (which it usually is), then it still means that your marketing is cheaper. If you spend $3 per download to get people to download your app, you will eventually get 2 downloads for that price is your k-factor is 0.5, bringing your effective cost per download to $1.50.

So far the early focus on getting the virality up by bringing in the real-world friends and acquaintances of the players. Early mobile games also tried to boost virality with similar methods, but it was way harder to get it to work well. New games are more focused on retention rather than virality.

To get virality, you should focus on the player’s real world friends, but to get retention, you want to build new in-game connections between strangers.

Social features are good drivers for retention, but only when some demands are met. Players can come back to a game for a variety of social reasons. If there are clans or guilds, players will feel a social obligation to play and contribute to their clan. With competitive features, people will be comparing their own progress to peers and try to keep up.

The problem is that both of these only work with players who are at roughly the same level. If I start playing any of the King games right now, it will not inspire me much to see my wife at level 245. If anything, I might get disheartened and think that I will never be able to catch up.

Similarly, when I play Clash Royale in my friend’s clan, I am actually dragging him down. He’s way more interested in the game than I am, and is also playing it a lot more as well as better. Which means that I should not really be in his clan. It would be in his interest to have better players than me in the clan. If he keeps to the clan that I am in, the social features will quickly become a liability rather than an asset. He is likely to stop playing, just as I stopped playing. If he moves to a clan with his own level of players, the social pressure is kept constant, and he is way more likely to stick around.

I think that this is a universal rule: it is unlikely that your friends are interested in exactly the same games as you are, and that they are equally skilled at them. Therefore, we can build games that try to get people to invite their real-world friends, but that is for short term virality. For the long term, we should transition players into making new friends in the game. Friends that share their interest in the game, and are playing at the same level.