## In Defense Of A/B Testing

Recently, A/B testing has come under (unjust) criticism from different circles on the Internet. Even though this criticism contains some relevant points, the basic argument against A/B testing is flawed. It seems to confuse the A/B testing methodology with a specific implementation of it (e.g. testing red vs. green buttons and other trivial tests). Let’s look at different criticisms that have surfaced on the Web recently and see why they are unfounded.

### Argument #1: A/B Testing And The Local Minimum

Jason Cohen, in his post titled Out of the Cesspool and Into the Sewer: A/B Testing Trap1, argues that A/B testing produces the local minimum, while the goal should be to get to the global minimum. For those who don’t understand the difference between the local and global minimum (or maxima), think of the conversion rate as a function of different elements on your page. It’s like a region in space where every point represents a variation of your page; the lower a point is in space, the better it is. To borrow an example from Jason, here is the issue with the local vs. global minimum:

As even Jason acknowledges in his post, this argument isn’t really concerned with A/B testing, because the same methodology could be used to test radical changes to get to the global minima. So, calling it an A/B testing trap is unfair because it doesn’t have anything to do with A/B testing. Rather, the argument uncovers the futility of testing small changes.

So, if A/B testing is not the culprit, is the real issue the local minima? No, even the theory of discounting local minima is flawed. The image above shows you a very simple one-dimensional fitness landscape2. You can imagine the x-axis as the background color and the y-axis as the bounce rate. Jason’s argument goes something like this: if you tested dozens of shades of blue, you might decrease your bounce rate, but if you tried something completely different (such as a yellow), you might achieve the absolute lowest bounce rate possible on your page.

There are two problems with this argument…

#### 1. You Never Know for Sure Whether You’ve Found the Global Minimum (or Maximum)

The global minimum (or absolute best) exists only in theory. Let’s continue with the example of an extreme yellow background giving you the global minima (in the bounce rate). Upon further testing, what if you found that no background color at all gave you a lower bounce rate? Or better yet, that a background full of lolcat images3 gave you an even lower bounce rate? The point is, unless you have reduced the bounce rate to 0% (or the conversion rate to 100%), you can never be confident that you have indeed achieved the global optimum.

There is another way to determine whether you have found the global optimum: by exhausting all possibilities. Theoretically, if your page didn’t contain anything other than background color (and you couldn’t even add the background image because, well, your boss hates it), then you could cycle through all background colors available and see which one gave you the lowest bounce rate. In exhausting all possibilities, the color that gives you the lowest bounce rate should be the one that is absolutely the best. This brings us to the second point…

#### 2. It’s Not Just About the Background Color, My Friend

When optimizing a Web page, you can vary literally hundreds or thousands of variables (background color being just one of them). Headline, copy, layout, page length, video, text color and images are just a few such variables. Your goal for the page (in terms of conversion or bounce rate) is determined by all of these variables. This means that the fitness landscape (as seen in the images above) is not one-dimensional and never as simple as it appears. In reality, it is multi-dimensional, with a ton of variables affecting the minima and maxima:

Again, imagine the peaks as your conversion rate (or bounce rate) and the different dimensions as the variables on your page (only two are here, but in reality there are hundreds). Unlike a one-dimensional case, exhausting all possibilities in a real-world scenario (i.e. in conversion optimization) is impossible. So, you are never guaranteed to have found the global maxima (or minima). Lesson to be learned: embrace local minima.

### Argument #2: A/B Tests Trivial Changes

Rand Fishkin of SEOMoz, posted an article titled Don’t Fall Into the Trap of A/B Testing Minutiae5 in which he reiterates Jason’s argument to not waste time testing small elements on a page (headline, text, etc.). His main argument is that getting to the local maxima (by testing trivial changes) takes up too much energy and time to make it worthwhile. See the image below, reproduced from his blog but modified a little to make the point:

The first point to make is that the opportunity cost is not the time required to run the test (which is weeks) but rather the time needed to set up the test (which is minutes). Once you have set up the test, it is pretty much automated, so you risk only the time spent setting it up. If an investment of 15 minutes to set up a button-color test ultimately yields a 1.5% improvement in your conversion rate, what’s wrong with that?

Many A/B testing tools (including Visual Website Optimizer8—disclaimer: my start-up) make setting up small tests a no-brainer. They also monitor your test in the background, so if it isn’t a winner, it is automatically paused. What’s the risk then of doing such trivial tests? I see only the upside: increased sales and conversions.

To make his point, Rand gives the example of a recent Basecamp home page redesign9, by which Basecamp managed to increase its conversion rate by 14%. Can you imagine the kind of effort that went into such a redesign (compared to a button-color test)? In fact, because the fitness landscape is multi-dimensional (and very complicated), a total redesign has a much higher probability of performing worse. A complex design can go wrong in many more ways than a simple button color can. Because we never hear of case studies of redesigns gone wrong (hello survivorship bias10), we shouldn’t conclude that testing radical changes is a better approach than testing minutiae (especially because radical changes require a huge investment in effort and time compared to small red vs. blue tests).

With the local minima (or maxima), you at least know for sure that you are increasing your conversion rate, which leads directly to increased profit. This isn’t to say that we should give up on our hunt to achieve the global optimum. Global optimum is like world peace: incredibly hard to achieve, but we have to keep moving in that direction. Lesson to be learned: the ideal strategy is a mix of both small (red vs. blue) tests and radical redesign tests. By jumping across the mountains in the conversion rate fitness landscape, you ensure that you are constantly seeking better conversion rates.

### Argument #3: A/B Testing Stifles Creativity

Jeff Atwood compares the movie Groundhog Day11 to (surprise, surprise) A/B testing and concludes that because the protagonist failed in the movie, A/B testing must also fail. Stripped of all (non-)comparisons, Jeff suggests that A/B testing lacks empathy and stifles creativity. He goes on to cite a tweet12 by Nathan Bowers:

A/B testing is like sandpaper. You can use it to smooth out details, but you can’t actually create anything with it.

Whoever claimed that A/B testing is good for creating anything? Creation happens in the mind, not in a tool. The same flawed reasoning could be applied to a paint brush:

A paint brush is like a stick with some fur. You can use it to poke your cat, but you can’t really create anything with it.

A/B testing, like a paint brush, is a tool, and like all tools, it has its properties and limitations. It doesn’t dictate what you can test; hence, it doesn’t limit your creativity. A/B testing or not, you can apply the full range of your creativity and empathy to coming up with a new design for your website. It is up to you whether to go with your gut and implement it on the website immediately or to take a more scientific approach and determine whether the new design converts better than the existing one. Lesson learned: A/B testing is a tool, not a guidebook for design.

### Summary

To reiterate the lessons learned from the three arguments above:

• Because you can never achieve the global minima, embrace the local minima. Testing trivial changes takes a few minutes, but the potential outcome is far greater than the cost of those minutes.
• Constantly explore the best ways to increase your conversion rate by performing both trivial tests and radical redesign tests at regular intervals.
• A/B testing is a tool and does not kill your imagination (in fact, you need your imagination most when designing variations).
• Lastly, don’t feel guilty about performing A/B testing.

(al)

#### Footnotes

Paras Chopra is founder of Visual Website Optimizer, the world's easiest A/B testing tool. Used by thousands of companies worldwide across 75+ countries, it allows marketers and designers to create A/B tests and make them live on websites in less than 10 minutes.

1. 1

### Jay Dalisay

Well written.

0
2. 2

### Daniel Wheeler

OK, so here is a question for you all…..

When do you stop testing a page? For example, i recently started doing A/B testing on this page (as well as many other, but lets take this as an example) – http://uk.computers.toshiba-europe.com/innovation/generic/home-computing-laptop-range/

The long story short: -The design I created managed to have a retention rate of 96% and a CTR of 94% (through to the next level in the user journey) to start with. We then implemented A/B testing on small elements of the page to try and increase the CTR, but i’m sure you can agree, to improve on 96% is pretty impossible.

So, when do you deem a page is totally complete in its successful-ness and when do you stop testing? Is 96% good enough to stop, or should you always try to achieve the 100%, even if it costs you a lot of time and resources to get the last 4%.

Is 100% even possible? Anyone got any examples of a 100% CTR success story?

Thoughts?

0
• 3

### Giles

Stop testing when the increase in benefit is worth less than the time spent testing.

If you only have 4% improvement left (presumably this is also the hardest to capture), how much revenue will that 4% generate and is that more than the cost of your time to test and improve the last 4%.

0
• 4

### Daniel Wheeler

A valid point Giles, and i too have wondered the same points.

However, without knowing how my effort it will take to get the 4%, how do you know how long it will take?

Its a tricky issue.

Thanks for the feedback!

Dan

0
• 5

### Scriptin

Testing itself doesn’t increase any parameters (CTR, conversion etc.), it just _shows_ you how does your _changes_ increase it.

So, the question should be: ‘how do I know when to stop making a changes?’ or something, not ‘stop testing’. You can make changes without testing at all and you can stop testing at any moment. Better think about the moment to stop improving your site.

100% is absolutely unreachable, because some users just not intrested in your site and get there by accident. Some other users may be interrupted being on your site. And so on.

0
3. 6

### Karim

Great article! I couldn’t agree more!

0
4. 7

### Sulpher

Thanks for great article!

0
5. 8

### Max Luzuriaga

Some interesting points made here. I’ve personally never tried A/B testing, but then again, most of the site’s I’ve designed haven’t been suited to A/B (blogs, etc.)

I would be happy to try it on a product page, if I were to ever design one.

0
• 9

### JMarsh

Try it on a blog! You might be surprised how much the color, position, or size of your “read more” links effect the number of people that click them. :)

0
6. 10

### JMarsh

Excellent article. I wasn’t even aware that A/B testing was under attack, and the fact that it is, is completely moronic. As someone who regularly does split and multivariate testing on design projects I think it is appalling that any professional would make such faulty claims about rigorous testing.

I have personally seen increases of 400% by changing a button color, 17% in global sales (millions of users) by removing a few payment options on a screen, 10% increase in value-per-transaction globally (again, millions of users) by changing the order of 3 buttons, and almost 20% by making “trivial” copy changes on a title and a button.

Anyone who attacks A/B testing either doesn’t understand it, or isn’t very good at it. And, if I may, I would like to add one more major point to your side of the argument:

During an A/B test the variables being tested are not random as your data implies. tests should be done by an interaction designer who knows what to look for and knows what factors could be influencing the conversion. I always have a goal in mind when I test and I even create some versions that I expect will fail, just to test my theories. I have never done an A/B test that delivered negative results, and often the results are better than expected.

The people arguing against A/B testing should be ashamed of themselves and it is sad that you had to write this article at all. That being said, well done.

0
• 11

### Paras Chopra

Great point about having a goal in mind. You are right, doing A/B testing randomly is not a great strategy. In fact, there should always be one or more goals in mind when you do A/B or multivariate testing.

0
7. 12

### RussellUresti

Yeah, as you pointed out, the issue they seem to have isn’t with A/B Testing, but with incremental changes vs. complete redesigns – and that’s just pretty moronic. A complete redesign takes a lot of time and effort and money to complete, it’s not something you can do once a month. Meanwhile, incremental changes are quick, easy, and relatively cheap; they can be executed every 2 – 6 weeks and result in small improvements. It’s called refinement, a design will never be prefect as it is right off the bat, so even if you do a complete redesign (something drastic), there is still room for improvement, so you’ll never get to that “global minimum” through drastic actions – you can get close, but refinement is still necessary.

0
• 13

### Paras Chopra

Wow, great points. I agree that a design has to exist before you can think of optimizing it using A/B testing. And that is precisely the point many people misunderstand. Testing and optimization kicks in once you have a basic design in place. So, criticizing A/B testing for stifling creativity is unjust because it was never meant for that stage in the first place. Every ace artist knows that he must use the right tool for the job at hand.

0
8. 14

### Theodor

Together with my collegue I have done over 150 AB tests on various sites. In the beginning we went into the trap of testing small things like buttons, colors etc. To gain an uplift from small changes you need a lot of traffic. Nowdays we do only testing on bigger stuff when the traffic is less than a certain point. The gains from our ab testing has been significant. One one page for example we had to three different versions before we gained a 40% increase in conversion. No need to say that this really rocked our revenues. So anyone who has extensive experience in AB testing will tell you that it is so worth the money. The ones that are fresh to the issue will run into many methodological traps and say that AB testing does not work. Who do you trust?

0
9. 15

### Chris Goward

You make excellent points, Paras.

My feedback:
1. I don’t see any reason the pursuit of global minima and local minima need to be mutually exclusive. We recommend a strategy for our clients that combines both, which provides a mix of incremental improvement and learning with opportunities for dramatic improvement.

2. A/B/n testing does not need to focus on trivial changes. That’s more of a Multivariate testing approach. A/B/n variations that include dramatic redesigns can lead to large increases when planned properly. If you trust your hypotheses, the dramatic redesign approach ends up being more labour efficient as you avoid multiple tests of minutia.

3. I agree with the sentiment that A/B/n testing usually does stifle creativity, but your rebuttal is right on. A/B testing should be a tool to validate that your insightful, creative ideas also work for the business! A/B testing should be a tool to prove the value of your creativity. A designer who rejects A/B testing is a designer who is either: a. not concerned with business results, or b. not confident in her ability to produce positive business results.

0
• 16

### Paras Chopra

Thanks for commenting, Chris.

Exactly. There should be a healthy dose of both global and local optima. Discounting one at the expense of the other is not wise. My point was related to neglecting testing small changes because then you are optimizing locally (which critics think is a waste of time). Though I agree large scale changes have their own place and time, there is no reason why one shouldn’t test small changes if it increases sales and conversions. For the (small amount of) time invested in designing and testing small changes, if you don’t do it — it is like leaving money on the table.

0
• 17

### Chris Goward

Yes, there is no reason not to test small changes, unless the traffic levels are a bottleneck that are stopping you from running the next test. You should always have a test in market. Complex Multivariate tests of small changes often run into test duration problems (ie. they take forever to complete).

In general, we find our best results from large changes.

0
• 18

### David Oh

Nobody ever talks about global and local minima and maxima when practicing science, and A/B testing is just science applied to the particular goal of maximizing yield per visit. Not doing it is ridiculous. If you reach 96% CTR on a page, then you make macro changes to your flow before and after that particular page. A good strategy is go dynamic with another page, using dynamic ajax to increase reliability so that the user never sees another page loading.

You don’t always have to change a single variable when designing your experiments– in fact, sometimes you are forced to change more than a single variable. The trade-off is that your specific learnings decrease the more variables you change, but your probable spectrum of gains or losses may increase. And that’s how science is done all the time, in fields that does not include conversion optimization. Sometimes it’s necessary to go back and reduce variables, and sometimes you want to go even wilder and test even more.

0
10. 19

0
11. 20

### Matthew

I agree with your points, and perhaps I should be writing my own blog on this, but isn’t the real issue with A/B testing the simplistic way it measures improvement?

In fact, you could be getting 40% conversion, with customers who find your site trustworthy and will remain loyal to you, only to have an A/B test convince you to change your design for a 60% conversion instead, where you’ve scared away the 40% who would have remained loyal in exchange for capturing more people who are willing, but skeptical.

In the long term, simple A/B testing still has its problems.

0
• 21

### David Oh

Matthew,

In your scenario, your hypothesis is that there is a real difference in returning visitors expectation of design.

Isolate the changes so that only new visitors see it. Then you are testing new, non-returning visitors only. Voila, you have also created a funnel which differentiates between new and returning visitors, which is a common thing to do.

As for your comment regarding “conversion rates not telling you the quality of leads”.

You are absolutely right, but you don’t have to measure just one thing when running AB tests. You can measure both lead rate, and then their initial first entry propensity for sales (and maybe 24 hours, and then 1 week, and so on)

0
12. 22

### Anonymouse

> Basecamp managed to increase its conversion rate by 14%. Can you imagine the kind of effort that went into such a redesign

Given that 1% improvement took 15 minutes, we can calculate the maximum time they should have spent on the re-design: 3 hours. Actually, that’s low because we haven’t accounted for the time spent on failed experiments. So let’s multiply that 3 hours by 4 or 5.

The real answer is to *manage* risk. A “complete redesign” is risky because you’re operating “open loop” without customer input. (Think of the great Netscape to FireFox re-write, or Windows XP to Vista rewrite.) An “incremental-only” approach is risky because you can get stuck in a local minimum. (Think of Windows NT/2000 “playing it safe” while MacOSX changes architectures and radically improves their UI — and starts picking up market share.)

0
13. 23

### Cory

This whole argument seems very short sighted. First, I’ve dealt with A/B split testing for years, and perhaps marketers think it’s just “15 seconds” to set up a test, but it’s not. You have to have two designs, two implementations developed, and that takes time.

The second problem is, that even small yields (1-2%) can’t be proved to be more than coincidental or small anomalies.

Third thing is, when your focus is “marketing” based, or “sales” based, you miss the obvious opportunity to innovate. A perfect example is Apple Inc. In the early 80’s they were innovation focused and they were 10 years ahead of the competition. Once they fired Steve Jobs the whole company was run by salesmen. Now, Jobs is a good salesmen, but he was an innovation focused leader. Once the “salesmen” of the late 80’s and early 90’s took over, there was no innovation – it was a matter of tiny, incremental updates that lead to them going nearly bankrupt until Jobs returned. Once he returned, they focused on innovation and brought us the iPod, iPhone and OS X. We wouldn’t have these things today (no copycats like Android either) if it weren’t for someone taking “risks” and not letting the “familiar” and “easy” force you into minor revisions.

It’s not A/B split testing per se, it’s the mentality that people have who usually end up using A/B split or multivariate testing that leads to stifled creativity. People who let numbers rule their lives will be afraid to take risks and won’t allow innovation to penetrate their little bubble of local minimum. I know this because I’ve seen it happen too many times.

0
• 24

### David Oh

The argument isn’t short sighted, the argument is just very specific.

Your last paragraph is very cohesive– however, the problem is that the “process of continual refinement and improvement”, of which A/B testing is just one large tool, has become some kind of bogeyman for people who do NOT want any kind of metrics based testing in their lives.

1 to 2% is not a large enough yield improvement to provide statistical significance quickly in my experience. Usually you want changes that result in at least 10% change to provide chi-square significance in relatively good time. Otherwise it would take forever and ever, and your decision to make it the new control would be outside of the boundaries of the test (your execs feels the new heading text more closely follows the “spirit” of the company)

0
• 25

### Paras Chopra

Hi Cory,

Great response. Well, the “15 seconds to setup a test” is true because my argument was about doing small changes there (like headline test). It doesn’t require coming up with different designs.

Small changes are hard to prove to be statistically significant, but they are not impossible. Moreover, as per my original argument, you have to see the opportunity cost. If setting up a headline test takes a few minutes at maximum AND there are chances that you may see good results, why not do it? Even if you just see 1-2% improvement, what is that you’ve lost there?

The last paragraph is a beautiful gist and is exactly what I wanted to say. A/B and multivariate testing are merely tools. As a marketer, you got to know when to use them for refinement or when to “innovate”.

0
• 26

### Cory

I suppose that in most cases it comes down to the wrong people being the decision makers. However in my experience the wrong people usually are the ones making the decisions, and testing such as this is just one of the tools that’s used to either stop or significantly slow down creativity.

This is usually how it goes: person A has good, innovative, and revolutionary ideas, while person B has a “slow and steady” mindset. Person A’s idea can be done in a month, but can’t get their ideas implemented because there is a 6 week test running and the test cannot be “disturbed”. Meanwhile, during the test, person A has other duties assigned, and by the time the test is done, person B wants to take the results of the test and do another 6 week test. Person B can do this because they’re in charge. A full quarter has gone by with no innovation and only minimal gains and they go through the whole loop again next quarter.

So, I agree with you that testing can bring in small gains, but the one thing testing can’t do is show you what you’re not gaining. What if a revolutionary change doesn’t just add but multiplies conversions?

Let me clarify that I’m not against testing at all. I’m against decisions that refuse to take risks. It may not always be appropriate or necessary to take big risks, but if I were to do a “split test” in my history on the amount of times smart risks did better than minimalist mentality, only anecdotally, the smart risk wins nearly every time (tested of course).

0
14. 27

### peach

I love SM but to be honest I think the arguments presented against AB testing were to lame to justify a whole article about countering them. you could counter those arguments perfectly in 3 sentences, no need to waste more words on it.

Instead, I would love to see an article about doing some real statistics, how about a case study about multivariate testing multiple elements on multiple attributes, or doing some regression analysis on ecommerce website statistics!

0
• 28

### Paras Chopra

Hi Peach, thanks for commenting. Well, the response justified an article because the original criticisms themselves were long. It is easy to pass 3 sentences, but hard to ignore a full article.

0
15. 29

### Marc Watts

Nice article. I agree with some of what you wrote, but one sentence in particular ruined it for me.

“To reiterate the lessons learned from the three arguments above”

This sentence shows a fair amout of arrogance in your writing as though what you wrote is fact and everyone has something to learn from you. Overall your arguments were extremely simple and maybe a little decorum would have ended it better.

0
• 30

### Paras Chopra

Marc, I am sorry if it came out to be an arrogant ending. I certainly didn’t mean it that way. The only reason why I re-iterated the points was to give a definite ending to the article which had lots of different (but connected) points.

0
16. 31

### martin

Just set up split testing for our frontpage last Friday, so this was a welcomed read and i was able to draw a few insights from users here with more experience in the field.

0
17. 32

### Artur Ejsmont

First time i see such a nice application for local/global minimum!

Well done guys, very nice article!

art

0
18. 33

### LA

Seems to me that the argument against A/B testing is to not waste time with “small changes”. This argument is inherently flawed due to the fact that you would never know what a “small change” was unless you tested it.

Eg: I may think that a “small change” is changing the text of a shopping cart button from “add to cart” to “buy now”, so I don’t waste my time with making the change. Little do I know that if I did make the change and ran it through an A/B test to statistical significance, I would have increased my conversion rate by 30%.

0
19. 34

### Simon Day

In the web design world you absolutely CANNOT brush anything to one side but especially A/B/testing!

I use A/B testing to not only increase conversions but to also test designs and positional changes against bounce rates, time on site as well as conversions.

On a recent site I managed to get the bounce rate down from 74% to 9% by A/B testing and heat-mapping. Sales went through the roof because of it.

A/B testing in an INCREDIBLY powerful tool and anyone who thinks it is a waste of time is either a fool or doesn’t understand just how powerful it truly is.

0
20. 35

### Ryan Notz

I disagree with every argument you put forward, Paras

Argument 1:
You misunderstood Jason’s original post. He is simply saying that it’s difficult to find the most optimum solution for a landing page (or even to make great strides in optimisation) if you aren’t willing to make big, radical changes sometimes. That is quite simply true, and he uses an analogy that you didn’t understand properly. His diagram, which you used at the top of your post, was an exaggerated cross section of his back yard in Austin. His analogy was water getting to the lowest point in his garden (admittedly, a somewhat confusing analogy). Water getting to a low point = landing page optimisation. Nothing he said should be contentious: he’s an advocate of A/B testing. The title: “A/B testing trap” simply referred to a trap that some people get into while A/B testing – the trap of only making small changes to an interface. You seemed to imply that he said A/B testing itself was a trap, which he did not.

Argument 2:
Your opening beef with Rand Fishkin was around the opportunity costs of A/B testing. You said: “The first point to make is that the opportunity cost is not the time required to run the test (which is weeks) but rather the time needed to set up the test (which is minutes).”

Ummm… no. If I’m running an A/B test on my homepage for button colours, I can’t simultaneously run another test. If my button colour test takes 4 weeks, that is 4 weeks that I have to wait until I can run another test. I call that an opportunity cost. I don’t give a crap about the 5 minutes it takes me to set it up.

Argument 3:
You took a pretty clear and rather insightful comment from Jeff Atwood and managed to confuse yourself and your readers along the way. He said: “A/B testing is like sandpaper. You can use it to smooth out details, but you can’t actually create anything with it.” Then you sarcastically replied: “A paint brush is like a stick with some fur. You can use it to poke your cat, but you can’t really create anything with it.”

While Jeff’s analogy isn’t perfect, it isn’t bad, either. Cabinetmakers use saws, drills, routers, lathes, etc to create something, then they sand it down for finishing touches (and maybe even paint it – that’s actually what a paint brush is for). When a designer needs to create a new interface, they look at business needs, speak with customers, developers and stakeholders, then wireframe, prototype and design a new interface. Once user tested and live, A/B testing should be used to fine tune the interface.

This is all basic stuff, and unfortunately it’s easily forgotten. All that you demonstrated in your post is that you never understood the basics to begin with.

Lessons learned:
— Read carefully before attempting to debunk another blogger
— Review your own logic carefully before publishing
— If you’re not an expert, don’t claim to be: blagging your way through life rarely works

2
21. 36

Interesting.

0