Georgi Georgiev (Guest Post): “The frequentist vs Bayesian split in online experimentation before and after the ‘abandon statistical significance’ call”

.

Georgi Georgiev

  • Author of Statistical methods in online A/B testing
  • Founder of Analytics-Toolkit.com
  • Statistics instructor at CXL Institute

In online experimentation, a.k.a. online A/B testing, one is primarily interested in estimating if and how different user experiences affect key business metrics such as average revenue per user. A trivial example would be to determine if a given change to the purchase flow of an e-commerce website is positive or negative as measured by average revenue per user, and by how much. An online controlled experiment would be conducted with actual users assigned randomly to either the currently implemented experience or the changed one.

Despite excellent motivation, good alignment of interests, and a growing body of knowledge, unbiased estimates from several sources show the median true effect in online experiments to be approximately zero [1]. Half of the proposed changes to business websites and mobile apps would have had no effect or a detrimental effect on the respective business, had they been permanently released to all end users. The effects of most of the rest are measured in single digit percentages, except for a long and thin positive tail. Such a median value of estimated effect sizes gives rise to the need to statistically discern true from false null hypotheses while the prevalence of small effect sizes necessitates relatively high-powered experiments.

Due to the competitive nature of business enterprises, there is a constant push to improve one’s experimentation program. In such an environment, it did not take long to feel ripples from the calls to abandon statistical significance and the “Moving to a world beyond p < .05” special issue of The American Statistician. However, major moves away from p-values and frequentist (error) statistics were ongoing in online A/B testing for several years prior.

Online A/B testing falls for the Bayesian allure

A noticeable shift from frequentist to Bayesian approaches occurred rapidly in 2015 and 2016 at which time two of the three most used A/B testing software providers shifted their statistical engines from simple fixed-sample frequentist tests to Bayesian approaches.

At that time, the major motivating factors behind the move away from statistical significance were, primarily:

  • The promise that ‘stopping rules do not matter’ as advocated by some proponents of Bayesian approaches.
  • A belief that the goal of A/B testing is to estimate the probability that a given hypothesis is true or false or the probability that a certain change is better than a default state of things, given some data.

The allure of ‘Stopping rules do not matter’ was inspired by a realization that stopping rules do matter if one cares to discern between false and true effects. This happened quickly in online experimentation due to the near real-time nature of the data. When you look at results that show your business losing (or failing to capture) 100’s of thousands of dollars per day / month, next to which there is a glowing sign saying ‘99% confidence’ (or even ‘100% confidence’), or an equivalently low p-value, it is easy to stop a test early. Do this a couple of times and one starts to realize that looking at statistics computed under a fixed-sample assumption hour-by-hour or on a daily basis is a really bad way to figure out what has a positive impact. One of the earliest short studies of the issue included the pointed quote: “A/B testing with repeated tests is totally legitimate, if your long-run ambition is to be wrong 100% of the time.” [2]. A less technical reason many were brought to similar realizations is that examinations of the expected combined impact of the tested changes turned out to be far below what was reported by the accounting team after the fact.

Instead of realizing the core issue with such a flawed process of optional stopping are the broken error guarantees, several leading software companies turned to measures which did away with such guarantees, namely different flavors of Bayesian inference and estimation [3].

The belief that Bayesian accounts of probability are better aligned with the objectives of online experimentation compared to frequentist ones relies on a combination of severe twisting of the meaning of words, oversimplification, and mischaracterization of frequentist inference. The major culprit in my view was a naïve mixing of decision-theoretic approaches and Bayesian methods. To avoid a major tangent in the article, those interested in an overview of the Bayesian v Frequentist debate in online experimentation should refer to reference [4].

Developments since the call to abandon statistical significance

In the years since the 2019 special issue [of The American Statistician], the main concern raised about p-values has changed slightly with a focus on insisting on how one is bound to misinterpret a p-value. These critiques unsurprisingly include issues of inferring a non-significant result to mean there is no real effect, of interpreting the observed p-value as the probability of the null hypothesis being true or false, and other valid concerns. Prominently featured is the argument that proposed Bayesian alternatives such as Bayes factors and posterior odds ratios are somehow easier to grasp, despite their objectively higher complexity.

The debate in the online experimentation community is therefore no different than the broader scientific discussion on the issue. It is, however, notable that what has primarily been put forward by some of the highest authorities in the business consists of simplistic explanations and appeals to the ‘intuitiveness’ of Bayesian probability.

How intuitive is Bayesian probability, really?

As a prime example, this “definition” comes from the product documentation of Google Optimize – a widely used A/B testing tool between 2016 and late 2023:

“Probability to be best tells you which variant is likely to be the best performing overall. It is exactly what it sounds like — no extra interpretation needed!”

The only clarification to the above statement comes in the form of an answer to the question “What is “probability to beat baseline”? Is that the same as confidence?”, with the answer reading: “probability to beat baseline is exactly what it sounds like: the probability that a variant is going to perform better than the original”.

The above is the whole definition users of the software were supposed to be satisfied with in regard to the Bayesian measures of probability they were presented. I believe this betrays entitlement and arrogance on the part of Bayesians not typical elsewhere.

Skeptical of such appeals to intuitiveness, I conducted a small poll among practitioners to see if they really understand probability in Bayesian terms. See [5] for the poll and its results, as well as a broader discussion on the topic, but I believe it is important to summarize it here.

The question was to imagine a test with a true null so no treatment was administered to either group and how a ‘probability’ measure would change from day one with 1000 users per group to day ten with 10,000 users per group. No definition of ‘probability’ was given on purpose. The possible answers were that such a ‘probability’ would either ‘Increase substantially’, ‘Decrease substantially’, or ‘Remain roughly the same as on day one’.

The third option is what is most likely to happen with most Bayesian software used in the field, including the most popular one (Google Optimize). In all of them the posterior odds would remain roughly unchanged. Yet, that option was only chosen by less than a third of respondents. Two thirds answered ‘Decrease substantially’ which would make sense in a logical construct in which the null hypothesis is either true or false (a frequentist view) and reflects the behavior of a consistent estimator of such a ‘probability’.

While the poll had just 61 respondents, they have all self-identified within the higher brackets of online A/B testing practitioners. By no means conclusive on its own, the poll remains a rare attempt to quantify the merit of the claim that Bayesian probability is more intuitive than frequentist probability. I would challenge all those critical of p-values and frequentist probability to conduct better polls and show if it is indeed the case that other frameworks are more intuitive.

Other recent developments

A relatively new angle of attack against p-values that has gained some traction in the last couple of years focuses on replicability as well as false positive risk (FPR) in line with Ioannidis (2005) [6] and Benjamin (2017) [7]. As framed by Colquhoun (2017) [8]: “We wish to answer this question: If you observe a ‘significant’ p-value after doing a single unbiased experiment, what is the probability that your result is a false positive?”. The inability of the p-value to give an answer to the above question is pointed out as a deficiency in need of addressing. The proposed way to solve it is by supplementing or outright replacing the reporting of p-values with reports of FPR, a.k.a. false positive probability.

A good example of the argument for false positive risk can be found in a paper by Kohavi, Deng & Vermeer (2022) [9]. Its motivating example is an egregious misuse of statistical inference and estimation deserving of every critique imaginable. The paper contains several good points to that effect. However, it also features the following suggestions:

  • A call for lower p-value thresholds
  • Critiques of p-values as being too prone to misinterpretation
  • A recommendation for the use of false positive risk alongside p-values, as well as to guide how low a p-value threshold should be in order to result in arguably desirable FPR values

p-value thresholds in online experimentation

It should be noted that there is no consensus threshold in the industry as a whole. Different companies or departments might impose their own thresholds or stick to textbook alpha of 0.05 for lack of a deeper understanding. More advanced teams may choose the threshold for each A/B test or each type of A/B tests performed such that it reflects the potential impact of the decision(s) to which it is relevant using some kind of a decision framework. A sample of thresholds used is available in [1].

In business experiments the risks and rewards associated with any experiment are often quantifiable. One can therefore arrive at significance thresholds and sample sizes which result in (roughly) optimal balance between risk and reward. For example, a company might employ a less strict threshold of 0.1 or 0.05 for mundane tests as part of regular quality assurance, whereas high stakes experiments might be subject to much lower thresholds in terms of both statistical significance and type II errors.

False positive risk

The main issue with using false positive risk is that it cannot be objectively computed for any single online experiment. It also does not surface any test-specific information which is not already contained in the p-value, but just augments the p-value with data from a set of other experiments with always questionable relevance. It relies on the assumption that the experiment at hand is drawn randomly from a sample much like a set of previously observed experiments, but that is not at all what happens in practice.

A further major issue is that the formula typically put forth for the calculation of FPR does not compute what the FPR concept is defined as, namely the probability that a statistically significant result is a false positive. This is something I’ve examined in much detail in [10].

Where is online experimentation heading to?

Despite examples like the above-mentioned work, in the past several years there has been no noticeable overall shift away from p-values and towards alternative measures of evidence, or even toward preferring lower p-value thresholds on Bayesian grounds. Critiques of p-values and proposals of alternatives remain a side topic in most of the published research as it continues to focus on improving the efficiency of existing methods, the removal of sources of bias, as well as dealing with violations of standard model assumptions in different scenarios.

For what that’s worth, experimentation programs at high profile corporations continue to share mostly experiments conducted in a typically frequentist fashion. I’m aware of some experimentation programs which exhibit methodological eclecticism by offering users the ability to conduct both frequentist and Bayesian tests, with the caveat that most seem to use uniform or noninformative priors.

At the high level, to the extent to which the field had swayed Bayesian in the mid-to-late 2010s, it seems it may have lately been headed back to error-statistical territory. The adoption of Bayesian methods since 2019 is either mostly unchanged or somewhat on the decline. The decline I believe is partly due to the discontinuation of Google Optimize in late 2023. It also has to do with the rapidly increasing popularity of frequentist sequential testing such as group sequential tests and methods of so-called ‘Always valid inference’ [11][12]. Many vendors have added such methods in just the past couple of years as there is now a wide recognition that optional stopping is an issue for the trustworthiness of test outcomes. Yet, reliable numbers on how many tests are conducted using frequentist methods vs Bayesian ones are near-impossible to come by so estimations come by proxy through the rough numbers of companies using particular vendors, methods used in publicly shared tests in various cases, methods discussed in research papers, etc.

Share your reflections and questions in the comments on this post.

References:

[1] Georgiev, G. (2022) What Can Be Learned From 1,001 A/B Tests?, https://blog.analytics-toolkit.com/2022/what-can-be-learned-from-1001-a-b-tests/

[2] Downey A. (2011) Repeated tests: how bad can it be? https://allendowney.blogspot.com/2011/10/repeated-tests-how-bad-can-it-be.html

[3] Deng A., Lu J., Chen S. (2016) Continuous Monitoring of A/B Tests without Pain: Optional Stopping in Bayesian Testing; https://doi.org/10.1109/DSAA.2016.33

[4] Georgiev, G. (2020) Frequentist vs Bayesian Inference, https://blog.analytics-toolkit.com/2020/frequentist-vs-bayesian-inference/

[5] Georgiev, G. (2020) Bayesian Probability and Nonsensical Bayesian Statistics in A/B Testing, https://blog.analytics-toolkit.com/2020/bayesian-probability-and-nonsensical-bayesian-statistics-in-a-b-testing/

[6] Ioannidis J.P.A (2005) Why Most Published Research Findings Are False; https://doi.org/10.1371/journal.pmed.0020124

[7] Benjamin, D.J., et al. (2018) Redefine statistical significance; https://doi.org/10.1038/s41562-017-0189-z

[8] Colquhoun, D. (2017) The reproducibility of research and the misinterpretation of p-values. Royal Society Open Science (4). https://doi.org/10.1098/rsos.171085

[9] Kohavi, R., Deng, A., Vermeer, L. (2022) A/B Testing Intuition Busters, KDD ’22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp.3168–3177; https://doi.org/10.1145/3534678.3539160

[10] Georgiev, G. (2023) False Positive Risk in A/B Testing, https://blog.analytics-toolkit.com/2023/false-positive-risk-in-a-b-testing/

[11] Johari, R., Pekelis, L., Walsh, D. J. (2015) Always valid inference: Bringing sequential analysis to A/B testing, arXiv preprint arXiv:1512.04922

[12] Johari R., Koomen P., Pekelis L., Walsch D. (2017) “Peeking at A/B Tests: Why it matters, and what to do about it” https://doi.org/10.1145/3097983.3097992

 

About Georgi Z. Georgiev:

Author of “Statistical methods in online A/B testing”, founder of Analytics-Toolkit.com, statistics Instructor at CXL Institute

Categories: A/B testing, abandon statistical significance, optional stopping | Tags: | 25 Comments

Post navigation

25 thoughts on “Georgi Georgiev (Guest Post): “The frequentist vs Bayesian split in online experimentation before and after the ‘abandon statistical significance’ call”

  1. Georgi:
    Thank you so much for your guest post, giving us a 5-review reflection on impacts of “abandon significance” in a field very different from what we usually talk about. Or maybe it’s not that different? It would be interesting to hear your reflections on this. In that connection it would be great to hear a bit more on how A/B testing works–for example, are people asked if they prefer A to B?

    I’m still just taking in the many references to your guestpost. I will especially recommend that people look at your discussion of false positive risks—not to be confused with false positive rates.
    https://blog.analytics-toolkit.com/2023/false-positive-risk-in-a-b-testing/
    I discuss this in my book Statistical Inference as Severe Testing (CUP 2019) [SIST] in Excursion 5 Tour II. You can find the corrected draft of this section on this this blog here:

    Blurbs of 16 Tours: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (SIST)


    I call it the diagnostic screening model of tests.
    “We examine an influential new front in the statistics wars based on what I call the diagnostic model of tests. (5.6) The model is a cross between a Bayesian and frequentist analysis. To get the priors, the hypothesis you’re about to test is viewed as a random sample from an urn of null hypotheses, a high proportion of which are true”.
    However, one might have thought that A/B testing is like diagnostic screening, and yet your criticisms of applying this model to A/B testing sound a lot like the ones I/we raise regarding its application in science! Your discussion in
    https://blog.analytics-toolkit.com/2023/false-positive-risk-in-a-b-testing/
    explains why I placed the discussion of the screening model in an excursion on power.

    Your closing remark in your discussion on false positive risk is a relevant note to end on:
    From Georgiev: “The above discussion on false positive risk is part of a much broader picture of Bayesian versus frequentist inference. Far from being a philosophical debate detached from practice, it has very real consequences on what experimenters think and do every day in online A/B testing and well beyond it. It is not a coincidence that the first article this author wrote on the topic addresses the “what you need to know” argument from Bayesians. Multiple better and better-worded arguments have been put forth since.
    Debates surrounding the replicability of scientific findings or their predictive value are as old as science itself. The proposed uses of false positive risk can be considered a local example regarding the replicability and predictive value of online experiments. The topics of replicability and predictive value are, however, much broader than what is encompassed by either p-values or false positive risk.”

    I don’t quite understand the poll you ran “to see if they really understand probability in Bayesian terms” but it might be unclarity with how A/B tests are done. .I’m sure I’ll return with further comments/queries at a later date.

    • Georgi Georgiev

      Deborah:

      Thank you for the opportunity to contribute to this debate with some observations from the very practical application of online experimentation.

      A/B testing works with live users of a website (or mobile app), with users randomly assigned to the current version of the website (the control group) or a tested treatment (the treatment group, of which there can be several if multiple changes are pitted against each other). The assignment is typically triggered, so it happens when a user has a chance to interact with the changed experience, but there are other approaches as well. Basically, when you visit Google, Booking, or Amazon you are very likely to be a participant in several, or possibly several dozen A/B tests going on at that very moment. Different parts of the online behavior are then measured depending on the goal of the A/B tests. It could be as simple as measuring how fast a page is loading for you, or it could be whether you’ve purchased something during the day or within a given number of days following your initial assignment to an A/B test.

      I agree with your discussion of Diagnostic Screening methods in SIST and I’ve also appreciated your paper with Richard D. Morey “A Poor Prognosis for the Diagnostic Screening Critique of Statistical Tests”.

  2. vaccinelegit0q

    Isn’t this missing the real issue? That consumer and market data don’t meet the assumptions of statistical inference? The market is an open, nonlinear nonequilibrium system whose phase space keeps changing. Convergence theories and other assumptions such as iid and ergodicity don’t hold. There isn’t even basic data integrity – you don’t know if a “user” is a real person or a bot.

    For instance, there is no way to know if two users or shoppers have the same needs, knowledge states, etc. So if you A/B test a new feature and the consumer rejects it, did they reject it because…

    1. They have a need for it, but they didn’t understand it
    2. They understood it and will never need it
    3. They understand it and will need it in the future, but just not now

    Another common scenario is when consumers are engaged in knowledge foraging (shopping). You will often attract consumers who mistakingly visited a website believing that it could be relevant, only to realize that it’s not what they need at all (e.g. I went to Pokemon website thinking Pokemon is a type of car). In this case the probability of conversion is undefined because the event is not even part of the sample space.

    A third example is decreasing entropy (yes decreasing) as new outcomes quickly emerge but then are selected out (never visited again) as the system localizes along a certain path. For instance, Neil Patel (link below) talks about how 18 months ago, adding “AI” to your website massively increase conversions. But now? He writes, “there little to no lift in conversions from using the word “A.I.”.”

    It’s better that we leave statistical inference to domains where the data meet methodological assumptions.

    https://x.com/neilpatel/status/1825704437056418164

    • Georgi Georgiev

      I’ll try to sum up the arguments raised in the comment:

      1. Data integrity is questionable, maybe a ‘User’ isn’t really a person
      2. No way to know the inner thoughts / state of a person
      3. Users / consumers / shoppers may land on a website by mistake
      4. What works for people one way today, may not work the same way tomorrow (people’s behavior changes, even when presented with the same stimuli)

      I agree with all of the above being issues. However, I disagree with the claim that the data available in online A/B testing does not meet methodological assumptions and that somehow statistical inference / estimation do not apply.

      The reason is simple: data integrity issues, inner thoughts / states of a person, and diverging user motivations can all be modelled as random errors, just as the same is done with measurement error and the infinite other confounders in any experiment ever done.

      In fact, I find it hard to point to another field which is able to so thoroughly verify the statistical assumptions in their models as is done in online experimentation, and especially at larger corporations. Even small to medium A/B testing teams can and do run hundreds if not thousands of A/A tests with actual users, actual measurements, and actual everything, in order to verify that, for example, the obtained p-values follow a uniform distribution under the null, or that intervals have the desired coverage.

      This practice in online experimentation goes well beyond your typical mis-specification testing (something also practiced in the industry). Real-world A/A tests verify the model assumptions at almost every imaginable level as tests are done not only on the null model (A/A), but sometimes at certain values under an alternative by using artificially induced actual hurdles to consumer behavior. Artificially induced actual improvements are harder to come by and there is large incentive for these not to exist.

      Further, even small organization can have hundreds or thousands of real-world A/B tests to meta-analyze in order to check if some of their assumptions hold. E.g. if sequential tests really stop X% earlier than their fixed-sample equivalents, on average.

      The only issue which remains is number 4) on my list: the non-constant nature of human behavior and our ever-changing environment. External validity / generalizability issues are nothing new in social sciences. The A/B testing community is well aware of it and this is one reason why re-testing is becoming a more common topic in recent years.

      The issue is obviously worse if one is focused on measuring minute, extra-particular changes like adding the word ‘A.I.’ to copy. If one thinks in broader categories such as those presented in “The Lever Framework” then many things tend to work the same as they always did, with minor tweaks to account for those changes in preferences and the environment.

      I’m happy to go on, but I want to make sure I’m not missing the main point of the commenter. If it is that social sciences are not physics, then I agree and there is no debate. If it is that somehow due to the above the use of statistical inference and estimation in A/B testing is methodologically unwise, then I’m happy to continue addressing any points raised.

      • Georgi:

        I’m glad you had an effective reply to this charge.

        I’m guessing that, for generalizability, it wouldn’t be difficult to just ask people, be it via questionnaires, focus groups or whatever’s out there. For example, I’m often asked why I’m returning a product. I’m not asked why I decided not to order after getting almost to the end, but there’s info there.

        There must be some famous cases in marketing that market students are taught.

        • Georgi Georgiev

          Deborah,

          My specialty is not qualitative methods such as surveys, but colleagues well-versed in the discipline have shared that it is difficult to survey people who ‘almost got to the end, but decided not to order’, unless you already have an account at that place, e.g. Amazon. It is just not easy to act on the absence of an action, especially if you have no way to reach those people as is the case for most. Even if you are Amazon, if you start doing that and somehow nail the purchase window precisely, it can be seen as intrusive, pushy, etc. plus the information gathered may not be of high utility.

          What I’ve been told works really well is to survey people who have just purchased, asking them something like: “What was the one thing that made you almost not purchase?” or “What would you buy if you hadn’t bought our product?”. The insights obtained would then result in ideas for things to A/B test.

          • Georgi:
            I neglected to mention that these companies email me to say something like: “you have items in your cart”, or “you haven’t completed your purchase”. But they don’t ask why.

            • Georgi Georgiev

              Deborah,

              I see. These are typically sent out very shortly after a cart interaction (several hours to a couple of days, at most) so I can imagine that it is seen as a bad opportunity for polling as the reasons for not purchasing at that given time can be many and mostly outside the influence of the merchant. Plus a survey question would detract from the main action the merchant want you to make at that point in time.

      • vaccinelegit0q

        Great to hear your response!

        The claim…

        “The reason is simple: data integrity issues, inner thoughts / states of a person, and diverging user motivations can all be modelled as random error

        … is not true.

        It exhibits the problem I was pointing out: people are using equilibrium thinking and methods to make statements about nonequilibrium systems.

        For instance, two challenges that A/B advocates hand-wave away include:

        1. The sample space and phase space are indefinite – or at least ill-defined (violating Kolmogorov axioms)
        2. A random error cannot be discerned from (a) transition in phase or state space or (b) an unaccounted outcome within the sample space

        So here are my questions. How do you discern random error from:

        1. An unaccounted outcome within the sample space
        2. An irreversible change in phase or sample space

        An example of #1 would be A/B testing a birth control ad to determine which is better. However, you (unknowingly) captured observations from: men, post-menopausal women, women who want to get pregnant, and women who are not sexually active.

        An example of #2 is why Blackberry failed so spectacularly against iPhone and Android. What they thought was “random error”, was really the emergence of a permanent change that was diffusing through the market.

        These observations are not “random errors” because (if I’m not mistaken) random errors only apply to measurement errors which affect observing known outcomes within a known sample space. But as I pointed out, A/B testing does not account for unknown outcomes or a changing sample space.

        In other words, you think you’re using a six-sided dice, but you’re really using a ten-sided dice. So when you see a “7” or “10” you think it’s random error but it’s not. It’s part of the sample space which you did not account for. Or, you started with a 6-sided dice and after the 10th trial, unknown to you, the numbers changed from 1-6 to 4-9.

        This discussion is why we must, to paraphrase Gigerenzer, “Teach statistical thinking instead of just statistical methods”

        Here are some reference for those who may be interested:

        • On probability as a basis for action (Deming)
        • On the Distinction Between Enumerative and Analytic Surveys (Deming)
        • A third transition in science? (Stuart A Kauffman, Andrea Roli)
        • Creative evolution in economics (Abigail Devereaux, Roger Koppl & Stuart Kauffman)
        • The Economy As An Evolving Complex System II (W. Brian Arthur, Steven N Durlauf, David Lane)
        • Complexity and the Economy (W. Brian Arthur)

        And probably anything from statistical mechanics.

        • Georgi Georgiev

          I’ve already stated that “If it is that social sciences are not physics, then I agree and there is no debate.”, so that will remain my response with regard to generalizability, especially at the level of a single test and not at the level of meta-analyses such as the Lever Framework.

          I fail to see any fundamental issue in your Example #1. If it were the case (possible), the main concern would be the higher cost to reach a prospective customer (or test subject) and the higher variance in the outcomes which would require a larger sample size.

          Regarding your random error remark: many in the industry are vigilant regarding model violations, be it IID or distributional. Where such violations are found to be material, i.e. in two-sided marketplaces (like eBay), different models and experimentation methods are employed which address that. However, there do not seem to be issues of the kind you imply in the vast majority of scenarios despite the huge number of A/A tests (and A/B tests where applicable) as well as mis-specification tests employed in the industry with the precise aim of capturing all kinds of expected and unexpected model violations. I’d be curious to hear how you’d imagine the issues you think are most impactful have not been caught despite all of that.

          • vaccinelegit0q

            There is a tremendous opportunity to develop robust, reliable statistical methods and design of experiments (e.g. factorial design) for use within the market. However, it cannot happen as long as people within the community refuse to accept that:

            1. The market is an open, evolutionary system. It does not have the statistical properties of a game of dice
            2. Consumers are not homogenous; and consumers who are heterogeneous still buy the same product

            For instance, I have been head of product (VP) several times. Here is how B2B shoppers really behave:

            1. They visit a brand’s website multiple times through the year, across multiple devices.
            2. They sign up for trials, if they exist, but rarely convert. Or if they convert, they do it for a small team and then cancel after 2-3 months. They’re just shopping.
            3. If a trial expires, they either sign up with another email address or ask for a trial extension.
            4. Shopping increases at the end of their fiscal year, but they don’t buy until the next fiscal year starts. This can be in January or July.
            5. When it comes to buying, even if they did a trial and visit the website, they pass on a recommendation to someone else who directly contacts sales

            However, A/B advocates assume that consumers show up to a brand’s website / app with the fill intent to buy or not buy, try it out, make their final decision in that moment, and never return. That almost never happens, and if it does, it’s because the consumer had predetermined what they were going to buy, and it doesn’t matter what the website says or trial experience.

            There is no evidence that A/B testing yields better business outcomes. Zero. In fact, many organizations find that they achieve better business outcomes when they stop doing A/B testing or avoid it all together (like AirBnB, Linear, Apple,…). This will continue to be the case unless practitioners admit that the idealized and simple maths they learned in school don’t match the real world.

            My question is: Why the refusal to admit that new methods are required. Scientists like Stefan Turner, Brian Arthur, and even Deming have explored the statistics of complex systems. The opportunity is there, why the reluctance amongst A/B testers?

            • Vaccinelegit:
              First of all, it would be more natural if you used your name in commenting. Second, I don’t understand what you mean in saying “They sign up for trials, if they exist, but rarely convert”. Are customers asked to sign up? I’ve never once been asked to sign up for a trial, maybe they’re not done in the U.S., and I do quite a lot of online shopping. And what is it to “convert”? You say “if they convert, they do it for a small team and then cancel after 2-3 months. They’re just shopping.” This sounds like subscribing to a service like a streaming service, where one would cancel after a special. Why would an organization achieve better business outcomes when they stop doing A/B testing?

            • Georgi Georgiev

              Replying to “There is a tremendous opportunity to develop robust…” as I initially did not see this comment.

              I’d be curious to hear more about: “There is no evidence that A/B testing yields better business outcomes. Zero.” – how would you measure that, overall, what’s your metric(s)? Where are you looking for such evidence?

              Also, I have serious doubts about this claim: “In fact, many organizations find that they achieve better business outcomes when they stop doing A/B testing or avoid it all together (like AirBnB, Linear, Apple,…).” Sources? My skepticism is due to knowing people currently in the experimentation teams of both AirBnB and Apple, as well as some who’ve been there until very recently. A/B testing is alive and well at those two companies and and none of the people I know seem to share your concerns. Quite the contrary, at least one is an outspoken proponent of A/B testing and has been for years, another is the head of their own A/B testing software company.

              As for your B2B example: I don’t think there are many in the industry who are not aware of their buying cycle. I have a B2B project and I fully expect my B2B shoppers to behave in the way you describe in your example since that’s what some of them do. I also had B2C clients with a typical purchase cycle of 1-2 months, with exceptions up to several months, but that is much rarer and limited to certain product/service categories. It is partly why A/B testing is most prominent among B2C products or services where the shopping behavior is much easier to track and statistically model, and not B2B where it is more difficult to do so on an end-to-end basis.

              While such issues limit the applicability of some kinds of tests, it doesn’t preclude running tests altogether. Even a buying cycle of over a year, across multiple devices and multiple people, does not preclude one from successfully A/B testing. Yes, the strategy should change by focusing on measuring more immediate actions as proxy for the end goal, e.g. instead of measuring sales you measure content engagement, views of deep content (technical resources, policy pages, etc.), trial sign-ups, post-trial engagement metrics, and so on and so forth. Are the results from those as compelling as those of tests ran at an e-commerce retailer? Probably to a lesser extent, but what’s the alternative? Do everything blindly? Rely on qualitative feedback and pray that you did not screw the implementation in some of the many unforeseen ways?

              The available meta data shows that roughly half of ideas put through an A/B test have no or negative effect. I can’t see how the ability to screen these off at the cost of missing some minute positive effects would not lead to positive practical consequences for most businesses.

          • Georgi:
            Can you give us an example of an A/B trial, or a link to one? Thank you.

            • Georgi Georgiev

              Mayo:
              Searching for “ab test case study” or “ab testing case studies” returns thousands of results. Depending on what you are after, they may or may not contain sufficient details. If you let me know what you’re looking to see in that example (changes people test, test outcomes, visuals, methodology, technical details, etc.), I can provide curated examples.

              Catching onto your last remark about how an organization is likely to do without A/B testing, I’ve shared one of the rarer case studies of what happens when one ‘ships’ without testing, even when there is no question that a change should be implemented (one way or another): https://blog.analytics-toolkit.com/2020/the-cost-of-not-ab-testing-case-study/

  3. rkenett

    A/B testing is indeed an interesting application area and this review is interesting. I also liked the reference to “severe twisting” of words.

    Two comments:

    1. Reference distribution – in the several projects we did on A/B testing, we started with A/A testing, i.e. comparing A with A. This is easy to do. It was very useful to debug some technical issues on how users are assigned which would have biased the A/B test that follows. In any case, this approach is empirical as it allows to positions evolving statistics in the context of a reference distribution of no change.
    2. Decision optimization perspective – My book on industrial statistics has a section on Sequential Sampling and A/B Testing (11.5) using the Two Arm Bandit Bayesian Strategy. In that context, the optimal strategy is determined by Dynamic Programming. The principle of Dynamic Programming is to optimize future possible trials, irrespective of what has been done in the past. We consider a truncated game, in which only N trials are allowed and start with the last trial and proceed inductively backward. The Python code for this more sophisticated approach is available in https://gedeck.github.io/mistat-code-solutions/IndustrialStatistics/. Steven Scott, working then at Google, implemented this approach which is focused on optimal decisions https://sites.socsci.uci.edu/~ivan/asmb.874.pdf

    Reproducibility in the context of A/B testing is not a concern since software platforms are routinely updated and one needs methods that effectively work with new releases. The context here is a strategic path taking you through various releases and not individual versions of the software.

    The two paths sketched above have gained popularity as the technology evolved. This might be a a ripple effect of the “abandon p value” discussions mentioned in this series of posts. However, I tend to think that they would have happened anyhow.

  4. Georgi:

    I’m interested to hear that you are suggesting that it is generally accepted “that The allure of ‘Stopping rules do not matter’ was inspired by a realization “that stopping rules do matter if one cares to discern between false and true effects. …When you look at results that show your business losing (or failing to capture) 100’s of thousands of dollars per day / month, next to which there is a glowing sign saying ‘99% confidence’ (or even ‘100% confidence’), or an equivalently low p-value, it is easy to stop a test early.”

    This should be persuasive for Bayesian researchers who still claim stopping rules don’t matter. You say something called “google optimize” dropped out in 2023. Does this (ignoring stopping rules) have anything to do with it? Is the recommended way to proceed nowadays to use various sequential testing strategies? I’d be interested in any examples. Thank you for a very interesting post.

    • Georgi Georgiev

      Mayo:

      Re “Google Optimize”: it was discontinued in late 2023 and as happens in such cases, the company discontinuing the product is light on words regarding the decision: “Why was Optimize sunsetted [sic]? Optimize, though a longstanding product, did not have many of the features and services that our customers request and need for experimentation testing. We therefore have decided to invest in solutions that will be more effective for our customers.”

      Most A/B testing tools nowadays support both frequentist (usually some kind of sequential testing, though there are some vendors which are yet to implement proper sequential tests) and Bayesian, with the latter typically being restricted to using a uniform prior with no choice given to the user in that regard. Since I do not think there is anyone with any experience in the industry who believes said prior is an appropriate one, I’ve always dubbed these “Bayesian in name only”, or a more computationally expensive way to compute a fixed-sample p-value / CI.

      Examples of employing sequential testing by well-known vendors:

      Alpha and beta spending GST is used by: VWO, ABsmartly, Analytics-toolkit.com

      A variant of mSPRT dubbed Always-Valid p-values is used by: Optimizely, Amplitude, Eppo, Statsig

      For some vendors it is unclear, e.g. ABtasty uses either a beta-spending-only GST or a futility-only SPRT variant, docs unclear), Convert.com is also very non-transparent about the type of frequentist sequential test used.

      Sequential testing use by large experimentation teams:
      GSTs: Booking, Spotify
      mSPRT: Uber, Netflix

      I think a very telling example is one of the remaining high-profile fully-Bayesian solutions, or at least I thought it was based on their SmartStats engine which uses a Beta(1,1) prior and has no mention of corrections for multiple evaluations in their original technical paper. But then I discover a glossary entry on sequential testing and in it I find this gem: “VWO uses a derivative of an approach called Alpha-Spending to correct Sequential Testing by Lan and DeMets.” and “By selecting Sequential Testing Correction in the SmartStats Configuration, decision probabilities will be adjusted to minimize errors while monitoring test results during data collection in the new test reports.”

      So the main entry I had planned as a highlight for the fully Bayesian case seem to have shifted to at least offer fully-frequentist sequential adjustments (efficacy-only GST, it seems) at least as an option.

  5. Nick

    Levine and Schervish (1999) demonstrate that Bayesian Factors and 2-sided p-values are incoherent and incompatible with the likelihood principle (which allows optional stopping). 1-sided p-values (which I know Georgi likes) are compatible with the LP so perhaps that’s the answer.

    • Nick:
      The answer to what? And why are Bayes factors incompatible with the LP. David Cox likes 1-sided p-values too, but vehemently rejects the LP–as do all error statisticians.

  6. Thanks for this interesting posting! So the Google Optimize promise is this: “Probability to be best tells you which variant is likely to be the best performing overall. It is exactly what it sounds like — no extra interpretation needed!”

    In Bayesian statistics, this will obviously depend on the prior, which if I haven’t missed anything hasn’t been discussed here. So I wonder, where does the prior come from, and on what information is it based? Personally I believe that in Bayesian statistics the posterior inherits meaning from the prior, and a meaningless prior will produce a meaningless posterior. Does this system work with a meaningful prior?

    • Georgi Georgiev

      @Christian Henning:

      I’ll cite my review of their documentation:

      – begin quote

      The documentation states “With respect to winners and conversion rates, we do our best to use uninformative priors — priors with as little influence on the experiment results as possible.”. There is no such thing as an uninformative prior, but there are minimally informative priors, and I suppose this is what is referred to here. But if Google Optimize were to use minimally-informative priors, then the results should match those of an equivalent frequentist method (same hypothesis, error control, etc.), rendering all claims of superiority null and void. The only gains of efficiency can come through adding more assumptions in the form of informative priors.

      The use of uninformative priors will also render any claim for easier interpretation false – if the math is the same and the result is the same, how is adding more complexity to the interpretation any better? Have you tried explaining the so-called “non-informative” priors over which there is no consensus among Bayesians, to a non-statistician?

      -end of quote

      It is another example of where adding “Bayesian” to a tool is seen as a license to claim that what is being computed is the probability of a hypothesis instead of a straightforward p-value. In some cases this comes with the added ‘benefit’ of being a grant to ignore optional stopping. The latter seems to also have applied to Optimize, given this statement of theirs “Additionally, Bayesian methods don’t require advanced statistical corrections when looking at different slices of the data.” (which I interpret in the sense of time slices).

  7. Georgi:

    I’ve been meaning to ask you how your monitoring relates to something called “anytime valid” p-values and e values, and how all of these connect to adaptive trials used by Wald and such agencies as the FDA? One of the researchers involved in developing these is Peter Grunwald. I take him to purport that his measures are superior, but I don’t know enough to evaluate them. I expect this to come up (by others) in a Neyman Seminar I’m giving next week at Berkeley; it would be great to know what you think.

    • Georgi Georgiev

      Deborah:

      The kind of monitoring I’ve found to offer the best trade-offs is of the Group-Sequential type, with early stopping for both efficacy and futility. This is pretty much the standard type of sequential test you’d see in a modern-day FDA approved clinical trial. All COVID trials were of this kind, for a prominent example that I’ve covered here: https://blog.analytics-toolkit.com/2022/improve-your-a-b-tests-with-9-lessons-from-the-covid-19-vaccine-trials/#sequential (the “Test efficiently with robust results through sequential monitoring” section). I’ve settled on this approach as the best for most online A/B tests since:

      1) it offers 20-80% faster tests than equivalent fixed-sample designs. The exact average sample size in each experiment depends on the relationship between the target MDE and the actual (unknown) true effect.

      2) it trades power for early stopping much more favorably than other methods such as SPRT or mSRPT (which anytime-valid inference is a subset of). I believe I’m one of the first to publish a (limited) comparison of a GST variant (AGILE), SPRT, and mSPRT: https://blog.analytics-toolkit.com/2022/comparison-of-the-statistical-power-of-sequential-tests/ . The guys at Booking did their own sims a year later and it seems they obtained similar results similar to mine.

      3) it allows for very flexible planning, where both the number and timing of analyses can be altered as the test progresses without a noticeable impact on error control (I would not advocate taking this to an extreme, of course)
      – it allows various error-spending functions to be applied to fit particular needs, but in general spending functions which are conservative early on are preferred, partly due to the better estimation accuracy post-experiment, and partly due to the positive effect on generalizability

      4) the non-continuous nature of the interim analyses means that one can achieve better generalizability by avoiding or mitigating known threats, e.g. hour of the day bias, day of week bias, etc. whereas a continuous test may end up too quickly unless intentionally stalled. For most representative results data is analyzed on a weekly basis, but sometimes daily examinations are performed as well.

      What is described in Grunwalds’ work is a more general approach akin to the mixture SPRT one we’ve had seen adopted by some in the industry in the last couple of years. In my view the main driver behind the rising popularity of Anytime-Valid Inference is that it is “braindead” to apply. You just start gathering data and examine it as it gathers, without the need of even surface understanding of sequential methods. I think this appeals to vendors in a quickly growing industry (to give an idea of the growth: https://trends.google.com/trends/explore?date=all&q=ab%20testing,conversion%20rate%20optimization&hl=en) where most of the practitioners they serve have been in the business for only a couple of years and many have minimal or no training in statistics. For vendors the easier solution to stop people from peeking and blowing up their error rates is to implement something like Always-Valid inference, which is foolproof in this manner. It would be harder to educate all of their users in stats.

      On the flip side, this comes at the cost of very poor statistical power. It is literally “off the charts” if you look at the article I shared above with the comparison of different sequential methods. However, only the more experienced and more statistically-savvy in the industry would know this and would fully appreciate what that means. Most of these people are not the target market for software vendors as they are part of the experimentation teams at large orgs (Google, Microsoft, Booking, etc.) who typically use internally developed tools.

      • Geogi:

        I’m extremely grateful to you for taking the time to look at the material I sent you and to explain so clearly the contrasts between these different sequential monitoring approaches. Your discussion of the Covid trials is quite enlightening, especially the emphasis on the importance of one-sided tests. I didn’t look at your comparison yet of the method you favor and “mSRPT (which anytime-valid inference is a subset of)”.

        I’m very interested in your “braindead” comment that “Anytime-Valid Inference is “braindead” to apply. You just start gathering data and examine it as it gathers, without the need of even surface understanding of sequential methods. How would a surface understanding of sequential methods yield a better result that is forfeited by the braindead approach? Better power? More soon.

Leave a reply to Christian Hennig Cancel reply

Blog at WordPress.com.