Keynote: Practical Defaults for A/B Testing

19 Nov 2022

11:20 - 12:20

Keynote Room

#CH2022: Ronny Kohavi

Ronny Kohavi – Consultant, Instructor for online controlled experimentation

1 minute video preview

Theo – Freelance CRO Specialist, feedback through our #CH2022 attendee survey:

Thanks for coming back and sharing your extensive knowledge again!

Slides

Notes

This is the link to the live notes of Ronny his talk

Questions asked by attendees through our #CH2022 app:

With >0.05 alpha two-tailed, you will have (a lot of) false nagetives, thus miss out on money. Is that not worse than a false positive?
Talking about stat. Significance, what di you think about frequentist vs bayesian tests evaluation?
You are referring to 2 tailed test, but would’nt you think one tailed approaches support our CRO-related goals better since we’re aiming for improvements?
Shouldn’t we look at risk of a wrong decision on a case by case basis? The million dollar decision is more risky than the hundred dollar decision.
What if you don’t run enough experiments to be confident of your successd rate to manage FPR?
Shouldn’t FPR be compared with a coin toss? Even at 41% you are still betaing the coin toss or the average hippo idea.
Reducing alphas and replicating reduces FPRs, but at a huge cost: lower power. And isn’t that what we’re really interested in? Finding real winners, reducing false negatives?
If the recommendation is to replicate an experimentation if in doubt, why not extending the duration of the initial test?
What benchmark would you use for success rate to calculate fpr?
Is MDE of 5% still a good default for testing on a checkout page from which 80% already converts or do you need a lower MDE?
If the company doesnt have that amount of traffic, should they completely abandon the idea of running A/B tests?
What would your advice be for testing on lower traffic websites (<200k)
Would you ever recommend testing from a bayesian approach over frequentist? Why or why not?
How about setting alpha based on defining the costs of a false positive (low: only costs of dev en pushing live), versus false negatives (high: missing real growth) up front?
“What do you think of sequntial analysis?
Should we only start analysis the results when the power is >80%?”
So if you can’t get 200k/2weeks what is the recommendation? Guessing? Hippo?
For low traffic websites: is running an experiment not still better than not running it and not validating anything at all?
Is it reasonable to be optimistic about MDE and work with lower sample sizes when managing innovation teams working on “big sized experiments?”
Is there a case for using non inferiority testing as the default?
What to do if you only have 5 days a year to test where 90 % of sales is made?
Did you optimize for speed? By how much did numbers change?
“Why “”flat OEC”” equals no ship?
E.g. a branding change caused a page update which had no OEC impact. Makes sense to still deploy the variation?”
How about experiments where there is no difference in cost/tech debt/maintenance between A/B. Couldn’t we just take what little data we have and take the highest mean with no test?
What would your advice be for B2B websites with low traffic (10.000 monthly users or less)?