#Growth 9: A/B Testing & Bayesian Statistics with Intellimize's Guy Yalif

Media Thumbnail
  • 0.5
  • 1
  • 1.25
  • 1.5
  • 1.75
  • 2
This is a podcast episode titled, #Growth 9: A/B Testing & Bayesian Statistics with Intellimize's Guy Yalif. The summary for this episode is: On today's episode of #Growth, host Matt Bilotti is talking all about testing – A/B testing to be exact. He breaks down the best ways to test, teaches us all about Bayesian Statistics and chats through this and more with Guy Yalif, co-founder and CEO of Intellimize.

Matt Bilotti: Hello and welcome to another episode of# Growth. I am your host, Matt Bilotti. And I am super excited to be joined by one of the most knowledgeable guys I know on the topic of AB testing to talk all about AB testing and his name is Guy Yalif, and I'm so excited to have him. Thanks for joining, Guy.

Guy Yalif: Matt, thanks for having me. Great to be here.

Matt Bilotti: Yeah. So, today what I want to talk about is... I know people are pretty familiar on basic AB testing, you run two things at the same time and one does better than the other and then you pick one and you kind of run forward with that. But more recently I've been learning about this thing called Bayesian statistics and through one of the tools that we use at Drift, which Guy is the CEO of called Intellimize, it incorporates this Bayesian statistics approach, which as I've learned more and more about it and was very, very confused at the beginning as to what it meant and how exactly it worked, as I learned more and more, I realized that it really feels like the future of AB testing for a lot of different reasons. And excited to go a little bit further into that. So Guy, maybe if you want to give a quick intro on your background and then we can dive right into the topic.

Guy Yalif: Sounds great, Matt. And yes, it'll be interesting to dig further into it. I am an aerospace engineer. I spent half of college coding AI to design airplanes. Thought I was going to do that for the rest of my life and love the idea. I then spent 10 years as a product guy and 10 years as a marketing guy before starting Intellimize with two long time machine learning friends. And we automatically optimize websites by personalizing the experience for each individual visitor. And Bayesian statistics is one of several techniques we use to help do that better.

Matt Bilotti: Very cool. Also, I had no idea that you did aerospace stuff. That's amazing. All right. Why don't we jump in, quick high level. I mean, I gave the basics of AB testing, maybe if you just want to give us a quick run through of, from your perspective, what is AB testing and why does it matter and how do you measure it? And then we can jump a little bit further into Bayesian after that.

Guy Yalif: Sounds great. AB testing is a great way for us as marketers, as growth professionals, with a number on our head, we have to deliver more revenue, more customers, more leads to sales every day, it's a great way for us to do data- driven marketing. And at its core, just as you said, we have an idea and we want to know if that experience is better to show everyone going forward than what's on our site currently, so that we can go deliver more revenue, more customers, and so on. And so just as you said, we show the current idea that's on our site, we show the new idea, and we flip a coin, randomly allocating everyone to one or the other, 50/50, and we use math, whose goal is to tell us," Hey, are these two ideas performing the same? Are they not performing the same?" And if they're not performing the same, we then go look at the conversion rates of each one of those two ideas and assume that the higher observed conversion rate is in fact the higher performer. We go to engineering, we ask them to code it into the base site, and we show it to everyone forevermore. And the stats on which that's based are the stats we all learned in college.

Matt Bilotti: Very cool. And I know, now getting a little bit deeper, people know the term statistical significance. When I first started doing AB testing, both on the website and the product all across our business, there's this concept of statistical significance. What exactly does that mean? And what is the difference between like 90% and 95%?

Guy Yalif: Statistical significance is a measure of how likely what we're seeing is due to chance. So if we're seeing our new idea, let's call it variation B, performing better than our original site, variation A, and we have 90% statistical significance, it's saying that with 90% confidence, these two variations do not perform alike. And we're going to assume variation B is better. And there's a 10% chance that actually this is just random noise. And so if you up that to 95%, now there's only a 5% chance what we're seeing is really just a bunch of noise. And why does this matter? Because if you're an organization running a bunch of AB tests, eventually you're going to run into one of those noisy ones. And so you're really saying how comfortable am I drawing a conclusion when there's really no conclusion to be drawn at all? And if I run 20 tests and I'm using 90% statistical significance, on average, two of those tests I'm going to see there's a conclusion here when there isn't at all. Makes sense?

Matt Bilotti: Yeah, it makes a lot of sense. And one of the things that we were talking about before you jumped on here was about this concept that as you're running the test, you can be tricking yourself by checking it along the way. Right? Checking it an hour from now versus four hours from now. And explain how that can mess up your results and what exactly that means in the frame of statistical significance.

Guy Yalif: It's a great question and it's something just about everybody I know, including myself, is guilty of when I run an AB test. The statistics we're using for these, the name for them is frequentist statistics. That's a label that on its own is not as important, but the core underlying concepts are ones we all learn in college. And those are predicated on the notion that we're going to say," Hey, sometime from now," let's say it's a week from now," I want to test whether these two variations perform the same." That's really what I'm doing. And so a week from now, I'm going to, with 95% confidence, know whether or not these two variations perform differently. It turns out that with the existing math, our tests may, if we're checking them all along the way, go above 90% statistical significance in some interim period and then go back down before we reach a week from now, which is when we're supposed to be looking at the results given the math. This results in false positives, where we think we've got a winner delivering lift. But in reality, when did we deploy this winner to production, there's no lift to be had. And we could commit to waiting for a week, but in reality, in a business setting, we likely feel pressured to call a winner quickly or keep a test running that's inconclusive longer. And so by checking the test every day, we're more likely to have false positives or type 1 errors declaring a winner when in fact had we given it enough time, it wouldn't be a winner.

Matt Bilotti: Okay. So this right here is at the crux of... I've learned about this Bayesian statistics thing and it sounds like it's really one of those things that helps address this core problem that most people aren't thinking about or at least when I started AB testing, I was like," Ah, whatever. That's going to happen, but I'm 90% sure." And you all just kind of run with it. It's really easy to just kind of run forward. Can you talk about Bayesian statistics, and it's a really complicated topic, but maybe in some high level metaphor that's easy to grasp? And then we could talk a little bit as to how it then helps you avoid false winners.

Guy Yalif: Absolutely. And because we all ignored that in the existing stats, we have the situation where like at the end of the quarter or the end of the year, those above us in the organization say," If I added up all the wins you declare for the last quarter, a year, man, we should have doubled our revenue. But it only went up like 20%. How did that happen then?"

Matt Bilotti: Been there.

Guy Yalif: Totally. Right?

Matt Bilotti: What is Bayesian statistics in a frame of a metaphor?

Guy Yalif: So it is intended to mitigate this challenge of checking results early, and it's different than frequentist statistics in several ways. So the frequentist statistics require that fixed time horizon that week. So you may say," Hey, I want to..." Let's pick a really simplistic example." I want to decide does the sun rise in the east or does it rise in the west?" And so with frequentist statistics, I may say," Look, I need to wait a month to figure that out." And so I've got to wait a month before I check. With Bayesian statistics, you can look at the result at any point in time and make a decision based on the probability that it's either rising in the east or the west. That's one. Two, frequentist statistics use just what I've seen in the test. So with frequentist statistics, I may say, look, I pretend I've never seen the sun rise before, and I'm going to observe it for 30 days. And I'll eventually conclude, hey, it probably rises in the east because 30 out of 30 it rose in the east. With Bayesian statistics, you can use what you've learned before as a hypothesis that you are then refining. And that gives you a couple of benefits. One, you often can reach a conclusion more quickly, which is something all of us want, because you've got this initial hypothesis like... Okay, with seeing the sun, it's straightforward. On your website, you probably have an idea of what's your average conversion rate on your website. And you can start your test already with that as a hypothesis rather than relearning that thing you already know. Second, you probably can do a better job, depending on how you set things up, of figuring out what's driving lift if you incorporate what you've learned before. You can decipher better, more quickly, did this variation perform better because it itself is better? Or really was it the visitor? Like, I don't know, super wealthy person tends to buy more. Or context of the visitor? Like this ad campaign does much better. And so we too found this useful in helping decipher what lift is sort of really driven by the variation versus other things. And the third way Bayesian statistics is different is it takes into account measurement error. So it recognizes that what I'm seeing may not in fact be the ground truth. With the sun rising in the east, it's straightforward. It's always going to rise in the east. You're always going to see that. But with a variation, I may, for example, have a variation whose true conversion rate, meaning if I let it run forever, it's 20%. But I may show that variation to 10 visitors, and let's say four of them convert. In a frequentist world, you would say conversion rate's 40%. And I'd make decisions on that. In a Bayesian world, I would say the conversion rate is 40% plus or minus some really large range, I don't know, I'll make it up, 30%. Right. You've got some really large confidence interval. And if I show the variation to more people, I would expect the observed conversion rate to trend towards 20. And in a Bayesian world, at the same time, that confidence interval would get tighter. My prediction of what the true conversion rate would get tighter and tighter so that I gain more and more confidence that it's within the range I've set up. And so in that way, it's quite different than the thinking we all go through with frequentist statistics where we just say," Look, that conversion rate is 20%." Period. Makes sense?

Matt Bilotti: Yeah, it makes sense. So it's really a difference between, at the end of the day, looking with frequentist, the AB testing that most of us use or know, we look at the conversion rate and we say it's better or not. Whereas a Bayesian type system allows us to get closer and closer rates, start with a range and get closer and closer. Rather than watch a number go up and down, you watch a range close down. Right? So you can confidently know that the range is getting tighter versus are you sure that the total average that it's at right now is the true total average?

Guy Yalif: Matt, you're exactly right. And to put a finer point on it, the Bayesian approach, in doing just what you said, is a sequential one. What do I mean? So you can come in with this hypothesis that, hey, my conversion rate... or let's say the sun rises in the west. That's my hypothesis. For some reason, I believe that. With Bayesian statistics, I'll update my understanding about that hypothesis after every new data point. So I might see the sun rise in the east a couple of times and you know what? My hypothesis will start shifting. I'll start believing, hey, it's more possible that it could rise in the east. And as I see it rise in the east more and more often, I will update my prediction. I will update my belief on what is actually happening step- by- step so that at any point I could stop and say, well, does it rise in the east or does it rise in the west? That question, unlike frequentist, is not, does it cross a certain P- value and then I've got a yes or no answer? It's not like that. It is, I've got these two probability distributions. The probability distribution of it rising in the west is getting lower and lower and worse and worse. The probability distribution of it rising in the east is getting higher and higher and tighter and tighter. And I can make the call about like, hey, how much do these probability distributions overlap? And I can decide, am I comfortable with these probabilities, concluding, you know what? It's more likely it rises in the east. And I can do that at any point in time.

Matt Bilotti: So, this is interesting. I feel like it's one of the harder points to grasp. I don't know if I've fully got it either. Is there a way to explain it for, I don't know, DG, who's the star of our core Seeking Wisdom podcast? If he's listening in, he would say that he's not a math guy. What is a really easy way to explain, like, your hypothesis is changing automatically? Right? How exactly does that happen?

Guy Yalif: There are many implementations and not everybody agrees. Very smart people disagree on the exact right implementation. But the core principle of that iterative approach is to say, look, my best guess is that this headline is going to convert at 10%. And as I see people come in and convert or not convert, I'm going to adjust that 10% on the fly after every visitor sees it to better approximate what I think the actual conversion rate of that headline is. And so I can get better and better at making guesses. Right? We do it all the time. When I'm riding a bike as a little kid who's never ridden a bike before, I start riding and I go straight for a little while and then I fall over to the left. And I think, all right, well, now I've learned. I fell over to the left, I'm going to lean a little bit more to the right. And I ride a little more to the right, and then I fall over to the right. And I learn about that. And I keep adjusting based on every turn of the pedal, a little over compensating one way, a little over compensating the other, eventually finding that right balance. That notion of this iterative dialing it in and narrowing in of like, hey, how should I maintain my balance on this bicycle is similar to this iterative approach in Bayesian.

Matt Bilotti: Got it. And so, in the context of how you would normally have your AB and you have your A as a control and your B is your variant, does that change much with this sort of Bayesian approach? Is there still an AB or is there an ABCDEF? How does that work?

Guy Yalif: So you can use frequentist or Bayesian statistics on an AB test, an ABCD test, a multivariate test. You can use it for all of them. At its core, the Bayesian approach is trying to solve a different problem. And this does get a little wonky, but like you almost said it before, the null hypothesis. When you're using frequentist based statistics, the actual question you're answering is do A and B perform the same? If they don't perform the same, you don't actually know which one's performing better. You assume, let's pretend B does better, that B is the higher performer. But actually, the true performance of B has some distribution around it and hopefully, it's above A. What you do know is they don't perform the same. With Bayesian statistics, you're not trying to answer the question, do they perform the same or not? You're trying to answer, how well does each one perform? And how accurately can I predict that? If I can predict it very poorly, then there's a lot of uncertainty and overlap in predicting that and I'm not sure. And as I get better and better at guessing, hey, you know what? it's really in this tight range, I can make better decisions. Additionally, Bayesian tries to minimize a notion of expected loss. What does that mean? So if there's a 10% probability that B is 10% better than A, in Bayesian statistics, that'll be treated the same as a 1% probability that B is 100% better than A, because those two multiply out to the same thing. In Bayesian, you're not just looking at the average difference, which is what you do in frequentist, here you're also saying what's expected? What do I think will happen? You're combining the probability that B is better than A along with the notion that if B is better than A, how much better will it be? You're combining those two notions and that has the effect of protecting you on the downside so that you don't put into production ideas that it turns out don't really work. The con of that, the thing you pay for in that, is you do have the potential for more false positives. You have the potential for something that is only okay to make it out into production. It may not produce a big lift, but you are protecting yourself against having something that really sucks, that'll never make it out into production. And so that's one of the implications of taking a Bayesian approach in that... It's not the panacea. It doesn't solve everything. It does have pros and cons. This is one of the cons. In general, I happen to believe the pros, obviously, outweigh.

Matt Bilotti: Yeah. So if I'm at a company and I'm looking at this Bayesian thing and I'm sitting here and saying," Yeah, Guy, Matt, this sounds really cool, but where do I get started with this?" It seems like it's really easy for people to just kind of continue on with the frequentist stuff. Not to say that that is like the biggest mistake you'll ever make, or... I mean, I'm sure there are people out there that are going to strongly believe that it's not a mistake to continue with it. But where do they go and how do they think about it?

Guy Yalif: So I think you're spot on that continuing with frequentist is something many will do. In fact, it's got organizational credibility in most companies. Many of us just spend time driving buy- in on data- driven marketing to begin with, on AB testing. And this is the stats we're reusing. The motivation for doing this would be to not have that difficult moment at the end of the quarter or the end of the year where someone looks at us and says," Hey, if I add up all your tests, we should have doubled revenue. And in fact, we're really up..." I don't know..."10, 20, 30%." That's the motivation for doing this? To do it, you could hire very smart stats person or you could use one of the many tools out there that is starting to use or has been using a Bayesian approach to their statistics in return for having it happen less often that you bring a test to production and it's not actually a winner, because it's okay to peak early, and in return for getting answers sooner, and in return for being protected against the downside. The cost is now you have to get buy- in again in the org. Right? You will shift thinking from a P- value to some probabilistic view and some minimum downside that's acceptable. That'll take some time, but the benefits can be significant if it's a path you want to walk down.

Matt Bilotti: Yeah. I think it's funny that you use the example of sitting there at the end of the year and someone saying hey, if I add up all your tests, you said this thing was a 30% increase, this thing was a 20% increase. Where's the number? I have been there, not only from someone asking it to me, but also, we sat down and we looked at all of our AB tests and we were like," these things don't add up. Right? Shouldn't they be if this is this much better and that one is that much better?" So I've totally been there. I'm sure there's other people who've been there. And it really is an interesting challenge to get your organization or your team rallied around this new concept, this new approach. Right? It might mean changing tools. It might mean re- educating people. It's definitely a really real part of the challenge.

Guy Yalif: I am with you and I have been there too. And those conversations are not pleasant, they are challenging trust- wise. And we could theoretically all solve it by waiting for the fixed amount of time we're supposed to wait for the frequentist test to run. But the reality of the business pressures day- to- day don't allow hardly any of us to go do that.

Matt Bilotti: Okay. So let's say we got someone listening, we've got a bunch of people listening, and those bunches of people are saying," All right, I want to get started with this. Sounds good. One, how do I start these conversations and start to get our organization to move to this? Two, how do I actually start implementing this? And then three, where can I go learn more about this concept if I maybe didn't fully grasp everything? It is a tough subject, where can I go to find that information?"

Guy Yalif: In my humble opinion, you should begin by thinking through how much depth you want to bring the organization through. I mean, even in this conversation here, we've varied in altitude between simple metaphors of how I'm going to learn to ride a bike, to pretty detailed stuff about the maximum acceptable errors, and a bunch of places in between. And it may not be the case that you need to move the whole organization to understanding the full depth. Many of them may not care to. They may not understand that about the frequentist approach that you have now. But to decide to share with others," Hey, we are going to shift our thinking to one that is more probabilistic." And one where we do need to decide what's the minimum acceptable downside for a test so that we can have more tests that have upside more quickly, that logic, that set of trade- offs, you can start talking people through. And on implementation, unless you have the statistician who can do this in- house, they probably can help you a lot with driving the buy- in if this is something they believe in too. Otherwise, finding the tools that can help you do this. Maybe it is your existing tool. And on learning more, if you do Google Bayesian statistics, you will see a lot of information. There is not unanimity, one, Bayesian is better than frequentist. There are very smart people who believe in both. And two, even within the folks who believe in Bayesian, there's not one universal way to implement it that is appropriate for every different situation. And so, if you Google it, you're going to see stuff from high level explanations down to simulations that are graphed that get quite detailed and specific. And my suggestion would be consume some of that. And just, as you said, go through it over and over again, to build the intuition. And if you want to dig in a little further, feel free to email me at guy @ intellimize. com. And my hunch is you'll say the same, Matt, so that we can continue that conversation and dig in a bunch further.

Matt Bilotti: Yeah, I think it's a fun one to have. I feel like every time I talk to someone about Bayesian and this future of AB testing, I'm always learning something new, whether I'm like finally connecting another piece of the dotted line or someone points me to another interesting piece of content. I'm super excited about this topic. I know it has saved us a bunch of accidental AB calls where we would've said B was better, but because we were using a Bayesian approach, it allowed us to be more certain that this thing is actually better and it's better within this range, which is higher than this other range. Right? It's more of the true north of the conversion rate. So, I'm a fan. Guy, I want to say thank you so much for joining today. I've learned a bunch. I hope our listeners... I was going to say viewers. There's no one viewing, you're listening to this. I know our listeners certainly have. And for all of you who are out there listening, I'm always open to feedback. My email is matt @drift. com. I really appreciate you tuning in, it means a lot. And if you have ideas for future topics, future guests, whatever it might be, feel free to send me a note. And if you liked it, six- star reviews in Seeking Wisdom fashion. And I'll catch you on the next episode. Guy, thank you again so much for joining today. I will talk to you again soon.

Guy Yalif: Thanks everyone for listening. Matt, thanks for the privilege of joining you. I'm looking forward to continuing the conversation offline.

Matt Bilotti: All right. Take care. Bye.


On today's episode of #Growth, host Matt Bilotti is talking all about testing – A/B testing to be exact. He breaks down the best ways to test, teaches us all about Bayesian Statistics and chats through this and more with Guy Yalif, co-founder and CEO of Intellimize.