Main content

## Statistics and probability

### Course: Statistics and probability > Unit 13

Lesson 2: Comparing two means- Statistical significance of experiment
- Statistical significance on bus speeds
- Hypothesis testing in experiments
- Difference of sample means distribution
- Confidence interval of difference of means
- Clarification of confidence interval of difference of means
- Hypothesis test for difference of means

© 2023 Khan AcademyTerms of usePrivacy PolicyCookie Notice

# Statistical significance of experiment

AP.STATS:

VAR‑3 (EU)

, VAR‑3.E (LO)

, VAR‑3.E.2 (EK)

, VAR‑3.E.3 (EK)

CCSS.Math: Sal determines if the results of an experiment about advertising are statistically significant. Created by Sal Khan.

## Want to join the conversation?

- This is more a heads up than a question. It may be just me but, when I first watched this video, I was utterly confused and misled by it. The conclusion that the experiment
*was significant*seemed to me - and to others in the comments, I see - to contradict the point that had been made throughout the video. I couldn't make sense of how results that appeared to be*not significant*should, just the opposite, be considered significant. It turns out I had it backwards because it wasn't at all clear to me what a re-randomization is! A re-randomization is all about mixing up at random the results of*one and the same*experiment. Imagine reshuffling the pages of the Encyclopedia Britannica many times, to see how often they*just happen to be found*in the proper order. If a specific pattern observed in an experiment emerges often from a re-randomization, this suggests that pure chance alone is often sufficient to produce the same pattern, thus invalidating the experiment. Conversely, the less often a pattern emerges from a re-randomization of the results, the more the results of the experiment are validated (the point being that chance alone proves*unlikely*to produce said results). The notion and the underlying logic of a re-randomization are better explained in the next video, here:

https://www.khanacademy.org/math/probability/statistical-studies/hypothesis-test/v/statistical-significance-on-bus-speeds

In short, if this video confuses you, it might make more sense to you after you watch the next. Hope this helps.(119 votes) - What do you mean when you say "significant" ? Is it important study? I don't understand. Please give couple very simple examples about it.(14 votes)
- They should say "statistically significant" to avoid this confusion. "statistically significant" means the data in question is below some threshold of likeliness to have arisen by chance (by convention this threshold is often set at 5%).(10 votes)

- Lets assume for a moment that watching food commercials makes the kids eat more. Now if we take a treatment group and a control group and perform the experiment a large number of times, shouldn't the mean of treatment group be higher than that of control group in most of the cases?

in my opinion, It should be so if we distribute the kids randomly in treatment and control groups everytime we perform the experiment.

As an analogy, we know a die is biased if it shows the same number most of the times when we throw it, similarly if there is a diff between means of treatment and control groups in most of the trials we should say that it is due to the type of commercials.(2 votes)- Dr C, I've read all your answers in this topic explaining the intricacies of re-randomization but it seems to me that I have a brain freeze even more after those answers... to me the situation is rather illogical... imagine, for example, somebody takes two groups - 250 zebras and 250 lions - for study, and gives both of the groups grass (or whatever zebras eat) and meat in the same proportion, and after observation it appears that the zebras have eaten only grass and the lions - only meat... than some statistician says "What the hell! Let's re-randomize this..." and makes 150 groups of zebras and lions just to show that there is no difference between their consumption of grass and meat... putting that to an extreme the statistician posits that there is no difference between the eating habits of zebras and lions... of course, this example is a huge amplification but it seems to me that it's from the same category... so, I still don't understand the aim of this video lesson though I had almost no problems with any on them on this site... and, of course, I can - and should be - wrong somewhere but just don't know where...(1 vote)

- I have a question. I've completely understood why we need to re-randomize the datapoints between the groups and recalculate our indicator (the difference between the means), to see how often it can happen that we get our result just by chance. However, now I found myself wondering: how many simulations are needed to correctly estimate the probability of getting a specific value for the indicator just by chance? In other words, how do we know that 150 simulations are enough? Ideally we would want to recalculate the indicator for any possible arrangment of the datapoints in the two groups. But that would be 500 choose 250 in our example if I'm not mistaken, which is 10E+149 combinations ! So how do we know that 150 simulations are enough?(5 votes)
- You're absolutely right. One datasets become moderately large, it's not really possible to run
*all*of the permutations. Instead, we randomly select a larger number of permutations. The probability (p-value) we calculate is than an*estimate*of the actual p-value we would have gotten if we ran all of the permutations. One thing that we can then do is to make a confidence interval for the probability, which will depend on the estimated probability, as well as the number of simulations we ran. Once we run enough simulations, this confidence interval will be pretty narrow, so we'll be fairly confident in our result to the degree that we need.

So how many replications do we really need? It depends on what the p-value is and how certain we need to be. Whenever I do tests of this nature, I usually start off with 1000 or 10000, depending on the complexity of the algorithm - some codes run very fast, others take a bit longer.(7 votes)

- We should care about these two points, not just one of them, shouldn't we ?(4 votes)
- I agree. I think we should care about those 2 points. I did not understand why Sal just took care of just one of them.(2 votes)

- At5:42, Sal looks at the probability that there would be a 10 gram difference. At6:07, he says that there is a 2/150 probability that the results are due to chance, indicating that he is adding the data point from 10 and the point from -10. This makes sense, as they both are indicative of a 10 point difference (regardless of whether it is +10 or -10). But in one of the problems (the autism/diet one), one of the hints only includes the data points from the positive side.

So, if the result of THIS experiment was, in fact, an 8 gram difference, would it have been significant? Would it be a 4/150 (approx 2.7%) chance -- adding from the negative side -- or a 9/150 (6%) -- adding from both sides -- chance that the results are insignificant?(3 votes)- Since it's not entirely clear which way they did the subtraction, my recommendation would be to go with the 2-sided test: meaning we add from both directions, so we'd get 9, and hence the probability of an 8g difference assuming the two groups have no difference is 6%.

If the question had given us an indication of direction, we could have used that and gotten 4 or 5 (2.7% or 3.3%) instead. That's certainly legitimate, but the television and snacking example doesn't give us that.(3 votes)

- I don't understand why, during the randomization, Sal creates new groups that have a mix of both kids who saw food videos and game videos.

If we wanted to be significant, shouldn't we have repeated the experiment 1000 times, keep the kids who watched the food commercials in one group and the others in their group, in order to be able to compare the means?

I feel like we're comparing apples with pears which doesn't make sense.(2 votes)- It's because we are assuming that the two treatment groups - watching a food commercial or a non-food commercial - has no effect on how much a person eats. Under that assumption, the groups are equivalent, it's only random chance that we got the results we did, and any random shuffling of the kids among the two groups would be equally as likely of an outcome.

Hence,*we*perform that shuffling, and calculate the difference between the groups each time. In doing so, we get a whole distribution of these differences.

If our assumption is true, then the*actual*difference that we observed (wasn't it 8 or 10 or something like that?) should be somewhere in the middle of that distribution. It doesn't have to be in the exact middle, but close enough that we don't think the observed result would be unreasonable. On the other hand, if the observed difference is out in the tails of this distribution of differences based on the shuffling, then we would think that our assumption is a very poor one, and therefore we would conclude that the type of commercial*does*influence how much someone eats.(5 votes)

- I need more things like this. Can someone please give me a link to something like this.(3 votes)
- How can we be so certain that this is significant? We do not know whether the 10gm difference is between 10gm and 20gm or 100gm and 120gm or 180gm and 190gm, for example. Ten grams is 1/3 of an ounce, a very tiny bit. If we are looking at a 10gm vs 20gm consumption, this is definitely significant, since consumption doubled. But what if we were looking at 180gm vs 190gm? This is only a 5.5% increase. If the consumption wad 240gm vs 250gm, the difference decreases to just 4%. How can we say definitively that the difference is significant if we do not have more information about the actual amounts consumed?(3 votes)
- What is the negative part of the graph for?(2 votes)
- The study found that for those tests, the mean was less than the original experiment.(1 vote)

## Video transcript

Voiceover:In an experiment
aimed at studying the effect of advertising on
eating behavior in children, a group of 500 children,
seven to 11 years old were randomly assigned
to two different groups. After randomization, each child was asked to watch a cartoon in a private room, containing a large bowl
of goldfish crackers. The cartoon included
two commercial breaks. The first group watched food commercials, mostly snacks while the second group watched non-food commercials, games and entertainment products. Once the child finished
watching the cartoon, the conductors of the experiment
weighed the cracker bowls to measure how many grams
of crackers the child ate. They found that the
mean amount of crackers eaten by the children who
watched food commercials is 10 grams greater than
the mean amount of crackers eaten by the children who
watched non-food commercials. Let's just think about what
happens up to this point. They took 500 children and then they randomly assigned them to two different groups. You have group one over here and you have group two. Let's say that this right
over here is the first group. The first group watched food commercials. This is group number one. They watched food commercials. We could call this the treatment group. We're trying to see what's the effect of watching food commercials
and then they tell us. The second group watched
non-food commercials, so this is the control group. Number two, this is non-food commercials. This is the control right over here. Once the child finished
watching the cartoon, for each child they weighed how much of the crackers they ate and then they took the mean of it and they found that the mean here that the kids ate 10
grams greater on average than this group right over here which just looking at that
data makes you believe that okay, well something
maybe happened over here. That maybe the treatment from
watching the food commercials made the students eat more
of the goldfish crackers but the question that you
always have to ask yourself in a situation like this. Well, isn't there some probability that this would have happened by chance that even if you didn't make
them watch the commercials. If these were just two random groups and you didn't make either
group watch a commercial, you made them all watch
the same commercials. There's some chance that
the mean of one group could be dramatically
different than the other one. It just happened to be in this experiment that the mean here that it looks like the kids ate 10 grams more. How do you figure out,
what's the probability that this could have happened, that the 10 grams greater
in mean amount eaten here that that could have
just happened by chance. Well the way you do it is
what they do right over here. Using a simulator, they
re-randomize the results into two new groups and
measure the difference between the means of the new groups. They repeated the simulation 150 times and plotted the differences given. The resulting difference
is as given below. What they did is they said, okay, they have 500 kids and each
kid, they had 500 children. Number one, two, three,
all the way up to 500. For each child they measured how much was the weight of
the crackers that they ate? Maybe child one ate two grams and child two ate four grams and child three ate, I
don't know, ate 12 grams all the way to child number 500 ate, I don't know, maybe they
didn't eat anything at all, ate zero grams. We already know, let's
say the first time around. The first half was in the treatment group when we're just ranking them like this and then the second,
they're randomly assigned into these groups and at the second half was in the control group. What they're doing now is they're taking the same results and they're re-randomizing it. Now they're saying, okay, let's maybe put this
person in group number two and this person in group number two and this person stays in group number two and this person stays in group number one and this person stays in group number one. Now they're completely mixing up all of the results that they had. It's completely random
of whether the student had watched the food commercial or the non-food commercial and then they're testing what's the mean of the new number one group
and the new number two group. They're saying well,
what is the distribution of the differences in means. They see when they did this way when they're essentially
just completely randomly taking these results and putting
them into two new buckets. You have a bunch of cases where you get no difference in the means. Out of the 150 times that
they repeated the simulation doing this little exercise here. One, two, three, four, five,
six, seven, eight, nine, 10, 11, 12, 13, 14, 15. I'm having trouble
counting this let's see. One, two, three, four, five,
six, seven, eight, nine, 10, 11, 12. It's so small, I'm aging but it looks like there's
about, I don't know. High teens about 20 times when there's actually
no noticeable difference in the means of the groups where you just randomly
allocate the results amongst the two groups. When you look at this, if it was just, if you just randomly put
people into two groups, the probability or the situations where you get a 10 gram difference are actually very unlikely. Let's see, is this the difference? The difference between the
means of the new groups. It's not clear whether this
is group one minus group two or group two minus group one but in either case the situations where you have a 10
gram difference in mean. It's only two out of the 150 times. When you do it randomly, when you just randomly put these results into two groups, the probability of the
means being this different, it only happens two out of the 150 times. There's a 150 dots here. That is on the order of 2% or
actually it's less than 2%, it's between one and 2%. Let's say the situation
we're talking about. Let's say that this is
group one minus group two in terms of how much was eaten and so you're looking at this
situation right over here that that's only one out of a 150 times. It happened less frequently
than one in a 100 times. It happened only one in a 150 times. If you look at that, you
say well, the probability this was just random. The probability of getting
the results that you got is less than 1%. To me and then to most statisticians, that tells us that our
experiment was significant, that the probability of getting
the results that you got. The children who watched food commercials being 10 grams greater
than the mean amount of crackers eaten by the children who watched non-food commercials. If you just randomly put 500 kids into two different buckets
based on the simulation results it looks like there's only, if you'd run the simulation a 150 times, that only happened one
out of the 150 times. It seems like this was
very, it's very unlikely that this was purely due to chance. If this was just a chance event, this would only happen
roughly one in 150 times but the fact that this
happened in your experiment, it makes you feel pretty confident that your experiment is significant. In most studies, in most experiments, the threshold that they think about is the probability of something
statistically significant. If the probability of
that happening by chance is less than 5%, this is less than 1%. I would definitely say that
the experiment is significant.