If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

Main content

Estimating a P-value from a simulation

Example of estimating a P-value based on a simulation to approximate a sampling distribution assuming the null hypothesis is true.

Want to join the conversation?

  • leaf green style avatar for user Siddharth Rayaprolu
    But why is the p-value based on 20%? The alternative hypothesis asks just >6%. In that case it's 15 students out of 40. So the p-hat is suppose to be 15/40, right?
    (29 votes)
    Default Khan Academy avatar avatar for user
    • blobby green style avatar for user ilya112358
      We are trying to reject the null hypothesis. We got 20% proportion from the sample and we want to see how probable to get a value at least this high if null hypothesis (about 6%) were true. This probability is called p-value.

      There are 25 students in a sample. 40 is a number of samples (of size 25) she simulates to estimate the p-value.

      Also, p-value is NOT the probability that null hypothesis is correct. We start the whole experiment assuming that it is correct and if we fail to reject it we simply return to where we started from.
      (8 votes)
  • blobby green style avatar for user White Shuu
    So here the p-value is 7.5%. This means the null hypothesis is not rejected. Correct?
    (5 votes)
    Default Khan Academy avatar avatar for user
    • leaf green style avatar for user Brian H
      I think that it would depend on the significance level that is set. Sometimes that could be 10%, other times less than 1%. As the significance level doesn't seem to be mentioned in this question we can't conclude if it is rejected. (Instead we're simply estimating the value that would be used to evaluate the rejection/acceptance decision.)
      (17 votes)
  • leaf green style avatar for user sdbshaad786
    Why take >= 20% for the p value and not just 20%?
    (10 votes)
    Default Khan Academy avatar avatar for user
    • piceratops tree style avatar for user jaeshinhyun96
      I think that's because of the definition of p-value itself. p- value is the probability of getting test results "at least as extreme as" the observed result (here 20%). That means that we have to take more extreme values than 20% into account.
      It might be quite confusing, but what we are trying to do here is to see whether we can reject the nullhypothesis (because that means that our suspect that there are more vegetarian in our school is likely to be true).

      If the sample proportion was 20%, then we can also include sample proportions that are greater than 20% in order to test the nullhypothesis.
      (4 votes)
  • blobby green style avatar for user ricardoadam_
    Why is the professor trying to run a simulation when you can calculate the binomial distribution with p = 0.06? For n = 25, the probability of getting P(p>=20%) = 1 - 0.98495 = 0.01505. Very, very different from the biased 7,5% found in the exercise.
    (6 votes)
    Default Khan Academy avatar avatar for user
  • blobby green style avatar for user JorgeMercedes
    Please show us how to obtain the P Value without simulation.
    (6 votes)
    Default Khan Academy avatar avatar for user
  • blobby green style avatar for user gurjassingh92
    Why do we always take the value of significance as 0.05? Is it a universal value or what?
    (3 votes)
    Default Khan Academy avatar avatar for user
    • piceratops sapling style avatar for user Bjorn Sverre Flatbro
      It's simply a rule of thumb. In medicine, for instance, you would definitely NOT want to have a significance level as high as 0,05. Instead, you might want a significance level of, for instance, 0,001.
      The lower the significance level, the harder it is to reject the H0. The reason you'd want the H0 to be hard to reject in the medical field is simple. Imagine if you were to give a medicine to a patient, and there is, for instance, a 5% chance (significance level of 0,05) that the medicine doesn't work. That would be catastrophic.
      (6 votes)
  • orange juice squid orange style avatar for user Evan
    I tried working this problem by first calculating the standard deviation for the sample given the null hypothesis was true.

    sqrt((0.06*0.94)/25) = 0.0475

    I then tried plugging this into the normalcdf function on my calculator with the following inputs.

    minimum: 0.2
    maximum: 1
    mean: 0.06
    standard deviation: 0.0475

    I got an answer of about 0.16%. This is completely off from the 7.5% that Sal got in the video (). Why does the way I tried to solve it not work? Thanks for your help!
    (3 votes)
    Default Khan Academy avatar avatar for user
    • starky seedling style avatar for user deka
      the formula of Z = (m_sample-m_population)/std_sample might give 0.16% as the p-value

      and this equation relies on Z-table, which assumes the sample distribution should be normal

      but as we see above in the simulation, it's not normally distributed. and the expected # of success cases (1.5) are also less than 10 (while that of failure cases, 23.5 is greater than 10). thus it is failed to meet the normal condition

      in short, if the normal condition wasn't met for z_table and then p_value, we better use simulation. and that might be the (implicit) point of this video, i believe
      (3 votes)
  • blobby green style avatar for user ilya112358
    It occurs to me that the sample size in this example is too small. For example, if we were building a confidence interval, we would demand that the sample contained at least 10 successes. In this example, there are only 5 (20% of 25).

    After reading this https://www.khanacademy.org/math/statistics-probability/significance-tests-one-sample/tests-about-population-proportion/a/conditions-inference-one-proportion I may say with more confidence that the sample size in Evie's experiment is quite small, that is probably the reason why she used simulation as the use of z-score is unjustified in this case since the sampling distribution is not normal.
    (3 votes)
    Default Khan Academy avatar avatar for user
  • blobby green style avatar for user Charles g
    At , we took p as >=20 while in the question it is just 20% ().
    (3 votes)
    Default Khan Academy avatar avatar for user
  • blobby green style avatar for user tamas.zeffer
    based on the graph we have 45 samples, because we have 45 little balls on the graph. Just a small correction assuming I am right. Am I eight? :-)
    (2 votes)
    Default Khan Academy avatar avatar for user

Video transcript

- [Instructor] So we have a question here on p-values. It says Evie read an article that said 6% of teenagers were vegetarians, but she thinks it's higher for students at her school. To test her theory, Evie took a random sample of 25 students at her school, and 20% of them were vegetarians. So just from this first paragraph, some interesting things are being said. It's saying that the true population proportion, if we believe this article, of teenagers that are vegetarian, we could say that is 6%. Now for her school, there is a null hypothesis that the proportion of students at her school that are vegetarian, so this is at her school, that the true proportion, the null would be as the same as the proportion of teenagers as a whole. So that would be the null hypothesis. And you can see that she's generating an alternative hypothesis, but she thinks it's higher for students at her large school. So her alternative hypothesis would be the proportion, the true population parameter for her school is greater than 6%. And so to see whether or not you could reject the null hypothesis, you take a sample, and that's exactly what Evie did. She took a random sample of 25 students, and you calculate the sample proportion. And then you figure out what is the probability of getting a sample proportion this high or greater? And if it's lower than a threshold, then you will reject your null hypothesis. And that probability we call the p-value. The p-value is equal to the probability that your sample proportion, as she's doing this for students at her school, is going to be greater than or equal to 20% if you assumed that your null hypothesis was true. So if you assumed that the true proportion at your school was 6% vegetarians, but you took a sample of 25 students where you got 20%, what is the probability of getting 20% or greater for a sample of 25? Now there's many ways to approach it but it looks like she is using a simulation. To see how likely a sample like this was to happen by random chance alone, Evie performed a simulation. She simulated 40 samples of n equals 25 students from a large population where 6% of the students were vegetarian. She recorded the proportion of vegetarians in each sample. Here are the sample proportions from her 40 samples. So what she's doing here with the simulation, this is an approximation of the sampling distribution of the sample proportions if you were to assume that your null hypothesis is true. And it says below, Evie wants to test her null hypothesis which is that the true proportion at her school is 6% versus the alternative hypothesis that the true proportion at her school is greater than 6% where p is the true proportion of students who are vegetarian at her school. And then they ask us, based on these simulated results, what is the approximate p-value of the test? And they say, the sample result, the sample proportion here, was 20%, we saw that right over here. Well if we assumed that this is a reasonably good approximation of our sampling distribution of our sample proportions, there's 40 data points here, and how many of these samples do we get a sample proportion that is greater than or equal to 20%? Well you could see this is 20% right over here, 20 hundredths, and so you see we have three right over here that meet this constraint. And so that is three out of 40. So if we think this is a reasonably good approximation, we would say that our p-value is going to be approximately three out of 40, that if the true population proportion for the school were 6%, if the null hypothesis were true, then approximately three out of every 40 times you would expect to get a sample with 20% or larger being vegetarians. And so three-fortieths is what? Let's see, if I multiply both the numerator and denominator by two and a half, this is approximately equal to, I say two and a half 'cause to go from 40 to 100, and then two and a half times three would be 7.5. So would say this is approximately 7.5% and this is actually a multiple choice question and if we scroll down, we do indeed see approximately 7.5% right over there.