Question 1

what's the unit on y-axis once you convert the frequency histogram to a density curve? you marked the peak as 0.2 in the density curve. what does 0.2 stand for?

Accepted Answer

I was struggling with this as well. The unit on the y-axis of the density curve is not percentage, it is density. Density values can be greater than 1. In the frequency histogram the y-axis was percentage, but in the density curve the y-axis is density and the _area_ gives the percentage. When creating the density curve the values on the y-axis are calculated (scaled) so that the total area under the curve is 1. This allows us to then define an arbitrary interval on the x-axis and calculate the percentage from the area for that continuous interval, as opposed to calculating the percentage for a discrete value from the frequency histogram. It can be confusing because the visual shapes of the histogram and density curve are the same, but the units on the y-axis are different things. I assume this is explained better than I can in a video down the line.

Question 2

3:28 Why if we become more granular, and therefore there are less data per rectangle, the percentage doesn't decrease?

Accepted Answer

Because there are more rectangles.

When you have a relative frequency histogram (discrete) you plot the probability that the data falls in one of those rectangles, but each value of your data can only be exactly in one of them. If you add them all you get 100%.

Imagine that you have a continuous data set (like the quantity of water that people drink every day), and you try to fit the values in your discrete rectangles. Then each rectangle should be a range, and you plot the probability that someone would drink water inside that range.

What you are thinking has a lot of sense. If I simply plot the probability of someone drinking water inside a smaller range, then the probability should be smaller, and the shape of the rectangles should change, being squashed.

BUT, then the area under the curve wouldn't be equal to 100%, because the total range is the same but the probabilities are smaller.

What that numbers represent is the probability that some value falls into those ranges, DIVIDED by the range. Those values are the probability PER UNIT. Now it makes sense to have infinitely many small rectangles.

If you put the probability divided by the range in a graph, then you can make an integral summing up infinitely many infinitely small rectangles. Rectangles of an area equal to the probability of some value falling into it. And that area would be smaller as we make smaller rectangles, because the width is smaller, but not the height.

In summary, what you thought that was the height of the rectangle in reality is the area of that rectangle. The height is the probability per unit of range, and the width is the range in which we could include values.

Question 3

When we switch from histogram to a density curve, we are told to conceptualize the buckets as getting infinitesimally narrow until they are no longer represented by bars but by a line connecting their tops.

However, the likelihood of any data point matching a given value becomes less and less likely as our buckets get narrower and narrower. So if you could zoom in it would look like a spike followed by zero for awhile then a spike.

Is the best way to conceptualize this as incredibly small buckets connected by a sort of trend line that is "stiff" (ie not dropping to zero in the space between buckets)? Or should the x-axis of density curves be conceptualized as legitimately continuous values (ie not buckets)? And if so, why isn't the line at 0 most of the time?

Accepted Answer

That was exactly my question, although you asked it one year before me :P

As of now I think the analogy for the density curve as a relative frequency histogram with incredibly thin rectangles is incorrect.

From other sites, I've gathered that the density curve can rather be thought of as the *derivative of the culmulative distribution function/curve*.

I am not the best person to explain exactly what this means, but check out this link:
https://math.stackexchange.com/questions/210630/what-does-the-value-of-a-probability-density-function-pdf-at-some-x-indicate?rq=1

The prob of getting an X in an interval for relatve frequency histograms is represented by the *tops* of the rectangles, but for a density curve, it is represented by the area under the curve. Therefore, I think that the density curve is *not* a rel freq histogram with really thin rectangles.

Question 4

7:59 not to be picky but it would be molecule right?

Accepted Answer

Well technically yes

Question 5

So if there are no values that are exactly 3, This part confuses me a little bit.  What if you asked how many data points drank 2.9 glasses of water, which was an original data point, is there any way to tell what % drank 2.9 glasses of water, or do you just have to assume that it isn't exactly 2.9 glasses?

Accepted Answer

At 2:47 he mentioned that you have "more data points" and want "more granular categories", which suggests that the data set used is not the same one in the first histogram. (3:20 - 16 million data points)

It seems like the lesson assumed that the first data set is discrete, 'cause it's countable (16 students), and the precision is only 1 decimal digit. It takes a huge number of data points to construct a distribution function of a continuous variable, and the variable itself, when described as a curve, is not countable.

In reality, the number of glasses of water drank per day of, say, 1 billion people, with each data point written with 10 decimal digits, is a continuous variable. The probability of someone who drinks 2.9000000000 glasses of water/day is close to 0.

Question 6

at 8:50 shouldn't he take the integral between 2.9 and 3.1 of the function that gives that density curve? Why is he doing only approximations?

Accepted Answer

Actually he should take the integral Andrei but since we are dealing with statistics Sal approximated it to be a very very thin rectangle so that students who are unaware of integral calculus feel comfortable.
Anyways both ways results in same answer but integration gives more precise answer.
hope that helps.

Question 7

I do not understand what is meant by "There might be many people between 2.9 and 3.1, but no one is exactly three glasses a day. " three glasses are three glasses, why 2.9 or 3.1 ?

Accepted Answer

To give analogy let's suppose each glass is 200 ml.

It is not possible to drink exactly 600ml because of measurement error. Maybe it might be 599.92... or 600.1197... or 600.023...

Question 8

how come the total area under the curve is 1? Could someone please explain I am little confused about it ?

Accepted Answer

1 in this case is just representing 100%. The total area that can be covered is 100%. So if the total area that can be covered by a certain dataset is 0.2, it means it occupies 20% of the area. Hope this helps.

Course: AP®︎/College Statistics > Unit 4

Density Curves

Want to join the conversation?

Video transcript