12 Appendix
In the appendix we dig more into concepts we gloss over in the text, however you may not be familiar with:
- Statistical Fundamentals
- Centrality Measures
- Percentiles
12.1 Statistical Fundamentals
12.1.1 Centrality Measures
In Chapter 2 we discussed a bit about GDP per Capita calculated from the Penn World Tables. Here we will be revisiting that data and describing some measures of centrality. First though, in the context of data, what is centrality?
Let’s first consider a small class of 9 students, on their first exam they receive the following scores:
Student | Score |
---|---|
1 | 65 |
2 | 70 |
3 | 77 |
4 | 82 |
5 | 84 |
6 | 88 |
7 | 92 |
8 | 95 |
9 | 99 |
We want to know what the “center” of these scores are. But how can we do that?
- We could add up all the numbers and divide by the number of students
- We could find the score halfway between the lowest and the highest
- Since there’s 9 students if we looked at the 5th highest score, then 4 student’s scores would fall below, 4 above.
These are all considered measures of centrality. In the context of data centrality is some estimate of the middle of the observations. Which measure we want to use often comes down to the question we are trying to address.
12.1.1.1 Mean
The mean (or average) of a group is simply the sum of all numbers, divided by the number of observations.
That is given the above test score example, we would have:
\[\frac{65+70+77+82+84+88+92+95+99}{9}=83.56\]
While the mean is a great tool to calculate centrality, it is not always ideal. One of the weaknesses of it is that its susceptible to outliers.
12.1.1.2 Median
The median of a group is the middle number in the ordered list of observations. In the above table of student’s test scores, we have already listed them from lowest to highest (this is what we mean by ordered list).
If we then look at student 5, why do we care about student 5? Because we have 9 students scores, hence compared to student 5, there will be 4 test scores lower and 4 test scores higher. This is the median student. Hence the median test score would be this student’s score: 84.
12.1.2 Percentiles
If you are not familiar with the concept of median please read the above section on them. As median is just a very specific case of a percentile, the 50th percentile.
Noticed how when we looked for the median we wanted equal number of observations above and below. The median is at the middle, or 50th percent of ordered observations. In other words, 50% of observations have a value less than or equal to the median. We can use this interpretation going forward.
For example if we had the 25th percentile of a group of observations, it would mean that 25% of observations were of equal value or lower than this number. For the 75th percentile, 75% of observations are of equal value or lower than this number.
12.1.3 Example
Let’s consider the following scenario, you are an advisor for a small organic foods grocery store. They currently have about a dozen shops and have found that 80% of their profits come from customers within a 2 mile radius of their store. They also know that since their products are organic and free trade whenever possible, they need customers of higher wealth to be able to afford their products.
The owner of the company has their data and marketing analyst provide for you the following statistic for an area they are looking at for a new store. The average wealth of owners in a 2 mile radius of the potential new store is about the same as the other stores. However they did notice that a few houses have been bought up by wealthy investors.
What would you say in response? Is average an appropriate statistic in this case? Or under what conditions would an average be a good statistic?
To illustrate this what if we wanted to know the average wealth of people on a relatively city block? We would add up the wealth of the owners of the 5 or so houses along the block. However, what we don’t know is that a billionaire investor had bought up one of the houses on the otherwise poor block, anticipating that the values would rise. We could have a scenario like the following:
\[\frac{0 + 0 + 0 + 0 + 1,000,000,000}{5}=200,000,000\]
In this case we could conclude that this block is extremely wealthy! I mean, the average wealth is 200 million dollars! Even though 4 of our observations have 0 in wealth, the mean is still very high.
This is what we are referring to when we say that the mean is susceptible to outliers (it is not the only measure of centrality that is).
However if we instead looked at the median of the data we would get 0.