A pragmatic introduction to Statistics with examples in Excel

Featured image

Photo by Macau Photo Agency on Unsplash

Let’s say you passed a job interview and was handed a salary offer of $90,000. Would you be happy to accept it?

Or consider the interviewer’s perspective:

How much salary should you earn to be in the top 95%?

Or a quality inspector’s task:

A company is producing bags of a product. Each bag has approximately 300g written on the box. How do you check the producer is not cheating by packing less?

Even though our gut feeling is enough for everyday situations, it is reasonable to apply more advanced techniques to make sure our intuition is correct. Statistics is such a tool for the job. It’s widely used in data-driven organisations all over the world.

This tutorial contains step-by-step examples in Excel. If you use a different tool for your spreadsheets, you need to translate the steps accordingly.

Table of contents

  1. Introduction
  2. Sampling and Central Limit Theorem — applications in real life
  3. Statistical inference — Using a sample to predict the population parameters
  4. Hypothesis Testing for one sample — how the Volkswagen emissions scandal was discovered
  5. Hypothesis Testing for two samples — how to check if the gender gap exists
  6. Linear Regression — build a model which predicts the future

1. Introduction

We’ll tackle the learning of statistics from a practical angle. We’re going to use Excel and the Data Analytic plugin.

Prerequisites:

  1. Enable the Data Analytic plugin by following the step-by-step instruction here.
  2. Download the little-prince.xlsx spreadsheet. It contains a couple of sheets for each described concept together with the answer key. I suggest you don’t open the solutions before you practise on your own.
  3. Download little-prince.csv. It has a list of words from the brilliant book “The Little Prince” by Antoine de Saint Exupery.

Now, let’s use some statistics lingo.

The little-prince.xlsx is our data set. This is a collection of data. Our data set contains two variables: word and count.

Each variable has a type. When we talk about numbers, we call the variable quantitative. However, when we consider things like gender, level of satisfaction, we talk about qualitative variables.

The little-prince data set contains all the words from the book with their frequency in the book in the column “count.” This is our population.

If we create another file with only ten rows from the original data set, we talk about a sample. So a sample is a subset of a population.

Population and sample are fundamental concepts. We rarely have access to the whole population (all people, all cars, all employees, etc.). Studying the whole population would be very expensive. That’s why we use sample, and based on the sample we form our conclusions about the whole population.

Summarising Data

One of the common tasks for analysing and presenting the data in the table or on a chart. We have a couple of options here.

Frequency table

In the little_prince_words.csv, we see a list of words. One word per row. The first row is the header name: “word.”

Exercise 1.

Find out which word is the most common, how many unique words there are and present their frequency in the book.

To solve this task, we’re going to generate a frequency table.

Step 1. List all unique words

  1. in the spreadsheet click Data / Advanced (next to Filter)
  2. Select “Copy list to another location.”
  3. In the range, select all rows (with the header). To do this, click the icon with an arrow in the “List range” field and click the A1 cell. Then press CMD + SHIFT + Down (on Mac) to select all rows with data.
  4. Choose “Unique records only” to generate unique words only.
  5. Select “Copy to another location” and choose the place where the results will be saved in the “Copy to” field, the C1 cell.

Step 2. Count words

  1. The new column with unique words is created. Let’s name it as “word.” Next to our column, we’ll create another one: “count.”
  2. In the second row of D column (D2) type an excel function:
=COUNTIF($A2:$A16836,C2)
  1. Then, double click on the bottom right part of the D2 cell to populate the function to the other rows.
  2. Click the D1 cell, then click “Data” / “Sort.” Make sure “My list contains headers” is selected and sorted from “Largest to Smallest” by the “count” column.
  3. You should see the most common word “the,” which appears 980 times.
  4. In the bottom of the C column add total:
=SUM(D2:D2126)
  1. Write “frequency” in the E1 cell. In this column, we’d like to see the percentage of usage.
  2. Add this function in E2:
=D2/$D$2127
  1. Click the bottom right point in the D2 cell to propagate the function below.
  2. Select all values in column C and click “Home,” and then format the column properly by clicking the “%” sign (“Percentage Style”).

$\square$

Bar chart

Another excellent method to visualise the data is in charts. Let’s take a look at the first one: a Histogram (a type of bar graph).

  1. Click on the first cell of the frequency table, and hold SHIFT and press the Right arrow to select both columns. And then press the CMD + SHIFT + Down to select all rows with data.
  2. Click “Insert” and choose the “Columns” chart (bar chart) / 2D column.

The Pie chart for this example doesn’t look great, so we will skip it.

Central Tendencies

Let’s focus on the frequency table created above. If you want to follow the step-by-step guide you can either follow the above steps to create the table by yourself or go to “solution 1” in little_prince_words.xlsx and open the “solution 1” sheet.

Mean

The mean shows the sum of all numbers divided by how many numbers there are. Let’s calculate the mean for our frequency table (column D contains a count of words):

=AVERAGE(D2:D2127)

We can say that, on average, if we consider unique words, a word has 16 characters. (15.8 to be precise).

Standard deviation

Standard deviation shows how our data are spread. A standard deviation of 0 means all numbers are the same. This is useful for describing the data points. We can, for example, say that 68% of our data are \(\pm1\) standard deviation from the mean.

To calculate the standard deviation for population in Excel, use:

=STDEV(D2:D2127)

Our standard deviation for the little prince data set is 367 (366.799).

Notation

Finally, it’s a bit wordy to always write “population mean” or “population standard deviation,” which is why mathematicians use the Greek alphabet.

For the mean, we use \(\mu\) and for standard deviation \(\sigma\). We can also agree on the convention that we will use Greek letters for the population. We will use \(\bar x\) (x bar) for the sample mean, and \(s\) for sample standard deviation.

Percentile

Also, we would like to have a way to communicate how common or rare data is. The percentile is a useful tool to do this, telling you how many values were lower than the value under analysis.

For example, if your salary is in the 90th percentile, it means that approximately 90% of all salaries were lower than yours — Keep up the good work!

Z-score

So far, we’ve figured out the average of the word length in the Little Prince is 16 letters. If we consider the word “the,” we can see that it has three letters. In communication, we can use statistics to be more descriptive. In general, we can ask this question:

How far from the mean is the considered value located?

If a word has four characters, it’s 12 letters from the mean. It’s more descriptive but not perfect. If I study another example where the mean is 1, and my value to consider is 0.0001. I’m only 0.9999 away from the mean. But which value is further away from the mean? In our case “12 > 0.9999”. But can we compare apples to bananas?

We can do better. First, we subtract our considered value from the mean, but then the difference we get will be divided by… the standard deviation. Why?

It will tell us this: how many standard deviations away from the mean our value is located. If we are \(1\sigma\) away from the mean, we know it’s closer to the average than \(3\sigma\). If your salary is more than \(2\sigma\) it’s a decent income. If it’s close to the mean, it’s a more modest salary.

The Z-score tells us precisely that.

$$Z={x-\mu \over \sigma}$$

This is a generic, standard way to express how far from the mean our values are among all the examples we’re going to consider.

Exercise 2:

You get an offer of \$65,000. The mean salary is \$54,030 with a standard deviation of $8,900. Did you get a salary for specialists or mid-level employees?

Solution:

$$Z={65,000-54,030 \over 8900 }=1.23$$

It looks like the salary is not for an expert because it is close to the mean.

$\square$

When do we know that the Z-score is large enough? By applying the “Empirical rule.”

Empirical rule and outliers

The normal distribution (bell curve) appears everywhere. And in statistics, when something is not “normal,” we can still transform it to a normal state. Then, once we have the bell curve, we can apply the empirical rule, which says that:

  • 68% of values lie \(1\sigma\) from the mean
  • 95% of values lie \(2\sigma\) from the mean
  • 99.7% of values lie \(3\sigma\) from the mean
  • Observations outside of \(3\sigma\) are considered outliers.

2. Sampling and Central Limit Theorem

In chapter 1, we learned the fundamental language of statistics: outliers, data set, mean, standard deviation, normal distribution, etc.

We practised by using the little prince data set. However, we operated on the whole population, which in most cases is not possible. You cannot survey all the people in a country.

If we cannot consider all, let’s consider only a few. We can consider a sample and gain insights into the whole population.

Sampling

If we take two people from our apartment, we probably won’t be able to claim anything meaningful about the population of the whole country. The sample must be representative. It is surprisingly easy (at least on paper) to get a representative sample. Firstly, we need to create our sample randomly. When we choose samples randomly, we use a simple random sampling method. There are a few more methods such as stratified sampling and cluster sampling.

If we want to understand some population parameters based on our sample, we will carry out a study.

There are two types of studies: Experimental — e.g., we invite people to our lab and do the test directly on them. As you may guess, not all kinds of experiments would be moral. Hence we move to the second type, an Observational study, where we observe and analyse what we “saw.”

It’s all easy once we deal with a normal distribution. However, what should be done if there is no normal distribution?

Central Limit Theorem

Central Limit Theorem states that it doesn’t matter what the true population distribution is. If you take many samples and calculate the mean for each of them, they will be normally distributed. We say that the “sampling distribution of sample means” approaches a normal distribution.

3. Statistical inference

Exercise 3

In the file little_prince_words.xlsx, open the sheet named “Sample.”

This sheet shows our word frequency table with a new column “id.” This has been added because the Data Analysis plugin works only on numeric values.

The goal is to create a random sample of 30 words, then check if we can estimate the population means using the sample mean.

Create sample

  1. To create a sample, we first need to go to “Data” / “Data Analysis” and select “Sampling”.
  2. Select Input Range as \$A\$1:\$A\$2126 — id column. Make sure “Labels” is checked. Sampling Method = Random, with “a number of samples” = 30. Set output Range to F1. Click OK.
  3. We have the sample, but we only see the word id instead of the word and the count. Let’s fix this.
  4. Label “id” to the sample id column (column F), then label “word”, “count” to column G and H, respectively. In G2 cell use the below function:

G2 cell:

=VLOOKUP(F2,$A$2:$C$2126,2)

in H2 enter:

=VLOOKUP(F2,$A$2:$C$2126,3)

Then double click on the bottom-right plus sign on both cells to populate these functions among all rows.

Now, we have a random sample, we can calculate the mean of the sample in K2:

=AVERAGE(H2:H31)

To calculate the standard deviation we’re going to use one specific to the sample (in K3):

=STDEV.S(H2:H31)

Notice that if you do this on your own, you’ll get different values as this is a random sample. My sample mean is 5.5, and the standard deviation is 12.4.

Let’s pretend we don’t know the true population mean (16) and try to predict it based on the sample.

Estimating the population mean

Let’s calculate the range where the true mean may lie: the lower and the upper boundaries. This is called the Confidence Interval for Mean.

  1. In the same sheet, click “Data” / “Data Analysis” and choose “Descriptive Statistics.”
  2. For the “input range” select our count column: \$H\$1:\$H\$31. Enter the output range as \$J\$5. Make sure, “Summary statistics are set at 95%, and Confidence Level for Mean is checked. Click OK.
  3. Resize the J column to see all the values. We can see that our mean and standard deviation match the values generated by using Data Analysis.
  4. Calculate lower boundary for the mean. It will be our sample mean minus the margin of error (it is already calculated by Data Analysis and called “Confidence level") — the last row from your descriptive statistics result).
  5. In J22 type “Lower,” in J23 enter: “Upper” and K22 cell is equal to =K2-K20 (mean - confidence level). The K23 cell is equal to =K2+K20.

For my sample, this experiment failed, as it showed that the true mean should lie between 1 and 10. However, we know that our true mean is 15.8. Why didn’t it work? It’s because we have chosen too small a sample size (only 30). For practice, try to repeat the experiment with larger size e.g. 200.

$\square$

The 95% you select is a parameter called the Level of Confidence. It describes how likely it is that the interval we calculate contains the true population value. The higher the probability, the wider the interval (and less informative).

4. Hypothesis Testing for one sample

We learned how to use samples to find the population means. Now, it’s time to prove how true a claim is. We’re going to use Hypothesis testing, method widely used. Famously, it was used by researchers to uncover the Volkswagen emission scandal (2015).

In general, a testing hypothesis works like this:

  1. Assume that a claim which we want to verify is true - Null hypothesis \(H_0\)
  2. Formulate the alternative hypothesis \(H_a\). If \(H_0\) is rejected, we accept \(H_a\).
  3. Check if our sample data support the claim.

To check if the data does this, we use two tools: the t-test and p-value.

Why t-test? What we are doing here is a statistical estimation. In our case, we estimate the true mean. We cannot use Normal Distribution as we don’t know what the true standard deviation is. In that case, t-distribution is a good fit.

What is the p-value? P-value is evidence against the Null Hypothesis. For example, p-value=0.02 means that there is less than a 2% chance the Null Hypothesis is correct. In hypothesis testing, before we start the calculation, we choose a significance level, typically 0.05. This is the value below which we’re going to reject the Null Hypothesis in favor of the Alternative Hypothesis.

Exercise 4

For this example, we’ll be using little_prince_words.xlsx, the “hypothesis” sheet, which has one of the samples.

Let’s assume a Little Prince book publisher claims all the books weigh approximately 300 g. You’re going to check if that claim is true. You’ve taken your sample of 50 books, and their masses are in the “hypothesis” sheet.

Solution:

Step 1 - Formulate $H_0$ and $H_a$:

$H_0: \mu=300$ — on average a book weight 300g

$H_a: \mu\ne300$ — on average a book does not weight 300g, it can be more or less than this.

Step 2 - Calculate the critical value

In excel calculate the mean of your sample, standard deviation (STDEV.S). We’re using a significance level of 0.05. Our sample size is 50. The hypothesized mean (the mean we use in \(H_0\)) is 300.

$\bar x=294.609$ — sample mean

$\sigma=10.98013331$ — sample standard deviation

Step 3 - Calculate the critical value

The critical value is equal to:

${\bar x - hypothesized mean \over {\sigma / \sqrt(sample size)}}$

We can calculate the above in Excel. The result is -3.47. Now we want to see, using T-distribution, what probability is associated with the above critical value. This is our p-value.

Step 4 - Calculate p-value

To calculate p-value we use this formula:

=T.DIST(D7,D4-1,TRUE)

D7 is our critical value, and D4-1 is the “degree of freedom” (sample size minus one). The p-value=0.0005448. As we have an equals sign (=) in \(H_0\) , it means this is a two-tail test, so we need to multiply the value by two and check if the result is less than our level of significance: 0.05.

$p-value < 0.05$ — therefore, We reject H0. We conclude that the alternative hypothesis is true, so the claim that all books weigh approximately 300g is incorrect.

$\square$

5. Hypothesis Testing for two samples

In this chapter, we’ll learn how to check claims for two samples. For example, if the gender gap is present.

Exercise 5

In the “gap” sheet from little_prince_words.xlsx we collected a sample of salaries for females and males from one company. We will perform hypothesis testing for two means to figure out if females earn less than males there.

Solution:

Step 1 - Formulate $H_0$ and $H_a$:

$H_0: \mu_{female}-\mu_{male}\ge0$ — female salaries are more than or equal to those of males

$H_a: \mu_{female}-\mu_{male}<0$ — female earns less than male

The $H_0$ must always contain the part with “=“. And all of the $H_0$ and $H_a$ must be mutually exclusive and cover all possibilities.

Step 2 - Calculate p-value

Open the “gap” sheet.

  1. Click “Data” / “Data Analysis” and choose “t-Test: Two Sample Assuming Unequal Variances”.
  2. Set “variable 1” to all values in the female column
  3. Set “variable 2” to all values in the male column
  4. We claimed that the difference between female and male salaries is equal or greater than zero, so the hypothesized mean difference must be set to 0.
  5. Alpha is 0.05 (confidence level 0.95)
  6. Provide results in D1.

As we used \(\ge\) in our \(H_0\) it will be a one-tail test.

The \({p-value}=0.0089\). It’s smaller than 0.05, we reject the \(H_0\) in favour of \(H_a\). Based on our sample, we can confirm that males earn more than females, so the gender gap exists.

$\square$

6. Linear Regression

Exercise 6

Let’s assume we have a data set like in the “regression” sheet of our spreadsheet. The “day” column is the day in the year we performed some measurements. The mood is how good we felt on this day from 1-5. The score shows our score on an exam.

Let’s try to build a model that will be able to predict what score we get based on the day of the year and our mood.

Solution:

Multilinear regression example:

  1. Open the sheet and click on “Data” / “Data Analysis.”
  2. Select Regression. As “Y range” select \$C\$1:\$C\$58 (this is the “score” column, our dependent variable).
  3. For “X range” choose \$A\$1:\$B\$58 (two columns: day and mood).
  4. Make sure labels are checked. Choose 95% confidence level.
  5. The results put in $E$1. Click OK.

Interpreting the results

For the multi regression model, we’re going first to check “Adjusted R Square”. This tells us how well the model was able to predict the dependent variable using linear regression. It’s 98%, which is pretty good.

The next step is to check if all the coefficients calculated for each variable are statistically significant. We’re checking the p-value for all of them (we can skip the intercept). If for any of them, the p-value is not significant, we’re removing it from our model and trying to build another without this variable.

In our case, all $p-values < 0.05$, so we use them all. Our final model looks like this:

$score=30.67*day+37.5*mood+11.05$

Where day is the day of the year between 1-365, the mood is between 1-5. Let’s see how we will perform at the end of the year:

Score calculation for bad mood (mood=1):

$30.67*365+37.5*1+11.05=11243.1$

Score calculation for good mood (mood=5):

$30.67*365+37.5*5+11.05=11393.1$

$\square$

Resources