What does the F-distribution in ANOVA mean?
where,
SSB = sum of squares between groups
SSW = sum of squares within groups
Also, based on the degrees of freedom within and between the groups, a F-critical value and a P-value is obtained.
If the calculated F-value is greater than the F-critical value it is inferred that the treatment of strategy has statistically significant effect on the outcome. The P value is the probability of F-values to be greater than or equal to the calculated F-value if there was no effect of the treatments.
But, it seems confusing as to what these F-values or F-distribution is. Of course, you may find the equation for the F-distribution that gives a probability distribution curve for various degrees of freedoms. However, what they really are?
Let's understand what is the F-distribution in an ANOVA. In a population , where there is no treatment, say, we do following steps:
1. Take 3 sample of size 10
2. Calculate the F value.
3. Note it down
Further we repeat the above 3 steps several times (say 50000), we will obtain a pool of F values. Now, when we plot the histogram of these pool of F values, the histogram will look like the F distribution with degrees of freedom within sample = 27 and degrees of freedom between samples = 2.
Here, we will try to see with an example. We will use python so that to calculate and visualize our results.
First we shall import all the necessary packages to be able to do the analysis.
import numpy as np
import seaborn as sn
import matplotlib.pyplot as plt
import pandas as pd
import scipy
Let's take an example of the effect of effect of two drugs on the recovery time (in hours) of the patients from an illness. We take samples from three treatment groups:
- Control, that is the ones which are not given any medicines.
- drugA, that is people treated with medicine A.
- drugB, that is people treated with medicine B.
From each groups, we take 8 samples. Folowing is how the data would look like.
The data
Following is an example data from 8 individuals in every treatment group
control = np.array([100,90,103,96,110,125,121,114])
drugA = np.array([91, 105, 84, 83, 92, 100, 88, 77])
drugB = np.array([108, 107, 92, 105, 83, 100, 80, 97])
df = pd.DataFrame()
df['control'] = control
df['drugA'] = drugA
df['drugB'] = drugB
df
control | drugA | drugB | ||
---|---|---|---|---|
0 | 100 | 91 | 108 | |
1 | 90 | 105 | 107 | |
2 | 103 | 84 | 92 | |
3 | 96 | 83 | 105 | |
4 | 110 | 92 | 83 | |
5 | 125 | 100 | 100 | |
6 | 121 | 88 | 80 | |
7 | 114 | 77 | 97 |
Each column of the data stands for a treatment group. The values are equal to the number of hours the patients took to recover from the illness.
Now, to see if the drug has any effect we would perform the ANOVA. One way ANOVA would be applicable here as we are testing the effect of only one factor, i.e. the medicine. There may be two type of medicine and the control but the factor is just one, a medicine.
We will take the help of the package scipy
to quickly perform the ANOVA for us.
ANOVA with SciPy
result_ori = scipy.stats.f_oneway(control, drugA, drugB, axis=0)
fvalue_ori = result_ori.statistic
pvalue_ori = result_ori.pvalue
print(f'result of ANOVA: \n{result_ori}\n')
print(f'{fvalue_ori=}')
print(f'{pvalue_ori=}')
result of ANOVA:
F_onewayResult(statistic=5.302226772459687, pvalue=0.01367557501401989)
fvalue_ori=5.302226772459687
pvalue_ori=0.01367557501401989
The result has two values:
- statistic, that is the calculated F value.
- pvalue, that is the probability that we get a F value greater than or equal to the calculated F value with our data even if the drug did not had any effect.
In above case the P value is about 0.014 which is less than 0.05. Which also means that there is less than 5% chance that we get this F value if the drugs did not had any effect on the recovery time.
Typically, we will have a null hypothesis stating that the drugs do not have any effect, and an alternate hypothesis that the drugs do have effect. And if the P value is less than 0.05 then we reject the null hypothesis and accept the alternate hypothesis, which is the same thing we have obtained in above example.
There is disagreement among statisticians and researchers in various fields whether the arbitrary value of 0.05 is appropriate to make a decision about significance of the treatments; but in this article we will not go into that.
Now, let's see what it all really means in our example.
First we will get the mean and standard deviation of the control group.
mean_control = control.mean()
std_control = control.std()
print(f'mean control group = {mean_control}')
print(f'standard deviation control group = {std_control}')
mean control group = 107.375 standard deviation control group = 11.46666363856549
In a population of patients where there is no medicine or no effect of a medicine, the samples will have same mean and standard deviation.
We will generated several datasets, each dataset will have a set of three groups, and each group will have recovery time values for 8 individuals each. However, the mean and standard deviation of each of these groups will be equal.
Basically we are generating several datasets similar to our above dataset. Following is how we will generate one such dataset.
group1 = np.random.normal(loc=mean_control, scale=std_control, size=8)
group2 = np.random.normal(loc=mean_control, scale=std_control, size=8)
group3 = np.random.normal(loc=mean_control, scale=std_control, size=8)
# loc = mean
# scale = standard deviation
Then we perform the ANOVA to obtain the F-value as follows:
result = scipy.stats.f_oneway(group1, group2, group3, axis=0)
fvalue = result.statistic
fvalue
Now we will repeat the above process ten thousand times to obtain fifty thousand F values.
fvalues = np.array([])
for i in range(50000):
# generate samples
group1 = np.random.normal(loc=mean_control, scale=std_control, size=8)
group2 = np.random.normal(loc=mean_control, scale=std_control, size=8)
group3 = np.random.normal(loc=mean_control, scale=std_control, size=8)
# perform ANOVA and get fvalue
result = scipy.stats.f_oneway(group1, group2, group3, axis=0)
fvalue = result.statistic
# append fvalue to the collection of fvalues
fvalues = np.append(fvalues, fvalue)
print(f'Number of f values in collection = {fvalues.size}')
We now have a collection of 50000 F values from a population where there is no medicine.
The above data has degrees of freedom equal to 21 and 2. We will plot this as a histogram and also compare it with the F-distribution curve with same degrees of freedom with standard F-distribution using SciPy package.
# generate 50 thousand F-values using SciPy
dfn, dfd = 2, 21
rv = scipy.stats.f.rvs(dfn, dfd, size=50000)
We will now plot the F-values obtained from data and from SciPy as two histograms stacked one above the other.
fig, (ax1, ax2) = plt.subplots(2, 1, sharex=True, sharey=True, figsize=(10,5))
ax1.hist(fvalues, bins=1000)
ax1.set_title('A. F values of Generated data')
ax2.hist(rv, bins=1000)
ax2.set_title('B. F values of standard F distribution')
plt.xlim(0,8)
plt.show()
Note that the two plots look very similar to each other, but are not exactly same. The histogram categorizes the the whole range of F-values into 1000 bins and then each F-value is put into the bin which it lies. For example, a F-value= 2.3 will lie in the bin that contains all the values beetween, say, 2.2 to 2.4. The height of each bin in the histogram represents the number of F-values that are in that particular bin. When the number of F-values is large enough, the histogram would closely resemble the probability distribution of F-values (F-value distribution).
In the above plots we can indicate the following:
- A plot
- The 95th percentile of the F-value
- The calculated F-value
- B plot
- the critical value where the probability of getting any F_value_ equal or greater than the critical value is 0.05 (5%) when the null hypothesis is true.
fig, (ax1, ax2) = plt.subplots(2, 1, sharex=True, sharey=True, figsize=(10,5))
ax1.hist(fvalues, bins=1000)
ax1.set_title('A. F values of Generated data')
ax2.hist(rv, bins=1000)
ax2.set_title('B. F values of standard F distribution')
plt.xlim(0,8)
# 95th percentile of F_values
ax1.vlines(np.percentile(fvalues, 95), 0,1000,
color='k',
label='95th percentile',
linestyle='--')
# black vertcle line for calculated F value
ax1.vlines(result_ori.statistic, 0,1000,
color='red',
label='calculated F-value',
linestyle='--')
ax1.legend()
# red verticle line for critical value
ax2.vlines(scipy.stats.f.isf(0.05, 2, 21), 0,1000,
color='green',
label='critical F-value',
linestyle='--')
ax2.legend()
plt.show()
We see that the A part of the graph has distribution F-values in different ranges but with mean and standard deviation from samples that would be obtianed when there is no drug or no effect of drug.
The B part of graph is standard distribution of F-values we will obtain with degrees of freedom 2 and 21.
Note that the 95th percentile of F-values we calculated and the F-critical value obtained from SciPy generated F-values is very similar.
So after doing the ANOVA if we obtain an F-value which is greater than the critical value, i.e. value corresponding to the p-value less than 0.05 then we infer that there is some effect of the drug on the recovery time of the patients. (In general, there is effect of the treatment on obervations). Which basically means that there is less than 5% probability that the F value we are getting is due to the samples having same mean and standard deviation as that of the control.
Another way of telling this is that, based on the observation of the samples we did, the probabilty is less than 5% that our treatment was not useful, so we conclude that the treatment was useful.
The final thing to learn from here is that the F-distribution is the distribution of F-values we would get if the mean of control and the means of treatment samples were statistically equal.
The shape of the distribution of F-values will change with different degrees of freedom. This you can read in books or posts where it would be better written. You can also plot the distributions as below.
dfn1, dfd1 = 3, 29 rv1 = scipy.stats.f.rvs(dfn1, dfd1, size=50000) plt.hist(rv1, bins=1000) plt.xlim(0,8) plt.show()
dfn1, dfd1 = 15, 29 rv1 = scipy.stats.f.rvs(dfn1, dfd1, size=50000) plt.hist(rv1, bins=1000) plt.xlim(0,8) plt.show()