Learning machine learning? Try my machine learning flashcards or Machine Learning with Python Cookbook.
Trimmed Mean
Trimmed means are averaging techniques that do not count (i.e. trim off) extreme values. The goal is to make mean calculations more robust to extreme values by not considering those values when calculating the mean.
SciPy offers a great methods of calculating trimmed means.
Preliminaries
# Import libraries
import pandas as pd
from scipy import stats
Create DataFrame
# Create dataframe with two extreme values
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy', 'Bob', 'Jack', 'Jill', 'Kelly', 'Mark', 'Kao', 'Dillon'],
'score': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 100, 100]
}
df = pd.DataFrame(data)
df
name | score | |
---|---|---|
0 | Jason | 1 |
1 | Molly | 2 |
2 | Tina | 3 |
3 | Jake | 4 |
4 | Amy | 5 |
5 | Bob | 6 |
6 | Jack | 7 |
7 | Jill | 8 |
8 | Kelly | 9 |
9 | Mark | 10 |
10 | Kao | 100 |
11 | Dillon | 100 |
Calculate Non-Trimmed Mean
# Calculate non-trimmed mean
df['score'].mean()
21.25
Calculate Mean After Trimming Off Highest And Lowest
# Trim off the 20% most extreme scores (lowest and highest)
stats.trim_mean(df['score'], proportiontocut=0.2)
6.5
We can use trimboth
to see which values are used to calculate the trimmed mean:
# Trim off the 20% most extreme scores and view the non-trimmed values
stats.trimboth(df['score'], proportiontocut=0.2)
array([ 3, 5, 4, 6, 7, 8, 9, 10])
Calculate Mean After Trimming Only Highest Extremes
The right
tail refers to the highest values in the array and left
refers to the lowest values in the array.
# Trim off the highest 20% of values and view trimmed mean
stats.trim1(df['score'], proportiontocut=0.2, tail='right').mean()
5.5
# Trim off the highest 20% of values and view non-trimmed values
stats.trim1(df['score'], proportiontocut=0.2, tail='right')
array([ 1, 3, 2, 4, 5, 6, 7, 9, 8, 10])