Trimmed Mean

Trimmed means are averaging techniques that do not count (i.e. trim off) extreme values. The goal is to make mean calculations more robust to extreme values by not considering those values when calculating the mean.

SciPy offers a great methods of calculating trimmed means.

Preliminaries

# Import libraries
import pandas as pd
from scipy import stats

Create DataFrame

# Create dataframe with two extreme values
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy', 'Bob', 'Jack', 'Jill', 'Kelly', 'Mark', 'Kao', 'Dillon'], 
        'score': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 100, 100]
       }
df = pd.DataFrame(data)
df
name score
0 Jason 1
1 Molly 2
2 Tina 3
3 Jake 4
4 Amy 5
5 Bob 6
6 Jack 7
7 Jill 8
8 Kelly 9
9 Mark 10
10 Kao 100
11 Dillon 100

Calculate Non-Trimmed Mean

# Calculate non-trimmed mean
df['score'].mean()
21.25

Calculate Mean After Trimming Off Highest And Lowest

# Trim off the 20% most extreme scores (lowest and highest)
stats.trim_mean(df['score'], proportiontocut=0.2)
6.5

We can use trimboth to see which values are used to calculate the trimmed mean:

# Trim off the 20% most extreme scores and view the non-trimmed values
stats.trimboth(df['score'], proportiontocut=0.2)
array([ 3,  5,  4,  6,  7,  8,  9, 10])

Calculate Mean After Trimming Only Highest Extremes

The right tail refers to the highest values in the array and left refers to the lowest values in the array.

# Trim off the highest 20% of values and view trimmed mean
stats.trim1(df['score'], proportiontocut=0.2, tail='right').mean()
5.5
# Trim off the highest 20% of values and view non-trimmed values
stats.trim1(df['score'], proportiontocut=0.2, tail='right')
array([ 1,  3,  2,  4,  5,  6,  7,  9,  8, 10])