Group Data By Time

On March 13, 2016, version 0.18.0 of pandas was released, with significant changes in how the resampling function operates. This tutorial follows v0.18.0 and will not work for previous versions of pandas.

First let's load the modules we care about

Want to learn more? I recommend these Python books: Python for Data Analysis, Python Data Science Handbook, and Introduction to Machine Learning with Python.

Preliminaries

# Import required packages
import pandas as pd
import datetime
import numpy as np

Next, let's create some sample data that we can group by time as an sample. In this example I am creating a dataframe with two columns with 365 rows. One column is a date, the second column is a numeric value.

Create Data

# Create a datetime variable for today
base = datetime.datetime.today()
# Create a list variable that creates 365 days of rows of datetime values
date_list = [base - datetime.timedelta(days=x) for x in range(0, 365)]
# Create a list variable of 365 numeric values
score_list = list(np.random.randint(low=1, high=1000, size=365))
# Create an empty dataframe
df = pd.DataFrame()

# Create a column from the datetime variable
df['datetime'] = date_list
# Convert that column into a datetime datatype
df['datetime'] = pd.to_datetime(df['datetime'])
# Set the datetime column as the index
df.index = df['datetime']
# Create a column from the numeric score variable
df['score'] = score_list
# Let's take a took at the data
df.head()
datetime score
datetime
2016-03-11 21:27:36.859714 2016-03-11 21:27:36.859714 459
2016-03-10 21:27:36.859714 2016-03-10 21:27:36.859714 153
2016-03-09 21:27:36.859714 2016-03-09 21:27:36.859714 458
2016-03-08 21:27:36.859714 2016-03-08 21:27:36.859714 310
2016-03-07 21:27:36.859714 2016-03-07 21:27:36.859714 376

Group Data By Date

In pandas, the most common way to group by time is to use the .resample() function. In v0.18.0 this function is two-stage. This means that df.resample('M') creates an object to which we can apply other functions (mean, count, sum, etc.)

# Group the data by month, and take the mean for each group (i.e. each month)
df.resample('M').mean()
score
datetime
2015-03-31 509.421053
2015-04-30 543.100000
2015-05-31 520.709677
2015-06-30 473.100000
2015-07-31 521.677419
2015-08-31 410.580645
2015-09-30 491.933333
2015-10-31 447.322581
2015-11-30 488.166667
2015-12-31 473.193548
2016-01-31 444.129032
2016-02-29 555.965517
2016-03-31 445.545455
# Group the data by month, and take the sum for each group (i.e. each month)
df.resample('M').sum()
score
datetime
2015-03-31 9679
2015-04-30 16293
2015-05-31 16142
2015-06-30 14193
2015-07-31 16172
2015-08-31 12728
2015-09-30 14758
2015-10-31 13867
2015-11-30 14645
2015-12-31 14669
2016-01-31 13768
2016-02-29 16123
2016-03-31 4901

Grouping Options

There are many options for grouping. You can learn more about them in Pandas's timeseries docs, however, I have also listed them below for your convenience.

Value Description
B business day frequency
C custom business day frequency (experimental)
D calendar day frequency
W weekly frequency
M month end frequency
BM business month end frequency
CBM custom business month end frequency
MS month start frequency
BMS business month start frequency
CBMS custom business month start frequency
Q quarter end frequency
BQ business quarter endfrequency
QS quarter start frequency
BQS business quarter start frequency
A year end frequency
BA business year end frequency
AS year start frequency
BAS business year start frequency
BH business hour frequency
H hourly frequency
T minutely frequency
S secondly frequency
L milliseconds
U microseconds
N nanoseconds