The 80/20 Pandas Tutorial

python
pandas
tutorial
An opinionated pandas tutorial on my preferred methods to accomplish the most essential data transformation tasks in a way that will make veteran R and tidyverse users smile.
Author

Matt Bowers

Published

November 25, 2020

Modified

November 8, 2023

Ahh, pandas. In addition to being everyone’s favorite mostly vegetarian bear from south central China, it’s also the python library for working with tabular data, a.k.a. dataframes. When you dive into pandas, you’ll quickly find out that there is a lot going on; indeed there are hundreds of methods for operating on dataframes. But luckily for us, as with many areas of life, there is a Pareto Principle, or 80/20 rule, that will help us focus on the small set of methods that collectively solve the majority of our data transformation needs.

If you’re like me, then pandas is not your first data-handling tool; maybe you’ve been using SQL or R with data.table or dplyr. If so, that’s great because you already have a sense for the key operations we need when working with tabular data. In their book, R for Data Science, Garrett Grolemund and Hadley Wickham describe five essential operations for manipulating dataframes. I’ve found that these cover the majority of my data transformation tasks to prepare data for analysis, visualization, and modeling.

  1. filtering rows based on data values
  2. sorting rows based on data values
  3. selecting columns by name
  4. adding new columns based on the existing columns
  5. creating grouped summaries of the dataset

I would add that we also need a way to build up more complex transformations by chaining these fundamental operations together sequentially.

Before we dive in, here’s the TLDR on the pandas methods that I prefer for accomplishing these tasks, along with their equivalents from SQL and dplyr in R.

description pandas SQL dplyr
filter rows based on data values query() WHERE filter()
sort rows based on data values sort_values() ORDER BY arrange()
select columns by name filter() SELECT select()
add new columns based on the existing columns assign() AS mutate()
create grouped summaries of the dataset groupby()
apply()
GROUP BY group_by()
summarise()
chain operations together . %>%

Imports and Data

We’ll use the nycflights13 dataset which contains data on the roughly 300k flights that departed from New York City in 2013.

import pandas as pd
import numpy as np

url = 'https://www.openintro.org/book/statdata/nycflights.csv'
storage_options = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0'}
df = pd.read_csv(url, storage_options=storage_options)

Typically if I’m only going to be using a single dataframe, I’ll use the name “df”. This is a pretty strong convention in pandas, e.g. you can see the name “df” being used all over the pandas documentation; therefore it makes your code easier for others to understand. If there will be more than one dataframe, I suggest prepending a meaningful name to the “df”, e.g. flights_df.

Let’s have a look at the dataframe structure using the info() method.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32735 entries, 0 to 32734
Data columns (total 16 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   year       32735 non-null  int64 
 1   month      32735 non-null  int64 
 2   day        32735 non-null  int64 
 3   dep_time   32735 non-null  int64 
 4   dep_delay  32735 non-null  int64 
 5   arr_time   32735 non-null  int64 
 6   arr_delay  32735 non-null  int64 
 7   carrier    32735 non-null  object
 8   tailnum    32735 non-null  object
 9   flight     32735 non-null  int64 
 10  origin     32735 non-null  object
 11  dest       32735 non-null  object
 12  air_time   32735 non-null  int64 
 13  distance   32735 non-null  int64 
 14  hour       32735 non-null  int64 
 15  minute     32735 non-null  int64 
dtypes: int64(12), object(4)
memory usage: 4.0+ MB

Select rows based on their values with query()

query() lets you retain a subset of rows based on the values of the data; it’s like dplyr::filter() in R or WHERE in SQL. Its argument is a string specifying the condition to be met for rows to be included in the result. You specify the condition as an expression involving the column names and comparison operators like <, >, <=, >=, == (equal), and ~= (not equal). You can specify compound expressions using and and or, and you can even check if the column value matches any items in a list.

# compare one column to a value
df.query('month == 6')

# compare two column values
df.query('arr_delay > dep_delay')

# using arithmetic
df.query('arr_delay > 0.5 * air_time')

# using "and"
df.query('month == 6 and day == 1')

# using "or"
df.query('origin == "JFK" or dest == "JFK"')

# column value matching any item in a list
df.query('carrier in ["AA", "UA"]')

You may have noticed that it seems to be much more popular to filter pandas data frames using boolean indexing. Indeed when I ask my favorite search engine how to filter a pandas dataframe on its values, I find this tutorial, this blog post, various questions on Stack Overflow, and even the pandas documentation, all espousing boolean indexing. Here’s what it looks like.

# canonical boolean indexing
df[(df['carrier'] == "AA") & (df['origin'] == "JFK")]

# the equivalent use of query()
df.query('carrier == "AA" and origin == "JFK"')

There are a few reasons I prefer query() over boolean indexing.

  1. query() does not require me to type the dataframe name again, whereas boolean indexing requires me to type it every time I wish to refer to a column.
  2. query() makes the code easier to read and understand, especially when expressions get complex.
  3. query() is more computationally efficient than boolean indexing.
  4. query() can safely be used in dot chains, which we’ll see very soon.

Select columns by name with filter()

filter() lets you pick out a specific set of columns by name; it’s analogous to dplyr::select() in R or SELECT in SQL. You can either provide exactly the column names you want, or you can grab all columns whose names contain a given substring or which match a given regular expression. This isn’t a big deal when your dataframe has only a few columns, but is particularly useful when you have a dataframe with tens or hundreds of columns.

# select a list of columns
df.filter(['origin', 'dest'])

# select columns containing a particular substring
df.filter(like='time')

# select columns matching a regular expression
df.filter(regex='e$')

Sort rows with sort_values()

sort_values() changes the order of the rows based on the data values; it’s likedplyr::arrange() in R or ORDER BY in SQL. You can specify one or more columns on which to sort, where their order denotes the sorting priority. You can also specify whether to sort in ascending or descending order.

# sort by a single column
df.sort_values('air_time')

# sort by a single column in descending order
df.sort_values('air_time', ascending=False)

# sort by carrier, then within carrier, sort by descending distance
df.sort_values(['carrier', 'distance'], ascending=[True, False])

Add new columns with assign()

assign() adds new columns which can be functions of the existing columns; it’s like dplyr::mutate() from R.

# add a new column based on other columns
df.assign(speed = lambda x: x.distance / x.air_time)

# another new column based on existing columns
df.assign(gain = lambda x: x.dep_delay - x.arr_delay)

If you’re like me, this way of using assign() might seem a little strange at first. Let’s break it down. In the call to assign() the keyword argument speed tells pandas the name of our new column. The business to the right of the = is a inline lambda function that takes the dataframe we passed to assign() and returns the column we want to add.

I like using x as the lambda argument because its easy to type and it evokes tabular data (think design matrix), which reminds me that it refers to the entire dataframe. We can then access the other columns in our dataframe using the dot like x.other_column.

It’s true that you can skip the whole lambda business and refer to the dataframe to which you are assigning directly inside the assign. That might look like this.

df.assign(speed = flights.distance / flights.air_time)

I prefer using a lambda for the following reasons.

  1. Using the lambda will save you from typing the name every time you want to refer to a column.
  2. The lambda makes your code more portable. Since you refer to the dataframe as a generic x, you can reuse this same assignment code on a dataframe with a different name.
  3. Most importantly, the lambda will allow you to harness the power of dot chaining.

Chain transformations together with the dot chain

One of the awesome things about pandas is that the object.method() paradigm lets us easily build up complex dataframe transformations from a sequence of method calls. In R, this is effectively accomplished by the pipe %>% operator. For example, suppose we want to look at high-speed flights from JFK to Honolulu, which would require us to query for JFK to Honolulu flights, assign a speed column, and maybe sort on that new speed column.

We can say:

# neatly chain method calls together
(
    df
    .query('origin == "JFK"')
    .query('dest == "HNL"')
    .assign(speed = lambda x: x.distance / x.air_time)
    .sort_values(by='speed', ascending=False)
    .query('speed > 8.0')
)

We compose the dot chain by wrapping the entire expression in parentheses and indenting each line within. The first line is the name of the dataframe on which we are operating. Each subsequent line has a single method call.

There are a few great things about writing the code this way:

  1. Readability - It’s easy to scan down the left margin of the code to see what’s happening. The first line gives us our noun (the dataframe) and each subsequent line starts with a verb. You could read this as “take df then query the rows where origin is JFK, then query for rows where destination is HNL, then assign a new column called speed, then sort the dataframe by speed, then query only for the rows where speed is greater than 8.0.
  2. Flexibility - It’s easy to comment out individual lines and re-run the cell. It’s also easy to reorder operations, since only one thing happens on each line.
  3. Neatness - We have not polluted our workspace with any intermediate variables, nor have we wasted any mental energy thinking of names for any temporary variables.

By default, dot chains do not modify the original dataframe; they just output a temporary result that we can inspect directly in the output. If you want to store the result, or pass it along to another function (e.g. for plotting), you can simply assign the entire dot chain to a variable.

# store the output of the dot chain in a new dataframe
high_speed_flights_df = (
    df
    .assign(speed = lambda x: x.distance / x.air_time)
    .query('speed > 8.0')
)

Collapsing rows into grouped summaries with groupby()

groupby() combined with apply() gives us flexibility and control over our grouped summaries; it’s like dplyr::group_by() and dplyr::summarise() in R. This is the primary pattern I use for SQL-style groupby operations in pandas. Specifically it unlocks the following essential functionality you’re used to having in SQL.

  1. specify the names of the aggregation columns we create
  2. specify which aggregation function to use on which columns
  3. compose more complex aggregations such as the proportion of rows meeting some condition
  4. aggregate over arbitrary functions of multiple columns

Let’s check out the departure delay stats for each carrier.

# grouped summary with groupby and apply
(
    df
    .groupby(['carrier'])
    .apply(lambda d: pd.Series({
        'n_flights': len(d),
        'med_delay': d.dep_delay.median(),
        'avg_delay': d.dep_delay.mean(),
    }))
    .head()
)
n_flights med_delay avg_delay
carrier
9E 1696.0 -1.0 17.285967
AA 3188.0 -2.0 9.142409
AS 66.0 -4.5 5.181818
B6 5376.0 -1.0 13.137091
DL 4751.0 -2.0 8.529573

While you might be used to apply() acting over the rows or columns of a dataframe, here we’re calling apply on a grouped dataframe object, so it’s acting over the groups. According to the pandas documentation:

The function passed to apply must take a dataframe as its first argument and return a dataframe, a series or a scalar. apply will then take care of combining the results back together into a single dataframe or series. apply is therefore a highly flexible grouping method.

We need to supply apply() with a function that takes each chunk of the grouped dataframe and returns (in our case) a series object with one element for each new aggregation column. Notice that I use a lambda to specify the function we pass to apply()), and that I name its argument d, which reminds me that it’s a dataframe. My lambda returns a pandas series whose index entries specify the new aggregation column names, and whose values constitute the results of the aggregations for each group. Pandas will then stitch everything back together into a lovely dataframe.

Notice how nice the code looks when we use this pattern. Each aggregation is specified on its own line, which makes it easy to see what aggregation columns we’re creating and allows us to comment, uncomment, and reorder the aggregations without breaking anything.

Here are some more complex aggregations to illustrate some useful patterns.

# more complex grouped summary
(
    df
    .groupby(['carrier'])
    .apply(lambda d: pd.Series({
        'avg_gain': np.mean(d.dep_delay - d.arr_delay), 
        'pct_delay_gt_30': np.mean(d.dep_delay > 30), 
        'pct_late_dep_early_arr': np.mean((d.dep_delay > 0) & (d.arr_delay < 0)), 
        'avg_arr_given_dep_delay_gt_0': d.query('dep_delay > 0').arr_delay.mean(),
        'cor_arr_delay_dep_delay': np.corrcoef(d.dep_delay, d.arr_delay)[0,1],
    }))
    .head()
)
avg_gain pct_delay_gt_30 pct_late_dep_early_arr avg_arr_given_dep_delay_gt_0 cor_arr_delay_dep_delay
carrier
9E 9.247642 0.196934 0.110259 39.086111 0.932485
AA 7.743726 0.113237 0.105395 30.087165 0.891013
AS 16.515152 0.106061 0.121212 28.058824 0.864565
B6 3.411458 0.160528 0.084449 37.306866 0.914180
DL 7.622816 0.097874 0.100821 30.078029 0.899327

Here’s what’s happening.

  • np.mean(d.dep_delay - d.arr_delay) aggregates over the difference of two columns.
  • np.mean(d.dep_delay > 30) computes the proportion of rows where the delay is greater than 30 minutes. Generating a boolean series based on some condition and then using mean() to find the proportion comes up all the time.
  • np.mean((d.dep_delay > 0) & (d.arr_delay < 0)) shows that we can compute proportions where conditions on multiple columns are met.
  • d.query('dep_delay > 0').arr_delay.mean() computes the average arrival delay on flights where the departure was delayed. Here we first filter each grouped dataframe down to the subset of rows where departure delay is greater than zero using query(), and then we take the mean of the remaining arrival delays.
  • np.corrcoef(d.dep_delay, d.arr_delay)[0,1] computes the correlation coefficient between departure and arrival delays. Remember we can use pretty much any reduction operation to collapse values down to a scalar.

You might have noticed that the canonical pandas approach for grouped summaries is to use agg(). That works well if you need to apply the same aggregation function on each column in the dataframe, e.g. taking the mean of every column. But because of the kind of data I work with these days, it’s much more common for me to use customized aggregations like those above, so the groupby() apply() idiom works best for me.

Wrapping Up

There you have it, how to pull off the five most essential data transformation tasks using pandas in a style reminiscent of my beloved dplyr. Remember that part of the beauty of pandas is that since there are so many ways to do most tasks, you can develop your own style based on the kind of data you work with, what you like about other tools, how you see others using the tools, and of course your own taste and preferences.

If you found this post helpful or if you have your own preferred style for accomplishing any of these key transformations with pandas, do let me know about it in the comments.


Comments