I do a fair amount of data manipulation work in pandas and as such, I find myself doing a lot of method chaining. In the past I've struggled to find a good way of keeping my code concise while still maintaining readability.

What do I mean by that? Suppose we have census data on a group of people.

In [1]:

import pandas as pd

census = pd.read_csv("census.csv")
census.head()

Out[1]:

	age	name	state
0	31	Jessica	Wisconsin
1	35	Heather	New Jersey
2	33	Veronica	Nevada
3	68	Michael	Mississippi
4	67	Veronica	Utah

Note: these are not actual people, I used the following script to generate this data

If we wanted to find the top 5 states with the largest range of ages, we could do the following

Option 1: Use intermediate variables¶

In [2]:

census_by_state = census.groupby("state")
age_range_by_state = census_by_state.apply(lambda group: group.age.max() - group.age.min())
age_range_by_state.sort_values(ascending=False).head()

Out[2]:

state
Massachusetts    49
Iowa             47
Illinois         47
Tennessee        43
Arkansas         42
dtype: int64

This code certainly works and is decently readable, but unless we're planning re-using one of those intermediate variables, it is a bit cumbersome. The same thing could also be accomplished without the use of intermediate variables by chaining all the method calls together.

Option 2: All on one line¶

In [3]:

census.groupby("state").apply(lambda group: group.age.max() - group.age.min()).sort_values(ascending=False).head()

Out[3]:

state
Massachusetts    49
Iowa             47
Illinois         47
Tennessee        43
Arkansas         42
dtype: int64

This solution, however, is more unreadable because the line is too long, and we've lost the benefit that we gained in naming our intermediate results.

Option 3: line wrap on open or close parenthesis¶

In [4]:

census.groupby("state"
).apply(lambda group: group.age.max() - group.age.min()
).sort_values(ascending=False
).head(
)

Out[4]:

state
Massachusetts    49
Iowa             47
Illinois         47
Tennessee        43
Arkansas         42
dtype: int64

The lines are short enough to be readable, so this isn't too bad. There are two things about this approach, however, that aren't ideal.

groupby appears on the first line.
- To me, a line should clearly indicate what it does, with minimal distraction. census.groupby("state" is two ideas (census and the grouping of it) jammed into a single line. In a single line statement, this wouldn't be a problem. But this is a multi-line statement, with the rest of the lines expressing a single idea per line, and having the first line express two ideas instead of one throws off the pacing of the statement.
subsequent lines begin with a closing parenthesis
- Starting with a closing parenthesis doesn't add anything to the readability of the current line. In fact, I would argue the opposite. The closing parenthesis doesn't give the reader any information about what the current line is doing and only exists because, syntactically, the line before it must be closed.

Option 4: Line continuations¶

In [5]:

census\
.groupby("state")\
.apply(lambda group: group.age.max() - group.age.min())\
.sort_values(ascending=False)\
.head()

Out[5]:

state
Massachusetts    49
Iowa             47
Illinois         47
Tennessee        43
Arkansas         42
dtype: int64

I like this option much more than the previous because it expresses a single idea per line and doesn't have the dangling close parenthesis at the beginning of each line. Where I don't like this approach, however, is the fact that it relies on line continuations. They look awkward, are not PEP8 recommended, and suffer from the same problem that JSON does - namely that the final element must not have the but wait, there's more character (\ in this example, , in JSON)

Option 5: Group the statement in parenthesis¶

In [6]:

(census
 .groupby("state")
 .apply(lambda group: group.age.max() - group.age.min())
 .sort_values(ascending=False)
 .head()
)

Out[6]:

state
Massachusetts    49
Iowa             47
Illinois         47
Tennessee        43
Arkansas         42
dtype: int64

This approach is used fairly often when using implicit string concatenation to make long strings; I find it much more readable than option 3 and only marginally more readable than option 4 because it removes the need for line continuation characters.

Can we do better?¶

Part of the benefit to using intermediate variables, aside from re-use, is that they can convey additional meaning to the problem. Each solution of ours made use of a lambda function to calculate the age range.

In the first example, we used an intermediate variable to store the meaning of the lambda

age_range_by_state = census_by_state.apply(lambda group: group.age.max() - group.age.min())

But the final solution missed out because it's stuffed in the middle of the statement

.apply(lambda group: group.age.max() - group.age.min())

If we use a normal function instead, we can again convey the meaning behind the function.

In [7]:

def get_age_range(group):
    return group.age.max() - group.age.min()


(census
 .groupby("state")
 .apply(get_age_range)
 .sort_values(ascending=False)
 .head()
)

Out[7]:

state
Massachusetts    49
Iowa             47
Illinois         47
Tennessee        43
Arkansas         42
dtype: int64

To me, this is the most readable approach.

What do you think? Which approach do you prefer, and why?

In [ ]: