Making method chains readable
Posted on Sun 06 May 2018 in programming
I do a fair amount of data manipulation work in pandas and as such, I find myself doing a lot of method chaining. In the past I've struggled to find a good way of keeping my code concise while still maintaining readability.
What do I mean by that? Suppose we have census data on a group of people.
import pandas as pd
census = pd.read_csv("census.csv")
census.head()
Note: these are not actual people, I used the following script to generate this data
If we wanted to find the top 5 states with the largest range of ages, we could do the following
Option 1: Use intermediate variables¶
census_by_state = census.groupby("state")
age_range_by_state = census_by_state.apply(lambda group: group.age.max() - group.age.min())
age_range_by_state.sort_values(ascending=False).head()
This code certainly works and is decently readable, but unless we're planning re-using one of those intermediate variables, it is a bit cumbersome. The same thing could also be accomplished without the use of intermediate variables by chaining all the method calls together.
Option 2: All on one line¶
census.groupby("state").apply(lambda group: group.age.max() - group.age.min()).sort_values(ascending=False).head()
This solution, however, is more unreadable because the line is too long, and we've lost the benefit that we gained in naming our intermediate results.
Option 3: line wrap on open or close parenthesis¶
census.groupby("state"
).apply(lambda group: group.age.max() - group.age.min()
).sort_values(ascending=False
).head(
)
The lines are short enough to be readable, so this isn't too bad. There are two things about this approach, however, that aren't ideal.
groupby
appears on the first line.- To me, a line should clearly indicate what it does, with minimal distraction.
census.groupby("state"
is two ideas (census
and the grouping of it) jammed into a single line. In a single line statement, this wouldn't be a problem. But this is a multi-line statement, with the rest of the lines expressing a single idea per line, and having the first line express two ideas instead of one throws off the pacing of the statement.
- To me, a line should clearly indicate what it does, with minimal distraction.
- subsequent lines begin with a closing parenthesis
- Starting with a closing parenthesis doesn't add anything to the readability of the current line. In fact, I would argue the opposite. The closing parenthesis doesn't give the reader any information about what the current line is doing and only exists because, syntactically, the line before it must be closed.
Option 4: Line continuations¶
census\
.groupby("state")\
.apply(lambda group: group.age.max() - group.age.min())\
.sort_values(ascending=False)\
.head()
I like this option much more than the previous because it expresses a single idea per line and doesn't have the dangling close parenthesis at the beginning of each line. Where I don't like this approach, however, is the fact that it relies on line continuations. They look awkward, are not PEP8 recommended, and suffer from the same problem that JSON does - namely that the final element must not have the but wait, there's more character (\
in this example, ,
in JSON)
Option 5: Group the statement in parenthesis¶
(census
.groupby("state")
.apply(lambda group: group.age.max() - group.age.min())
.sort_values(ascending=False)
.head()
)
This approach is used fairly often when using implicit string concatenation to make long strings; I find it much more readable than option 3 and only marginally more readable than option 4 because it removes the need for line continuation characters.
Can we do better?¶
Part of the benefit to using intermediate variables, aside from re-use, is that they can convey additional meaning to the problem. Each solution of ours made use of a lambda
function to calculate the age range.
In the first example, we used an intermediate variable to store the meaning of the lambda
age_range_by_state = census_by_state.apply(lambda group: group.age.max() - group.age.min())
But the final solution missed out because it's stuffed in the middle of the statement
.apply(lambda group: group.age.max() - group.age.min())
If we use a normal function instead, we can again convey the meaning behind the function.
def get_age_range(group):
return group.age.max() - group.age.min()
(census
.groupby("state")
.apply(get_age_range)
.sort_values(ascending=False)
.head()
)
To me, this is the most readable approach.
What do you think? Which approach do you prefer, and why?