Pets of Seattle
Posted on Thu 04 October 2018 in programming
Seattle has a reputation for being a pet friendly city. By some estimates, there are more dogs in the city than there are children, an impressive feat for a place as populous as Seattle. Seattle's open data portal contains, among other things, information on licensed pets.
Awesome!
Let's explore the kinds of insights that can be found by looking at this data.
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib_venn_wordcloud import venn2_wordcloud
%matplotlib inline
license_file = "pet_licenses.csv"
# Alternatively, the license file may be read from here
# license_file = "https://data.seattle.gov/api/views/jguv-t9rb/rows.csv?accessType=DOWNLOAD"
pets = pd.read_csv(license_file)
Cleaning the data¶
Let's inspect the dataset to ensure it's properly cleaned before any analysis is run on it.
First, what does our data look like?
pets.shape # (rows, columns)
pets.columns
Let's rename the columns so they're easier to work with
pets.columns = [
"license_issued", "license", "name",
"species", "primary_breed",
"secondary_breed", "zipcode",
]
# how many NaNs does each colum have?
# https://stackoverflow.com/a/26266451
pets.isna().sum()
Given the amount of NaN
values in the secondary_breed
column, most owners either don't know what breed their pet is mixed with, or left that section blank when applying for a license.
A fair number of licenses also don't have the name
listed. It's possible these animals were either babies or recently adopted at the time of their licensing.
Cute!
In either case, let's replace the missing values from secondary_breed
and drop the rest of the NaN
s
pets = (pets
.fillna({"secondary_breed": ""})
.dropna()
)
The zipcode
field is useful if we want to analyze pet information based on location. To make it easy, let's only use rows that contain a 5-digit ZIP code.
pets = pets[pets.zipcode.str.match("^\d{5}$")]
pets.describe()
license_issued
looks like it's a string - it would be more useful as a datetime object
pets["license_issued"] = pd.to_datetime(pets["license_issued"])
There are also duplicate licenses that we should get rid of
pets.drop_duplicates(subset=["license"], inplace=True)
Our data should be clean enough now, let's move on to the analysis.
Cats & Dogs¶
Unsurprisingly, cats and dogs are the most commonly licensed pet.
although there are a few farm animals as well.
pets.species.value_counts()
cats = pets[pets.species == "Cat"]
dogs = pets[pets.species == "Dog"]
When do owners register their pets?¶
We've all heard of the cliché of getting your loved ones a pet for Valentine's day, but apparently February is the least popular month to license a pet.
ax = (pets
.license_issued
.apply(lambda date: date.strftime("%m - %b"))
.value_counts(normalize=True).apply(lambda percent: percent * 100)
.sort_index()
.plot.bar()
)
ax.set_ylabel("Percent of licenses issued", rotation=0, labelpad=60)
ax.set_xlabel("Month")
ax
Most common names¶
If you're ever on the street and bump into a dog (or cat), your best bet is to call it Lucy.
cat_names = cats.name.value_counts()
dog_names = dogs.name.value_counts()
fig, ax = plt.subplots(1, 2, figsize=(15, 5))
cat_names.head(10).plot.bar(
title="Most popular cat names",
ax=ax[0],
)
dog_names.head(10).plot.bar(
title="Most popular dog names",
ax=ax[1],
)
plt.show()
fig, ax = plt.subplots(1, 1)
ax.set_title("Top cat and dog names")
venn2_wordcloud(
[
set(cat_names.head(10).index),
set(dog_names.head(10).index),
],
set_labels=("Cats", "Dogs"),
ax=ax,
)
plt.show()
Breed¶
Retrievers (Labrador and Golden) are the most popular type of dog.
Domestic shorthair cats are the most popular cat by a wide margin.
fig, ax = plt.subplots(1, 2, figsize=(15, 5))
cats.primary_breed.value_counts().head(10).plot.bar(title="Top 10 cat breeds", ax=ax[0])
dogs.primary_breed.value_counts().head(10).plot.bar(title="Top 10 dog breeds", ax=ax[1])
Where are they?¶
Are there specific places within Seattle that have more pets than others?
pets.zipcode.value_counts().head(10)
import folium
import requests
SEATTLE_COORDINATES = (47.63, -122.27)
DEFAULT_ZOOM = 10
# Thank you, SeattleIO, for providing this!
ZIPCODE_GEO_URL = (
"https://raw.githubusercontent.com/seattleio/seattle-boundaries-data/master/data/zip-codes.geojson"
)
ZIPCODE_GEO_TEXT = requests.get(ZIPCODE_GEO_URL).text
def plot_data_by_zipcode(data,
location=SEATTLE_COORDINATES,
zoom_start=DEFAULT_ZOOM,
geo_data=ZIPCODE_GEO_TEXT,
key_on="feature.properties.ZCTA5CE10",
fill_color="PuBuGn",
**kwargs):
"""Plot data on a choropleth map
With the exception of map_coordinates and default_zoom, all arguments
in this function are passed to folium.Map.choropleth. map_coordinates
and default_zoom arguments are passed in as parameters to the folium.Map
constructor.
Args:
data - data to pass into the choropleth map. Corresponds to the
data argument passed into folium.Map.choropleth and is usually
a pandas DataFrame or Series.
**kwargs - Additional arguments to pass to folium.Map.choropleth
Returns:
A folium.Map instance containing a choropleth of mapped data
"""
map_ = folium.Map(location=location, zoom_start=zoom_start)
map_.choropleth(
geo_data=geo_data,
data=data,
key_on=key_on,
fill_color=fill_color,
**kwargs,
)
return map_
plot_data_by_zipcode(
data=pets.zipcode.value_counts(),
legend_name="Pet Population by Zipcode",
)
North Seattle, especially around the Green Lake area, has the highest population of pets. If you ever want to go dog watching, you can use this data as a guide to find the best place to do so!