Heroes of the Storm Game Analysis
Posted on Tue 20 November 2018 in programming
The HOTS Logs website (https://hotslogs.com) has an API with a data dump of the past 30 days' worth of replays.
https://www.hotslogs.com/Info/API
It's an automated version of this reddit post:
That being said, this information is incredibly awesome! What kind of information can be gleaned from it?
import itertools
from multiprocessing import Pool
import pandas as pd
import matplotlib.patches as mpatches
import matplotlib.pyplot as plt
%matplotlib inline
replays = pd.read_csv("/home/brooks/data/hots/Replays.csv")
replays.head()
Let's prep the data for analysis
replays.rename(
{
"ReplayID": "id",
"GameMode(3=Quick Match 4=Hero League 5=Team League 6=Unranked Draft)": "game_mode",
"MapID": "map_id",
"Replay Length": "length",
"Timestamp (UTC)": "timestamp",
},
axis=1,
inplace=True,
)
The length
column is currently just a string. If we want to plot it, we should turn it into a timedelta object.
replays["length"] = pd.to_timedelta(replays.length)
The timestamp
column is still just a string. It needs to be in a date format that pandas understands in order to be useful.
%%time
with Pool() as pool:
replays["timestamp"] = pool.map(pd.to_datetime, replays.timestamp)
replays.head()
replays.iloc[0].timestamp
When do people play?¶
games_per_day = replays.resample("D", on="timestamp").timestamp.count()
ax = games_per_day.plot()
ax.set_ylabel("Games played", rotation=0, labelpad=45)
ax.set_xlabel("Date")
Looks like people play anywhere from 30k to 50k games on any given day. Except on the last day? That doesn't look right.
replays.sort_values(by="timestamp").tail(1)
ls -l ~/data/hots
Even though I downloaded this data on the 27th, it looks like the data only goes up to the first couple of hours on the 26th. Given this, it makes sense why the previous graph shows a drastic decrease in games played on the last day - because we don't have a full days' worth of data. If we were to omit games for the last day, we would see a much more accurate picture of games played.
replays = replays[replays.timestamp < "2018-09-26"]
games_per_day = replays.resample("D", on="timestamp").timestamp.count()
ax = replays.resample("D", on="timestamp").timestamp.count().plot()
ax.set_ylabel("Games played", rotation=0, labelpad=45)
ax.set_xlabel("Date")
That gives us a much better picture of things
ax = games_per_day.pct_change().apply(lambda pct: pct * 100).plot()
ax.set_ylabel("Percent change", rotation=0, labelpad=45)
ax.set_xlabel("Date")
ax.set_title("Variance of games played per day")
weekday_color = "#6272a4"
weekend_color = "#50fa7b"
ax = games_per_day.groupby(games_per_day.index.dayofweek).mean().plot.bar(
rot=0, # Let x labels be horizontal
color=[weekday_color] * 5 + [weekend_color] * 2,
)
ax.set_title("Average games played per day of the week")
ax.set_xlabel("Day of week (0=Monday, 6=Sunday)")
ax.set_ylabel("Games played", rotation=0, labelpad=45)
ax.legend(handles=[
mpatches.Patch(color=weekday_color, label="Weekday"),
mpatches.Patch(color=weekend_color, label="Weekend"),
])
Not surprisingly, people play more games on the weekend than they do during the week.
Match length¶
How long do matches last on average, and how does that vary depending on the map being played?
heroes_and_maps = pd.read_csv("/home/brooks/data/hots/HeroIDAndMapID.csv")
heroes_and_maps.columns = [column.lower() for column in heroes_and_maps.columns]
# Heroes are at the beginning of the csv file
# with non-null group and subgroup columns
heroes_and_maps.head()
# and maps are at the end, with null group & subgroup columns
heroes_and_maps.tail()
match_lengths = (
replays[["map_id", "length"]]
.merge(heroes_and_maps[["id", "name"]],
left_on="map_id",
right_on="id",
)
)
del match_lengths["map_id"]
match_lengths["length_mins"] = match_lengths.length.apply(lambda length: length.total_seconds() / 60)
match_lengths.head()
ax = (match_lengths
.groupby("name")
.length_mins
.mean()
.sort_values()
.plot.barh()
)
ax.set_title("Average duration of match")
ax.set_ylabel("")
ax.set_xlabel("Minutes")
Garden of Terror may have the longest average playtime (and Towers of Doom the shortest), but the playtime distribution is different for each map played.
fig, ax = plt.subplots(5, 3, figsize=(15, 35))
for axis in itertools.chain(*ax):
axis.set_xlabel("Match length (minutes)")
axis.set_ylabel("Games played")
match_lengths["length_mins"].hist(by=match_lengths["name"], rot=0, ax=ax)
plt.show()
All things being equal, with a dataset of this size I would expect the number of games to be approximately the same across all maps, but this is not the case. Cursed Hollow, Infernal Shrines, and others have ~60k matches played, but others like Warhead Junction and Garden of Terror only have ~7k matches played. Without knowing more information, my hunch is that the maps with a high number of matches played were the maps on current rotation.
The winning vs. losing team¶
How often does the winning team have a level advantage over the losing team?
There is basic win/loss information in the ReplayCharacters.csv
file, but unfortunately the laptop I'm using only has 3.3g of usable RAM and the file is too large to fit into memory via a pandas DataFrame (ala pd.read_csv(filename)
)
ls /home/brooks/data/hots/ReplayCharacters.csv -lh
So I'll have to get creative. In this case, I'm going to create a postgres table to contain the information I want, namely the replay id, in game level, experience contribution, and whether or not the player was on the winning team.
My script for creating such a table can be found here. On my laptop, it took a little bit over an hour to dump the entire 12M rows into the database.
connection = "postgresql://brooks:@localhost/hots"
%%time
query = """\
WITH winners AS (
SELECT
replay_id, avg(team_level)::INTEGER as level, sum(exp_contribution) as total_exp
FROM character
WHERE is_winner
GROUP BY replay_id
), losers AS (
SELECT
replay_id, avg(team_level)::INTEGER as level, sum(exp_contribution) as total_exp
FROM character
WHERE NOT is_winner
GROUP BY replay_id
)
SELECT
winners.replay_id, winners.level AS winning_level, winners.total_exp AS winning_exp,
losers.level AS losing_level, losers.total_exp AS losing_exp
FROM winners
JOIN losers
ON winners.replay_id=losers.replay_id
;
"""
outcomes = pd.read_sql(query, connection)
That's blazingly fast and memory efficient compared to trying to do the same thing in pandas
outcomes.head()
Back to the question at hand - how often do winning teams have the level advantage?
(outcomes.winning_level > outcomes.losing_level).value_counts(normalize=True) * 100
87.5% of teams win with a level advantage. What about just an experience advantage?
(outcomes.winning_exp > outcomes.losing_exp).value_counts(normalize=True) * 100
What can we learn from this?¶
If you're a Heroes of the Storm player, then it serves you well to know that, while having a higher team level doesn't necessarily cause a team to win, the two are highly correlated. Thus, if you're analyzing your own replays to find areas of improvement, look for the places where the enemy team's experience gain rate is higher than your own team's. It can be a sign of the other team outmaneuvering yours, and over the course of a match, could cost you a win.