Heroes of the Storm Game Analysis

Posted on Tue 20 November 2018 in programming

The HOTS Logs website (https://hotslogs.com) has an API with a data dump of the past 30 days' worth of replays.

https://www.hotslogs.com/Info/API

It's an automated version of this reddit post:

That being said, this information is incredibly awesome! What kind of information can be gleaned from it?

In [1]:
import itertools
from multiprocessing import Pool

import pandas as pd
import matplotlib.patches as mpatches
import matplotlib.pyplot as plt

%matplotlib inline
In [2]:
replays = pd.read_csv("/home/brooks/data/hots/Replays.csv")
In [3]:
replays.head()
Out[3]:
ReplayID GameMode(3=Quick Match 4=Hero League 5=Team League 6=Unranked Draft) MapID Replay Length Timestamp (UTC)
0 155936477 3 1007 00:25:04 8/27/2018 1:31:50 AM
1 155291732 3 1003 00:21:51 8/27/2018 1:31:52 AM
2 155676469 3 1001 00:24:09 8/27/2018 1:31:52 AM
3 155291738 3 1002 00:16:35 8/27/2018 1:32:00 AM
4 155291753 3 1019 00:26:01 8/27/2018 1:32:03 AM

Let's prep the data for analysis

In [4]:
replays.rename(
    {
        "ReplayID": "id",
        "GameMode(3=Quick Match 4=Hero League 5=Team League 6=Unranked Draft)": "game_mode",
        "MapID": "map_id",
        "Replay Length": "length",
        "Timestamp (UTC)": "timestamp",
    },
    axis=1,
    inplace=True,
)

The length column is currently just a string. If we want to plot it, we should turn it into a timedelta object.

In [5]:
replays["length"] = pd.to_timedelta(replays.length)

The timestamp column is still just a string. It needs to be in a date format that pandas understands in order to be useful.

In [6]:
%%time

with Pool() as pool:
    replays["timestamp"] = pool.map(pd.to_datetime, replays.timestamp)
CPU times: user 14.3 s, sys: 1.17 s, total: 15.4 s
Wall time: 7min 22s
In [7]:
replays.head()
Out[7]:
id game_mode map_id length timestamp
0 155936477 3 1007 00:25:04 2018-08-27 01:31:50
1 155291732 3 1003 00:21:51 2018-08-27 01:31:52
2 155676469 3 1001 00:24:09 2018-08-27 01:31:52
3 155291738 3 1002 00:16:35 2018-08-27 01:32:00
4 155291753 3 1019 00:26:01 2018-08-27 01:32:03
In [8]:
replays.iloc[0].timestamp
Out[8]:
Timestamp('2018-08-27 01:31:50')

When do people play?

In [9]:
games_per_day = replays.resample("D", on="timestamp").timestamp.count()
In [10]:
ax = games_per_day.plot()

ax.set_ylabel("Games played", rotation=0, labelpad=45)
ax.set_xlabel("Date")
Out[10]:
Text(0.5, 0, 'Date')

Looks like people play anywhere from 30k to 50k games on any given day. Except on the last day? That doesn't look right.

In [11]:
replays.sort_values(by="timestamp").tail(1)
Out[11]:
id game_mode map_id length timestamp
791122 157198794 3 1016 00:24:07 2018-09-26 01:31:46
In [12]:
ls -l ~/data/hots
total 1385816
-rw-r--r-- 1 brooks brooks      3088 Oct 26 01:31  HeroIDAndMapID.csv
-rw-r--r-- 1 brooks brooks 382118435 Oct 27 11:26 'HOTSLogs Data Export Current.zip'
-rw-r--r-- 1 brooks brooks 979037223 Oct 26 02:02  ReplayCharacters.csv
-rw-r--r-- 1 brooks brooks  57903413 Oct 26 02:02  Replays.csv
drwxr-xr-x 6 brooks brooks      4096 Nov 19 12:43  venv/

Even though I downloaded this data on the 27th, it looks like the data only goes up to the first couple of hours on the 26th. Given this, it makes sense why the previous graph shows a drastic decrease in games played on the last day - because we don't have a full days' worth of data. If we were to omit games for the last day, we would see a much more accurate picture of games played.

In [13]:
replays = replays[replays.timestamp < "2018-09-26"]
games_per_day = replays.resample("D", on="timestamp").timestamp.count()
In [14]:
ax = replays.resample("D", on="timestamp").timestamp.count().plot()

ax.set_ylabel("Games played", rotation=0, labelpad=45)
ax.set_xlabel("Date")
Out[14]:
Text(0.5, 0, 'Date')

That gives us a much better picture of things

In [15]:
ax = games_per_day.pct_change().apply(lambda pct: pct * 100).plot()

ax.set_ylabel("Percent change", rotation=0, labelpad=45)
ax.set_xlabel("Date")
ax.set_title("Variance of games played per day")
Out[15]:
Text(0.5, 1.0, 'Variance of games played per day')
In [16]:
weekday_color = "#6272a4"
weekend_color = "#50fa7b"

ax = games_per_day.groupby(games_per_day.index.dayofweek).mean().plot.bar(
    rot=0,  # Let x labels be horizontal
    color=[weekday_color] * 5 + [weekend_color] * 2,
)

ax.set_title("Average games played per day of the week")
ax.set_xlabel("Day of week (0=Monday, 6=Sunday)")
ax.set_ylabel("Games played", rotation=0, labelpad=45)

ax.legend(handles=[
    mpatches.Patch(color=weekday_color, label="Weekday"),
    mpatches.Patch(color=weekend_color, label="Weekend"),
])
Out[16]:
<matplotlib.legend.Legend at 0x7f2a93db90b8>

Not surprisingly, people play more games on the weekend than they do during the week.

Match length

How long do matches last on average, and how does that vary depending on the map being played?

In [17]:
heroes_and_maps = pd.read_csv("/home/brooks/data/hots/HeroIDAndMapID.csv")
In [18]:
heroes_and_maps.columns = [column.lower() for column in heroes_and_maps.columns]
In [19]:
# Heroes are at the beginning of the csv file
# with non-null group and subgroup columns
heroes_and_maps.head()
Out[19]:
id name group subgroup
0 0 Unknown NaN NaN
1 1 Abathur Specialist Utility
2 2 Anub'arak Warrior Tank
3 3 Arthas Warrior Bruiser
4 4 Azmodan Specialist Siege
In [20]:
# and maps are at the end, with null group & subgroup columns
heroes_and_maps.tail()
Out[20]:
id name group subgroup
101 1018 Pull Party NaN NaN
102 1019 Volskaya Foundry NaN NaN
103 1020 Industrial District NaN NaN
104 1021 Escape From Braxis NaN NaN
105 1022 Alterac Pass NaN NaN
In [21]:
match_lengths = (
    replays[["map_id", "length"]]
    .merge(heroes_and_maps[["id", "name"]],
        left_on="map_id",
        right_on="id",
    )
)

del match_lengths["map_id"]
In [22]:
match_lengths["length_mins"] = match_lengths.length.apply(lambda length: length.total_seconds() / 60)
In [23]:
match_lengths.head()
Out[23]:
length id name length_mins
0 00:25:04 1007 Infernal Shrines 25.066667
1 00:21:15 1007 Infernal Shrines 21.250000
2 00:19:27 1007 Infernal Shrines 19.450000
3 00:25:16 1007 Infernal Shrines 25.266667
4 00:12:09 1007 Infernal Shrines 12.150000
In [24]:
ax = (match_lengths
      .groupby("name")
      .length_mins
      .mean()
      .sort_values()
      .plot.barh()
)

ax.set_title("Average duration of match")
ax.set_ylabel("")
ax.set_xlabel("Minutes")
Out[24]:
Text(0.5, 0, 'Minutes')

Garden of Terror may have the longest average playtime (and Towers of Doom the shortest), but the playtime distribution is different for each map played.

In [25]:
fig, ax = plt.subplots(5, 3, figsize=(15, 35))
for axis in itertools.chain(*ax):
    axis.set_xlabel("Match length (minutes)")
    axis.set_ylabel("Games played")
    
match_lengths["length_mins"].hist(by=match_lengths["name"], rot=0, ax=ax)

plt.show()

All things being equal, with a dataset of this size I would expect the number of games to be approximately the same across all maps, but this is not the case. Cursed Hollow, Infernal Shrines, and others have ~60k matches played, but others like Warhead Junction and Garden of Terror only have ~7k matches played. Without knowing more information, my hunch is that the maps with a high number of matches played were the maps on current rotation.

The winning vs. losing team

How often does the winning team have a level advantage over the losing team?

There is basic win/loss information in the ReplayCharacters.csv file, but unfortunately the laptop I'm using only has 3.3g of usable RAM and the file is too large to fit into memory via a pandas DataFrame (ala pd.read_csv(filename))

In [26]:
ls /home/brooks/data/hots/ReplayCharacters.csv -lh
-rw-r--r-- 1 brooks brooks 934M Oct 26 02:02 /home/brooks/data/hots/ReplayCharacters.csv

So I'll have to get creative. In this case, I'm going to create a postgres table to contain the information I want, namely the replay id, in game level, experience contribution, and whether or not the player was on the winning team.

My script for creating such a table can be found here. On my laptop, it took a little bit over an hour to dump the entire 12M rows into the database.

In [27]:
connection = "postgresql://brooks:@localhost/hots"
In [28]:
%%time
query = """\
WITH winners AS (
    SELECT
        replay_id, avg(team_level)::INTEGER as level, sum(exp_contribution) as total_exp
    FROM character
    WHERE is_winner
    GROUP BY replay_id
), losers AS (
    SELECT
        replay_id, avg(team_level)::INTEGER as level, sum(exp_contribution) as total_exp
    FROM character
    WHERE NOT is_winner
    GROUP BY replay_id
)
SELECT
    winners.replay_id, winners.level AS winning_level, winners.total_exp AS winning_exp,
    losers.level AS losing_level, losers.total_exp AS losing_exp
FROM winners
JOIN losers
ON winners.replay_id=losers.replay_id
;
"""

outcomes = pd.read_sql(query, connection)
CPU times: user 7.51 s, sys: 872 ms, total: 8.38 s
Wall time: 40.9 s

That's blazingly fast and memory efficient compared to trying to do the same thing in pandas

In [29]:
outcomes.head()
Out[29]:
replay_id winning_level winning_exp losing_level losing_exp
0 155291732 23 82518 22 69485
1 155291738 19 51890 16 32092
2 155291743 16 38761 14 29107
3 155291747 23 80746 21 63744
4 155291749 14 32180 11 17992

Back to the question at hand - how often do winning teams have the level advantage?

In [30]:
(outcomes.winning_level > outcomes.losing_level).value_counts(normalize=True) * 100
Out[30]:
True     87.510756
False    12.489244
dtype: float64

87.5% of teams win with a level advantage. What about just an experience advantage?

In [31]:
(outcomes.winning_exp > outcomes.losing_exp).value_counts(normalize=True) * 100
Out[31]:
True     95.032962
False     4.967038
dtype: float64

What can we learn from this?

If you're a Heroes of the Storm player, then it serves you well to know that, while having a higher team level doesn't necessarily cause a team to win, the two are highly correlated. Thus, if you're analyzing your own replays to find areas of improvement, look for the places where the enemy team's experience gain rate is higher than your own team's. It can be a sign of the other team outmaneuvering yours, and over the course of a match, could cost you a win.