The HOTS Logs website (https://hotslogs.com) has an API with a data dump of the past 30 days' worth of replays.

It's an automated version of this reddit post:

That being said, this information is incredibly awesome! What kind of information can be gleaned from it?

In [1]:

import itertools
from multiprocessing import Pool

import pandas as pd
import matplotlib.patches as mpatches
import matplotlib.pyplot as plt

%matplotlib inline

In [2]:

replays = pd.read_csv("/home/brooks/data/hots/Replays.csv")

In [3]:

replays.head()

Out[3]:

	ReplayID	GameMode(3=Quick Match 4=Hero League 5=Team League 6=Unranked Draft)	MapID	Replay Length	Timestamp (UTC)
0	155936477	3	1007	00:25:04	8/27/2018 1:31:50 AM
1	155291732	3	1003	00:21:51	8/27/2018 1:31:52 AM
2	155676469	3	1001	00:24:09	8/27/2018 1:31:52 AM
3	155291738	3	1002	00:16:35	8/27/2018 1:32:00 AM
4	155291753	3	1019	00:26:01	8/27/2018 1:32:03 AM

Let's prep the data for analysis

In [4]:

replays.rename(
    {
        "ReplayID": "id",
        "GameMode(3=Quick Match 4=Hero League 5=Team League 6=Unranked Draft)": "game_mode",
        "MapID": "map_id",
        "Replay Length": "length",
        "Timestamp (UTC)": "timestamp",
    },
    axis=1,
    inplace=True,
)

The length column is currently just a string. If we want to plot it, we should turn it into a timedelta object.

In [5]:

replays["length"] = pd.to_timedelta(replays.length)

The timestamp column is still just a string. It needs to be in a date format that pandas understands in order to be useful.

In [6]:

%%time

with Pool() as pool:
    replays["timestamp"] = pool.map(pd.to_datetime, replays.timestamp)

CPU times: user 14.3 s, sys: 1.17 s, total: 15.4 s
Wall time: 7min 22s

In [7]:

replays.head()

Out[7]:

	id	game_mode	map_id	length	timestamp
0	155936477	3	1007	00:25:04	2018-08-27 01:31:50
1	155291732	3	1003	00:21:51	2018-08-27 01:31:52
2	155676469	3	1001	00:24:09	2018-08-27 01:31:52
3	155291738	3	1002	00:16:35	2018-08-27 01:32:00
4	155291753	3	1019	00:26:01	2018-08-27 01:32:03

In [8]:

replays.iloc[0].timestamp

Out[8]:

Timestamp('2018-08-27 01:31:50')

When do people play?¶

In [9]:

games_per_day = replays.resample("D", on="timestamp").timestamp.count()

In [10]:

ax = games_per_day.plot()

ax.set_ylabel("Games played", rotation=0, labelpad=45)
ax.set_xlabel("Date")

Out[10]:

Text(0.5, 0, 'Date')

Looks like people play anywhere from 30k to 50k games on any given day. Except on the last day? That doesn't look right.

In [11]:

replays.sort_values(by="timestamp").tail(1)

Out[11]:

	id	game_mode	map_id	length	timestamp
791122	157198794	3	1016	00:24:07	2018-09-26 01:31:46

In [12]:

ls -l ~/data/hots

total 1385816
-rw-r--r-- 1 brooks brooks      3088 Oct 26 01:31  HeroIDAndMapID.csv
-rw-r--r-- 1 brooks brooks 382118435 Oct 27 11:26 'HOTSLogs Data Export Current.zip'
-rw-r--r-- 1 brooks brooks 979037223 Oct 26 02:02  ReplayCharacters.csv
-rw-r--r-- 1 brooks brooks  57903413 Oct 26 02:02  Replays.csv
drwxr-xr-x 6 brooks brooks      4096 Nov 19 12:43  venv/

Even though I downloaded this data on the 27th, it looks like the data only goes up to the first couple of hours on the 26th. Given this, it makes sense why the previous graph shows a drastic decrease in games played on the last day - because we don't have a full days' worth of data. If we were to omit games for the last day, we would see a much more accurate picture of games played.

In [13]:

replays = replays[replays.timestamp < "2018-09-26"]
games_per_day = replays.resample("D", on="timestamp").timestamp.count()

In [14]:

ax = replays.resample("D", on="timestamp").timestamp.count().plot()

ax.set_ylabel("Games played", rotation=0, labelpad=45)
ax.set_xlabel("Date")

Out[14]:

Text(0.5, 0, 'Date')

That gives us a much better picture of things

In [15]:

ax = games_per_day.pct_change().apply(lambda pct: pct * 100).plot()

ax.set_ylabel("Percent change", rotation=0, labelpad=45)
ax.set_xlabel("Date")
ax.set_title("Variance of games played per day")

Out[15]:

Text(0.5, 1.0, 'Variance of games played per day')

In [16]:

weekday_color = "#6272a4"
weekend_color = "#50fa7b"

ax = games_per_day.groupby(games_per_day.index.dayofweek).mean().plot.bar(
    rot=0,  # Let x labels be horizontal
    color=[weekday_color] * 5 + [weekend_color] * 2,
)

ax.set_title("Average games played per day of the week")
ax.set_xlabel("Day of week (0=Monday, 6=Sunday)")
ax.set_ylabel("Games played", rotation=0, labelpad=45)

ax.legend(handles=[
    mpatches.Patch(color=weekday_color, label="Weekday"),
    mpatches.Patch(color=weekend_color, label="Weekend"),
])

Out[16]:

<matplotlib.legend.Legend at 0x7f2a93db90b8>

Not surprisingly, people play more games on the weekend than they do during the week.

Match length¶

How long do matches last on average, and how does that vary depending on the map being played?

In [17]:

heroes_and_maps = pd.read_csv("/home/brooks/data/hots/HeroIDAndMapID.csv")

In [18]:

heroes_and_maps.columns = [column.lower() for column in heroes_and_maps.columns]

In [19]:

# Heroes are at the beginning of the csv file
# with non-null group and subgroup columns
heroes_and_maps.head()

Out[19]:

	id	name	group	subgroup
0	0	Unknown	NaN	NaN
1	1	Abathur	Specialist	Utility
2	2	Anub'arak	Warrior	Tank
3	3	Arthas	Warrior	Bruiser
4	4	Azmodan	Specialist	Siege

In [20]:

# and maps are at the end, with null group & subgroup columns
heroes_and_maps.tail()

Out[20]:

	id	name	group	subgroup
101	1018	Pull Party	NaN	NaN
102	1019	Volskaya Foundry	NaN	NaN
103	1020	Industrial District	NaN	NaN
104	1021	Escape From Braxis	NaN	NaN
105	1022	Alterac Pass	NaN	NaN

In [21]:

match_lengths = (
    replays[["map_id", "length"]]
    .merge(heroes_and_maps[["id", "name"]],
        left_on="map_id",
        right_on="id",
    )
)

del match_lengths["map_id"]

In [22]:

match_lengths["length_mins"] = match_lengths.length.apply(lambda length: length.total_seconds() / 60)

In [23]:

match_lengths.head()

Out[23]:

	length	id	name	length_mins
0	00:25:04	1007	Infernal Shrines	25.066667
1	00:21:15	1007	Infernal Shrines	21.250000
2	00:19:27	1007	Infernal Shrines	19.450000
3	00:25:16	1007	Infernal Shrines	25.266667
4	00:12:09	1007	Infernal Shrines	12.150000

In [24]:

ax = (match_lengths
      .groupby("name")
      .length_mins
      .mean()
      .sort_values()
      .plot.barh()
)

ax.set_title("Average duration of match")
ax.set_ylabel("")
ax.set_xlabel("Minutes")

Out[24]:

Text(0.5, 0, 'Minutes')

Garden of Terror may have the longest average playtime (and Towers of Doom the shortest), but the playtime distribution is different for each map played.

In [25]:

fig, ax = plt.subplots(5, 3, figsize=(15, 35))
for axis in itertools.chain(*ax):
    axis.set_xlabel("Match length (minutes)")
    axis.set_ylabel("Games played")
    
match_lengths["length_mins"].hist(by=match_lengths["name"], rot=0, ax=ax)

plt.show()

All things being equal, with a dataset of this size I would expect the number of games to be approximately the same across all maps, but this is not the case. Cursed Hollow, Infernal Shrines, and others have ~60k matches played, but others like Warhead Junction and Garden of Terror only have ~7k matches played. Without knowing more information, my hunch is that the maps with a high number of matches played were the maps on current rotation.

The winning vs. losing team¶

How often does the winning team have a level advantage over the losing team?

There is basic win/loss information in the ReplayCharacters.csv file, but unfortunately the laptop I'm using only has 3.3g of usable RAM and the file is too large to fit into memory via a pandas DataFrame (ala pd.read_csv(filename))

In [26]:

ls /home/brooks/data/hots/ReplayCharacters.csv -lh

-rw-r--r-- 1 brooks brooks 934M Oct 26 02:02 /home/brooks/data/hots/ReplayCharacters.csv

So I'll have to get creative. In this case, I'm going to create a postgres table to contain the information I want, namely the replay id, in game level, experience contribution, and whether or not the player was on the winning team.

My script for creating such a table can be found here. On my laptop, it took a little bit over an hour to dump the entire 12M rows into the database.

In [27]:

connection = "postgresql://brooks:@localhost/hots"

In [28]:

%%time
query = """\
WITH winners AS (
    SELECT
        replay_id, avg(team_level)::INTEGER as level, sum(exp_contribution) as total_exp
    FROM character
    WHERE is_winner
    GROUP BY replay_id
), losers AS (
    SELECT
        replay_id, avg(team_level)::INTEGER as level, sum(exp_contribution) as total_exp
    FROM character
    WHERE NOT is_winner
    GROUP BY replay_id
)
SELECT
    winners.replay_id, winners.level AS winning_level, winners.total_exp AS winning_exp,
    losers.level AS losing_level, losers.total_exp AS losing_exp
FROM winners
JOIN losers
ON winners.replay_id=losers.replay_id
;
"""

outcomes = pd.read_sql(query, connection)

CPU times: user 7.51 s, sys: 872 ms, total: 8.38 s
Wall time: 40.9 s

That's blazingly fast and memory efficient compared to trying to do the same thing in pandas

In [29]:

outcomes.head()

Out[29]:

	replay_id	winning_level	winning_exp	losing_level	losing_exp
0	155291732	23	82518	22	69485
1	155291738	19	51890	16	32092
2	155291743	16	38761	14	29107
3	155291747	23	80746	21	63744
4	155291749	14	32180	11	17992

Back to the question at hand - how often do winning teams have the level advantage?

In [30]:

(outcomes.winning_level > outcomes.losing_level).value_counts(normalize=True) * 100

Out[30]:

True     87.510756
False    12.489244
dtype: float64

87.5% of teams win with a level advantage. What about just an experience advantage?

In [31]:

(outcomes.winning_exp > outcomes.losing_exp).value_counts(normalize=True) * 100

Out[31]:

True     95.032962
False     4.967038
dtype: float64

What can we learn from this?¶

If you're a Heroes of the Storm player, then it serves you well to know that, while having a higher team level doesn't necessarily cause a team to win, the two are highly correlated. Thus, if you're analyzing your own replays to find areas of improvement, look for the places where the enemy team's experience gain rate is higher than your own team's. It can be a sign of the other team outmaneuvering yours, and over the course of a match, could cost you a win.