Introduction

The research topic of this project is an analysis of National Hockey League (NHL) statistics. Our team’s objective is to determine which statistics influence the outcomes of NHL games and if these statistics can help predict an NHL team’s final points total. Additionally, we want to investigate if they can be used to predict the outcome of an individual game.

Specifically, the question that we have set out to answer in this report are: Can build a model to predict the outcome of a hockey game that is better than flipping a coin?

As hockey fans, this topic is of interest for a couple of reasons. First, it is common for hockey fans to make predictions based on what they have seen in previous games or how the teams look in the standings, but often these predictions are subjective and qualitative. This group is interested to see if there is a more quantitative approach to these predictions. Secondly, we wanted to investigate if a quantitative approach to predictions could be used for sports betting. The group is interested to see if a model can be created that consistently beats the odds set by the bookmakers.

The data used for this project is taken from three key sources:

These sites provide a combination of both the traditional hockey statistics such as goals, assists, penalty minutes etc. but also provide the advanced non-traditional hockey statistics that incorporate location information to determine shot quality. We wanted to see if, based on our techniques and methodologies from our DATA 603 course, if a model would consider both traditional hockey statistics and advanced hockey statistics.

Permission restrictions for use of data from hockey-reference.com can be found at the following link: https://www.sports-reference.com/data_use.html. For the purposes of this project, there are no restrictions for the dataset that we are using. Additionally, we have communicated with naturalstattrick.com to ensure use of their data is acceptable for this project. A transcript of this communication can be provided upon request. As far as the NHL Game Data from Kaggle, this is a public dataset and is not restricted for the purposes of this project.

To accomplish our tasks for this project we used a combination of R and Python. Various packages in R’s extensive statistics library will be leveraged for model building, whereas Python was used for data wrangling and data frame preparation for analysis in R.

Methodology

Descriptive Statistics

Each descriptive statistics considered in both the logistic and multiple linear regression models are described below. All variables are numeric values and are either a whole number or represented as a percentage.

Descriptive Statistics - Logistic Regression

Hockey Stat Abbrev. Hockey Stat. Name Description
Corsi  Corsi  Any shot attempt (goals, shots on net, misses and blocks) outside of the shootout
CF Corsi For Count of Corsi for that team.
CA Corsi Againist Count of Corsi against that team.
CF% CF% Percentage of total Corsi in games that team played that are for that team. CF*100/(CF+CA)
Fenwick  Fenwick Any unblocked shot attempt (goals, shots on net and misses) outside of the shootout
FF  Fenwick For Count of Fenwick for that team
FA  Fenwick Againist Count of Fenwick against that team.
FF% Fenwick Pencentage Percentage of total Fenwick in games that team played that are for that team. FF*100/(FF+FA)
Shots  Any shot attempt on net (goals and shots on net) outside of the shootout NA
SF Shots For Count of Shots for that team
SA Shots Againist Count of Shots against that team.
Goals  Any goal, outside of the shootout. Any goal, outside of the shootout.
GF  Goals For Count of Goals for that team.
GA Goals Againist Count of Goals against that team
GF% Goals For % Percentage of total Goals in games that team played that are for that team. GF*100/(GF+GA)
SC Scoring Chances Each shot attempt (Corsi) taken in the offensive zone is assigned a value based on the area of the zone in which it was recorded
SCF Scoring Chances For Count of Scoring Chances for that team.
SCA  Scoring Chances againist Count of Scoring Chances against that team
SCF% Scoring Chances % Percentage of total Scoring Chances in games that team played that are for that team. SCF*100/(SCF+SCA)
SH% Shots % Percentage of Shots for that team that were Goals. GF*100/SF
SV% Save % Percentage of Shots against that team that were not Goals. 100-(GA*100/SA)
PDO  Not defined Shooting percentage plus save percentage. (GF/SF)+(GA/SA)
Hits Hits Count of hits for a team in a game
PIM Penalty Minutes The number of penalty minutes a team gets in a game
PowerPlayOpportunities PowerPlayOpportunities The number of times a team goes on the Power Play
powerPlayGoals powerPlayGoals The number of goals a team scores in a game
faceOffWinPercentage faceOffWinPercentage Percent of the times a team wins a faceoff
giveaways giveaways The number of times a team gives away the puck
takeaways takeaways The number of times a team take away the puck from the other team
z Zone For the next variables, z is a prefix for zone. There are 3 types of zones. Zone can be HighDanger (HD), Medium Danger (MD) or Low Danger (LD)
zCF  “zone” Chances For Count of zone Scoring Chances for that team.
zCA  “zone” Chances Againist Count of zone Scoring Chances against that team.
zCF% “zone” Chances For Percent Percentage of total “Zone” Scoring Chances in games that team played that are for that team. HDCF*100/(HDCF+HDCA)
zSF  “zone” Shots For Count of Shots that are in the “Zone” Scoring Chances for that team.
zSA  “zone” Shots Againist Count of Shots that are in the “Zone” Scoring Chances against that team.
zSF% “zone” Shots for percent Percentage of total Shots that are in the “Zone” Scoring Chances in games that team played that are for that team. zSF*100/(zSF+zSA)
zGF  “zone” Goals For Count of Goals off of a “Zone” Scoring Chances for that team.
zGA “zone” Goals Againist Count of Goals off of a “Zone” Scoring Chances against that team.
zGF% “zone” Goals For Percentage Percentage of total Goals off of a “zone” Scoring Chances in games that team played that are for that team. zGF*100/(zGF+zGA)
zSH% “zone” Shots Percentage Percentage of total Shots for a “zone” Scoring Chances in games that team played that are for that team. zGF*100/(zGF+zGA)
zSV% “zone” Save Percentage Percentage of total Shots for a “zone” Shots for that team that were Goals. zGF*100/zSF

Descriptive Statistics - Multiple Linear Regression

Hockey Stat Abbrev. Hockey Stat. Name Description Relation
AvAge Average Age Average Age of the hockey team Lower age means a youger team
W Wins Number of Wins for a season More wins in a season better the team
L (Losses) Losses Number of Losses for a season More losses in a season the worse the team
OL (Overtime Losses) Overtime Losses Number of Overtime Losses for a season More Overtime Losses in a season the worse the team but these losses aren’t as bad as regular losses
PTS (Points) Points per season The number of points for a team for a season More points in a season better the team
GF (Goals For Team A) Goals for Team A Number of goals a team scores for a season More goals in a season better the team
GA (Goals Against Team A Goals Againist Team A Number of goals a team lets in for a season More goals againist in a season better the team
GD (GF - GA) Goal Differential The difference of goals for a team. The number of goals for minus the number of goals againist The higher the difference the better the team
SOW Shootout wins Number of wins for a team in the shootout More SOW means the team is better at the Shootout
SOL Shootout Losses Number of loses for a team in the shootout More SOL means the team is worse at the Shootout
SRS Simple Rating System Rating for Goal Differential and Strength of Schedule Higher SRS means better the team
SOS Strength of Scehdule Measurement of the opponent in the standings Lower sos means the opponent are not good
TG/G Total goals = (GF+GA) per game Total goals = (GF+GA) per game The higher TG/G means the more goals per game
EVGF Even Strength Goals For Even Strength Goals For a team per season Higher EVGF means the team is better at even strength
EVGA Even Strength Goals Against Even Strength Goals Against per season Lower EVGA means the team is better at even strength
PD (Penalty Differential) Penalty Minutes For - Penalty Minutes againist per game in a season The difference of Penalty for - Penalty againist per game in a season The higher Penalty Differential means the team takes more penalties
SD (Shot Differential) Shots For - Shots againist per game in a season Shots For - Shots againist per game in a season The higher Shots Differential means the team takes more shots than they give up
PDO Shooting % + Save % at Even Strength Shooting % + Save % at Even Strength Higher pdo means the better the team at Even Strength
CF% Corsi For % at 5 on 5 Percentage of any shot attempt (goals, shots on net, misses and blocked shots) The higher cf% means the team is shotting more than their opponents.
FF% - Percentage of any shot attempt (goals, shots on net, misses) Fenwick For % at 5 on 5 FF / (FF + FA) Above 50% means the team was controlling the puck more often than not with this player on the ice in this situation. This doesn’t count blocked shots The higher FF% means the team is shooting more than their opponents.
xGF Expected Goals For Expected Goals For’ given where shots came from, for and against, while this player was on the ice at even strength. It’s based on where the shots are coming from, compared to the league-wide shooting percentage for that shot location. The higher xGF the more goals a team is expected to score for a game
xGA Expected Goals Against Expected Goals Against’ given where shots came from, for and against, while this player was on the ice at even strength. It’s based on where the shots are coming from, compared to the league-wide shooting percentage for that shot location. The lower xGA the more goals a team is expected to let in for a game
SCF% Percentage Scoring Chances for Percentage of scoring chances in this team’s favor Higher scf% means a team has more scoring chances
HDF% High Danger Scoring Chances For Percentage Percentage of high-danger scoring chances in this team’s favor Higher HDF% means a team has more scoring chances

Data Wrangling

The data that we have used for this project provides game by game statistics, as defined by the above table, for the 2014-2018 seasons (five seasons in total). From these years we wanted to try and use the 2014-2017 data to make predictions for the 2018 season and with this goal in mind the 2018 data was excluded from our model building. This left us with three seasons with 1230 games per season and one season with 1271 games for a total of 4,961 games of data that we could use to build our model.

Our goal was to build a logistic regression model to predict future games using previous data. Game by game data does not provide any value for our purposes, but it does provide insight into how a team plays if the data is looked at in aggregate. With this in mind we averaged all previous game data for each individual statistic for each season. The seasons were separated as teams tend to make drastic roster changes in the offseason so the statistics were only averaged for each individual season. For example, for the 42nd game that the Calgary Flames played in the 2017 season we would use an average of games 1-41 that the Flames played during the 2017 season.

Once we had the average statistics for the previous games for each team we then subtracted the Away Team statistics from the Home Team statistics (Home minus Away) to create marginal statistics for each game and we used these marginal statistics as inputs to our logistic regression model. For example, if the Home Team average shots per game was 31.3 prior to game i and the Away Team average shots per game was 29.4 prior to game i, then our model would see an average marginal shot difference of 1.9 and it is these numbers that we used in our model.

All data wrangling was performed in Python and the results were exported to csv files to analyze in R.

Methodology - Logistic Regression

In order to create a prediction model for NHL games, the group utilized the logistic regression methodology described in DATA 603. The first step in this process was to evaluate the variables in the previously described datasets by running a full model, check for multicollinearity and run individual z-tests on all variables to test variable significance.

Assumptions that will be tested in the multiple logistic regression model include:

  • Multicollinearity

Methodology - Multiple Linear Regression

In order to create a prediction model for a team’s next season’s points totals, the group utilized the multiple linear regression methodology described in DATA 603. The individual coefficients test (t-test) is a partial model test that will be used to test the significance of the statistics from the previously described datasets. Additionally, a stepwise regression procedure will be used to compare the partial model tests to see if a more effective model can be built.

Assumptions that will be tested in the multiple linear regression model include:

  • Multicollinearity
    • Linearity
    • Equal variance
    • Normality
    • Outliers

Main Results of the Analysis

Results - Logistic Regression

We calculate for Variance Inflation Factor (VIF) to confirm that our variables are not collinear. We also use ggpair to visually check if there is any type of multicollinearity.

From our data wrangling and compiling our datasets, we start out with 75 variables! 75!

Code, Findings and Visualizations - Logistic Regression

nhl.na = na.omit(nhl.data)
names(nhl.na)
##  [1] "X1"                         "game_id"                   
##  [3] "date"                       "home"                      
##  [5] "home_goals"                 "away"                      
##  [7] "away_goals"                 "hoa"                       
##  [9] "result"                     "result_bool"               
## [11] "Game"                       "Team"                      
## [13] "TOI"                        "CF_avg"                    
## [15] "CA_avg"                     "CF%_avg"                   
## [17] "FF_avg"                     "FA_avg"                    
## [19] "FF%_avg"                    "SF_avg"                    
## [21] "SA_avg"                     "SF%_avg"                   
## [23] "GF_avg"                     "GA_avg"                    
## [25] "GF%_avg"                    "xGF_avg"                   
## [27] "xGA_avg"                    "xGF%_avg"                  
## [29] "SCF_avg"                    "SCA_avg"                   
## [31] "SCF%_avg"                   "HDCF_avg"                  
## [33] "HDCA_avg"                   "HDCF%_avg"                 
## [35] "HDSF_avg"                   "HDSA_avg"                  
## [37] "HDSF%_avg"                  "HDGF_avg"                  
## [39] "HDGA_avg"                   "HDGF%_avg"                 
## [41] "HDSH%_avg"                  "HDSV%_avg"                 
## [43] "MDCF_avg"                   "MDCA_avg"                  
## [45] "MDCF%_avg"                  "MDSF_avg"                  
## [47] "MDSA_avg"                   "MDSF%_avg"                 
## [49] "MDGF_avg"                   "MDGA_avg"                  
## [51] "MDGF%_avg"                  "MDSH%_avg"                 
## [53] "MDSV%_avg"                  "LDCF_avg"                  
## [55] "LDCA_avg"                   "LDCF%_avg"                 
## [57] "LDSF_avg"                   "LDSA_avg"                  
## [59] "LDSF%_avg"                  "LDGF_avg"                  
## [61] "LDGA_avg"                   "LDGF%_avg"                 
## [63] "LDSH%_avg"                  "LDSV%_avg"                 
## [65] "SH%_avg"                    "SV%_avg"                   
## [67] "PDO_avg"                    "blocks_avg"                
## [69] "goals_avg"                  "shots_avg"                 
## [71] "hits_avg"                   "pim_avg"                   
## [73] "powerPlayOpportunities_avg" "powerPlayGoals_avg"        
## [75] "faceOffWinPercentage_avg"   "giveaways_avg"             
## [77] "takeaways_avg"
nhl.stats = subset(nhl.na, select = CF_avg:takeaways_avg)

nhl.stats = nhl.stats %>% mutate(result_bool = nhl.na$result_bool)
nhl.reduced = nhl.na %>% select(-c(CF_avg:`CF%_avg`, `FF%_avg`:`SF%_avg`, `GF%_avg`, `xGF%_avg`, `SCF%_avg`, HDCF_avg:`SV%_avg`, PDO_avg, shots_avg, goals_avg))
imcdiag(nhl.stats%>% select(-c(result_bool)), as.numeric(nhl.stats$result_bool), method="VIF")
## 
## Call:
## imcdiag(x = nhl.stats %>% select(-c(result_bool)), y = as.numeric(nhl.stats$result_bool), 
##     method = "VIF")
## 
## 
##  VIF Multicollinearity Diagnostics
## 
##                                    VIF detection
## CF_avg                        334.0870         1
## CA_avg                             Inf         1
## CF%_avg                       720.8852         1
## FF_avg                        307.5446         1
## FA_avg                             Inf         1
## FF%_avg                       864.1731         1
## SF_avg                        673.3095         1
## SA_avg                        257.0018         1
## SF%_avg                       442.7898         1
## GF_avg                        370.8073         1
## GA_avg                        129.6370         1
## GF%_avg                        12.9360         1
## xGF_avg                        57.3259         1
## xGA_avg                        50.0463         1
## xGF%_avg                      133.4764         1
## SCF_avg                            Inf         1
## SCA_avg                            Inf         1
## SCF%_avg                      130.8243         1
## HDCF_avg                           Inf         1
## HDCA_avg                           Inf         1
## HDCF%_avg                      84.7504         1
## HDSF_avg                       71.0278         1
## HDSA_avg                       55.0845         1
## HDSF%_avg                      52.8002         1
## HDGF_avg                       46.3906         1
## HDGA_avg                       45.8362         1
## HDGF%_avg                       7.5802         0
## HDSH%_avg                       7.6312         0
## HDSV%_avg                       8.6963         0
## MDCF_avg                           Inf         1
## MDCA_avg                           Inf         1
## MDCF%_avg                     135.9731         1
## MDSF_avg                       48.0604         1
## MDSA_avg                       35.0233         1
## MDSF%_avg                      39.2388         1
## MDGF_avg                       29.1327         1
## MDGA_avg                       27.7779         1
## MDGF%_avg                       7.5209         0
## MDSH%_avg                       9.7202         0
## MDSV%_avg                       8.7675         0
## LDCF_avg                      110.1810         1
## LDCA_avg                      103.3410         1
## LDCF%_avg                     182.2968         1
## LDSF_avg                       59.9849         1
## LDSA_avg                       75.5658         1
## LDSF%_avg                      66.1091         1
## LDGF_avg                       22.7655         1
## LDGA_avg                       29.1783         1
## LDGF%_avg                      11.2325         1
## LDSH%_avg                      14.4888         1
## LDSV%_avg                      15.0828         1
## SH%_avg                     56065.1337         1
## SV%_avg                     49254.8411         1
## PDO_avg                    122287.7386         1
## blocks_avg                         Inf         1
## goals_avg                     252.7855         1
## shots_avg                     496.7065         1
## hits_avg                        1.7558         0
## pim_avg                         1.5844         0
## powerPlayOpportunities_avg      1.6405         0
## powerPlayGoals_avg              2.2401         0
## faceOffWinPercentage_avg        1.4877         0
## giveaways_avg                   1.8786         0
## takeaways_avg                   1.6381         0
## 
## Multicollinearity may be due to CF_avg CA_avg CF%_avg FF_avg FA_avg FF%_avg SF_avg SA_avg SF%_avg GF_avg GA_avg GF%_avg xGF_avg xGA_avg xGF%_avg SCF_avg SCA_avg SCF%_avg HDCF_avg HDCA_avg HDCF%_avg HDSF_avg HDSA_avg HDSF%_avg HDGF_avg HDGA_avg MDCF_avg MDCA_avg MDCF%_avg MDSF_avg MDSA_avg MDSF%_avg MDGF_avg MDGA_avg LDCF_avg LDCA_avg LDCF%_avg LDSF_avg LDSA_avg LDSF%_avg LDGF_avg LDGA_avg LDGF%_avg LDSH%_avg LDSV%_avg SH%_avg SV%_avg PDO_avg blocks_avg goals_avg shots_avg regressors
## 
## 1 --> COLLINEARITY is detected by the test 
## 0 --> COLLINEARITY is not detected by the test
## 
## ===================================
ggpairs(data = nhl.reduced %>% select (-c(X1:TOI)),
        lower = list(continuous = wrap("smooth_loess", alpha = 0.1, size = 0.5, color = 'blue'),
                     combo ="facethist", 
                     discrete = "facetbar", 
                     na = "na"))