Scoring first in the NHL


Scoring First in the NHL

The data for this analysis were collected in another section. I separated this to keep things a little shorter. I’m working with SAS university edition, which includes a Jupyter Notebook and access to SAS through Python or using magics %% to flag SAS code. I’ll primarily work with SAS code. Unfortunately, I can’t do all the steps in SAS University edition, since I cannot install external libraries onto their virtual machine. If the library is pure python, I could probably use it locally, but that is not the case with things like sklearn and matplotlib. I mostly wanted to use SAS for the multilevel modeling capabilities. Otherwise sklearn or statsmodel could be used.

I’ve stored the data as a csv file to load into SAS, but I’ll load it through a pandas data frame initially.

import saspy
import pandas as pd
import numpy as np
from IPython.display import HTML
import math
df = pd.read_csv(#rename if you need to
    ,index_col=0) 
df['notwon'] = np.where(df['won']==1,0,1)
df['penalty_cen'] = df['penaltycount']-3.98446
print(df.head())
print(df.describe())
print(df['team'].unique())
#to fix the arizona team
#for i,row in df.iterrows():
#    if row['team']=='ARI':
#        df.set_value(i,'team','PHX')

Create a sas session and load the dataframe into SAS for analysis.

sas_session = saspy.SASsession()
sas_session.saslib("nhl",path="/folders/myfolders/nhl/Update20180709")
sas_session.teach_me_SAS(False)
games = sas_session.df2sd(df,table='games',libref='nhl')
print(games.describe())
games.head()
Using SAS Config named: default
SAS Connection established. Subprocess id is 5436


27   
28   libname nhl    '/folders/myfolders/nhl/Update20180709'  ;
NOTE: Libref NHL was successfully assigned as follows: 
      Engine:        V9 
      Physical Name: /folders/myfolders/nhl/Update20180709
29   
30   
        Variable      N  NMiss        Median          Mean        StdDev  \
0        gamekey  12356      0  2.016290e+07  5.380185e+07  7.256136e+07   
1          goals  12356      0  3.000000e+00  2.785529e+00  1.621079e+00   
2           home  12356      0  5.000000e-01  5.000000e-01  5.000202e-01   
3       overtime  12356      0  0.000000e+00  1.317579e-01  3.382410e-01   
4   penaltycount  12356      0  4.000000e+00  3.984461e+00  2.089983e+00   
5         period  12356      0  1.000000e+00  1.248786e+00  5.256075e-01   
6    scoredfirst  12356      0  5.000000e-01  5.000000e-01  5.000202e-01   
7   scoredsecond  12356      0  0.000000e+00  4.876174e-01  4.998669e-01   
8            won  12356      0  5.000000e-01  5.000000e-01  5.000202e-01   
9         notwon  12356      0  5.000000e-01  5.000000e-01  5.000202e-01   
10   penalty_cen  12356      0  1.554000e-02  9.906119e-07  2.089983e+00   

             Min           P25           P50           P75           Max  
0   201421.00000  2.015226e+07  2.016290e+07  2.018265e+07  2.018213e+08  
1        0.00000  2.000000e+00  3.000000e+00  4.000000e+00  1.000000e+01  
2        0.00000  0.000000e+00  5.000000e-01  1.000000e+00  1.000000e+00  
3        0.00000  0.000000e+00  0.000000e+00  0.000000e+00  1.000000e+00  
4        0.00000  3.000000e+00  4.000000e+00  5.000000e+00  2.000000e+01  
5        1.00000  1.000000e+00  1.000000e+00  1.000000e+00  4.000000e+00  
6        0.00000  0.000000e+00  5.000000e-01  1.000000e+00  1.000000e+00  
7        0.00000  0.000000e+00  0.000000e+00  1.000000e+00  1.000000e+00  
8        0.00000  0.000000e+00  5.000000e-01  1.000000e+00  1.000000e+00  
9        0.00000  0.000000e+00  5.000000e-01  1.000000e+00  1.000000e+00  
10      -3.98446 -9.844600e-01  1.554000e-02  1.015540e+00  1.601554e+01  
gamekey goals home overtime penaltycount period scoredfirst scoredsecond team won notwon penalty_cen
0 201421 3 1 0 14 1 0 1 MTL 0 1 10.01554
1 201421 4 0 0 12 1 1 0 TOR 1 0 8.01554
2 201422 6 1 0 6 1 1 0 CHI 1 0 2.01554
3 201422 4 0 0 4 1 0 1 WSH 0 1 0.01554
4 201423 4 1 0 6 1 1 0 EDM 0 1 2.01554

One of the problems with the dataset is that really each observation is not independent. The value of won depends on who one. Know one of these you know who lost the game. I won’t go into the details here, but I experimented with a couple of ways to approach this problem. For one, I decided to look at a multilevel approach where games were nested within teams, a sort of cross classification problem. This didn’t work out that well because of the small number of samples per gamekey. Often, it had trouble estimated the covariance matrix, although it would converge.

Instead I chose alternative route of sampling a team from each game - stratified random sampling; where the gamekey was the strata.

Regardless of the approach, the estimated parameters were usually close. Here is an example of a random sample and the full model.

def convertToProb(logit):
    odds = math.exp(logit)
    return odds/(1.0+odds)


%%SAS sas_session
proc sort data=nhl.games;
 by gamekey;
run;
Proc SurveySelect data=nhl.games out=nhl.gamesrand noprint Method = urs N = 1 outhits
    rep = 1;
    Strata gamekey;
run;
proc freq data=nhl.gamesrand;
    table won;
run;
proc print data=nhl.gamesrand(obs=10);
run;

<!DOCTYPE html>

SAS Output

The SAS System

The FREQ Procedure

won Frequency Percent Cumulative
Frequency
Cumulative
Percent
0 3093 50.06 3093 50.06
1 3085 49.94 6178 100.00

The SAS System

Obs gamekey Replicate goals home overtime penaltycount period scoredfirst scoredsecond team won notwon penalty_cen NumberHits ExpectedHits SamplingWeight
1 201421 1 3 1 0 14 1 0 1 MTL 0 1 10.0155 1 0.5 2
2 201422 1 6 1 0 6 1 1 0 CHI 1 0 2.0155 1 0.5 2
3 201423 1 4 1 0 6 1 1 0 EDM 0 1 2.0155 1 0.5 2
4 201424 1 1 1 0 4 1 1 0 PHI 0 1 0.0155 1 0.5 2
5 201425 1 2 1 0 7 1 1 1 DET 1 0 3.0155 1 0.5 2
6 201426 1 6 1 0 8 1 1 1 COL 1 0 4.0155 1 0.5 2
7 201427 1 3 1 0 7 1 1 0 BOS 1 0 3.0155 1 0.5 2
8 201428 1 3 1 0 3 1 1 1 PIT 1 0 -0.9845 1 0.5 2
9 201429 1 4 0 0 5 1 1 1 CGY 0 1 1.0155 1 0.5 2
10 201521 1 3 1 0 2 1 0 1 TOR 0 1 -1.9845 1 0.5 2

This means that about 50% of the sample are wins. A 50/50 chance of picking one or the other in each game.

Looking solely at the scoring first variable, we get an estimate of the effect of scoring first has on winning. This does a better job predicting as compared to the intercept only model. The odds ratio is pretty high at 4.6 (versus not scoring first). I find plugging in the estimates to be a useful exercise. And you can see this below. Not surprisingly the probability is close to the proportion of wins where the winner scored first at 68%. This will change when we start to control for other factors. Next I will fit the full model.

%%SAS sas_session
proc logistic data=nhl.gamesrand;
class scoredfirst(ref='0') /param=ref;
model won(event='1')= scoredfirst/;
run; quit;

<!DOCTYPE html>

SAS Output

The SAS System

The LOGISTIC Procedure

Model Information
Data Set NHL.GAMESRAND
Response Variable won
Number of Response Levels 2
Model binary logit
Optimization Technique Fisher's scoring
Number of Observations Read 6178
Number of Observations Used 6178
Response Profile
Ordered
Value
won Total
Frequency
1 0 3071
2 1 3107

Probability modeled is won=1.

Class Level Information
Class Value Design Variables
scoredfirst 0 0
  1 1
Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Model Fit Statistics
Criterion Intercept Only Intercept and Covariates
AIC 8566.317 7728.093
SC 8573.046 7741.551
-2 Log L 8564.317 7724.093
Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 840.2236 1 <.0001
Score 820.9889 1 <.0001
Wald 782.1602 1 <.0001
Type 3 Analysis of Effects
Effect DF Wald
Chi-Square
Pr > ChiSq
scoredfirst 1 782.1602 <.0001
Analysis of Maximum Likelihood Estimates
Parameter   DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept   1 -0.7489 0.0385 378.5074 <.0001
scoredfirst 1 1 1.5285 0.0547 782.1602 <.0001
Odds Ratio Estimates
Effect Point Estimate 95% Wald
Confidence Limits
scoredfirst 1 vs 0 4.611 4.143 5.133
Association of Predicted Probabilities and Observed Responses
Percent Concordant 46.5 Somers' D 0.365
Percent Discordant 10.1 Gamma 0.644
Percent Tied 43.4 Tau-a 0.182
Pairs 9541597 c 0.682
print("Probability of winning when scoring first: ")
print("{0:.2f}%".format(convertToProb(-.7489+1.5285)*100))
Probability of winning when scoring first: 
68.56%

The model certainly performs better than the intercept only model, and converged properly. Overtime is the only non-significant factor, and so one of the other variables explains the same information as the overtime. That could be the period scored. It may be worth looking later which variables are added that overlap with overtime. As expected our scoredfirst variable is important, and significant. Penalties were centered around the mean of approximately 4 penalties (3.9…), and so a 0 would indicate 4.

%%SAS sas_session
proc logistic data=nhl.gamesrand;
class scoredfirst(ref='0') scoredsecond(ref='0') home(ref='0')overtime(ref='0')/param=ref;
model won(event='1')= scoredfirst scoredsecond home overtime period penalty_cen/;
run; quit;

<!DOCTYPE html>

SAS Output

The SAS System

The LOGISTIC Procedure

Model Information
Data Set NHL.GAMESRAND
Response Variable won
Number of Response Levels 2
Model binary logit
Optimization Technique Fisher's scoring
Number of Observations Read 6178
Number of Observations Used 6178
Response Profile
Ordered
Value
won Total
Frequency
1 0 3071
2 1 3107

Probability modeled is won=1.

Class Level Information
Class Value Design Variables
scoredfirst 0 0
  1 1
scoredsecond 0 0
  1 1
home 0 0
  1 1
overtime 0 0
  1 1
Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Model Fit Statistics
Criterion Intercept Only Intercept and Covariates
AIC 8566.317 6727.321
SC 8573.046 6774.422
-2 Log L 8564.317 6713.321
Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 1850.9960 6 <.0001
Score 1648.0410 6 <.0001
Wald 1235.9008 6 <.0001
Type 3 Analysis of Effects
Effect DF Wald
Chi-Square
Pr > ChiSq
scoredfirst 1 870.7582 <.0001
scoredsecond 1 814.4736 <.0001
home 1 29.2964 <.0001
overtime 1 1.4692 0.2255
period 1 9.4676 0.0021
penalty_cen 1 10.0272 0.0015
Analysis of Maximum Likelihood Estimates
Parameter   DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept   1 -2.2164 0.1022 470.7539 <.0001
scoredfirst 1 1 1.9117 0.0648 870.7582 <.0001
scoredsecond 1 1 1.8564 0.0650 814.4736 <.0001
home 1 1 0.3241 0.0599 29.2964 <.0001
overtime 1 1 -0.1016 0.0838 1.4692 0.2255
period   1 0.1746 0.0567 9.4676 0.0021
penalty_cen   1 -0.0463 0.0146 10.0272 0.0015
Odds Ratio Estimates
Effect Point Estimate 95% Wald
Confidence Limits
scoredfirst 1 vs 0 6.764 5.958 7.680
scoredsecond 1 vs 0 6.401 5.635 7.271
home 1 vs 0 1.383 1.230 1.555
overtime 1 vs 0 0.903 0.767 1.065
period 1.191 1.065 1.331
penalty_cen 0.955 0.928 0.983
Association of Predicted Probabilities and Observed Responses
Percent Concordant 78.6 Somers' D 0.580
Percent Discordant 20.7 Gamma 0.584
Percent Tied 0.7 Tau-a 0.290
Pairs 9541597 c 0.790

What is the probability of wining a game given these factors?

It can be broken up in many different ways. Probability of winning all else being zero: 9.83% Probability of winning when scoring first in the first period, not at home, no overtime, and ~4 penalties: 46.75% Probability of winning when scoring first at home, in the first period, no overtime, and ~4 penalties: 54.83% Probability of winning when scoring first in the second period,not at home, no overtime, and ~4 penalties: 51.11% Probability of winning when scoring first in the first period, scoring second, not at home, no overtime, and ~4 penalties: 84.89% Probability of winning when scoring first in the second period, scoring second, at home, no overtime, and ~4 penalties: 87.00%

print("Probability of winning all else being zero: ")
print("{0:.2f}%".format(convertToProb(-2.2164)*100))

print("Probability of winning when scoring first not at home, in the first period, no overtime, and ~4 penalties: ")
print("{0:.2f}%".format(convertToProb(-2.2164+1.9117+0.1746)*100))

print("Probability of winning when scoring first at home, in the first period, no overtime, and ~4 penalties: ")
print("{0:.2f}%".format(convertToProb(-2.2164+1.9117+.3241+0.1746)*100))

print("Probability of winning when scoring first not at home, in the second period, no overtime, and ~4 penalties: ")
print("{0:.2f}%".format(convertToProb(-2.2164+1.9117+2*0.1746)*100))

print("Probability of winning when scoring first and second not at home, in the first period, no overtime, and ~4 penalties: ")
print("{0:.2f}%".format(convertToProb(-2.2164+1.9117+1.8564+0.1746)*100))

print("Probability of winning when scoring first and second at home, in the second period, no overtime, and ~4 penalties: ")
print("{0:.2f}%".format(convertToProb(-2.2164+1.9117+1.8564+2*0.1746)*100))

So when you control for many of the other factors scoring first is no garauntee of winning. Having a home game and scoring first gives you a slightly better chance over. It really isn’t till you are the team that scores second does it become more than likely that you will. Again, that makes sense because it becomes all the more harder to beat 2 goals than just one.

Ultimately, it is the team that scores the first two goals that tend to go on to win. So while scoring first is important, you absolutely need to step on the gas and get that second goal.

But this is just one sample from the population of games from 2013-2014 to 2017-2018. Could this just have been a fluke of the sample that was drawn. Actually, because I didn’t set the seed, if you are rerunning this code, you probably got slightly different estimates.

Let’s bootstrap the process and look at the sampling distribution of the estimates to get a better picture of the possible win probabilities. In SAS this can be done more efficiently using the survey select process, and then the by statement in the logistic process.

%%SAS sas_session
proc sort data=nhl.games;
 by gamekey;
run;
proc surveyselect data=nhl.games NOPRINT
     out=nhl.gamessamp
     noprint Method = urs N = 1 outhits               
     reps=100;  /* generate this many bootstrap resamples */
     Strata gamekey;
run;

proc print data=nhl.gamessamp(obs=10);
run;

<!DOCTYPE html>

SAS Output

The SAS System

Obs gamekey Replicate goals home overtime penaltycount period scoredfirst scoredsecond team won notwon penalty_cen NumberHits ExpectedHits SamplingWeight
1 201421 1 3 1 0 14 1 0 1 MTL 0 1 10.0155 1 0.5 2
2 201421 2 4 0 0 12 1 1 0 TOR 1 0 8.0155 1 0.5 2
3 201421 3 4 0 0 12 1 1 0 TOR 1 0 8.0155 1 0.5 2
4 201421 4 3 1 0 14 1 0 1 MTL 0 1 10.0155 1 0.5 2
5 201421 5 4 0 0 12 1 1 0 TOR 1 0 8.0155 1 0.5 2
6 201421 6 4 0 0 12 1 1 0 TOR 1 0 8.0155 1 0.5 2
7 201421 7 3 1 0 14 1 0 1 MTL 0 1 10.0155 1 0.5 2
8 201421 8 3 1 0 14 1 0 1 MTL 0 1 10.0155 1 0.5 2
9 201421 9 3 1 0 14 1 0 1 MTL 0 1 10.0155 1 0.5 2
10 201421 10 3 1 0 14 1 0 1 MTL 0 1 10.0155 1 0.5 2
%%SAS sas_session
proc sort data=nhl.gamessamp;
 by replicate;
run;
ods trace on;
ods output ParameterEstimates=nhl.estimates;
proc logistic data=nhl.gamessamp;
by replicate;
class scoredfirst(ref='0') scoredsecond(ref='0') home(ref='0')overtime(ref='0')/param=ref;
model won(event='1')= scoredfirst scoredsecond home overtime period penalty_cen/;
run; quit;
ods trace off;

I cleared the outputs so you didn’t have to see each iteration of the logistic step. Instead we can summarize the data by looking at it visually or by the mean of the sample distribution.

%%SAS sas_session
proc sql;
    select Variable, mean(Estimate) as MeanEst
      from nhl.estimates
      group by variable;
run;
quit;

<!DOCTYPE html>

SAS Output

The SAS System

Variable MeanEst
Intercept -2.14334
home 0.342078
overtime -0.01201
penalty_cen -0.02247
period 0.10468
scoredfirst 1.90498
scoredsecond 1.836549
%%SAS sas_session
%MACRO histograms(curvar=);
    %let titlevar = &curvar;
    proc template;
        define statgraph Histogram;
            begingraph;
                entrytitle "Histogram of &titlevar ";
                layout lattice/columns=1;
                    layout overlay;
                        histogram Estimate;
                    endlayout;
                    layout overlay;
                        boxplot y=Estimate / orient=horizontal;
                    endlayout;
                endlayout;
            endgraph;
        end;
    run;

    proc sgrender data=nhl.estimates(where=(variable=&curvar)) template = Histogram;
    run;
%MEND histograms;
%histograms(curvar='scoredfirst')
%histograms(curvar='scoredsecond')
%histograms(curvar='home')

<!DOCTYPE html>

SAS Output

The SAS System

The SGRender Procedure

The SAS System

The SGRender Procedure

The SAS System

The SGRender Procedure

As expected the sampling distributions look normal. The means are not far off from the original estimates. The probabilities can be recalculated again.

Variable Value
Intercept -2.14334
home 0.342078
overtime -0.01201
penalty_cen -0.02247
period 0.10468
scoredfirst 1.90498
scoredsecond 1.836549

Probability of winning at home before the first period 13.14%

Probability of winning not at home before the first period 9.70%

Probability of winning when scoring first not at home, in the first period, no overtime, and ~4 penalties: 46.66%

Probability of winning when scoring first at home, in the first period, no overtime, and ~4 penalties: 55.19%

Probability of winning when scoring first not at home, in the second period, no overtime, and ~4 penalties: 49.28%

Probability of winning when scoring first and second not at home, in the first period, no overtime, and ~4 penalties: 84.59%

Probability of winning when scoring first and second at home, no overtime, and ~4 penalties: 88.54%

print("Probability of winning at home before the first period")
print("{0:.2f}%".format(convertToProb(-2.14334+	0.342078+(3.9*-0.02247))*100))

print("Probability of winning not at home before the first period")
print("{0:.2f}%".format(convertToProb(-2.14334+(3.9*-0.02247))*100))

print("Probability of winning when scoring first not at home, in the first period, no overtime, and ~4 penalties: ")
print("{0:.2f}%".format(convertToProb(-2.14334+1.90498+0.10468)*100))

print("Probability of winning when scoring first at home, in the first period, no overtime, and ~4 penalties: ")
print("{0:.2f}%".format(convertToProb(-2.14334+1.90498+0.342078+0.10468)*100))

print("Probability of winning when scoring first not at home, in the second period, no overtime, and ~4 penalties: ")
print("{0:.2f}%".format(convertToProb(-2.14334+1.90498+2*0.10468)*100))

print("Probability of winning when scoring first and second not at home, in the first period, no overtime, and ~4 penalties: ")
print("{0:.2f}%".format(convertToProb(-2.14334+1.90498+1.836549+0.10468)*100))

print("Probability of winning when scoring first and second at home,  no overtime, and ~4 penalties: ")
print("{0:.2f}%".format(convertToProb(-2.14334+1.90498+0.342078+1.836549+0.10468)*100))

Something else to consider in terms of independence are the games and the teams. Wins might be considered nested within a team. A team playing well may be getting more wins in a row or something along those lines. We could examine this by looking at the interclass correlation.

%%SAS sas_session
proc nlmixed data=nhl.gamesrand;
    ods output ParameterEstimates=p1; 
    parms gamma00 = -1 tau00=1;
    gamma0j = gamma00 + u0j;
    eta = gamma0j;
    expEta = exp(eta);
    pij = expEta/(1+expEta);
    model scoredfirst ~ binary(pij);
    random u0j ~ normal([0],[tau00]) subject=team OUT=randout;
    estimate 'ICC' tau00/(tau00+3.29);
run;

<!DOCTYPE html>

SAS Output

The SAS System

The NLMIXED Procedure

Specifications
Data Set NHL.GAMESRAND
Dependent Variable scoredfirst
Distribution for Dependent Variable Binary
Random Effects u0j
Distribution for Random Effects Normal
Subject Variable team
Optimization Technique Dual Quasi-Newton
Integration Method Adaptive Gaussian Quadrature
Dimensions
Observations Used 6178
Observations Not Used 0
Total Observations 6178
Subjects 31
Max Obs per Subject 238
Parameters 2
Quadrature Points 1
Initial Parameters
gamma00 tau00 Negative
Log
Likelihood
-1 1 4329.25972
Iteration History
Iteration Calls Negative
Log
Likelihood
Difference Maximum
Gradient
Slope
1 6 4314.1825 15.07724 14.4613 -151.871
2 16 4302.8614 11.32104 41.4893 -34.2401
3 22 4302.2958 0.565654 53.4288 -60.7572
4 29 4298.0747 4.22112 184.088 -20.6365
5 36 4280.8413 17.23342 415.583 -32.7086
6 40 4278.4399 2.401311 36.7201 -156.999
7 43 4278.3651 0.074847 34.0781 -0.19067
8 45 4278.2916 0.073493 8.94353 -0.09359
9 48 4278.2854 0.006165 2.81772 -0.01572
10 51 4278.2849 0.0005 0.078424 -0.00096
11 54 4278.2849 9.217E-7 0.002819 -1.92E-6
NOTE: GCONV convergence criterion satisfied.
Fit Statistics
-2 Log Likelihood 8556.6
AIC (smaller is better) 8560.6
AICC (smaller is better) 8560.6
BIC (smaller is better) 8563.4
Parameter Estimates
Parameter Estimate Standard
Error
DF t Value Pr > |t| 95% Confidence Limits Gradient
gamma00 -0.00313 0.03522 30 -0.09 0.9298 -0.07505 0.06879 -0.00232
tau00 0.01792 0.009702 30 1.85 0.0747 -0.00190 0.03773 0.002819
Additional Estimates
Label Estimate Standard
Error
DF t Value Pr > |t| Alpha Lower Upper
ICC 0.005416 0.002917 30 1.86 0.0732 0.05 -0.00054 0.01137

The average estimates for the coefficients in this model are very similar to what we were seeing with the bootstrapped model. The ICC is .005 which is extremely low and indicates that the wins and loses with this sample are not affected by the grouping of teams. We could still use a multilevel model to examine the random effect associated with each team, and that is shown below.

Were any teams better at scoring first? To answer that I’ll use a different approach.

%%SAS sas_session
proc glimmix data=nhl.gamesrand method=quad;
    class team;
    ODS OUTPUT SOLUTIONR=R;
    model won(event='1')= scoredfirst scoredsecond home overtime period penaltycount/dist=binary solution oddsratio ddfm=bw;
    nloptions maxiter=10000;
    random intercept scoredfirst /subject=team type=chol g solution;
run;

<!DOCTYPE html>

SAS Output

The SAS System

The GLIMMIX Procedure

Model Information
Data Set NHL.GAMESRAND
Response Variable won
Response Distribution Binary
Link Function Logit
Variance Function Default
Variance Matrix Blocked By team
Estimation Technique Maximum Likelihood
Likelihood Approximation Gauss-Hermite Quadrature
Degrees of Freedom Method Between-Within
Class Level Information
Class Levels Values
team 31 ANA BOS BUF CAR CBJ CGY CHI COL DAL DET EDM FLA LA MIN MTL NJ NSH NYI NYR OTT PHI PHX PIT SJ STL TB TOR VAN VGK WPG WSH
Number of Observations Read 6178
Number of Observations Used 6178
Response Profile
Ordered
Value
won Total
Frequency
The GLIMMIX procedure is modeling the probability that won='1'.
1 0 3071
2 1 3107
Dimensions
G-side Cov. Parameters 3
Columns in X 7
Columns in Z per Subject 2
Subjects (Blocks in V) 31
Max Obs per Subject 238
Optimization Information
Optimization Technique Dual Quasi-Newton
Parameters in Optimization 10
Lower Boundaries 2
Upper Boundaries 0
Fixed Effects Not Profiled
Starting From GLM estimates
Quadrature Points 1
Iteration History
Iteration Restarts Evaluations Objective
Function
Change Max
Gradient
0 0 4 6711.4553725 . 60.84809
1 0 3 6710.7871203 0.66825213 263.1968
2 0 3 6706.5499723 4.23714802 148.9014
3 0 2 6703.3524844 3.19748792 159.9946
4 0 3 6703.2107923 0.14169209 147.1521
5 0 2 6702.940897 0.26989531 130.1817
6 0 3 6702.7805741 0.16032292 116.9171
7 0 2 6702.4437877 0.33678642 48.60881
8 0 3 6702.1046842 0.33910344 5.312724
9 0 3 6702.043703 0.06098120 12.1805
10 0 3 6702.0167753 0.02692775 13.42428
11 0 3 6702.0127666 0.00400870 12.09659
12 0 4 6701.999187 0.01357963 0.689931
13 0 3 6701.9990994 0.00008755 0.015468
14 0 3 6701.9990992 0.00000016 0.000821
Convergence criterion (GCONV=1E-8) satisfied.

Estimated G matrix is not positive definite.

Fit Statistics
-2 Log Likelihood 6702.00
AIC (smaller is better) 6720.00
AICC (smaller is better) 6720.03
BIC (smaller is better) 6732.90
CAIC (smaller is better) 6741.90
HQIC (smaller is better) 6724.21
Fit Statistics for Conditional Distribution
-2 log L(won | r. effects) 6663.54
Pearson Chi-Square 6138.44
Pearson Chi-Square / DF 0.99
Estimated G Matrix
Effect Row Col1 Col2
Intercept 1 0.02931 0.000438
scoredfirst 2 0.000438 6.533E-6
Covariance Parameter Estimates
Cov Parm Subject Estimate Standard
Error
CHOL(1,1) team 0.1712 0.05534
CHOL(2,1) team 0.002556 0.07133
CHOL(2,2) team 0 .
Solutions for Fixed Effects
Effect Estimate Standard
Error
DF t Value Pr > |t|
Intercept -2.0111 0.1244 30 -16.17 <.0001
scoredfirst 1.9085 0.06506 6141 29.34 <.0001
scoredsecond 1.8539 0.06533 6141 28.38 <.0001
home 0.3270 0.06014 6141 5.44 <.0001
overtime -0.09928 0.08424 6141 -1.18 0.2386
period 0.1754 0.05700 6141 3.08 0.0021
penaltycount -0.05052 0.01484 6141 -3.40 0.0007
Odds Ratio Estimates
scoredfirst scoredsecond home overtime period penaltycount _scoredfirst _scoredsecond _home _overtime _period _penaltycount Estimate DF 95% Confidence Limits
Effects of continuous variables are assessed as one unit offsets from the mean. The AT suboption modifies the reference value and the UNIT suboption modifies the offsets.
1.4989 0.4909 0.5034 0.1318 1.2488 3.9908 0.4989 0.4909 0.5034 0.1318 1.2488 3.9908 6.743 6141 5.936 7.661
0.4989 1.4909 0.5034 0.1318 1.2488 3.9908 0.4989 0.4909 0.5034 0.1318 1.2488 3.9908 6.385 6141 5.617 7.257
0.4989 0.4909 1.5034 0.1318 1.2488 3.9908 0.4989 0.4909 0.5034 0.1318 1.2488 3.9908 1.387 6141 1.233 1.560
0.4989 0.4909 0.5034 1.1318 1.2488 3.9908 0.4989 0.4909 0.5034 0.1318 1.2488 3.9908 0.905 6141 0.768 1.068
0.4989 0.4909 0.5034 0.1318 2.2488 3.9908 0.4989 0.4909 0.5034 0.1318 1.2488 3.9908 1.192 6141 1.066 1.333
0.4989 0.4909 0.5034 0.1318 1.2488 4.9908 0.4989 0.4909 0.5034 0.1318 1.2488 3.9908 0.951 6141 0.923 0.979
Type III Tests of Fixed Effects
Effect Num DF Den DF F Value Pr > F
scoredfirst 1 6141 860.64 <.0001
scoredsecond 1 6141 805.20 <.0001
home 1 6141 29.57 <.0001
overtime 1 6141 1.39 0.2386
period 1 6141 9.47 0.0021
penaltycount 1 6141 11.58 0.0007
Solution for Random Effects
Effect Subject Estimate Std Err Pred DF t Value Pr > |t|
Intercept team ANA 0.1035 0.1275 6171 0.81 0.4169
scoredfirst team ANA 0.001546 0.04294 6171 0.04 0.9713
Intercept team BOS 0.1345 0.1305 6171 1.03 0.3027
scoredfirst team BOS 0.002008 0.05592 6171 0.04 0.9714
Intercept team BUF -0.2700 0.1522 6171 -1.77 0.0761
scoredfirst team BUF -0.00403 0.1123 6171 -0.04 0.9714
Intercept team CAR -0.2826 0.1553 6171 -1.82 0.0690
scoredfirst team CAR -0.00422 0.1174 6171 -0.04 0.9713
Intercept team CBJ 0.008767 0.1201 6171 0.07 0.9418
scoredfirst team CBJ 0.000131 0.003898 6171 0.03 0.9732
Intercept team CGY 0.08337 0.1249 6171 0.67 0.5046
scoredfirst team CGY 0.001245 0.03460 6171 0.04 0.9713
Intercept team CHI 0.1148 0.1248 6171 0.92 0.3576
scoredfirst team CHI 0.001714 0.04811 6171 0.04 0.9716
Intercept team COL -0.1140 0.1241 6171 -0.92 0.3584
scoredfirst team COL -0.00170 0.04749 6171 -0.04 0.9714
Intercept team DAL -0.04652 0.1179 6171 -0.39 0.6932
scoredfirst team DAL -0.00069 0.01952 6171 -0.04 0.9716
Intercept team DET -0.09946 0.1226 6171 -0.81 0.4173
scoredfirst team DET -0.00149 0.04182 6171 -0.04 0.9717
Intercept team EDM -0.06339 0.1213 6171 -0.52 0.6013
scoredfirst team EDM -0.00095 0.02658 6171 -0.04 0.9716
Intercept team FLA -0.03694 0.1195 6171 -0.31 0.7573
scoredfirst team FLA -0.00055 0.01543 6171 -0.04 0.9715
Intercept team LA -0.05814 0.1219 6171 -0.48 0.6333
scoredfirst team LA -0.00087 0.02468 6171 -0.04 0.9719
Intercept team MIN 0.1063 0.1267 6171 0.84 0.4014
scoredfirst team MIN 0.001587 0.04433 6171 0.04 0.9714
Intercept team MTL 0.01175 0.1219 6171 0.10 0.9232
scoredfirst team MTL 0.000175 0.005245 6171 0.03 0.9733
Intercept team NJ -0.1392 0.1221 6171 -1.14 0.2540
scoredfirst team NJ -0.00208 0.05832 6171 -0.04 0.9716
Intercept team NSH 0.1093 0.1279 6171 0.85 0.3929
scoredfirst team NSH 0.001632 0.04540 6171 0.04 0.9713
Intercept team NYI -0.03941 0.1202 6171 -0.33 0.7430
scoredfirst team NYI -0.00059 0.01672 6171 -0.04 0.9719
Intercept team NYR 0.1213 0.1217 6171 1.00 0.3193
scoredfirst team NYR 0.001810 0.05085 6171 0.04 0.9716
Intercept team OTT 0.05774 0.1216 6171 0.47 0.6349
scoredfirst team OTT 0.000862 0.02421 6171 0.04 0.9716
Intercept team PHI -0.06709 0.1233 6171 -0.54 0.5863
scoredfirst team PHI -0.00100 0.02831 6171 -0.04 0.9718
Intercept team PHX -0.2198 0.1461 6171 -1.50 0.1324
scoredfirst team PHX -0.00328 0.09139 6171 -0.04 0.9714
Intercept team PIT 0.1195 0.1313 6171 0.91 0.3629
scoredfirst team PIT 0.001784 0.04981 6171 0.04 0.9714
Intercept team SJ 0.03619 0.1214 6171 0.30 0.7656
scoredfirst team SJ 0.000540 0.01527 6171 0.04 0.9718
Intercept team STL 0.1790 0.1342 6171 1.33 0.1825
scoredfirst team STL 0.002672 0.07458 6171 0.04 0.9714
Intercept team TB 0.1397 0.1292 6171 1.08 0.2797
scoredfirst team TB 0.002086 0.05817 6171 0.04 0.9714
Intercept team TOR -0.06816 0.1281 6171 -0.53 0.5947
scoredfirst team TOR -0.00102 0.02818 6171 -0.04 0.9712
Intercept team VAN -0.08880 0.1202 6171 -0.74 0.4600
scoredfirst team VAN -0.00133 0.03721 6171 -0.04 0.9716
Intercept team VGK 0.01550 0.1552 6171 0.10 0.9205
scoredfirst team VGK 0.000231 0.007014 6171 0.03 0.9737
Intercept team WPG 0.08358 0.1203 6171 0.69 0.4873
scoredfirst team WPG 0.001248 0.03502 6171 0.04 0.9716
Intercept team WSH 0.1675 0.1264 6171 1.33 0.1850
scoredfirst team WSH 0.002501 0.07022 6171 0.04 0.9716
%%SAS sas_session
proc sql;
    select Subject, (Estimate+1.9085) as ScoredFirst
      from r
      where Effect="scoredfirst"
        order by ScoredFirst;
run;
quit;

<!DOCTYPE html>

SAS Output

The SAS System

Subject ScoredFirst
team CAR 1.904281
team BUF 1.904468
team PHX 1.905218
team NJ 1.906421
team COL 1.906798
team DET 1.907015
team VAN 1.907174
team TOR 1.907482
team PHI 1.907498
team EDM 1.907554
team LA 1.907632
team DAL 1.907805
team NYI 1.907912
team FLA 1.907948
team CBJ 1.908631
team MTL 1.908675
team VGK 1.908731
team SJ 1.90904
team OTT 1.909362
team CGY 1.909745
team WPG 1.909748
team ANA 1.910046
team MIN 1.910087
team NSH 1.910132
team CHI 1.910214
team PIT 1.910284
team NYR 1.91031
team BOS 1.910508
team TB 1.910586
team WSH 1.911001
team STL 1.911172

To break up the effect for each team, I will run the logistic regression by team.

%%SAS sas_session
proc sort data=nhl.games;
 by team;
run;
ods trace on;
ods output ParameterEstimates=nhl.teamestimates;
proc logistic data=nhl.games;
by team;
class scoredfirst(ref='0') scoredsecond(ref='0') home(ref='0')overtime(ref='0')/param=ref;
model won(event='1')= scoredfirst scoredsecond home overtime period penalty_cen/;
run; quit;
ods trace off;

%%SAS sas_session
proc print data=nhl.teamestimates(obs=10);
run;

<!DOCTYPE html>

SAS Output

The SAS System

Obs team Variable ClassVal0 DF Estimate StdErr WaldChiSq ProbChiSq _ESTTYPE_
1 ANA Intercept   1 -1.9796 0.4050 23.8903 <.0001 MLE
2 ANA scoredfirst 1 1 1.5697 0.2389 43.1713 <.0001 MLE
3 ANA scoredsecond 1 1 1.5974 0.2415 43.7494 <.0001 MLE
4 ANA home 1 1 0.6988 0.2314 9.1181 0.0025 MLE
5 ANA overtime 1 1 -0.5108 0.3259 2.4570 0.1170 MLE
6 ANA period   1 0.3861 0.2362 2.6721 0.1021 MLE
7 ANA penalty_cen   1 -0.00960 0.0528 0.0331 0.8556 MLE
8 BOS Intercept   1 -2.4694 0.4521 29.8340 <.0001 MLE
9 BOS scoredfirst 1 1 2.1861 0.2719 64.6617 <.0001 MLE
10 BOS scoredsecond 1 1 1.8727 0.2724 47.2734 <.0001 MLE
%%SAS sas_session
proc sql;
    select Variable, Team, Estimate, ProbChiSQ 
      from nhl.teamestimates
      where Variable="scoredfirst"
        order by Estimate;
run;
quit;

<!DOCTYPE html>

SAS Output

The SAS System

Variable team Estimate Pr > Chi-Square
scoredfirst NYI 1.1333 <.0001
scoredfirst VAN 1.3658 <.0001
scoredfirst PIT 1.5193 <.0001
scoredfirst ANA 1.5697 <.0001
scoredfirst LA 1.6351 <.0001
scoredfirst DET 1.6427 <.0001
scoredfirst PHX 1.6749 <.0001
scoredfirst CBJ 1.7355 <.0001
scoredfirst EDM 1.7418 <.0001
scoredfirst NSH 1.7765 <.0001
scoredfirst DAL 1.8462 <.0001
scoredfirst OTT 1.8750 <.0001
scoredfirst PHI 1.8914 <.0001
scoredfirst CGY 1.8929 <.0001
scoredfirst BUF 1.9120 <.0001
scoredfirst FLA 1.9808 <.0001
scoredfirst MIN 2.0916 <.0001
scoredfirst TOR 2.1204 <.0001
scoredfirst CHI 2.1262 <.0001
scoredfirst WPG 2.1340 <.0001
scoredfirst STL 2.1541 <.0001
scoredfirst SJ 2.1664 <.0001
scoredfirst NJ 2.1830 <.0001
scoredfirst BOS 2.1861 <.0001
scoredfirst TB 2.1991 <.0001
scoredfirst COL 2.2399 <.0001
scoredfirst CAR 2.2597 <.0001
scoredfirst NYR 2.2732 <.0001
scoredfirst MTL 2.3390 <.0001
scoredfirst VGK 2.3544 0.0002
scoredfirst WSH 2.5618 <.0001

Surprisingly, the Vegas Golden Knights ranked really high in this. The Knights were really good at winning after scoring first (72% probability of winning). But, maybe it isn’t so surprising. There is only one season to go off of, and they made it to the stanley cup, meaning they won lots of games in that season.

Another consideration, if you look at a lot of the top teams that are likely to win after scoring first they have some of the best goalies in the league: Holtby, Fleury, Bishop/Vasilevskiy (some of TB seasons), Price, and Lundqvist. If you have a good goalie, it likely makes it even harder on the opponent to score the equalizer and a second goal. That first goal also takes off some pressure on the goalie. It also probably indicates more offensive zone posession, meaning the team is keeping the puck away from their own goal. VGK had Fleury who played phenomenally throughout the regular season and playoffs.

What about COL and CAR? That seems to make less sense. But going back to the section about downloading the data, I reviewed the players that scored first or assisted on the first goal. Colorado had RANTANEN and Carolina STAAL pop up for different seasons. So there could just be a factor that the low-winning teams just happened to win by scoring first.

%%SAS sas_session
proc sql;
    select Variable, Team, Estimate, ProbChiSQ 
      from nhl.teamestimates
      where team="VGK"
        order by Estimate;
run;
quit;

<!DOCTYPE html>

SAS Output

The SAS System

Variable team Estimate Pr > Chi-Square
Intercept VGK -1.8942 0.0563
penalty_cen VGK -0.1949 0.3537
period VGK 0.0616 0.9138
overtime VGK 0.3167 0.6663
home VGK 0.6651 0.2225
scoredsecond VGK 1.3322 0.0316
scoredfirst VGK 2.3544 0.0002

The SAS System

Variable team Estimate Pr > Chi-Square
Intercept PIT -2.0994 <.0001
overtime PIT -0.0818 0.8050
penalty_cen PIT -0.0151 0.7778
period PIT 0.2954 0.1847
home PIT 0.9300 <.0001
scoredfirst PIT 1.5193 <.0001
scoredsecond PIT 1.7099 <.0001
print("VGK Probability of winning after scoring first in the first period at home with about 4 penalties")
print("{0:.2f}%".format(convertToProb(-1.8942+2.3544+.6651+.0616-.1949)*100))
print("VGK Probability of winning after scoring second in the first period at home with about 4 penalties")
print("{0:.2f}%".format(convertToProb(-1.8942+1.3322+.6651+.0616-.1949)*100))
VGK Probability of winning after scoring first in the first period at home with about 4 penalties
72.95%
VGK Probability of winning after scoring second in the first period at home with about 4 penalties
49.25%
%%SAS sas_session
proc sql;
    select Variable, Team, Estimate, ProbChiSQ 
      from nhl.teamestimates
      where Variable="scoredsecond"
        order by Estimate;
run;
quit;

<!DOCTYPE html>

SAS Output

The SAS System

Variable team Estimate Pr > Chi-Square
scoredsecond VGK 1.3322 0.0316
scoredsecond NYI 1.4828 <.0001
scoredsecond MIN 1.4955 <.0001
scoredsecond NSH 1.5128 <.0001
scoredsecond PHX 1.5876 <.0001
scoredsecond EDM 1.5884 <.0001
scoredsecond ANA 1.5974 <.0001
scoredsecond SJ 1.6608 <.0001
scoredsecond PIT 1.7099 <.0001
scoredsecond CBJ 1.7114 <.0001
scoredsecond OTT 1.7161 <.0001
scoredsecond FLA 1.7174 <.0001
scoredsecond VAN 1.7430 <.0001
scoredsecond DAL 1.8158 <.0001
scoredsecond BOS 1.8727 <.0001
scoredsecond DET 1.8870 <.0001
scoredsecond BUF 1.8951 <.0001
scoredsecond CHI 1.9125 <.0001
scoredsecond NYR 1.9689 <.0001
scoredsecond CAR 1.9849 <.0001
scoredsecond WSH 2.0039 <.0001
scoredsecond PHI 2.0239 <.0001
scoredsecond STL 2.0241 <.0001
scoredsecond COL 2.0283 <.0001
scoredsecond CGY 2.1101 <.0001
scoredsecond TB 2.1191 <.0001
scoredsecond LA 2.1662 <.0001
scoredsecond TOR 2.1773 <.0001
scoredsecond WPG 2.1887 <.0001
scoredsecond MTL 2.2306 <.0001
scoredsecond NJ 2.2486 <.0001

Interestingly the value for scoring second is flipped a little bit. The VKG are now at the bottom, again this is difficult because it is based only on their one great season. NJ was good at sealing the deal when they got the second goal, even without the first goal. Pittsburg seems to be the biggest outlier. They are low in winning probability even if they had the first or second goals. But still over 50% after scoring first. Another big factor for them seems to be home ice advantage, and holding the onto the win even if they scored in the first period.

Some other things that I’m not controlling for that could be influencing the result. The team’s even strength goals, penalty killing percentages, etc…

%%SAS sas_session
proc sql;
    select Variable, Team, Estimate, ProbChiSQ 
      from nhl.teamestimates
      where team="PIT"
        order by Estimate;
run;
quit;

<!DOCTYPE html>

SAS Output

The SAS System

Variable team Estimate Pr > Chi-Square
Intercept PIT -2.0994 <.0001
overtime PIT -0.0818 0.8050
penalty_cen PIT -0.0151 0.7778
period PIT 0.2954 0.1847
home PIT 0.9300 <.0001
scoredfirst PIT 1.5193 <.0001
scoredsecond PIT 1.7099 <.0001
print("Probability of winning after scoring first in the first period at home with about 4 penalties")
print("{0:.2f}%".format(convertToProb(-2.0994+1.5193+.9300+.2954-.0151)*100))
print("Probability of winning after scoring second in the first period at home with about 4 penalties")
print("{0:.2f}%".format(convertToProb(-2.0994+1.7099+.9300+.2954-.0151)*100))
Probability of winning after scoring first in the first period at home with about 4 penalties
65.25%
Probability of winning after scoring second in the first period at home with about 4 penalties
69.44%

Looking at home ice advantage when controling for overtime and who scores first is interesting. Some teams actually has a negative or even non-significant effect, while others, like PIT, it has a high effect. Some of the others are not so surprising, Toronto and Philadelphia (if I was a ref I’d be intimidated in Philadelphia too). Tampa Bay is surprising to me, given the other teams in the top spots. Certainly, they have a great fan base (biased), but one of the problems is with out of state fans coming in. It isn’t surprising to see more red sweaters than blue sweaters when, say, the blackhawks come through. I wonder, though, if ice has something to do with this. It can be hot and humid in Tampa, and the quality of the ice may be a factor, although it generally sits at the right temperature. Negating that argument is that PHX is hot too, but doesn’t rank high in the home ice effect. Ignoring significance, PHX has more of an effect than BOS, which I would think the opposite would be true.

Perhaps more importantly, the top teams are also the teams that just won more games over the seasons of analysis, and would by that have more home wins.

%%SAS sas_session
proc sql;
    select Variable, Team, Estimate, ProbChiSQ 
      from nhl.teamestimates
      where Variable="home"
        order by Estimate;
run;
quit;

<!DOCTYPE html>

SAS Output

The SAS System

Variable team Estimate Pr > Chi-Square
home MTL -0.0638 0.7957
home NYI -0.0379 0.8621
home DET 0.0381 0.8691
home SJ 0.0876 0.7082
home VAN 0.0998 0.6632
home COL 0.1336 0.5819
home CGY 0.1383 0.5713
home MIN 0.1708 0.4756
home CHI 0.1838 0.4493
home BOS 0.1872 0.4215
home OTT 0.1959 0.3941
home CBJ 0.2357 0.3013
home BUF 0.2436 0.3206
home NYR 0.2618 0.2845
home WPG 0.2762 0.2346
home STL 0.3251 0.1768
home FLA 0.3905 0.0989
home PHX 0.3960 0.0867
home EDM 0.4256 0.0643
home CAR 0.4366 0.0673
home NJ 0.4426 0.0706
home LA 0.4487 0.0591
home NSH 0.4628 0.0445
home DAL 0.4668 0.0447
home WSH 0.5049 0.0427
home VGK 0.6651 0.2225
home ANA 0.6988 0.0025
home PHI 0.7184 0.0026
home TB 0.8152 0.0010
home TOR 0.8890 0.0006
home PIT 0.9300 <.0001

One final look at a different component of home ice advantage is suggested in Soccernomics. That the home influence has nothing to do with time change, jet lag, or the fans and players; but it has something to do with fans and referees. To look at that, I will look at the relationship between penalties and home. This will just be a simple model, and as you will see in the results, it doesn’t explain the variation in penalty count that well. Still it can tell us the overall difference in penalties between the home team and the away team

%%SAS sas_session
proc reg data=nhl.gamesrand;

model penaltycount= home won /;
run; quit;

<!DOCTYPE html>

SAS Output

The SAS System

The REG Procedure

Model: MODEL1

Dependent Variable: penaltycount

Number of Observations Read 6178
Number of Observations Used 6178
Analysis of Variance
Source DF Sum of
Squares
Mean
Square
F Value Pr > F
Model 2 326.99322 163.49661 37.78 <.0001
Error 6175 26721 4.32737    
Corrected Total 6177 27048      
Root MSE 2.08023 R-Square 0.0121
Dependent Mean 3.99077 Adj R-Sq 0.0118
Coeff Var 52.12603    
Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t|
Intercept 1 4.26670 0.04470 95.45 <.0001
home 1 -0.43584 0.05316 -8.20 <.0001
won 1 -0.11240 0.05316 -2.11 0.0345

The SAS System

The REG Procedure

Model: MODEL1

Dependent Variable: penaltycount

Panel of heat maps of residuals by regressors for penaltycount.

R-Squared is quite small, but then again I’ve had models so bad R-squared was negative (really it is impossible, so that was a really bad model using Twitter data if that explains anything). But the difference is significant. On average you would expect a team to get 4.21545 penalties per game. This is close to the mean calculated without controlling for home ice. Somewhat obviously, there is a negative effect for the team that wins. You have less penalties, you are more likely to win; which was also shown above.

When home ice is accounted for, there is a -0.43584 drop in penalties. So that shows a slight favortism towards the home team. Let’s bootstrap it again for a better estimate.

%%SAS sas_session
proc sort data=nhl.gamessamp;
 by replicate;
run;
ods trace on;
ods output ParameterEstimates=nhl.penestimates;
proc reg data=nhl.gamessamp;
by replicate;
model penaltycount= home won/;
run; quit;
ods trace off;
%%SAS sas_session
proc sql;
    select Variable, mean(Estimate) as MeanEst
      from nhl.penestimates
      group by variable;
run;
quit;
%histograms(curvar='home')

<!DOCTYPE html>

SAS Output

The SAS System

Variable MeanEst
Intercept 4.156299
home -0.29919
won -0.04627

The SAS System

The SGRender Procedure

The original sample ended up being higher than the bootstrapped estimate. Still there is a significant negative effect on the number of penalties a team is called for if they are the home team. The next question would be if this varies by team. Do certain teams have more of a home ice advantage than others, in terms of the penalties called?

%%SAS sas_session
proc sort data=nhl.games;
 by team;
run;
ods trace on;
ods output ParameterEstimates=nhl.penteamestimates;
proc reg data=nhl.games;
by team;
model penaltycount= home won/;
run; quit;
ods trace off;
%%SAS sas_session
proc sql;
    select Variable, Team, Estimate, Probt 
      from nhl.penteamestimates
      where Variable="home"
        order by Estimate;
run;
quit;

<!DOCTYPE html>

SAS Output

The SAS System

Variable team Parameter Estimate Pr > |t|
home CGY -0.72226 0.0013
home CHI -0.62266 0.0004
home VAN -0.55893 0.0151
home TOR -0.48348 0.0213
home NJ -0.47302 0.0069
home MTL -0.46980 0.0228
home PIT -0.46902 0.0371
home NYR -0.42721 0.0142
home WSH -0.41880 0.0390
home NSH -0.40351 0.0507
home VGK -0.39992 0.1965
home FLA -0.39745 0.0721
home MIN -0.38571 0.0462
home CAR -0.36810 0.0254
home OTT -0.35791 0.1104
home ANA -0.33947 0.1173
home BUF -0.30478 0.1368
home CBJ -0.28451 0.1720
home WPG -0.26757 0.2471
home DET -0.25629 0.1476
home BOS -0.25514 0.2141
home COL -0.20313 0.3130
home PHI -0.12630 0.5781
home EDM -0.11030 0.5855
home NYI -0.08499 0.6757
home PHX -0.04440 0.8409
home SJ -0.04309 0.8336
home DAL -0.03506 0.8597
home LA -0.03000 0.8776
home TB 0.01714 0.9290
home STL 0.19078 0.4047

If you are a Pittsburg rival, and Sydney Crosby conspiracist, then we might see PITs top rank of home ice advantage as proof of something nefarious. I’ll just leave it at that. Then there’s Calgary. Or Philadelphia, which no longer seems to have a home ice advantage.

One question though, is it that these teams have a home ice advantage, or are they just more likely to take more penalties on the road? Many of the teams that showed up with home ice advantage in terms of wins, are not on the top of the list when compared to penalties. Actually, the penalties don’t really see any real difference when at home or away for most of the teams.

As with a lot of analysis this just leads to more questions. Does the nationality of the refs come into play? Some of the top teams are Canadian (then again 3 canadian teams aren’t). Some of the teams are original six teams, some aren’t.

Summary

As I stated before, this is mostly for fun. It is a toy dataset that can be used to answer some questions. I would say that the first goal is important, but only because of the second goal. If a team scores both the first and second, the game is pretty much over. Home seems to provide, on average, some advantage, and on average the penalties are slightly lower for home teams than for away teams. This doesn’t hold up when looking at every team though.

What does it mean for strategy? Well, coming from someone who has never played, I don’t know. I do have some thoughts. One is that a team should try and score first. Is that helpful? Just go and score? Of course if it were easy….I think there may be ways to do that. Upload the minutes of your best scoring lines in the first part of the game, including scoring defensemen. Or if you know which player is the best at scoring the first goal, give them lots of time at the start.

Another strategy though, is it gets harder for the opposing team to win if the first score occurs later in in the game. An average team has a 70% chance of winning if the first goal comes in the 3rd period. So, give your most defensive minded players more minutes in the first and part of the second. Really grind it out with the other team, lots of hits, and tire them out. Make them cover the whole rink. Then switch and give more minutes to the best scorers who will have fresh legs.

But…I’m not a coach and I don’t play hockey.

print("Probability of winning when scoring first at home, in the first period, no overtime, and ~4 penalties: ")
print("{0:.2f}%".format(convertToProb(-2.14334+1.90498+3*0.342078+0.10468)*100))
Probability of winning when scoring first at home, in the first period, no overtime, and ~4 penalties: 
70.94%