Scoring first in the NHL

Scoring First in the NHL

The data for this analysis were collected in another section. I separated this to keep things a little shorter. I’m working with SAS university edition, which includes a Jupyter Notebook and access to SAS through Python or using magics %% to flag SAS code. I’ll primarily work with SAS code. Unfortunately, I can’t do all the steps in SAS University edition, since I cannot install external libraries onto their virtual machine. If the library is pure python, I could probably use it locally, but that is not the case with things like sklearn and matplotlib. I mostly wanted to use SAS for the multilevel modeling capabilities. Otherwise sklearn or statsmodel could be used.

I’ve stored the data as a csv file to load into SAS, but I’ll load it through a pandas data frame initially.

import saspy
import pandas as pd
import numpy as np
from IPython.display import HTML
import math

df = pd.read_csv(#rename if you need to
    ,index_col=0) 
df['notwon'] = np.where(df['won']==1,0,1)
df['penalty_cen'] = df['penaltycount']-3.98446
print(df.head())
print(df.describe())
print(df['team'].unique())
#to fix the arizona team
#for i,row in df.iterrows():
#    if row['team']=='ARI':
#        df.set_value(i,'team','PHX')

Create a sas session and load the dataframe into SAS for analysis.

sas_session = saspy.SASsession()
sas_session.saslib("nhl",path="/folders/myfolders/nhl/Update20180709")
sas_session.teach_me_SAS(False)
games = sas_session.df2sd(df,table='games',libref='nhl')
print(games.describe())
games.head()

Using SAS Config named: default
SAS Connection established. Subprocess id is 5436


 
 libname nhl    '/folders/myfolders/nhl/Update20180709'  ;
NOTE: Libref NHL was successfully assigned as follows: 
      Engine:        V9 
      Physical Name: /folders/myfolders/nhl/Update20180709
 
 
        Variable      N  NMiss        Median          Mean        StdDev  \
      gamekey  12356      0  2.016290e+07  5.380185e+07  7.256136e+07   
        goals  12356      0  3.000000e+00  2.785529e+00  1.621079e+00   
         home  12356      0  5.000000e-01  5.000000e-01  5.000202e-01   
     overtime  12356      0  0.000000e+00  1.317579e-01  3.382410e-01   
 penaltycount  12356      0  4.000000e+00  3.984461e+00  2.089983e+00   
       period  12356      0  1.000000e+00  1.248786e+00  5.256075e-01   
  scoredfirst  12356      0  5.000000e-01  5.000000e-01  5.000202e-01   
 scoredsecond  12356      0  0.000000e+00  4.876174e-01  4.998669e-01   
          won  12356      0  5.000000e-01  5.000000e-01  5.000202e-01   
       notwon  12356      0  5.000000e-01  5.000000e-01  5.000202e-01   
 penalty_cen  12356      0  1.554000e-02  9.906119e-07  2.089983e+00   

             Min           P25           P50           P75           Max  
 201421.00000  2.015226e+07  2.016290e+07  2.018265e+07  2.018213e+08  
      0.00000  2.000000e+00  3.000000e+00  4.000000e+00  1.000000e+01  
      0.00000  0.000000e+00  5.000000e-01  1.000000e+00  1.000000e+00  
      0.00000  0.000000e+00  0.000000e+00  0.000000e+00  1.000000e+00  
      0.00000  3.000000e+00  4.000000e+00  5.000000e+00  2.000000e+01  
      1.00000  1.000000e+00  1.000000e+00  1.000000e+00  4.000000e+00  
      0.00000  0.000000e+00  5.000000e-01  1.000000e+00  1.000000e+00  
      0.00000  0.000000e+00  0.000000e+00  1.000000e+00  1.000000e+00  
      0.00000  0.000000e+00  5.000000e-01  1.000000e+00  1.000000e+00  
      0.00000  0.000000e+00  5.000000e-01  1.000000e+00  1.000000e+00  
    -3.98446 -9.844600e-01  1.554000e-02  1.015540e+00  1.601554e+01  

	gamekey	goals	home	penaltycount	period	scoredfirst	scoredsecond	team	won	notwon	penalty_cen
0	201421	3	1	14	1	0	1	MTL	0	1	10.01554
1	201421	4	0	12	1	1	0	TOR	1	0	8.01554
2	201422	6	1	6	1	1	0	CHI	1	0	2.01554
3	201422	4	0	4	1	0	1	WSH	0	1	0.01554
4	201423	4	1	6	1	1	0	EDM	0	1	2.01554

One of the problems with the dataset is that really each observation is not independent. The value of won depends on who one. Know one of these you know who lost the game. I won’t go into the details here, but I experimented with a couple of ways to approach this problem. For one, I decided to look at a multilevel approach where games were nested within teams, a sort of cross classification problem. This didn’t work out that well because of the small number of samples per gamekey. Often, it had trouble estimated the covariance matrix, although it would converge.

Instead I chose alternative route of sampling a team from each game - stratified random sampling; where the gamekey was the strata.

Regardless of the approach, the estimated parameters were usually close. Here is an example of a random sample and the full model.

def convertToProb(logit):
    odds = math.exp(logit)
    return odds/(1.0+odds)

%%SAS sas_session
proc sort data=nhl.games;
 by gamekey;
run;
Proc SurveySelect data=nhl.games out=nhl.gamesrand noprint Method = urs N = 1 outhits
    rep = 1;
    Strata gamekey;
run;
proc freq data=nhl.gamesrand;
    table won;
run;
proc print data=nhl.gamesrand(obs=10);
run;

<!DOCTYPE html>

SAS Output

The SAS System

The FREQ Procedure


won	Frequency	Percent	Cumulative Frequency	Cumulative Percent
0	3093	50.06	3093	50.06
1	3085	49.94	6178	100.00

The SAS System


Obs	gamekey	Replicate	goals	home	penaltycount	period	scoredfirst	scoredsecond	team	won	notwon	penalty_cen	NumberHits	ExpectedHits	SamplingWeight
1	201421	1	3	1	14	1	0	1	MTL	0	1	10.0155	1	0.5	2
2	201422	1	6	1	6	1	1	0	CHI	1	0	2.0155	1	0.5	2
3	201423	1	4	1	6	1	1	0	EDM	0	1	2.0155	1	0.5	2
4	201424	1	1	1	4	1	1	0	PHI	0	1	0.0155	1	0.5	2
5	201425	1	2	1	7	1	1	1	DET	1	0	3.0155	1	0.5	2
6	201426	1	6	1	8	1	1	1	COL	1	0	4.0155	1	0.5	2
7	201427	1	3	1	7	1	1	0	BOS	1	0	3.0155	1	0.5	2
8	201428	1	3	1	3	1	1	1	PIT	1	0	-0.9845	1	0.5	2
9	201429	1	4	0	5	1	1	1	CGY	0	1	1.0155	1	0.5	2
10	201521	1	3	1	2	1	0	1	TOR	0	1	-1.9845	1	0.5	2

This means that about 50% of the sample are wins. A 50/50 chance of picking one or the other in each game.

Looking solely at the scoring first variable, we get an estimate of the effect of scoring first has on winning. This does a better job predicting as compared to the intercept only model. The odds ratio is pretty high at 4.6 (versus not scoring first). I find plugging in the estimates to be a useful exercise. And you can see this below. Not surprisingly the probability is close to the proportion of wins where the winner scored first at 68%. This will change when we start to control for other factors. Next I will fit the full model.

%%SAS sas_session
proc logistic data=nhl.gamesrand;
class scoredfirst(ref='0') /param=ref;
model won(event='1')= scoredfirst/;
run; quit;

<!DOCTYPE html>

SAS Output

The SAS System

The LOGISTIC Procedure


Model Information
Data Set	NHL.GAMESRAND
Response Variable	won
Number of Response Levels	2
Model	binary logit
Optimization Technique	Fisher's scoring


Number of Observations Read	6178
Number of Observations Used	6178


Response Profile
Ordered Value	won	Total Frequency
1	0	3071
2	1	3107

Probability modeled is won=1.


Class Level Information
Class	Value	Design Variables
scoredfirst	0	0
	1	1


Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.


Model Fit Statistics
Criterion	Intercept Only	Intercept and Covariates
AIC	8566.317	7728.093
SC	8573.046	7741.551
-2 Log L	8564.317	7724.093


Testing Global Null Hypothesis: BETA=0
Test	Chi-Square	DF	Pr > ChiSq
Likelihood Ratio	840.2236	1	<.0001
Score	820.9889	1	<.0001
Wald	782.1602	1	<.0001


Type 3 Analysis of Effects
Effect	DF	Wald Chi-Square	Pr > ChiSq
scoredfirst	1	782.1602	<.0001


Analysis of Maximum Likelihood Estimates
Parameter		DF	Estimate	Standard Error	Wald Chi-Square	Pr > ChiSq
Intercept		1	-0.7489	0.0385	378.5074	<.0001
scoredfirst	1	1	1.5285	0.0547	782.1602	<.0001


Odds Ratio Estimates
Effect	Point Estimate	95% Wald Confidence Limits
scoredfirst 1 vs 0	4.611	4.143	5.133


Association of Predicted Probabilities and Observed Responses
Percent Concordant	46.5	Somers' D	0.365
Percent Discordant	10.1	Gamma	0.644
Percent Tied	43.4	Tau-a	0.182
Pairs	9541597	c	0.682

print("Probability of winning when scoring first: ")
print("{0:.2f}%".format(convertToProb(-.7489+1.5285)*100))

Probability of winning when scoring first: 
68.56%

The model certainly performs better than the intercept only model, and converged properly. Overtime is the only non-significant factor, and so one of the other variables explains the same information as the overtime. That could be the period scored. It may be worth looking later which variables are added that overlap with overtime. As expected our scoredfirst variable is important, and significant. Penalties were centered around the mean of approximately 4 penalties (3.9…), and so a 0 would indicate 4.

%%SAS sas_session
proc logistic data=nhl.gamesrand;
class scoredfirst(ref='0') scoredsecond(ref='0') home(ref='0')overtime(ref='0')/param=ref;
model won(event='1')= scoredfirst scoredsecond home overtime period penalty_cen/;
run; quit;

<!DOCTYPE html>

SAS Output

The SAS System

The LOGISTIC Procedure


Model Information
Data Set	NHL.GAMESRAND
Response Variable	won
Number of Response Levels	2
Model	binary logit
Optimization Technique	Fisher's scoring


Number of Observations Read	6178
Number of Observations Used	6178


Response Profile
Ordered Value	won	Total Frequency
1	0	3071
2	1	3107

Probability modeled is won=1.


Class Level Information
Class	Value	Design Variables
scoredfirst	0	0
	1	1
scoredsecond	0	0
	1	1
home	0	0
	1	1
overtime	0	0
	1	1


Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.


Model Fit Statistics
Criterion	Intercept Only	Intercept and Covariates
AIC	8566.317	6727.321
SC	8573.046	6774.422
-2 Log L	8564.317	6713.321


Testing Global Null Hypothesis: BETA=0
Test	Chi-Square	DF	Pr > ChiSq
Likelihood Ratio	1850.9960	6	<.0001
Score	1648.0410	6	<.0001
Wald	1235.9008	6	<.0001


Type 3 Analysis of Effects
Effect	DF	Wald Chi-Square	Pr > ChiSq
scoredfirst	1	870.7582	<.0001
scoredsecond	1	814.4736	<.0001
home	1	29.2964	<.0001
overtime	1	1.4692	0.2255
period	1	9.4676	0.0021
penalty_cen	1	10.0272	0.0015


Analysis of Maximum Likelihood Estimates
Parameter		DF	Estimate	Standard Error	Wald Chi-Square	Pr > ChiSq
Intercept		1	-2.2164	0.1022	470.7539	<.0001
scoredfirst	1	1	1.9117	0.0648	870.7582	<.0001
scoredsecond	1	1	1.8564	0.0650	814.4736	<.0001
home	1	1	0.3241	0.0599	29.2964	<.0001
overtime	1	1	-0.1016	0.0838	1.4692	0.2255
period		1	0.1746	0.0567	9.4676	0.0021
penalty_cen		1	-0.0463	0.0146	10.0272	0.0015


Odds Ratio Estimates
Effect	Point Estimate	95% Wald Confidence Limits
scoredfirst 1 vs 0	6.764	5.958	7.680
scoredsecond 1 vs 0	6.401	5.635	7.271
home 1 vs 0	1.383	1.230	1.555
overtime 1 vs 0	0.903	0.767	1.065
period	1.191	1.065	1.331
penalty_cen	0.955	0.928	0.983


Association of Predicted Probabilities and Observed Responses
Percent Concordant	78.6	Somers' D	0.580
Percent Discordant	20.7	Gamma	0.584
Percent Tied	0.7	Tau-a	0.290
Pairs	9541597	c	0.790

What is the probability of wining a game given these factors?

It can be broken up in many different ways. Probability of winning all else being zero: 9.83% Probability of winning when scoring first in the first period, not at home, no overtime, and ~4 penalties: 46.75% Probability of winning when scoring first at home, in the first period, no overtime, and ~4 penalties: 54.83% Probability of winning when scoring first in the second period,not at home, no overtime, and ~4 penalties: 51.11% Probability of winning when scoring first in the first period, scoring second, not at home, no overtime, and ~4 penalties: 84.89% Probability of winning when scoring first in the second period, scoring second, at home, no overtime, and ~4 penalties: 87.00%

print("Probability of winning all else being zero: ")
print("{0:.2f}%".format(convertToProb(-2.2164)*100))

print("Probability of winning when scoring first not at home, in the first period, no overtime, and ~4 penalties: ")
print("{0:.2f}%".format(convertToProb(-2.2164+1.9117+0.1746)*100))

print("Probability of winning when scoring first at home, in the first period, no overtime, and ~4 penalties: ")
print("{0:.2f}%".format(convertToProb(-2.2164+1.9117+.3241+0.1746)*100))

print("Probability of winning when scoring first not at home, in the second period, no overtime, and ~4 penalties: ")
print("{0:.2f}%".format(convertToProb(-2.2164+1.9117+2*0.1746)*100))

print("Probability of winning when scoring first and second not at home, in the first period, no overtime, and ~4 penalties: ")
print("{0:.2f}%".format(convertToProb(-2.2164+1.9117+1.8564+0.1746)*100))

print("Probability of winning when scoring first and second at home, in the second period, no overtime, and ~4 penalties: ")
print("{0:.2f}%".format(convertToProb(-2.2164+1.9117+1.8564+2*0.1746)*100))

So when you control for many of the other factors scoring first is no garauntee of winning. Having a home game and scoring first gives you a slightly better chance over. It really isn’t till you are the team that scores second does it become more than likely that you will. Again, that makes sense because it becomes all the more harder to beat 2 goals than just one.

Ultimately, it is the team that scores the first two goals that tend to go on to win. So while scoring first is important, you absolutely need to step on the gas and get that second goal.

But this is just one sample from the population of games from 2013-2014 to 2017-2018. Could this just have been a fluke of the sample that was drawn. Actually, because I didn’t set the seed, if you are rerunning this code, you probably got slightly different estimates.

Let’s bootstrap the process and look at the sampling distribution of the estimates to get a better picture of the possible win probabilities. In SAS this can be done more efficiently using the survey select process, and then the by statement in the logistic process.

%%SAS sas_session
proc sort data=nhl.games;
 by gamekey;
run;
proc surveyselect data=nhl.games NOPRINT
     out=nhl.gamessamp
     noprint Method = urs N = 1 outhits               
     reps=100;  /* generate this many bootstrap resamples */
     Strata gamekey;
run;

proc print data=nhl.gamessamp(obs=10);
run;

<!DOCTYPE html>

SAS Output

The SAS System


Obs	gamekey	Replicate	goals	home	penaltycount	period	scoredfirst	scoredsecond	team	won	notwon	penalty_cen	NumberHits	ExpectedHits	SamplingWeight
1	201421	1	3	1	14	1	0	1	MTL	0	1	10.0155	1	0.5	2
2	201421	2	4	0	12	1	1	0	TOR	1	0	8.0155	1	0.5	2
3	201421	3	4	0	12	1	1	0	TOR	1	0	8.0155	1	0.5	2
4	201421	4	3	1	14	1	0	1	MTL	0	1	10.0155	1	0.5	2
5	201421	5	4	0	12	1	1	0	TOR	1	0	8.0155	1	0.5	2
6	201421	6	4	0	12	1	1	0	TOR	1	0	8.0155	1	0.5	2
7	201421	7	3	1	14	1	0	1	MTL	0	1	10.0155	1	0.5	2
8	201421	8	3	1	14	1	0	1	MTL	0	1	10.0155	1	0.5	2
9	201421	9	3	1	14	1	0	1	MTL	0	1	10.0155	1	0.5	2
10	201421	10	3	1	14	1	0	1	MTL	0	1	10.0155	1	0.5	2

%%SAS sas_session
proc sort data=nhl.gamessamp;
 by replicate;
run;
ods trace on;
ods output ParameterEstimates=nhl.estimates;
proc logistic data=nhl.gamessamp;
by replicate;
class scoredfirst(ref='0') scoredsecond(ref='0') home(ref='0')overtime(ref='0')/param=ref;
model won(event='1')= scoredfirst scoredsecond home overtime period penalty_cen/;
run; quit;
ods trace off;

I cleared the outputs so you didn’t have to see each iteration of the logistic step. Instead we can summarize the data by looking at it visually or by the mean of the sample distribution.

%%SAS sas_session
proc sql;
    select Variable, mean(Estimate) as MeanEst
      from nhl.estimates
      group by variable;
run;
quit;

<!DOCTYPE html>

SAS Output

The SAS System


Variable	MeanEst
Intercept	-2.14334
home	0.342078
overtime	-0.01201
penalty_cen	-0.02247
period	0.10468
scoredfirst	1.90498
scoredsecond	1.836549

%%SAS sas_session
%MACRO histograms(curvar=);
    %let titlevar = &curvar;
    proc template;
        define statgraph Histogram;
            begingraph;
                entrytitle "Histogram of &titlevar ";
                layout lattice/columns=1;
                    layout overlay;
                        histogram Estimate;
                    endlayout;
                    layout overlay;
                        boxplot y=Estimate / orient=horizontal;
                    endlayout;
                endlayout;
            endgraph;
        end;
    run;

    proc sgrender data=nhl.estimates(where=(variable=&curvar)) template = Histogram;
    run;
%MEND histograms;
%histograms(curvar='scoredfirst')
%histograms(curvar='scoredsecond')
%histograms(curvar='home')

<!DOCTYPE html>

SAS Output

The SAS System

The SAS System

As expected the sampling distributions look normal. The means are not far off from the original estimates. The probabilities can be recalculated again.

Variable	Value
Intercept	-2.14334
home	0.342078
overtime	-0.01201
penalty_cen	-0.02247
period	0.10468
scoredfirst	1.90498
scoredsecond	1.836549

Probability of winning at home before the first period 13.14%

Probability of winning not at home before the first period 9.70%

Probability of winning when scoring first not at home, in the first period, no overtime, and ~4 penalties: 46.66%

Probability of winning when scoring first at home, in the first period, no overtime, and ~4 penalties: 55.19%

Probability of winning when scoring first not at home, in the second period, no overtime, and ~4 penalties: 49.28%

Probability of winning when scoring first and second not at home, in the first period, no overtime, and ~4 penalties: 84.59%

Probability of winning when scoring first and second at home, no overtime, and ~4 penalties: 88.54%

print("Probability of winning at home before the first period")
print("{0:.2f}%".format(convertToProb(-2.14334+	0.342078+(3.9*-0.02247))*100))

print("Probability of winning not at home before the first period")
print("{0:.2f}%".format(convertToProb(-2.14334+(3.9*-0.02247))*100))

print("Probability of winning when scoring first not at home, in the first period, no overtime, and ~4 penalties: ")
print("{0:.2f}%".format(convertToProb(-2.14334+1.90498+0.10468)*100))

print("Probability of winning when scoring first at home, in the first period, no overtime, and ~4 penalties: ")
print("{0:.2f}%".format(convertToProb(-2.14334+1.90498+0.342078+0.10468)*100))

print("Probability of winning when scoring first not at home, in the second period, no overtime, and ~4 penalties: ")
print("{0:.2f}%".format(convertToProb(-2.14334+1.90498+2*0.10468)*100))

print("Probability of winning when scoring first and second not at home, in the first period, no overtime, and ~4 penalties: ")
print("{0:.2f}%".format(convertToProb(-2.14334+1.90498+1.836549+0.10468)*100))

print("Probability of winning when scoring first and second at home,  no overtime, and ~4 penalties: ")
print("{0:.2f}%".format(convertToProb(-2.14334+1.90498+0.342078+1.836549+0.10468)*100))

Something else to consider in terms of independence are the games and the teams. Wins might be considered nested within a team. A team playing well may be getting more wins in a row or something along those lines. We could examine this by looking at the interclass correlation.

%%SAS sas_session
proc nlmixed data=nhl.gamesrand;
    ods output ParameterEstimates=p1; 
    parms gamma00 = -1 tau00=1;
    gamma0j = gamma00 + u0j;
    eta = gamma0j;
    expEta = exp(eta);
    pij = expEta/(1+expEta);
    model scoredfirst ~ binary(pij);
    random u0j ~ normal([0],[tau00]) subject=team OUT=randout;
    estimate 'ICC' tau00/(tau00+3.29);
run;

<!DOCTYPE html>

SAS Output

The SAS System

The NLMIXED Procedure


Specifications
Data Set	NHL.GAMESRAND
Dependent Variable	scoredfirst
Distribution for Dependent Variable	Binary
Random Effects	u0j
Distribution for Random Effects	Normal
Subject Variable	team
Optimization Technique	Dual Quasi-Newton
Integration Method	Adaptive Gaussian Quadrature


Dimensions
Observations Used	6178
Observations Not Used	0
Total Observations	6178
Subjects	31
Max Obs per Subject	238
Parameters	2
Quadrature Points	1


Initial Parameters
gamma00	tau00	Negative Log Likelihood
-1	1	4329.25972


Iteration History
Iteration	Calls	Negative Log Likelihood	Difference	Maximum Gradient	Slope
1	6	4314.1825	15.07724	14.4613	-151.871
2	16	4302.8614	11.32104	41.4893	-34.2401
3	22	4302.2958	0.565654	53.4288	-60.7572
4	29	4298.0747	4.22112	184.088	-20.6365
5	36	4280.8413	17.23342	415.583	-32.7086
6	40	4278.4399	2.401311	36.7201	-156.999
7	43	4278.3651	0.074847	34.0781	-0.19067
8	45	4278.2916	0.073493	8.94353	-0.09359
9	48	4278.2854	0.006165	2.81772	-0.01572
10	51	4278.2849	0.0005	0.078424	-0.00096
11	54	4278.2849	9.217E-7	0.002819	-1.92E-6


NOTE: GCONV convergence criterion satisfied.


Fit Statistics
-2 Log Likelihood	8556.6
AIC (smaller is better)	8560.6
AICC (smaller is better)	8560.6
BIC (smaller is better)	8563.4


Parameter Estimates
Parameter	Estimate	Standard Error	DF	t Value	Pr > \|t\|	95% Confidence Limits		Gradient
gamma00	-0.00313	0.03522	30	-0.09	0.9298	-0.07505	0.06879	-0.00232
tau00	0.01792	0.009702	30	1.85	0.0747	-0.00190	0.03773	0.002819


Additional Estimates
Label	Estimate	Standard Error	DF	t Value	Pr > \|t\|	Alpha	Lower	Upper
ICC	0.005416	0.002917	30	1.86	0.0732	0.05	-0.00054	0.01137

The average estimates for the coefficients in this model are very similar to what we were seeing with the bootstrapped model. The ICC is .005 which is extremely low and indicates that the wins and loses with this sample are not affected by the grouping of teams. We could still use a multilevel model to examine the random effect associated with each team, and that is shown below.

Were any teams better at scoring first? To answer that I’ll use a different approach.

%%SAS sas_session
proc glimmix data=nhl.gamesrand method=quad;
    class team;
    ODS OUTPUT SOLUTIONR=R;
    model won(event='1')= scoredfirst scoredsecond home overtime period penaltycount/dist=binary solution oddsratio ddfm=bw;
    nloptions maxiter=10000;
    random intercept scoredfirst /subject=team type=chol g solution;
run;

<!DOCTYPE html>

SAS Output

The SAS System

The GLIMMIX Procedure


Model Information
Data Set	NHL.GAMESRAND
Response Variable	won
Response Distribution	Binary
Link Function	Logit
Variance Function	Default
Variance Matrix Blocked By	team
Estimation Technique	Maximum Likelihood
Likelihood Approximation	Gauss-Hermite Quadrature
Degrees of Freedom Method	Between-Within


Class Level Information
Class	Levels	Values
team	31	ANA BOS BUF CAR CBJ CGY CHI COL DAL DET EDM FLA LA MIN MTL NJ NSH NYI NYR OTT PHI PHX PIT SJ STL TB TOR VAN VGK WPG WSH


Number of Observations Read	6178
Number of Observations Used	6178


Response Profile
Ordered Value	won	Total Frequency
The GLIMMIX procedure is modeling the probability that won='1'.
1	0	3071
2	1	3107


Dimensions
G-side Cov. Parameters	3
Columns in X	7
Columns in Z per Subject	2
Subjects (Blocks in V)	31
Max Obs per Subject	238


Optimization Information
Optimization Technique	Dual Quasi-Newton
Parameters in Optimization	10
Lower Boundaries	2
Upper Boundaries	0
Fixed Effects	Not Profiled
Starting From	GLM estimates
Quadrature Points	1


Iteration History
Iteration	Restarts	Evaluations	Objective Function	Change	Max Gradient
0	0	4	6711.4553725	.	60.84809
1	0	3	6710.7871203	0.66825213	263.1968
2	0	3	6706.5499723	4.23714802	148.9014
3	0	2	6703.3524844	3.19748792	159.9946
4	0	3	6703.2107923	0.14169209	147.1521
5	0	2	6702.940897	0.26989531	130.1817
6	0	3	6702.7805741	0.16032292	116.9171
7	0	2	6702.4437877	0.33678642	48.60881
8	0	3	6702.1046842	0.33910344	5.312724
9	0	3	6702.043703	0.06098120	12.1805
10	0	3	6702.0167753	0.02692775	13.42428
11	0	3	6702.0127666	0.00400870	12.09659
12	0	4	6701.999187	0.01357963	0.689931
13	0	3	6701.9990994	0.00008755	0.015468
14	0	3	6701.9990992	0.00000016	0.000821


Convergence criterion (GCONV=1E-8) satisfied.

Estimated G matrix is not positive definite.


Fit Statistics
-2 Log Likelihood	6702.00
AIC (smaller is better)	6720.00
AICC (smaller is better)	6720.03
BIC (smaller is better)	6732.90
CAIC (smaller is better)	6741.90
HQIC (smaller is better)	6724.21


Fit Statistics for Conditional Distribution
-2 log L(won \| r. effects)	6663.54
Pearson Chi-Square	6138.44
Pearson Chi-Square / DF	0.99


Estimated G Matrix
Effect	Row	Col1	Col2
Intercept	1	0.02931	0.000438
scoredfirst	2	0.000438	6.533E-6


Covariance Parameter Estimates
Cov Parm	Subject	Estimate	Standard Error
CHOL(1,1)	team	0.1712	0.05534
CHOL(2,1)	team	0.002556	0.07133
CHOL(2,2)	team	0	.


Solutions for Fixed Effects
Effect	Estimate	Standard Error	DF	t Value	Pr > \|t\|
Intercept	-2.0111	0.1244	30	-16.17	<.0001
scoredfirst	1.9085	0.06506	6141	29.34	<.0001
scoredsecond	1.8539	0.06533	6141	28.38	<.0001
home	0.3270	0.06014	6141	5.44	<.0001
overtime	-0.09928	0.08424	6141	-1.18	0.2386
period	0.1754	0.05700	6141	3.08	0.0021
penaltycount	-0.05052	0.01484	6141	-3.40	0.0007


Odds Ratio Estimates
scoredfirst	scoredsecond	home	overtime	period	penaltycount	_scoredfirst	_scoredsecond	_home	_overtime	_period	_penaltycount	Estimate	DF	95% Confidence Limits
Effects of continuous variables are assessed as one unit offsets from the mean. The AT suboption modifies the reference value and the UNIT suboption modifies the offsets.
1.4989	0.4909	0.5034	0.1318	1.2488	3.9908	0.4989	0.4909	0.5034	0.1318	1.2488	3.9908	6.743	6141	5.936	7.661
0.4989	1.4909	0.5034	0.1318	1.2488	3.9908	0.4989	0.4909	0.5034	0.1318	1.2488	3.9908	6.385	6141	5.617	7.257
0.4989	0.4909	1.5034	0.1318	1.2488	3.9908	0.4989	0.4909	0.5034	0.1318	1.2488	3.9908	1.387	6141	1.233	1.560
0.4989	0.4909	0.5034	1.1318	1.2488	3.9908	0.4989	0.4909	0.5034	0.1318	1.2488	3.9908	0.905	6141	0.768	1.068
0.4989	0.4909	0.5034	0.1318	2.2488	3.9908	0.4989	0.4909	0.5034	0.1318	1.2488	3.9908	1.192	6141	1.066	1.333
0.4989	0.4909	0.5034	0.1318	1.2488	4.9908	0.4989	0.4909	0.5034	0.1318	1.2488	3.9908	0.951	6141	0.923	0.979


Type III Tests of Fixed Effects
Effect	Num DF	Den DF	F Value	Pr > F
scoredfirst	1	6141	860.64	<.0001
scoredsecond	1	6141	805.20	<.0001
home	1	6141	29.57	<.0001
overtime	1	6141	1.39	0.2386
period	1	6141	9.47	0.0021
penaltycount	1	6141	11.58	0.0007


Solution for Random Effects
Effect	Subject	Estimate	Std Err Pred	DF	t Value	Pr > \|t\|
Intercept	team ANA	0.1035	0.1275	6171	0.81	0.4169
scoredfirst	team ANA	0.001546	0.04294	6171	0.04	0.9713
Intercept	team BOS	0.1345	0.1305	6171	1.03	0.3027
scoredfirst	team BOS	0.002008	0.05592	6171	0.04	0.9714
Intercept	team BUF	-0.2700	0.1522	6171	-1.77	0.0761
scoredfirst	team BUF	-0.00403	0.1123	6171	-0.04	0.9714
Intercept	team CAR	-0.2826	0.1553	6171	-1.82	0.0690
scoredfirst	team CAR	-0.00422	0.1174	6171	-0.04	0.9713
Intercept	team CBJ	0.008767	0.1201	6171	0.07	0.9418
scoredfirst	team CBJ	0.000131	0.003898	6171	0.03	0.9732
Intercept	team CGY	0.08337	0.1249	6171	0.67	0.5046
scoredfirst	team CGY	0.001245	0.03460	6171	0.04	0.9713
Intercept	team CHI	0.1148	0.1248	6171	0.92	0.3576
scoredfirst	team CHI	0.001714	0.04811	6171	0.04	0.9716
Intercept	team COL	-0.1140	0.1241	6171	-0.92	0.3584
scoredfirst	team COL	-0.00170	0.04749	6171	-0.04	0.9714
Intercept	team DAL	-0.04652	0.1179	6171	-0.39	0.6932
scoredfirst	team DAL	-0.00069	0.01952	6171	-0.04	0.9716
Intercept	team DET	-0.09946	0.1226	6171	-0.81	0.4173
scoredfirst	team DET	-0.00149	0.04182	6171	-0.04	0.9717
Intercept	team EDM	-0.06339	0.1213	6171	-0.52	0.6013
scoredfirst	team EDM	-0.00095	0.02658	6171	-0.04	0.9716
Intercept	team FLA	-0.03694	0.1195	6171	-0.31	0.7573
scoredfirst	team FLA	-0.00055	0.01543	6171	-0.04	0.9715
Intercept	team LA	-0.05814	0.1219	6171	-0.48	0.6333
scoredfirst	team LA	-0.00087	0.02468	6171	-0.04	0.9719
Intercept	team MIN	0.1063	0.1267	6171	0.84	0.4014
scoredfirst	team MIN	0.001587	0.04433	6171	0.04	0.9714
Intercept	team MTL	0.01175	0.1219	6171	0.10	0.9232
scoredfirst	team MTL	0.000175	0.005245	6171	0.03	0.9733
Intercept	team NJ	-0.1392	0.1221	6171	-1.14	0.2540
scoredfirst	team NJ	-0.00208	0.05832	6171	-0.04	0.9716
Intercept	team NSH	0.1093	0.1279	6171	0.85	0.3929
scoredfirst	team NSH	0.001632	0.04540	6171	0.04	0.9713
Intercept	team NYI	-0.03941	0.1202	6171	-0.33	0.7430
scoredfirst	team NYI	-0.00059	0.01672	6171	-0.04	0.9719
Intercept	team NYR	0.1213	0.1217	6171	1.00	0.3193
scoredfirst	team NYR	0.001810	0.05085	6171	0.04	0.9716
Intercept	team OTT	0.05774	0.1216	6171	0.47	0.6349
scoredfirst	team OTT	0.000862	0.02421	6171	0.04	0.9716
Intercept	team PHI	-0.06709	0.1233	6171	-0.54	0.5863
scoredfirst	team PHI	-0.00100	0.02831	6171	-0.04	0.9718
Intercept	team PHX	-0.2198	0.1461	6171	-1.50	0.1324
scoredfirst	team PHX	-0.00328	0.09139	6171	-0.04	0.9714
Intercept	team PIT	0.1195	0.1313	6171	0.91	0.3629
scoredfirst	team PIT	0.001784	0.04981	6171	0.04	0.9714
Intercept	team SJ	0.03619	0.1214	6171	0.30	0.7656
scoredfirst	team SJ	0.000540	0.01527	6171	0.04	0.9718
Intercept	team STL	0.1790	0.1342	6171	1.33	0.1825
scoredfirst	team STL	0.002672	0.07458	6171	0.04	0.9714
Intercept	team TB	0.1397	0.1292	6171	1.08	0.2797
scoredfirst	team TB	0.002086	0.05817	6171	0.04	0.9714
Intercept	team TOR	-0.06816	0.1281	6171	-0.53	0.5947
scoredfirst	team TOR	-0.00102	0.02818	6171	-0.04	0.9712
Intercept	team VAN	-0.08880	0.1202	6171	-0.74	0.4600
scoredfirst	team VAN	-0.00133	0.03721	6171	-0.04	0.9716
Intercept	team VGK	0.01550	0.1552	6171	0.10	0.9205
scoredfirst	team VGK	0.000231	0.007014	6171	0.03	0.9737
Intercept	team WPG	0.08358	0.1203	6171	0.69	0.4873
scoredfirst	team WPG	0.001248	0.03502	6171	0.04	0.9716
Intercept	team WSH	0.1675	0.1264	6171	1.33	0.1850
scoredfirst	team WSH	0.002501	0.07022	6171	0.04	0.9716

%%SAS sas_session
proc sql;
    select Subject, (Estimate+1.9085) as ScoredFirst
      from r
      where Effect="scoredfirst"
        order by ScoredFirst;
run;
quit;

<!DOCTYPE html>

SAS Output

The SAS System


Subject	ScoredFirst
team CAR	1.904281
team BUF	1.904468
team PHX	1.905218
team NJ	1.906421
team COL	1.906798
team DET	1.907015
team VAN	1.907174
team TOR	1.907482
team PHI	1.907498
team EDM	1.907554
team LA	1.907632
team DAL	1.907805
team NYI	1.907912
team FLA	1.907948
team CBJ	1.908631
team MTL	1.908675
team VGK	1.908731
team SJ	1.90904
team OTT	1.909362
team CGY	1.909745
team WPG	1.909748
team ANA	1.910046
team MIN	1.910087
team NSH	1.910132
team CHI	1.910214
team PIT	1.910284
team NYR	1.91031
team BOS	1.910508
team TB	1.910586
team WSH	1.911001
team STL	1.911172

To break up the effect for each team, I will run the logistic regression by team.

%%SAS sas_session
proc sort data=nhl.games;
 by team;
run;
ods trace on;
ods output ParameterEstimates=nhl.teamestimates;
proc logistic data=nhl.games;
by team;
class scoredfirst(ref='0') scoredsecond(ref='0') home(ref='0')overtime(ref='0')/param=ref;
model won(event='1')= scoredfirst scoredsecond home overtime period penalty_cen/;
run; quit;
ods trace off;

%%SAS sas_session
proc print data=nhl.teamestimates(obs=10);
run;

<!DOCTYPE html>

SAS Output

The SAS System


Obs	team	Variable	ClassVal0	DF	Estimate	StdErr	WaldChiSq	ProbChiSq	_ESTTYPE_
1	ANA	Intercept		1	-1.9796	0.4050	23.8903	<.0001	MLE
2	ANA	scoredfirst	1	1	1.5697	0.2389	43.1713	<.0001	MLE
3	ANA	scoredsecond	1	1	1.5974	0.2415	43.7494	<.0001	MLE
4	ANA	home	1	1	0.6988	0.2314	9.1181	0.0025	MLE
5	ANA	overtime	1	1	-0.5108	0.3259	2.4570	0.1170	MLE
6	ANA	period		1	0.3861	0.2362	2.6721	0.1021	MLE
7	ANA	penalty_cen		1	-0.00960	0.0528	0.0331	0.8556	MLE
8	BOS	Intercept		1	-2.4694	0.4521	29.8340	<.0001	MLE
9	BOS	scoredfirst	1	1	2.1861	0.2719	64.6617	<.0001	MLE
10	BOS	scoredsecond	1	1	1.8727	0.2724	47.2734	<.0001	MLE

%%SAS sas_session
proc sql;
    select Variable, Team, Estimate, ProbChiSQ 
      from nhl.teamestimates
      where Variable="scoredfirst"
        order by Estimate;
run;
quit;

<!DOCTYPE html>

SAS Output

The SAS System


Variable	team	Estimate	Pr > Chi-Square
scoredfirst	NYI	1.1333	<.0001
scoredfirst	VAN	1.3658	<.0001
scoredfirst	PIT	1.5193	<.0001
scoredfirst	ANA	1.5697	<.0001
scoredfirst	LA	1.6351	<.0001
scoredfirst	DET	1.6427	<.0001
scoredfirst	PHX	1.6749	<.0001
scoredfirst	CBJ	1.7355	<.0001
scoredfirst	EDM	1.7418	<.0001
scoredfirst	NSH	1.7765	<.0001
scoredfirst	DAL	1.8462	<.0001
scoredfirst	OTT	1.8750	<.0001
scoredfirst	PHI	1.8914	<.0001
scoredfirst	CGY	1.8929	<.0001
scoredfirst	BUF	1.9120	<.0001
scoredfirst	FLA	1.9808	<.0001
scoredfirst	MIN	2.0916	<.0001
scoredfirst	TOR	2.1204	<.0001
scoredfirst	CHI	2.1262	<.0001
scoredfirst	WPG	2.1340	<.0001
scoredfirst	STL	2.1541	<.0001
scoredfirst	SJ	2.1664	<.0001
scoredfirst	NJ	2.1830	<.0001
scoredfirst	BOS	2.1861	<.0001
scoredfirst	TB	2.1991	<.0001
scoredfirst	COL	2.2399	<.0001
scoredfirst	CAR	2.2597	<.0001
scoredfirst	NYR	2.2732	<.0001
scoredfirst	MTL	2.3390	<.0001
scoredfirst	VGK	2.3544	0.0002
scoredfirst	WSH	2.5618	<.0001

Surprisingly, the Vegas Golden Knights ranked really high in this. The Knights were really good at winning after scoring first (72% probability of winning). But, maybe it isn’t so surprising. There is only one season to go off of, and they made it to the stanley cup, meaning they won lots of games in that season.

Another consideration, if you look at a lot of the top teams that are likely to win after scoring first they have some of the best goalies in the league: Holtby, Fleury, Bishop/Vasilevskiy (some of TB seasons), Price, and Lundqvist. If you have a good goalie, it likely makes it even harder on the opponent to score the equalizer and a second goal. That first goal also takes off some pressure on the goalie. It also probably indicates more offensive zone posession, meaning the team is keeping the puck away from their own goal. VGK had Fleury who played phenomenally throughout the regular season and playoffs.

What about COL and CAR? That seems to make less sense. But going back to the section about downloading the data, I reviewed the players that scored first or assisted on the first goal. Colorado had RANTANEN and Carolina STAAL pop up for different seasons. So there could just be a factor that the low-winning teams just happened to win by scoring first.

%%SAS sas_session
proc sql;
    select Variable, Team, Estimate, ProbChiSQ 
      from nhl.teamestimates
      where team="VGK"
        order by Estimate;
run;
quit;

<!DOCTYPE html>

SAS Output

The SAS System


Variable	team	Estimate	Pr > Chi-Square
Intercept	VGK	-1.8942	0.0563
penalty_cen	VGK	-0.1949	0.3537
period	VGK	0.0616	0.9138
overtime	VGK	0.3167	0.6663
home	VGK	0.6651	0.2225
scoredsecond	VGK	1.3322	0.0316
scoredfirst	VGK	2.3544	0.0002

The SAS System


Variable	team	Estimate	Pr > Chi-Square
Intercept	PIT	-2.0994	<.0001
overtime	PIT	-0.0818	0.8050
penalty_cen	PIT	-0.0151	0.7778
period	PIT	0.2954	0.1847
home	PIT	0.9300	<.0001
scoredfirst	PIT	1.5193	<.0001
scoredsecond	PIT	1.7099	<.0001

print("VGK Probability of winning after scoring first in the first period at home with about 4 penalties")
print("{0:.2f}%".format(convertToProb(-1.8942+2.3544+.6651+.0616-.1949)*100))
print("VGK Probability of winning after scoring second in the first period at home with about 4 penalties")
print("{0:.2f}%".format(convertToProb(-1.8942+1.3322+.6651+.0616-.1949)*100))

VGK Probability of winning after scoring first in the first period at home with about 4 penalties
72.95%
VGK Probability of winning after scoring second in the first period at home with about 4 penalties
49.25%

%%SAS sas_session
proc sql;
    select Variable, Team, Estimate, ProbChiSQ 
      from nhl.teamestimates
      where Variable="scoredsecond"
        order by Estimate;
run;
quit;

<!DOCTYPE html>

SAS Output

The SAS System


Variable	team	Estimate	Pr > Chi-Square
scoredsecond	VGK	1.3322	0.0316
scoredsecond	NYI	1.4828	<.0001
scoredsecond	MIN	1.4955	<.0001
scoredsecond	NSH	1.5128	<.0001
scoredsecond	PHX	1.5876	<.0001
scoredsecond	EDM	1.5884	<.0001
scoredsecond	ANA	1.5974	<.0001
scoredsecond	SJ	1.6608	<.0001
scoredsecond	PIT	1.7099	<.0001
scoredsecond	CBJ	1.7114	<.0001
scoredsecond	OTT	1.7161	<.0001
scoredsecond	FLA	1.7174	<.0001
scoredsecond	VAN	1.7430	<.0001
scoredsecond	DAL	1.8158	<.0001
scoredsecond	BOS	1.8727	<.0001
scoredsecond	DET	1.8870	<.0001
scoredsecond	BUF	1.8951	<.0001
scoredsecond	CHI	1.9125	<.0001
scoredsecond	NYR	1.9689	<.0001
scoredsecond	CAR	1.9849	<.0001
scoredsecond	WSH	2.0039	<.0001
scoredsecond	PHI	2.0239	<.0001
scoredsecond	STL	2.0241	<.0001
scoredsecond	COL	2.0283	<.0001
scoredsecond	CGY	2.1101	<.0001
scoredsecond	TB	2.1191	<.0001
scoredsecond	LA	2.1662	<.0001
scoredsecond	TOR	2.1773	<.0001
scoredsecond	WPG	2.1887	<.0001
scoredsecond	MTL	2.2306	<.0001
scoredsecond	NJ	2.2486	<.0001

Interestingly the value for scoring second is flipped a little bit. The VKG are now at the bottom, again this is difficult because it is based only on their one great season. NJ was good at sealing the deal when they got the second goal, even without the first goal. Pittsburg seems to be the biggest outlier. They are low in winning probability even if they had the first or second goals. But still over 50% after scoring first. Another big factor for them seems to be home ice advantage, and holding the onto the win even if they scored in the first period.

Some other things that I’m not controlling for that could be influencing the result. The team’s even strength goals, penalty killing percentages, etc…

%%SAS sas_session
proc sql;
    select Variable, Team, Estimate, ProbChiSQ 
      from nhl.teamestimates
      where team="PIT"
        order by Estimate;
run;
quit;

<!DOCTYPE html>

SAS Output

The SAS System


Variable	team	Estimate	Pr > Chi-Square
Intercept	PIT	-2.0994	<.0001
overtime	PIT	-0.0818	0.8050
penalty_cen	PIT	-0.0151	0.7778
period	PIT	0.2954	0.1847
home	PIT	0.9300	<.0001
scoredfirst	PIT	1.5193	<.0001
scoredsecond	PIT	1.7099	<.0001

print("Probability of winning after scoring first in the first period at home with about 4 penalties")
print("{0:.2f}%".format(convertToProb(-2.0994+1.5193+.9300+.2954-.0151)*100))
print("Probability of winning after scoring second in the first period at home with about 4 penalties")
print("{0:.2f}%".format(convertToProb(-2.0994+1.7099+.9300+.2954-.0151)*100))

Probability of winning after scoring first in the first period at home with about 4 penalties
65.25%
Probability of winning after scoring second in the first period at home with about 4 penalties
69.44%

Looking at home ice advantage when controling for overtime and who scores first is interesting. Some teams actually has a negative or even non-significant effect, while others, like PIT, it has a high effect. Some of the others are not so surprising, Toronto and Philadelphia (if I was a ref I’d be intimidated in Philadelphia too). Tampa Bay is surprising to me, given the other teams in the top spots. Certainly, they have a great fan base (biased), but one of the problems is with out of state fans coming in. It isn’t surprising to see more red sweaters than blue sweaters when, say, the blackhawks come through. I wonder, though, if ice has something to do with this. It can be hot and humid in Tampa, and the quality of the ice may be a factor, although it generally sits at the right temperature. Negating that argument is that PHX is hot too, but doesn’t rank high in the home ice effect. Ignoring significance, PHX has more of an effect than BOS, which I would think the opposite would be true.

Perhaps more importantly, the top teams are also the teams that just won more games over the seasons of analysis, and would by that have more home wins.

%%SAS sas_session
proc sql;
    select Variable, Team, Estimate, ProbChiSQ 
      from nhl.teamestimates
      where Variable="home"
        order by Estimate;
run;
quit;

<!DOCTYPE html>

SAS Output

The SAS System


Variable	team	Estimate	Pr > Chi-Square
home	MTL	-0.0638	0.7957
home	NYI	-0.0379	0.8621
home	DET	0.0381	0.8691
home	SJ	0.0876	0.7082
home	VAN	0.0998	0.6632
home	COL	0.1336	0.5819
home	CGY	0.1383	0.5713
home	MIN	0.1708	0.4756
home	CHI	0.1838	0.4493
home	BOS	0.1872	0.4215
home	OTT	0.1959	0.3941
home	CBJ	0.2357	0.3013
home	BUF	0.2436	0.3206
home	NYR	0.2618	0.2845
home	WPG	0.2762	0.2346
home	STL	0.3251	0.1768
home	FLA	0.3905	0.0989
home	PHX	0.3960	0.0867
home	EDM	0.4256	0.0643
home	CAR	0.4366	0.0673
home	NJ	0.4426	0.0706
home	LA	0.4487	0.0591
home	NSH	0.4628	0.0445
home	DAL	0.4668	0.0447
home	WSH	0.5049	0.0427
home	VGK	0.6651	0.2225
home	ANA	0.6988	0.0025
home	PHI	0.7184	0.0026
home	TB	0.8152	0.0010
home	TOR	0.8890	0.0006
home	PIT	0.9300	<.0001

One final look at a different component of home ice advantage is suggested in Soccernomics. That the home influence has nothing to do with time change, jet lag, or the fans and players; but it has something to do with fans and referees. To look at that, I will look at the relationship between penalties and home. This will just be a simple model, and as you will see in the results, it doesn’t explain the variation in penalty count that well. Still it can tell us the overall difference in penalties between the home team and the away team

%%SAS sas_session
proc reg data=nhl.gamesrand;

model penaltycount= home won /;
run; quit;

<!DOCTYPE html>

SAS Output

The SAS System

The REG Procedure

Model: MODEL1

Dependent Variable: penaltycount


Number of Observations Read	6178
Number of Observations Used	6178


Analysis of Variance
Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	2	326.99322	163.49661	37.78	<.0001
Error	6175	26721	4.32737
Corrected Total	6177	27048


Root MSE	2.08023	R-Square	0.0121
Dependent Mean	3.99077	Adj R-Sq	0.0118
Coeff Var	52.12603


Parameter Estimates
Variable	DF	Parameter Estimate	Standard Error	t Value	Pr > \|t\|
Intercept	1	4.26670	0.04470	95.45	<.0001
home	1	-0.43584	0.05316	-8.20	<.0001
won	1	-0.11240	0.05316	-2.11	0.0345

The SAS System

The REG Procedure

Model: MODEL1

Dependent Variable: penaltycount

Panel of heat maps of residuals by regressors for penaltycount.

R-Squared is quite small, but then again I’ve had models so bad R-squared was negative (really it is impossible, so that was a really bad model using Twitter data if that explains anything). But the difference is significant. On average you would expect a team to get 4.21545 penalties per game. This is close to the mean calculated without controlling for home ice. Somewhat obviously, there is a negative effect for the team that wins. You have less penalties, you are more likely to win; which was also shown above.

When home ice is accounted for, there is a -0.43584 drop in penalties. So that shows a slight favortism towards the home team. Let’s bootstrap it again for a better estimate.

%%SAS sas_session
proc sort data=nhl.gamessamp;
 by replicate;
run;
ods trace on;
ods output ParameterEstimates=nhl.penestimates;
proc reg data=nhl.gamessamp;
by replicate;
model penaltycount= home won/;
run; quit;
ods trace off;

%%SAS sas_session
proc sql;
    select Variable, mean(Estimate) as MeanEst
      from nhl.penestimates
      group by variable;
run;
quit;
%histograms(curvar='home')

<!DOCTYPE html>

SAS Output

The SAS System


Variable	MeanEst
Intercept	4.156299
home	-0.29919
won	-0.04627

The SAS System

The original sample ended up being higher than the bootstrapped estimate. Still there is a significant negative effect on the number of penalties a team is called for if they are the home team. The next question would be if this varies by team. Do certain teams have more of a home ice advantage than others, in terms of the penalties called?

%%SAS sas_session
proc sort data=nhl.games;
 by team;
run;
ods trace on;
ods output ParameterEstimates=nhl.penteamestimates;
proc reg data=nhl.games;
by team;
model penaltycount= home won/;
run; quit;
ods trace off;

%%SAS sas_session
proc sql;
    select Variable, Team, Estimate, Probt 
      from nhl.penteamestimates
      where Variable="home"
        order by Estimate;
run;
quit;

<!DOCTYPE html>

SAS Output

The SAS System


Variable	team	Parameter Estimate	Pr > \|t\|
home	CGY	-0.72226	0.0013
home	CHI	-0.62266	0.0004
home	VAN	-0.55893	0.0151
home	TOR	-0.48348	0.0213
home	NJ	-0.47302	0.0069
home	MTL	-0.46980	0.0228
home	PIT	-0.46902	0.0371
home	NYR	-0.42721	0.0142
home	WSH	-0.41880	0.0390
home	NSH	-0.40351	0.0507
home	VGK	-0.39992	0.1965
home	FLA	-0.39745	0.0721
home	MIN	-0.38571	0.0462
home	CAR	-0.36810	0.0254
home	OTT	-0.35791	0.1104
home	ANA	-0.33947	0.1173
home	BUF	-0.30478	0.1368
home	CBJ	-0.28451	0.1720
home	WPG	-0.26757	0.2471
home	DET	-0.25629	0.1476
home	BOS	-0.25514	0.2141
home	COL	-0.20313	0.3130
home	PHI	-0.12630	0.5781
home	EDM	-0.11030	0.5855
home	NYI	-0.08499	0.6757
home	PHX	-0.04440	0.8409
home	SJ	-0.04309	0.8336
home	DAL	-0.03506	0.8597
home	LA	-0.03000	0.8776
home	TB	0.01714	0.9290
home	STL	0.19078	0.4047

If you are a Pittsburg rival, and Sydney Crosby conspiracist, then we might see PITs top rank of home ice advantage as proof of something nefarious. I’ll just leave it at that. Then there’s Calgary. Or Philadelphia, which no longer seems to have a home ice advantage.

One question though, is it that these teams have a home ice advantage, or are they just more likely to take more penalties on the road? Many of the teams that showed up with home ice advantage in terms of wins, are not on the top of the list when compared to penalties. Actually, the penalties don’t really see any real difference when at home or away for most of the teams.

As with a lot of analysis this just leads to more questions. Does the nationality of the refs come into play? Some of the top teams are Canadian (then again 3 canadian teams aren’t). Some of the teams are original six teams, some aren’t.

Summary

As I stated before, this is mostly for fun. It is a toy dataset that can be used to answer some questions. I would say that the first goal is important, but only because of the second goal. If a team scores both the first and second, the game is pretty much over. Home seems to provide, on average, some advantage, and on average the penalties are slightly lower for home teams than for away teams. This doesn’t hold up when looking at every team though.

What does it mean for strategy? Well, coming from someone who has never played, I don’t know. I do have some thoughts. One is that a team should try and score first. Is that helpful? Just go and score? Of course if it were easy….I think there may be ways to do that. Upload the minutes of your best scoring lines in the first part of the game, including scoring defensemen. Or if you know which player is the best at scoring the first goal, give them lots of time at the start.

Another strategy though, is it gets harder for the opposing team to win if the first score occurs later in in the game. An average team has a 70% chance of winning if the first goal comes in the 3rd period. So, give your most defensive minded players more minutes in the first and part of the second. Really grind it out with the other team, lots of hits, and tire them out. Make them cover the whole rink. Then switch and give more minutes to the best scorers who will have fresh legs.

But…I’m not a coach and I don’t play hockey.

print("Probability of winning when scoring first at home, in the first period, no overtime, and ~4 penalties: ")
print("{0:.2f}%".format(convertToProb(-2.14334+1.90498+3*0.342078+0.10468)*100))

Probability of winning when scoring first at home, in the first period, no overtime, and ~4 penalties: 
70.94%