Scoring First in the NHL
The data for this analysis were collected in another section. I separated this to keep things a little shorter. I’m working with SAS university edition, which includes a Jupyter Notebook and access to SAS through Python or using magics %% to flag SAS code. I’ll primarily work with SAS code. Unfortunately, I can’t do all the steps in SAS University edition, since I cannot install external libraries onto their virtual machine. If the library is pure python, I could probably use it locally, but that is not the case with things like sklearn and matplotlib. I mostly wanted to use SAS for the multilevel modeling capabilities. Otherwise sklearn or statsmodel could be used.
I’ve stored the data as a csv file to load into SAS, but I’ll load it through a pandas data frame initially.
import saspy
import pandas as pd
import numpy as np
from IPython.display import HTML
import math
df = pd.read_csv(#rename if you need to
,index_col=0)
df['notwon'] = np.where(df['won']==1,0,1)
df['penalty_cen'] = df['penaltycount']-3.98446
print(df.head())
print(df.describe())
print(df['team'].unique())
#to fix the arizona team
#for i,row in df.iterrows():
# if row['team']=='ARI':
# df.set_value(i,'team','PHX')
Create a sas session and load the dataframe into SAS for analysis.
sas_session = saspy.SASsession()
sas_session.saslib("nhl",path="/folders/myfolders/nhl/Update20180709")
sas_session.teach_me_SAS(False)
games = sas_session.df2sd(df,table='games',libref='nhl')
print(games.describe())
games.head()
Using SAS Config named: default
SAS Connection established. Subprocess id is 5436
27
28 libname nhl '/folders/myfolders/nhl/Update20180709' ;
NOTE: Libref NHL was successfully assigned as follows:
Engine: V9
Physical Name: /folders/myfolders/nhl/Update20180709
29
30
Variable N NMiss Median Mean StdDev \
0 gamekey 12356 0 2.016290e+07 5.380185e+07 7.256136e+07
1 goals 12356 0 3.000000e+00 2.785529e+00 1.621079e+00
2 home 12356 0 5.000000e-01 5.000000e-01 5.000202e-01
3 overtime 12356 0 0.000000e+00 1.317579e-01 3.382410e-01
4 penaltycount 12356 0 4.000000e+00 3.984461e+00 2.089983e+00
5 period 12356 0 1.000000e+00 1.248786e+00 5.256075e-01
6 scoredfirst 12356 0 5.000000e-01 5.000000e-01 5.000202e-01
7 scoredsecond 12356 0 0.000000e+00 4.876174e-01 4.998669e-01
8 won 12356 0 5.000000e-01 5.000000e-01 5.000202e-01
9 notwon 12356 0 5.000000e-01 5.000000e-01 5.000202e-01
10 penalty_cen 12356 0 1.554000e-02 9.906119e-07 2.089983e+00
Min P25 P50 P75 Max
0 201421.00000 2.015226e+07 2.016290e+07 2.018265e+07 2.018213e+08
1 0.00000 2.000000e+00 3.000000e+00 4.000000e+00 1.000000e+01
2 0.00000 0.000000e+00 5.000000e-01 1.000000e+00 1.000000e+00
3 0.00000 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00
4 0.00000 3.000000e+00 4.000000e+00 5.000000e+00 2.000000e+01
5 1.00000 1.000000e+00 1.000000e+00 1.000000e+00 4.000000e+00
6 0.00000 0.000000e+00 5.000000e-01 1.000000e+00 1.000000e+00
7 0.00000 0.000000e+00 0.000000e+00 1.000000e+00 1.000000e+00
8 0.00000 0.000000e+00 5.000000e-01 1.000000e+00 1.000000e+00
9 0.00000 0.000000e+00 5.000000e-01 1.000000e+00 1.000000e+00
10 -3.98446 -9.844600e-01 1.554000e-02 1.015540e+00 1.601554e+01
gamekey | goals | home | overtime | penaltycount | period | scoredfirst | scoredsecond | team | won | notwon | penalty_cen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 201421 | 3 | 1 | 0 | 14 | 1 | 0 | 1 | MTL | 0 | 1 | 10.01554 |
1 | 201421 | 4 | 0 | 0 | 12 | 1 | 1 | 0 | TOR | 1 | 0 | 8.01554 |
2 | 201422 | 6 | 1 | 0 | 6 | 1 | 1 | 0 | CHI | 1 | 0 | 2.01554 |
3 | 201422 | 4 | 0 | 0 | 4 | 1 | 0 | 1 | WSH | 0 | 1 | 0.01554 |
4 | 201423 | 4 | 1 | 0 | 6 | 1 | 1 | 0 | EDM | 0 | 1 | 2.01554 |
One of the problems with the dataset is that really each observation is not independent. The value of won depends on who one. Know one of these you know who lost the game. I won’t go into the details here, but I experimented with a couple of ways to approach this problem. For one, I decided to look at a multilevel approach where games were nested within teams, a sort of cross classification problem. This didn’t work out that well because of the small number of samples per gamekey. Often, it had trouble estimated the covariance matrix, although it would converge.
Instead I chose alternative route of sampling a team from each game - stratified random sampling; where the gamekey was the strata.
Regardless of the approach, the estimated parameters were usually close. Here is an example of a random sample and the full model.
def convertToProb(logit):
odds = math.exp(logit)
return odds/(1.0+odds)
%%SAS sas_session
proc sort data=nhl.games;
by gamekey;
run;
Proc SurveySelect data=nhl.games out=nhl.gamesrand noprint Method = urs N = 1 outhits
rep = 1;
Strata gamekey;
run;
proc freq data=nhl.gamesrand;
table won;
run;
proc print data=nhl.gamesrand(obs=10);
run;
<!DOCTYPE html>
The FREQ Procedure
won | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
---|---|---|---|---|
0 | 3093 | 50.06 | 3093 | 50.06 |
1 | 3085 | 49.94 | 6178 | 100.00 |
Obs | gamekey | Replicate | goals | home | overtime | penaltycount | period | scoredfirst | scoredsecond | team | won | notwon | penalty_cen | NumberHits | ExpectedHits | SamplingWeight |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 201421 | 1 | 3 | 1 | 0 | 14 | 1 | 0 | 1 | MTL | 0 | 1 | 10.0155 | 1 | 0.5 | 2 |
2 | 201422 | 1 | 6 | 1 | 0 | 6 | 1 | 1 | 0 | CHI | 1 | 0 | 2.0155 | 1 | 0.5 | 2 |
3 | 201423 | 1 | 4 | 1 | 0 | 6 | 1 | 1 | 0 | EDM | 0 | 1 | 2.0155 | 1 | 0.5 | 2 |
4 | 201424 | 1 | 1 | 1 | 0 | 4 | 1 | 1 | 0 | PHI | 0 | 1 | 0.0155 | 1 | 0.5 | 2 |
5 | 201425 | 1 | 2 | 1 | 0 | 7 | 1 | 1 | 1 | DET | 1 | 0 | 3.0155 | 1 | 0.5 | 2 |
6 | 201426 | 1 | 6 | 1 | 0 | 8 | 1 | 1 | 1 | COL | 1 | 0 | 4.0155 | 1 | 0.5 | 2 |
7 | 201427 | 1 | 3 | 1 | 0 | 7 | 1 | 1 | 0 | BOS | 1 | 0 | 3.0155 | 1 | 0.5 | 2 |
8 | 201428 | 1 | 3 | 1 | 0 | 3 | 1 | 1 | 1 | PIT | 1 | 0 | -0.9845 | 1 | 0.5 | 2 |
9 | 201429 | 1 | 4 | 0 | 0 | 5 | 1 | 1 | 1 | CGY | 0 | 1 | 1.0155 | 1 | 0.5 | 2 |
10 | 201521 | 1 | 3 | 1 | 0 | 2 | 1 | 0 | 1 | TOR | 0 | 1 | -1.9845 | 1 | 0.5 | 2 |
This means that about 50% of the sample are wins. A 50/50 chance of picking one or the other in each game.
Looking solely at the scoring first variable, we get an estimate of the effect of scoring first has on winning. This does a better job predicting as compared to the intercept only model. The odds ratio is pretty high at 4.6 (versus not scoring first). I find plugging in the estimates to be a useful exercise. And you can see this below. Not surprisingly the probability is close to the proportion of wins where the winner scored first at 68%. This will change when we start to control for other factors. Next I will fit the full model.
%%SAS sas_session
proc logistic data=nhl.gamesrand;
class scoredfirst(ref='0') /param=ref;
model won(event='1')= scoredfirst/;
run; quit;
<!DOCTYPE html>
The LOGISTIC Procedure
Model Information | |
---|---|
Data Set | NHL.GAMESRAND |
Response Variable | won |
Number of Response Levels | 2 |
Model | binary logit |
Optimization Technique | Fisher's scoring |
Number of Observations Read | 6178 |
---|---|
Number of Observations Used | 6178 |
Response Profile | ||
---|---|---|
Ordered Value |
won | Total Frequency |
1 | 0 | 3071 |
2 | 1 | 3107 |
Probability modeled is won=1.
Class Level Information | ||
---|---|---|
Class | Value | Design Variables |
scoredfirst | 0 | 0 |
1 | 1 |
Model Convergence Status |
---|
Convergence criterion (GCONV=1E-8) satisfied. |
Model Fit Statistics | ||
---|---|---|
Criterion | Intercept Only | Intercept and Covariates |
AIC | 8566.317 | 7728.093 |
SC | 8573.046 | 7741.551 |
-2 Log L | 8564.317 | 7724.093 |
Testing Global Null Hypothesis: BETA=0 | |||
---|---|---|---|
Test | Chi-Square | DF | Pr > ChiSq |
Likelihood Ratio | 840.2236 | 1 | <.0001 |
Score | 820.9889 | 1 | <.0001 |
Wald | 782.1602 | 1 | <.0001 |
Type 3 Analysis of Effects | |||
---|---|---|---|
Effect | DF | Wald Chi-Square |
Pr > ChiSq |
scoredfirst | 1 | 782.1602 | <.0001 |
Analysis of Maximum Likelihood Estimates | ||||||
---|---|---|---|---|---|---|
Parameter | DF | Estimate | Standard Error |
Wald Chi-Square |
Pr > ChiSq | |
Intercept | 1 | -0.7489 | 0.0385 | 378.5074 | <.0001 | |
scoredfirst | 1 | 1 | 1.5285 | 0.0547 | 782.1602 | <.0001 |
Odds Ratio Estimates | |||
---|---|---|---|
Effect | Point Estimate | 95% Wald Confidence Limits |
|
scoredfirst 1 vs 0 | 4.611 | 4.143 | 5.133 |
Association of Predicted Probabilities and Observed Responses | |||
---|---|---|---|
Percent Concordant | 46.5 | Somers' D | 0.365 |
Percent Discordant | 10.1 | Gamma | 0.644 |
Percent Tied | 43.4 | Tau-a | 0.182 |
Pairs | 9541597 | c | 0.682 |
print("Probability of winning when scoring first: ")
print("{0:.2f}%".format(convertToProb(-.7489+1.5285)*100))
Probability of winning when scoring first:
68.56%
The model certainly performs better than the intercept only model, and converged properly. Overtime is the only non-significant factor, and so one of the other variables explains the same information as the overtime. That could be the period scored. It may be worth looking later which variables are added that overlap with overtime. As expected our scoredfirst variable is important, and significant. Penalties were centered around the mean of approximately 4 penalties (3.9…), and so a 0 would indicate 4.
%%SAS sas_session
proc logistic data=nhl.gamesrand;
class scoredfirst(ref='0') scoredsecond(ref='0') home(ref='0')overtime(ref='0')/param=ref;
model won(event='1')= scoredfirst scoredsecond home overtime period penalty_cen/;
run; quit;
<!DOCTYPE html>
The LOGISTIC Procedure
Model Information | |
---|---|
Data Set | NHL.GAMESRAND |
Response Variable | won |
Number of Response Levels | 2 |
Model | binary logit |
Optimization Technique | Fisher's scoring |
Number of Observations Read | 6178 |
---|---|
Number of Observations Used | 6178 |
Response Profile | ||
---|---|---|
Ordered Value |
won | Total Frequency |
1 | 0 | 3071 |
2 | 1 | 3107 |
Probability modeled is won=1.
Class Level Information | ||
---|---|---|
Class | Value | Design Variables |
scoredfirst | 0 | 0 |
1 | 1 | |
scoredsecond | 0 | 0 |
1 | 1 | |
home | 0 | 0 |
1 | 1 | |
overtime | 0 | 0 |
1 | 1 |
Model Convergence Status |
---|
Convergence criterion (GCONV=1E-8) satisfied. |
Model Fit Statistics | ||
---|---|---|
Criterion | Intercept Only | Intercept and Covariates |
AIC | 8566.317 | 6727.321 |
SC | 8573.046 | 6774.422 |
-2 Log L | 8564.317 | 6713.321 |
Testing Global Null Hypothesis: BETA=0 | |||
---|---|---|---|
Test | Chi-Square | DF | Pr > ChiSq |
Likelihood Ratio | 1850.9960 | 6 | <.0001 |
Score | 1648.0410 | 6 | <.0001 |
Wald | 1235.9008 | 6 | <.0001 |
Type 3 Analysis of Effects | |||
---|---|---|---|
Effect | DF | Wald Chi-Square |
Pr > ChiSq |
scoredfirst | 1 | 870.7582 | <.0001 |
scoredsecond | 1 | 814.4736 | <.0001 |
home | 1 | 29.2964 | <.0001 |
overtime | 1 | 1.4692 | 0.2255 |
period | 1 | 9.4676 | 0.0021 |
penalty_cen | 1 | 10.0272 | 0.0015 |
Analysis of Maximum Likelihood Estimates | ||||||
---|---|---|---|---|---|---|
Parameter | DF | Estimate | Standard Error |
Wald Chi-Square |
Pr > ChiSq | |
Intercept | 1 | -2.2164 | 0.1022 | 470.7539 | <.0001 | |
scoredfirst | 1 | 1 | 1.9117 | 0.0648 | 870.7582 | <.0001 |
scoredsecond | 1 | 1 | 1.8564 | 0.0650 | 814.4736 | <.0001 |
home | 1 | 1 | 0.3241 | 0.0599 | 29.2964 | <.0001 |
overtime | 1 | 1 | -0.1016 | 0.0838 | 1.4692 | 0.2255 |
period | 1 | 0.1746 | 0.0567 | 9.4676 | 0.0021 | |
penalty_cen | 1 | -0.0463 | 0.0146 | 10.0272 | 0.0015 |
Odds Ratio Estimates | |||
---|---|---|---|
Effect | Point Estimate | 95% Wald Confidence Limits |
|
scoredfirst 1 vs 0 | 6.764 | 5.958 | 7.680 |
scoredsecond 1 vs 0 | 6.401 | 5.635 | 7.271 |
home 1 vs 0 | 1.383 | 1.230 | 1.555 |
overtime 1 vs 0 | 0.903 | 0.767 | 1.065 |
period | 1.191 | 1.065 | 1.331 |
penalty_cen | 0.955 | 0.928 | 0.983 |
Association of Predicted Probabilities and Observed Responses | |||
---|---|---|---|
Percent Concordant | 78.6 | Somers' D | 0.580 |
Percent Discordant | 20.7 | Gamma | 0.584 |
Percent Tied | 0.7 | Tau-a | 0.290 |
Pairs | 9541597 | c | 0.790 |
What is the probability of wining a game given these factors?
It can be broken up in many different ways. Probability of winning all else being zero: 9.83% Probability of winning when scoring first in the first period, not at home, no overtime, and ~4 penalties: 46.75% Probability of winning when scoring first at home, in the first period, no overtime, and ~4 penalties: 54.83% Probability of winning when scoring first in the second period,not at home, no overtime, and ~4 penalties: 51.11% Probability of winning when scoring first in the first period, scoring second, not at home, no overtime, and ~4 penalties: 84.89% Probability of winning when scoring first in the second period, scoring second, at home, no overtime, and ~4 penalties: 87.00%
print("Probability of winning all else being zero: ")
print("{0:.2f}%".format(convertToProb(-2.2164)*100))
print("Probability of winning when scoring first not at home, in the first period, no overtime, and ~4 penalties: ")
print("{0:.2f}%".format(convertToProb(-2.2164+1.9117+0.1746)*100))
print("Probability of winning when scoring first at home, in the first period, no overtime, and ~4 penalties: ")
print("{0:.2f}%".format(convertToProb(-2.2164+1.9117+.3241+0.1746)*100))
print("Probability of winning when scoring first not at home, in the second period, no overtime, and ~4 penalties: ")
print("{0:.2f}%".format(convertToProb(-2.2164+1.9117+2*0.1746)*100))
print("Probability of winning when scoring first and second not at home, in the first period, no overtime, and ~4 penalties: ")
print("{0:.2f}%".format(convertToProb(-2.2164+1.9117+1.8564+0.1746)*100))
print("Probability of winning when scoring first and second at home, in the second period, no overtime, and ~4 penalties: ")
print("{0:.2f}%".format(convertToProb(-2.2164+1.9117+1.8564+2*0.1746)*100))
So when you control for many of the other factors scoring first is no garauntee of winning. Having a home game and scoring first gives you a slightly better chance over. It really isn’t till you are the team that scores second does it become more than likely that you will. Again, that makes sense because it becomes all the more harder to beat 2 goals than just one.
Ultimately, it is the team that scores the first two goals that tend to go on to win. So while scoring first is important, you absolutely need to step on the gas and get that second goal.
But this is just one sample from the population of games from 2013-2014 to 2017-2018. Could this just have been a fluke of the sample that was drawn. Actually, because I didn’t set the seed, if you are rerunning this code, you probably got slightly different estimates.
Let’s bootstrap the process and look at the sampling distribution of the estimates to get a better picture of the possible win probabilities. In SAS this can be done more efficiently using the survey select process, and then the by statement in the logistic process.
%%SAS sas_session
proc sort data=nhl.games;
by gamekey;
run;
proc surveyselect data=nhl.games NOPRINT
out=nhl.gamessamp
noprint Method = urs N = 1 outhits
reps=100; /* generate this many bootstrap resamples */
Strata gamekey;
run;
proc print data=nhl.gamessamp(obs=10);
run;
<!DOCTYPE html>
Obs | gamekey | Replicate | goals | home | overtime | penaltycount | period | scoredfirst | scoredsecond | team | won | notwon | penalty_cen | NumberHits | ExpectedHits | SamplingWeight |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 201421 | 1 | 3 | 1 | 0 | 14 | 1 | 0 | 1 | MTL | 0 | 1 | 10.0155 | 1 | 0.5 | 2 |
2 | 201421 | 2 | 4 | 0 | 0 | 12 | 1 | 1 | 0 | TOR | 1 | 0 | 8.0155 | 1 | 0.5 | 2 |
3 | 201421 | 3 | 4 | 0 | 0 | 12 | 1 | 1 | 0 | TOR | 1 | 0 | 8.0155 | 1 | 0.5 | 2 |
4 | 201421 | 4 | 3 | 1 | 0 | 14 | 1 | 0 | 1 | MTL | 0 | 1 | 10.0155 | 1 | 0.5 | 2 |
5 | 201421 | 5 | 4 | 0 | 0 | 12 | 1 | 1 | 0 | TOR | 1 | 0 | 8.0155 | 1 | 0.5 | 2 |
6 | 201421 | 6 | 4 | 0 | 0 | 12 | 1 | 1 | 0 | TOR | 1 | 0 | 8.0155 | 1 | 0.5 | 2 |
7 | 201421 | 7 | 3 | 1 | 0 | 14 | 1 | 0 | 1 | MTL | 0 | 1 | 10.0155 | 1 | 0.5 | 2 |
8 | 201421 | 8 | 3 | 1 | 0 | 14 | 1 | 0 | 1 | MTL | 0 | 1 | 10.0155 | 1 | 0.5 | 2 |
9 | 201421 | 9 | 3 | 1 | 0 | 14 | 1 | 0 | 1 | MTL | 0 | 1 | 10.0155 | 1 | 0.5 | 2 |
10 | 201421 | 10 | 3 | 1 | 0 | 14 | 1 | 0 | 1 | MTL | 0 | 1 | 10.0155 | 1 | 0.5 | 2 |
%%SAS sas_session
proc sort data=nhl.gamessamp;
by replicate;
run;
ods trace on;
ods output ParameterEstimates=nhl.estimates;
proc logistic data=nhl.gamessamp;
by replicate;
class scoredfirst(ref='0') scoredsecond(ref='0') home(ref='0')overtime(ref='0')/param=ref;
model won(event='1')= scoredfirst scoredsecond home overtime period penalty_cen/;
run; quit;
ods trace off;
I cleared the outputs so you didn’t have to see each iteration of the logistic step. Instead we can summarize the data by looking at it visually or by the mean of the sample distribution.
%%SAS sas_session
proc sql;
select Variable, mean(Estimate) as MeanEst
from nhl.estimates
group by variable;
run;
quit;
<!DOCTYPE html>
Variable | MeanEst |
---|---|
Intercept | -2.14334 |
home | 0.342078 |
overtime | -0.01201 |
penalty_cen | -0.02247 |
period | 0.10468 |
scoredfirst | 1.90498 |
scoredsecond | 1.836549 |
%%SAS sas_session
%MACRO histograms(curvar=);
%let titlevar = &curvar;
proc template;
define statgraph Histogram;
begingraph;
entrytitle "Histogram of &titlevar ";
layout lattice/columns=1;
layout overlay;
histogram Estimate;
endlayout;
layout overlay;
boxplot y=Estimate / orient=horizontal;
endlayout;
endlayout;
endgraph;
end;
run;
proc sgrender data=nhl.estimates(where=(variable=&curvar)) template = Histogram;
run;
%MEND histograms;
%histograms(curvar='scoredfirst')
%histograms(curvar='scoredsecond')
%histograms(curvar='home')
<!DOCTYPE html>
As expected the sampling distributions look normal. The means are not far off from the original estimates. The probabilities can be recalculated again.
Variable | Value |
---|---|
Intercept | -2.14334 |
home | 0.342078 |
overtime | -0.01201 |
penalty_cen | -0.02247 |
period | 0.10468 |
scoredfirst | 1.90498 |
scoredsecond | 1.836549 |
Probability of winning at home before the first period 13.14%
Probability of winning not at home before the first period 9.70%
Probability of winning when scoring first not at home, in the first period, no overtime, and ~4 penalties: 46.66%
Probability of winning when scoring first at home, in the first period, no overtime, and ~4 penalties: 55.19%
Probability of winning when scoring first not at home, in the second period, no overtime, and ~4 penalties: 49.28%
Probability of winning when scoring first and second not at home, in the first period, no overtime, and ~4 penalties: 84.59%
Probability of winning when scoring first and second at home, no overtime, and ~4 penalties: 88.54%
print("Probability of winning at home before the first period")
print("{0:.2f}%".format(convertToProb(-2.14334+ 0.342078+(3.9*-0.02247))*100))
print("Probability of winning not at home before the first period")
print("{0:.2f}%".format(convertToProb(-2.14334+(3.9*-0.02247))*100))
print("Probability of winning when scoring first not at home, in the first period, no overtime, and ~4 penalties: ")
print("{0:.2f}%".format(convertToProb(-2.14334+1.90498+0.10468)*100))
print("Probability of winning when scoring first at home, in the first period, no overtime, and ~4 penalties: ")
print("{0:.2f}%".format(convertToProb(-2.14334+1.90498+0.342078+0.10468)*100))
print("Probability of winning when scoring first not at home, in the second period, no overtime, and ~4 penalties: ")
print("{0:.2f}%".format(convertToProb(-2.14334+1.90498+2*0.10468)*100))
print("Probability of winning when scoring first and second not at home, in the first period, no overtime, and ~4 penalties: ")
print("{0:.2f}%".format(convertToProb(-2.14334+1.90498+1.836549+0.10468)*100))
print("Probability of winning when scoring first and second at home, no overtime, and ~4 penalties: ")
print("{0:.2f}%".format(convertToProb(-2.14334+1.90498+0.342078+1.836549+0.10468)*100))
Something else to consider in terms of independence are the games and the teams. Wins might be considered nested within a team. A team playing well may be getting more wins in a row or something along those lines. We could examine this by looking at the interclass correlation.
%%SAS sas_session
proc nlmixed data=nhl.gamesrand;
ods output ParameterEstimates=p1;
parms gamma00 = -1 tau00=1;
gamma0j = gamma00 + u0j;
eta = gamma0j;
expEta = exp(eta);
pij = expEta/(1+expEta);
model scoredfirst ~ binary(pij);
random u0j ~ normal([0],[tau00]) subject=team OUT=randout;
estimate 'ICC' tau00/(tau00+3.29);
run;
<!DOCTYPE html>
The NLMIXED Procedure
Specifications | |
---|---|
Data Set | NHL.GAMESRAND |
Dependent Variable | scoredfirst |
Distribution for Dependent Variable | Binary |
Random Effects | u0j |
Distribution for Random Effects | Normal |
Subject Variable | team |
Optimization Technique | Dual Quasi-Newton |
Integration Method | Adaptive Gaussian Quadrature |
Dimensions | |
---|---|
Observations Used | 6178 |
Observations Not Used | 0 |
Total Observations | 6178 |
Subjects | 31 |
Max Obs per Subject | 238 |
Parameters | 2 |
Quadrature Points | 1 |
Initial Parameters | ||
---|---|---|
gamma00 | tau00 | Negative Log Likelihood |
-1 | 1 | 4329.25972 |
Iteration History | |||||
---|---|---|---|---|---|
Iteration | Calls | Negative Log Likelihood |
Difference | Maximum Gradient |
Slope |
1 | 6 | 4314.1825 | 15.07724 | 14.4613 | -151.871 |
2 | 16 | 4302.8614 | 11.32104 | 41.4893 | -34.2401 |
3 | 22 | 4302.2958 | 0.565654 | 53.4288 | -60.7572 |
4 | 29 | 4298.0747 | 4.22112 | 184.088 | -20.6365 |
5 | 36 | 4280.8413 | 17.23342 | 415.583 | -32.7086 |
6 | 40 | 4278.4399 | 2.401311 | 36.7201 | -156.999 |
7 | 43 | 4278.3651 | 0.074847 | 34.0781 | -0.19067 |
8 | 45 | 4278.2916 | 0.073493 | 8.94353 | -0.09359 |
9 | 48 | 4278.2854 | 0.006165 | 2.81772 | -0.01572 |
10 | 51 | 4278.2849 | 0.0005 | 0.078424 | -0.00096 |
11 | 54 | 4278.2849 | 9.217E-7 | 0.002819 | -1.92E-6 |
NOTE: GCONV convergence criterion satisfied. |
Fit Statistics | |
---|---|
-2 Log Likelihood | 8556.6 |
AIC (smaller is better) | 8560.6 |
AICC (smaller is better) | 8560.6 |
BIC (smaller is better) | 8563.4 |
Parameter Estimates | ||||||||
---|---|---|---|---|---|---|---|---|
Parameter | Estimate | Standard Error |
DF | t Value | Pr > |t| | 95% Confidence Limits | Gradient | |
gamma00 | -0.00313 | 0.03522 | 30 | -0.09 | 0.9298 | -0.07505 | 0.06879 | -0.00232 |
tau00 | 0.01792 | 0.009702 | 30 | 1.85 | 0.0747 | -0.00190 | 0.03773 | 0.002819 |
Additional Estimates | ||||||||
---|---|---|---|---|---|---|---|---|
Label | Estimate | Standard Error |
DF | t Value | Pr > |t| | Alpha | Lower | Upper |
ICC | 0.005416 | 0.002917 | 30 | 1.86 | 0.0732 | 0.05 | -0.00054 | 0.01137 |
The average estimates for the coefficients in this model are very similar to what we were seeing with the bootstrapped model. The ICC is .005 which is extremely low and indicates that the wins and loses with this sample are not affected by the grouping of teams. We could still use a multilevel model to examine the random effect associated with each team, and that is shown below.
Were any teams better at scoring first? To answer that I’ll use a different approach.
%%SAS sas_session
proc glimmix data=nhl.gamesrand method=quad;
class team;
ODS OUTPUT SOLUTIONR=R;
model won(event='1')= scoredfirst scoredsecond home overtime period penaltycount/dist=binary solution oddsratio ddfm=bw;
nloptions maxiter=10000;
random intercept scoredfirst /subject=team type=chol g solution;
run;
<!DOCTYPE html>
The GLIMMIX Procedure
Model Information | |
---|---|
Data Set | NHL.GAMESRAND |
Response Variable | won |
Response Distribution | Binary |
Link Function | Logit |
Variance Function | Default |
Variance Matrix Blocked By | team |
Estimation Technique | Maximum Likelihood |
Likelihood Approximation | Gauss-Hermite Quadrature |
Degrees of Freedom Method | Between-Within |
Class Level Information | ||
---|---|---|
Class | Levels | Values |
team | 31 | ANA BOS BUF CAR CBJ CGY CHI COL DAL DET EDM FLA LA MIN MTL NJ NSH NYI NYR OTT PHI PHX PIT SJ STL TB TOR VAN VGK WPG WSH |
Number of Observations Read | 6178 |
---|---|
Number of Observations Used | 6178 |
Response Profile | ||
---|---|---|
Ordered Value |
won | Total Frequency |
The GLIMMIX procedure is modeling the probability that won='1'. | ||
1 | 0 | 3071 |
2 | 1 | 3107 |
Dimensions | |
---|---|
G-side Cov. Parameters | 3 |
Columns in X | 7 |
Columns in Z per Subject | 2 |
Subjects (Blocks in V) | 31 |
Max Obs per Subject | 238 |
Optimization Information | |
---|---|
Optimization Technique | Dual Quasi-Newton |
Parameters in Optimization | 10 |
Lower Boundaries | 2 |
Upper Boundaries | 0 |
Fixed Effects | Not Profiled |
Starting From | GLM estimates |
Quadrature Points | 1 |
Iteration History | |||||
---|---|---|---|---|---|
Iteration | Restarts | Evaluations | Objective Function |
Change | Max Gradient |
0 | 0 | 4 | 6711.4553725 | . | 60.84809 |
1 | 0 | 3 | 6710.7871203 | 0.66825213 | 263.1968 |
2 | 0 | 3 | 6706.5499723 | 4.23714802 | 148.9014 |
3 | 0 | 2 | 6703.3524844 | 3.19748792 | 159.9946 |
4 | 0 | 3 | 6703.2107923 | 0.14169209 | 147.1521 |
5 | 0 | 2 | 6702.940897 | 0.26989531 | 130.1817 |
6 | 0 | 3 | 6702.7805741 | 0.16032292 | 116.9171 |
7 | 0 | 2 | 6702.4437877 | 0.33678642 | 48.60881 |
8 | 0 | 3 | 6702.1046842 | 0.33910344 | 5.312724 |
9 | 0 | 3 | 6702.043703 | 0.06098120 | 12.1805 |
10 | 0 | 3 | 6702.0167753 | 0.02692775 | 13.42428 |
11 | 0 | 3 | 6702.0127666 | 0.00400870 | 12.09659 |
12 | 0 | 4 | 6701.999187 | 0.01357963 | 0.689931 |
13 | 0 | 3 | 6701.9990994 | 0.00008755 | 0.015468 |
14 | 0 | 3 | 6701.9990992 | 0.00000016 | 0.000821 |
Convergence criterion (GCONV=1E-8) satisfied. |
Estimated G matrix is not positive definite.
Fit Statistics | |
---|---|
-2 Log Likelihood | 6702.00 |
AIC (smaller is better) | 6720.00 |
AICC (smaller is better) | 6720.03 |
BIC (smaller is better) | 6732.90 |
CAIC (smaller is better) | 6741.90 |
HQIC (smaller is better) | 6724.21 |
Fit Statistics for Conditional Distribution | |
---|---|
-2 log L(won | r. effects) | 6663.54 |
Pearson Chi-Square | 6138.44 |
Pearson Chi-Square / DF | 0.99 |
Estimated G Matrix | |||
---|---|---|---|
Effect | Row | Col1 | Col2 |
Intercept | 1 | 0.02931 | 0.000438 |
scoredfirst | 2 | 0.000438 | 6.533E-6 |
Covariance Parameter Estimates | |||
---|---|---|---|
Cov Parm | Subject | Estimate | Standard Error |
CHOL(1,1) | team | 0.1712 | 0.05534 |
CHOL(2,1) | team | 0.002556 | 0.07133 |
CHOL(2,2) | team | 0 | . |
Solutions for Fixed Effects | |||||
---|---|---|---|---|---|
Effect | Estimate | Standard Error |
DF | t Value | Pr > |t| |
Intercept | -2.0111 | 0.1244 | 30 | -16.17 | <.0001 |
scoredfirst | 1.9085 | 0.06506 | 6141 | 29.34 | <.0001 |
scoredsecond | 1.8539 | 0.06533 | 6141 | 28.38 | <.0001 |
home | 0.3270 | 0.06014 | 6141 | 5.44 | <.0001 |
overtime | -0.09928 | 0.08424 | 6141 | -1.18 | 0.2386 |
period | 0.1754 | 0.05700 | 6141 | 3.08 | 0.0021 |
penaltycount | -0.05052 | 0.01484 | 6141 | -3.40 | 0.0007 |
Odds Ratio Estimates | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
scoredfirst | scoredsecond | home | overtime | period | penaltycount | _scoredfirst | _scoredsecond | _home | _overtime | _period | _penaltycount | Estimate | DF | 95% Confidence Limits | |
Effects of continuous variables are assessed as one unit offsets from the mean. The AT suboption modifies the reference value and the UNIT suboption modifies the offsets. | |||||||||||||||
1.4989 | 0.4909 | 0.5034 | 0.1318 | 1.2488 | 3.9908 | 0.4989 | 0.4909 | 0.5034 | 0.1318 | 1.2488 | 3.9908 | 6.743 | 6141 | 5.936 | 7.661 |
0.4989 | 1.4909 | 0.5034 | 0.1318 | 1.2488 | 3.9908 | 0.4989 | 0.4909 | 0.5034 | 0.1318 | 1.2488 | 3.9908 | 6.385 | 6141 | 5.617 | 7.257 |
0.4989 | 0.4909 | 1.5034 | 0.1318 | 1.2488 | 3.9908 | 0.4989 | 0.4909 | 0.5034 | 0.1318 | 1.2488 | 3.9908 | 1.387 | 6141 | 1.233 | 1.560 |
0.4989 | 0.4909 | 0.5034 | 1.1318 | 1.2488 | 3.9908 | 0.4989 | 0.4909 | 0.5034 | 0.1318 | 1.2488 | 3.9908 | 0.905 | 6141 | 0.768 | 1.068 |
0.4989 | 0.4909 | 0.5034 | 0.1318 | 2.2488 | 3.9908 | 0.4989 | 0.4909 | 0.5034 | 0.1318 | 1.2488 | 3.9908 | 1.192 | 6141 | 1.066 | 1.333 |
0.4989 | 0.4909 | 0.5034 | 0.1318 | 1.2488 | 4.9908 | 0.4989 | 0.4909 | 0.5034 | 0.1318 | 1.2488 | 3.9908 | 0.951 | 6141 | 0.923 | 0.979 |
Type III Tests of Fixed Effects | ||||
---|---|---|---|---|
Effect | Num DF | Den DF | F Value | Pr > F |
scoredfirst | 1 | 6141 | 860.64 | <.0001 |
scoredsecond | 1 | 6141 | 805.20 | <.0001 |
home | 1 | 6141 | 29.57 | <.0001 |
overtime | 1 | 6141 | 1.39 | 0.2386 |
period | 1 | 6141 | 9.47 | 0.0021 |
penaltycount | 1 | 6141 | 11.58 | 0.0007 |
Solution for Random Effects | ||||||
---|---|---|---|---|---|---|
Effect | Subject | Estimate | Std Err Pred | DF | t Value | Pr > |t| |
Intercept | team ANA | 0.1035 | 0.1275 | 6171 | 0.81 | 0.4169 |
scoredfirst | team ANA | 0.001546 | 0.04294 | 6171 | 0.04 | 0.9713 |
Intercept | team BOS | 0.1345 | 0.1305 | 6171 | 1.03 | 0.3027 |
scoredfirst | team BOS | 0.002008 | 0.05592 | 6171 | 0.04 | 0.9714 |
Intercept | team BUF | -0.2700 | 0.1522 | 6171 | -1.77 | 0.0761 |
scoredfirst | team BUF | -0.00403 | 0.1123 | 6171 | -0.04 | 0.9714 |
Intercept | team CAR | -0.2826 | 0.1553 | 6171 | -1.82 | 0.0690 |
scoredfirst | team CAR | -0.00422 | 0.1174 | 6171 | -0.04 | 0.9713 |
Intercept | team CBJ | 0.008767 | 0.1201 | 6171 | 0.07 | 0.9418 |
scoredfirst | team CBJ | 0.000131 | 0.003898 | 6171 | 0.03 | 0.9732 |
Intercept | team CGY | 0.08337 | 0.1249 | 6171 | 0.67 | 0.5046 |
scoredfirst | team CGY | 0.001245 | 0.03460 | 6171 | 0.04 | 0.9713 |
Intercept | team CHI | 0.1148 | 0.1248 | 6171 | 0.92 | 0.3576 |
scoredfirst | team CHI | 0.001714 | 0.04811 | 6171 | 0.04 | 0.9716 |
Intercept | team COL | -0.1140 | 0.1241 | 6171 | -0.92 | 0.3584 |
scoredfirst | team COL | -0.00170 | 0.04749 | 6171 | -0.04 | 0.9714 |
Intercept | team DAL | -0.04652 | 0.1179 | 6171 | -0.39 | 0.6932 |
scoredfirst | team DAL | -0.00069 | 0.01952 | 6171 | -0.04 | 0.9716 |
Intercept | team DET | -0.09946 | 0.1226 | 6171 | -0.81 | 0.4173 |
scoredfirst | team DET | -0.00149 | 0.04182 | 6171 | -0.04 | 0.9717 |
Intercept | team EDM | -0.06339 | 0.1213 | 6171 | -0.52 | 0.6013 |
scoredfirst | team EDM | -0.00095 | 0.02658 | 6171 | -0.04 | 0.9716 |
Intercept | team FLA | -0.03694 | 0.1195 | 6171 | -0.31 | 0.7573 |
scoredfirst | team FLA | -0.00055 | 0.01543 | 6171 | -0.04 | 0.9715 |
Intercept | team LA | -0.05814 | 0.1219 | 6171 | -0.48 | 0.6333 |
scoredfirst | team LA | -0.00087 | 0.02468 | 6171 | -0.04 | 0.9719 |
Intercept | team MIN | 0.1063 | 0.1267 | 6171 | 0.84 | 0.4014 |
scoredfirst | team MIN | 0.001587 | 0.04433 | 6171 | 0.04 | 0.9714 |
Intercept | team MTL | 0.01175 | 0.1219 | 6171 | 0.10 | 0.9232 |
scoredfirst | team MTL | 0.000175 | 0.005245 | 6171 | 0.03 | 0.9733 |
Intercept | team NJ | -0.1392 | 0.1221 | 6171 | -1.14 | 0.2540 |
scoredfirst | team NJ | -0.00208 | 0.05832 | 6171 | -0.04 | 0.9716 |
Intercept | team NSH | 0.1093 | 0.1279 | 6171 | 0.85 | 0.3929 |
scoredfirst | team NSH | 0.001632 | 0.04540 | 6171 | 0.04 | 0.9713 |
Intercept | team NYI | -0.03941 | 0.1202 | 6171 | -0.33 | 0.7430 |
scoredfirst | team NYI | -0.00059 | 0.01672 | 6171 | -0.04 | 0.9719 |
Intercept | team NYR | 0.1213 | 0.1217 | 6171 | 1.00 | 0.3193 |
scoredfirst | team NYR | 0.001810 | 0.05085 | 6171 | 0.04 | 0.9716 |
Intercept | team OTT | 0.05774 | 0.1216 | 6171 | 0.47 | 0.6349 |
scoredfirst | team OTT | 0.000862 | 0.02421 | 6171 | 0.04 | 0.9716 |
Intercept | team PHI | -0.06709 | 0.1233 | 6171 | -0.54 | 0.5863 |
scoredfirst | team PHI | -0.00100 | 0.02831 | 6171 | -0.04 | 0.9718 |
Intercept | team PHX | -0.2198 | 0.1461 | 6171 | -1.50 | 0.1324 |
scoredfirst | team PHX | -0.00328 | 0.09139 | 6171 | -0.04 | 0.9714 |
Intercept | team PIT | 0.1195 | 0.1313 | 6171 | 0.91 | 0.3629 |
scoredfirst | team PIT | 0.001784 | 0.04981 | 6171 | 0.04 | 0.9714 |
Intercept | team SJ | 0.03619 | 0.1214 | 6171 | 0.30 | 0.7656 |
scoredfirst | team SJ | 0.000540 | 0.01527 | 6171 | 0.04 | 0.9718 |
Intercept | team STL | 0.1790 | 0.1342 | 6171 | 1.33 | 0.1825 |
scoredfirst | team STL | 0.002672 | 0.07458 | 6171 | 0.04 | 0.9714 |
Intercept | team TB | 0.1397 | 0.1292 | 6171 | 1.08 | 0.2797 |
scoredfirst | team TB | 0.002086 | 0.05817 | 6171 | 0.04 | 0.9714 |
Intercept | team TOR | -0.06816 | 0.1281 | 6171 | -0.53 | 0.5947 |
scoredfirst | team TOR | -0.00102 | 0.02818 | 6171 | -0.04 | 0.9712 |
Intercept | team VAN | -0.08880 | 0.1202 | 6171 | -0.74 | 0.4600 |
scoredfirst | team VAN | -0.00133 | 0.03721 | 6171 | -0.04 | 0.9716 |
Intercept | team VGK | 0.01550 | 0.1552 | 6171 | 0.10 | 0.9205 |
scoredfirst | team VGK | 0.000231 | 0.007014 | 6171 | 0.03 | 0.9737 |
Intercept | team WPG | 0.08358 | 0.1203 | 6171 | 0.69 | 0.4873 |
scoredfirst | team WPG | 0.001248 | 0.03502 | 6171 | 0.04 | 0.9716 |
Intercept | team WSH | 0.1675 | 0.1264 | 6171 | 1.33 | 0.1850 |
scoredfirst | team WSH | 0.002501 | 0.07022 | 6171 | 0.04 | 0.9716 |
%%SAS sas_session
proc sql;
select Subject, (Estimate+1.9085) as ScoredFirst
from r
where Effect="scoredfirst"
order by ScoredFirst;
run;
quit;
<!DOCTYPE html>
Subject | ScoredFirst |
---|---|
team CAR | 1.904281 |
team BUF | 1.904468 |
team PHX | 1.905218 |
team NJ | 1.906421 |
team COL | 1.906798 |
team DET | 1.907015 |
team VAN | 1.907174 |
team TOR | 1.907482 |
team PHI | 1.907498 |
team EDM | 1.907554 |
team LA | 1.907632 |
team DAL | 1.907805 |
team NYI | 1.907912 |
team FLA | 1.907948 |
team CBJ | 1.908631 |
team MTL | 1.908675 |
team VGK | 1.908731 |
team SJ | 1.90904 |
team OTT | 1.909362 |
team CGY | 1.909745 |
team WPG | 1.909748 |
team ANA | 1.910046 |
team MIN | 1.910087 |
team NSH | 1.910132 |
team CHI | 1.910214 |
team PIT | 1.910284 |
team NYR | 1.91031 |
team BOS | 1.910508 |
team TB | 1.910586 |
team WSH | 1.911001 |
team STL | 1.911172 |
To break up the effect for each team, I will run the logistic regression by team.
%%SAS sas_session
proc sort data=nhl.games;
by team;
run;
ods trace on;
ods output ParameterEstimates=nhl.teamestimates;
proc logistic data=nhl.games;
by team;
class scoredfirst(ref='0') scoredsecond(ref='0') home(ref='0')overtime(ref='0')/param=ref;
model won(event='1')= scoredfirst scoredsecond home overtime period penalty_cen/;
run; quit;
ods trace off;
%%SAS sas_session
proc print data=nhl.teamestimates(obs=10);
run;
<!DOCTYPE html>
Obs | team | Variable | ClassVal0 | DF | Estimate | StdErr | WaldChiSq | ProbChiSq | _ESTTYPE_ |
---|---|---|---|---|---|---|---|---|---|
1 | ANA | Intercept | 1 | -1.9796 | 0.4050 | 23.8903 | <.0001 | MLE | |
2 | ANA | scoredfirst | 1 | 1 | 1.5697 | 0.2389 | 43.1713 | <.0001 | MLE |
3 | ANA | scoredsecond | 1 | 1 | 1.5974 | 0.2415 | 43.7494 | <.0001 | MLE |
4 | ANA | home | 1 | 1 | 0.6988 | 0.2314 | 9.1181 | 0.0025 | MLE |
5 | ANA | overtime | 1 | 1 | -0.5108 | 0.3259 | 2.4570 | 0.1170 | MLE |
6 | ANA | period | 1 | 0.3861 | 0.2362 | 2.6721 | 0.1021 | MLE | |
7 | ANA | penalty_cen | 1 | -0.00960 | 0.0528 | 0.0331 | 0.8556 | MLE | |
8 | BOS | Intercept | 1 | -2.4694 | 0.4521 | 29.8340 | <.0001 | MLE | |
9 | BOS | scoredfirst | 1 | 1 | 2.1861 | 0.2719 | 64.6617 | <.0001 | MLE |
10 | BOS | scoredsecond | 1 | 1 | 1.8727 | 0.2724 | 47.2734 | <.0001 | MLE |
%%SAS sas_session
proc sql;
select Variable, Team, Estimate, ProbChiSQ
from nhl.teamestimates
where Variable="scoredfirst"
order by Estimate;
run;
quit;
<!DOCTYPE html>
Variable | team | Estimate | Pr > Chi-Square |
---|---|---|---|
scoredfirst | NYI | 1.1333 | <.0001 |
scoredfirst | VAN | 1.3658 | <.0001 |
scoredfirst | PIT | 1.5193 | <.0001 |
scoredfirst | ANA | 1.5697 | <.0001 |
scoredfirst | LA | 1.6351 | <.0001 |
scoredfirst | DET | 1.6427 | <.0001 |
scoredfirst | PHX | 1.6749 | <.0001 |
scoredfirst | CBJ | 1.7355 | <.0001 |
scoredfirst | EDM | 1.7418 | <.0001 |
scoredfirst | NSH | 1.7765 | <.0001 |
scoredfirst | DAL | 1.8462 | <.0001 |
scoredfirst | OTT | 1.8750 | <.0001 |
scoredfirst | PHI | 1.8914 | <.0001 |
scoredfirst | CGY | 1.8929 | <.0001 |
scoredfirst | BUF | 1.9120 | <.0001 |
scoredfirst | FLA | 1.9808 | <.0001 |
scoredfirst | MIN | 2.0916 | <.0001 |
scoredfirst | TOR | 2.1204 | <.0001 |
scoredfirst | CHI | 2.1262 | <.0001 |
scoredfirst | WPG | 2.1340 | <.0001 |
scoredfirst | STL | 2.1541 | <.0001 |
scoredfirst | SJ | 2.1664 | <.0001 |
scoredfirst | NJ | 2.1830 | <.0001 |
scoredfirst | BOS | 2.1861 | <.0001 |
scoredfirst | TB | 2.1991 | <.0001 |
scoredfirst | COL | 2.2399 | <.0001 |
scoredfirst | CAR | 2.2597 | <.0001 |
scoredfirst | NYR | 2.2732 | <.0001 |
scoredfirst | MTL | 2.3390 | <.0001 |
scoredfirst | VGK | 2.3544 | 0.0002 |
scoredfirst | WSH | 2.5618 | <.0001 |
Surprisingly, the Vegas Golden Knights ranked really high in this. The Knights were really good at winning after scoring first (72% probability of winning). But, maybe it isn’t so surprising. There is only one season to go off of, and they made it to the stanley cup, meaning they won lots of games in that season.
Another consideration, if you look at a lot of the top teams that are likely to win after scoring first they have some of the best goalies in the league: Holtby, Fleury, Bishop/Vasilevskiy (some of TB seasons), Price, and Lundqvist. If you have a good goalie, it likely makes it even harder on the opponent to score the equalizer and a second goal. That first goal also takes off some pressure on the goalie. It also probably indicates more offensive zone posession, meaning the team is keeping the puck away from their own goal. VGK had Fleury who played phenomenally throughout the regular season and playoffs.
What about COL and CAR? That seems to make less sense. But going back to the section about downloading the data, I reviewed the players that scored first or assisted on the first goal. Colorado had RANTANEN and Carolina STAAL pop up for different seasons. So there could just be a factor that the low-winning teams just happened to win by scoring first.
%%SAS sas_session
proc sql;
select Variable, Team, Estimate, ProbChiSQ
from nhl.teamestimates
where team="VGK"
order by Estimate;
run;
quit;
<!DOCTYPE html>
Variable | team | Estimate | Pr > Chi-Square |
---|---|---|---|
Intercept | VGK | -1.8942 | 0.0563 |
penalty_cen | VGK | -0.1949 | 0.3537 |
period | VGK | 0.0616 | 0.9138 |
overtime | VGK | 0.3167 | 0.6663 |
home | VGK | 0.6651 | 0.2225 |
scoredsecond | VGK | 1.3322 | 0.0316 |
scoredfirst | VGK | 2.3544 | 0.0002 |
Variable | team | Estimate | Pr > Chi-Square |
---|---|---|---|
Intercept | PIT | -2.0994 | <.0001 |
overtime | PIT | -0.0818 | 0.8050 |
penalty_cen | PIT | -0.0151 | 0.7778 |
period | PIT | 0.2954 | 0.1847 |
home | PIT | 0.9300 | <.0001 |
scoredfirst | PIT | 1.5193 | <.0001 |
scoredsecond | PIT | 1.7099 | <.0001 |
print("VGK Probability of winning after scoring first in the first period at home with about 4 penalties")
print("{0:.2f}%".format(convertToProb(-1.8942+2.3544+.6651+.0616-.1949)*100))
print("VGK Probability of winning after scoring second in the first period at home with about 4 penalties")
print("{0:.2f}%".format(convertToProb(-1.8942+1.3322+.6651+.0616-.1949)*100))
VGK Probability of winning after scoring first in the first period at home with about 4 penalties
72.95%
VGK Probability of winning after scoring second in the first period at home with about 4 penalties
49.25%
%%SAS sas_session
proc sql;
select Variable, Team, Estimate, ProbChiSQ
from nhl.teamestimates
where Variable="scoredsecond"
order by Estimate;
run;
quit;
<!DOCTYPE html>
Variable | team | Estimate | Pr > Chi-Square |
---|---|---|---|
scoredsecond | VGK | 1.3322 | 0.0316 |
scoredsecond | NYI | 1.4828 | <.0001 |
scoredsecond | MIN | 1.4955 | <.0001 |
scoredsecond | NSH | 1.5128 | <.0001 |
scoredsecond | PHX | 1.5876 | <.0001 |
scoredsecond | EDM | 1.5884 | <.0001 |
scoredsecond | ANA | 1.5974 | <.0001 |
scoredsecond | SJ | 1.6608 | <.0001 |
scoredsecond | PIT | 1.7099 | <.0001 |
scoredsecond | CBJ | 1.7114 | <.0001 |
scoredsecond | OTT | 1.7161 | <.0001 |
scoredsecond | FLA | 1.7174 | <.0001 |
scoredsecond | VAN | 1.7430 | <.0001 |
scoredsecond | DAL | 1.8158 | <.0001 |
scoredsecond | BOS | 1.8727 | <.0001 |
scoredsecond | DET | 1.8870 | <.0001 |
scoredsecond | BUF | 1.8951 | <.0001 |
scoredsecond | CHI | 1.9125 | <.0001 |
scoredsecond | NYR | 1.9689 | <.0001 |
scoredsecond | CAR | 1.9849 | <.0001 |
scoredsecond | WSH | 2.0039 | <.0001 |
scoredsecond | PHI | 2.0239 | <.0001 |
scoredsecond | STL | 2.0241 | <.0001 |
scoredsecond | COL | 2.0283 | <.0001 |
scoredsecond | CGY | 2.1101 | <.0001 |
scoredsecond | TB | 2.1191 | <.0001 |
scoredsecond | LA | 2.1662 | <.0001 |
scoredsecond | TOR | 2.1773 | <.0001 |
scoredsecond | WPG | 2.1887 | <.0001 |
scoredsecond | MTL | 2.2306 | <.0001 |
scoredsecond | NJ | 2.2486 | <.0001 |
Interestingly the value for scoring second is flipped a little bit. The VKG are now at the bottom, again this is difficult because it is based only on their one great season. NJ was good at sealing the deal when they got the second goal, even without the first goal. Pittsburg seems to be the biggest outlier. They are low in winning probability even if they had the first or second goals. But still over 50% after scoring first. Another big factor for them seems to be home ice advantage, and holding the onto the win even if they scored in the first period.
Some other things that I’m not controlling for that could be influencing the result. The team’s even strength goals, penalty killing percentages, etc…
%%SAS sas_session
proc sql;
select Variable, Team, Estimate, ProbChiSQ
from nhl.teamestimates
where team="PIT"
order by Estimate;
run;
quit;
<!DOCTYPE html>
Variable | team | Estimate | Pr > Chi-Square |
---|---|---|---|
Intercept | PIT | -2.0994 | <.0001 |
overtime | PIT | -0.0818 | 0.8050 |
penalty_cen | PIT | -0.0151 | 0.7778 |
period | PIT | 0.2954 | 0.1847 |
home | PIT | 0.9300 | <.0001 |
scoredfirst | PIT | 1.5193 | <.0001 |
scoredsecond | PIT | 1.7099 | <.0001 |
print("Probability of winning after scoring first in the first period at home with about 4 penalties")
print("{0:.2f}%".format(convertToProb(-2.0994+1.5193+.9300+.2954-.0151)*100))
print("Probability of winning after scoring second in the first period at home with about 4 penalties")
print("{0:.2f}%".format(convertToProb(-2.0994+1.7099+.9300+.2954-.0151)*100))
Probability of winning after scoring first in the first period at home with about 4 penalties
65.25%
Probability of winning after scoring second in the first period at home with about 4 penalties
69.44%
Looking at home ice advantage when controling for overtime and who scores first is interesting. Some teams actually has a negative or even non-significant effect, while others, like PIT, it has a high effect. Some of the others are not so surprising, Toronto and Philadelphia (if I was a ref I’d be intimidated in Philadelphia too). Tampa Bay is surprising to me, given the other teams in the top spots. Certainly, they have a great fan base (biased), but one of the problems is with out of state fans coming in. It isn’t surprising to see more red sweaters than blue sweaters when, say, the blackhawks come through. I wonder, though, if ice has something to do with this. It can be hot and humid in Tampa, and the quality of the ice may be a factor, although it generally sits at the right temperature. Negating that argument is that PHX is hot too, but doesn’t rank high in the home ice effect. Ignoring significance, PHX has more of an effect than BOS, which I would think the opposite would be true.
Perhaps more importantly, the top teams are also the teams that just won more games over the seasons of analysis, and would by that have more home wins.
%%SAS sas_session
proc sql;
select Variable, Team, Estimate, ProbChiSQ
from nhl.teamestimates
where Variable="home"
order by Estimate;
run;
quit;
<!DOCTYPE html>
Variable | team | Estimate | Pr > Chi-Square |
---|---|---|---|
home | MTL | -0.0638 | 0.7957 |
home | NYI | -0.0379 | 0.8621 |
home | DET | 0.0381 | 0.8691 |
home | SJ | 0.0876 | 0.7082 |
home | VAN | 0.0998 | 0.6632 |
home | COL | 0.1336 | 0.5819 |
home | CGY | 0.1383 | 0.5713 |
home | MIN | 0.1708 | 0.4756 |
home | CHI | 0.1838 | 0.4493 |
home | BOS | 0.1872 | 0.4215 |
home | OTT | 0.1959 | 0.3941 |
home | CBJ | 0.2357 | 0.3013 |
home | BUF | 0.2436 | 0.3206 |
home | NYR | 0.2618 | 0.2845 |
home | WPG | 0.2762 | 0.2346 |
home | STL | 0.3251 | 0.1768 |
home | FLA | 0.3905 | 0.0989 |
home | PHX | 0.3960 | 0.0867 |
home | EDM | 0.4256 | 0.0643 |
home | CAR | 0.4366 | 0.0673 |
home | NJ | 0.4426 | 0.0706 |
home | LA | 0.4487 | 0.0591 |
home | NSH | 0.4628 | 0.0445 |
home | DAL | 0.4668 | 0.0447 |
home | WSH | 0.5049 | 0.0427 |
home | VGK | 0.6651 | 0.2225 |
home | ANA | 0.6988 | 0.0025 |
home | PHI | 0.7184 | 0.0026 |
home | TB | 0.8152 | 0.0010 |
home | TOR | 0.8890 | 0.0006 |
home | PIT | 0.9300 | <.0001 |
One final look at a different component of home ice advantage is suggested in Soccernomics. That the home influence has nothing to do with time change, jet lag, or the fans and players; but it has something to do with fans and referees. To look at that, I will look at the relationship between penalties and home. This will just be a simple model, and as you will see in the results, it doesn’t explain the variation in penalty count that well. Still it can tell us the overall difference in penalties between the home team and the away team
%%SAS sas_session
proc reg data=nhl.gamesrand;
model penaltycount= home won /;
run; quit;
<!DOCTYPE html>
The REG Procedure
Model: MODEL1
Dependent Variable: penaltycount
Number of Observations Read | 6178 |
---|---|
Number of Observations Used | 6178 |
Analysis of Variance | |||||
---|---|---|---|---|---|
Source | DF | Sum of Squares |
Mean Square |
F Value | Pr > F |
Model | 2 | 326.99322 | 163.49661 | 37.78 | <.0001 |
Error | 6175 | 26721 | 4.32737 | ||
Corrected Total | 6177 | 27048 |
Root MSE | 2.08023 | R-Square | 0.0121 |
---|---|---|---|
Dependent Mean | 3.99077 | Adj R-Sq | 0.0118 |
Coeff Var | 52.12603 |
Parameter Estimates | |||||
---|---|---|---|---|---|
Variable | DF | Parameter Estimate |
Standard Error |
t Value | Pr > |t| |
Intercept | 1 | 4.26670 | 0.04470 | 95.45 | <.0001 |
home | 1 | -0.43584 | 0.05316 | -8.20 | <.0001 |
won | 1 | -0.11240 | 0.05316 | -2.11 | 0.0345 |
The REG Procedure
Model: MODEL1
Dependent Variable: penaltycount
R-Squared is quite small, but then again I’ve had models so bad R-squared was negative (really it is impossible, so that was a really bad model using Twitter data if that explains anything). But the difference is significant. On average you would expect a team to get 4.21545 penalties per game. This is close to the mean calculated without controlling for home ice. Somewhat obviously, there is a negative effect for the team that wins. You have less penalties, you are more likely to win; which was also shown above.
When home ice is accounted for, there is a -0.43584 drop in penalties. So that shows a slight favortism towards the home team. Let’s bootstrap it again for a better estimate.
%%SAS sas_session
proc sort data=nhl.gamessamp;
by replicate;
run;
ods trace on;
ods output ParameterEstimates=nhl.penestimates;
proc reg data=nhl.gamessamp;
by replicate;
model penaltycount= home won/;
run; quit;
ods trace off;
%%SAS sas_session
proc sql;
select Variable, mean(Estimate) as MeanEst
from nhl.penestimates
group by variable;
run;
quit;
%histograms(curvar='home')
<!DOCTYPE html>
Variable | MeanEst |
---|---|
Intercept | 4.156299 |
home | -0.29919 |
won | -0.04627 |
The original sample ended up being higher than the bootstrapped estimate. Still there is a significant negative effect on the number of penalties a team is called for if they are the home team. The next question would be if this varies by team. Do certain teams have more of a home ice advantage than others, in terms of the penalties called?
%%SAS sas_session
proc sort data=nhl.games;
by team;
run;
ods trace on;
ods output ParameterEstimates=nhl.penteamestimates;
proc reg data=nhl.games;
by team;
model penaltycount= home won/;
run; quit;
ods trace off;
%%SAS sas_session
proc sql;
select Variable, Team, Estimate, Probt
from nhl.penteamestimates
where Variable="home"
order by Estimate;
run;
quit;
<!DOCTYPE html>
Variable | team | Parameter Estimate | Pr > |t| |
---|---|---|---|
home | CGY | -0.72226 | 0.0013 |
home | CHI | -0.62266 | 0.0004 |
home | VAN | -0.55893 | 0.0151 |
home | TOR | -0.48348 | 0.0213 |
home | NJ | -0.47302 | 0.0069 |
home | MTL | -0.46980 | 0.0228 |
home | PIT | -0.46902 | 0.0371 |
home | NYR | -0.42721 | 0.0142 |
home | WSH | -0.41880 | 0.0390 |
home | NSH | -0.40351 | 0.0507 |
home | VGK | -0.39992 | 0.1965 |
home | FLA | -0.39745 | 0.0721 |
home | MIN | -0.38571 | 0.0462 |
home | CAR | -0.36810 | 0.0254 |
home | OTT | -0.35791 | 0.1104 |
home | ANA | -0.33947 | 0.1173 |
home | BUF | -0.30478 | 0.1368 |
home | CBJ | -0.28451 | 0.1720 |
home | WPG | -0.26757 | 0.2471 |
home | DET | -0.25629 | 0.1476 |
home | BOS | -0.25514 | 0.2141 |
home | COL | -0.20313 | 0.3130 |
home | PHI | -0.12630 | 0.5781 |
home | EDM | -0.11030 | 0.5855 |
home | NYI | -0.08499 | 0.6757 |
home | PHX | -0.04440 | 0.8409 |
home | SJ | -0.04309 | 0.8336 |
home | DAL | -0.03506 | 0.8597 |
home | LA | -0.03000 | 0.8776 |
home | TB | 0.01714 | 0.9290 |
home | STL | 0.19078 | 0.4047 |
If you are a Pittsburg rival, and Sydney Crosby conspiracist, then we might see PITs top rank of home ice advantage as proof of something nefarious. I’ll just leave it at that. Then there’s Calgary. Or Philadelphia, which no longer seems to have a home ice advantage.
One question though, is it that these teams have a home ice advantage, or are they just more likely to take more penalties on the road? Many of the teams that showed up with home ice advantage in terms of wins, are not on the top of the list when compared to penalties. Actually, the penalties don’t really see any real difference when at home or away for most of the teams.
As with a lot of analysis this just leads to more questions. Does the nationality of the refs come into play? Some of the top teams are Canadian (then again 3 canadian teams aren’t). Some of the teams are original six teams, some aren’t.
Summary
As I stated before, this is mostly for fun. It is a toy dataset that can be used to answer some questions. I would say that the first goal is important, but only because of the second goal. If a team scores both the first and second, the game is pretty much over. Home seems to provide, on average, some advantage, and on average the penalties are slightly lower for home teams than for away teams. This doesn’t hold up when looking at every team though.
What does it mean for strategy? Well, coming from someone who has never played, I don’t know. I do have some thoughts. One is that a team should try and score first. Is that helpful? Just go and score? Of course if it were easy….I think there may be ways to do that. Upload the minutes of your best scoring lines in the first part of the game, including scoring defensemen. Or if you know which player is the best at scoring the first goal, give them lots of time at the start.
Another strategy though, is it gets harder for the opposing team to win if the first score occurs later in in the game. An average team has a 70% chance of winning if the first goal comes in the 3rd period. So, give your most defensive minded players more minutes in the first and part of the second. Really grind it out with the other team, lots of hits, and tire them out. Make them cover the whole rink. Then switch and give more minutes to the best scorers who will have fresh legs.
But…I’m not a coach and I don’t play hockey.
print("Probability of winning when scoring first at home, in the first period, no overtime, and ~4 penalties: ")
print("{0:.2f}%".format(convertToProb(-2.14334+1.90498+3*0.342078+0.10468)*100))
Probability of winning when scoring first at home, in the first period, no overtime, and ~4 penalties:
70.94%