Home Unfiltered Articles Players Baseball Prospectus
Basketball Prospectus home
Click here to log in Click here for forgotten password Click here to subscribe

Click here for Important Basketball Prospectus Premium Information!

<< Previous Article
The Books Are Coming (10/05)
Next Article >>
Premium Article Five Burning Questions (10/15)

October 12, 2012
So You Want to Predict the Future
Beta Testing Team Projections

by Dan Hanner


Editor's note: Stand up right now and give Dan a round of applause. He's going all Linux on his team projection system and saying: "Here you go." Call it open-source team ratings. The product of Dan's labor, preseason ratings for all 345 Division I teams, will soon appear in both the College Basketball Prospectus 2012-13, and, for a second consecutive season, in ESPN The Magazine. Sure, this is shop talk, and the "for practitioners only" warning label is duly affixed. But if you're one of those practitioners, here, step by granular step, is exactly how Dan comes up with his projections now, and how he intends to next year and beyond.

A predictive model is only as good as its inputs. If you watch a team practice in the off-season, attend AAU tournaments, and talk to the team's coaches, odds are you will have more information than any database will ever contain. If you knew Meyers Leonard was going to break out for Illinois last year, you probably attended the under-19 games in the off-season. But if you looked at Leonard's stats from his freshman year, you simply would not have predicted a breakout season. A statistical model will never replace the first-hand information reporters can collect about a team. But the real advantage of a statistical model is its consistency. A statistical model can provide a more even-handed analysis of multiple teams and remind us what we might be overlooking in our predictions. It can also help us avoid group-think.

I used 10 years of historic tempo-free player stats to predict player performance in 2013. Ken Pomeroy's player stats are fantastic, and Ken recently added the ability to link players across years. To reach back even further in time I use data from RealGM.com and calculate the tempo-free stats myself. I also tabulated and merged other important variables onto the player data, such as the identity of the head coach and a player's RSCI high-school recruiting rank. For the current season, verbalcommits.com has links to the current rosters for every Division I team, as well as additional information about scholarship and non-scholarship players.

Key terms
Offensive rating (ORtg)
: A Dean Oliver statistic that approximates points scored per 100 possessions, giving credit to teammates for assists and offensive rebounds (since these also contribute to baskets). The formula is more than a page long. Ken, statsheet.com, and I each use a slightly different version of the formula. If you buy Dean's book and have the formula in front of you, I can explain the differences to you, but you probably don't care.

Raw offense and defense: Points scored and allowed per 100 possessions at the team level.

Adjusted offense and defense: Ken's adjustment based on schedule strength and game location. The adjusted offense and defense represent the expected performance on a neutral court against an average D-I team.

Pythagorean rating: In most sports, this combines points scored and points against to calculate an expected winning percentage. I use Ken's formula, which weights the adjusted offense and defense coefficients by 10.25.

Single-variable analysis
: Let's say you had data on returning minutes and winning percentage. You can fit a line that shows that returning minutes (x-axis) versus predicted winning percentage (y-axis).

Multivariate regression: This incorporates multiple x-axis variables in the analysis. For example, you might incorporate the number of elite recruits, the team's returning minutes, and the team's past performance as x-axis variables when predicting the y-axis variable, winning percentage.

For instance in my predictions last season the x-axis variables were the tempo-free stats (ORtg, minutes, possessions used, and percentages for defensive rebounding, blocks, and steals) of departing, returning, and debuting players, along with data about the player's class (since freshmen improve the most), the coach, and the RSCI recruiting data. My y-axis variables were each team's adjusted offense and defense. Once I projected each team's offense and defense, I used those to calculate Pythagorean winning percentages, which I then translated into predicted wins and losses for each team.

Which brings me to 2012-13. I still use the same x-axis variables, but now I predict player performance. Next I combine the player projections to predict each team's lineup. Then I add up the player projections (weighted by playing time) to get each team's adjusted offense and defense. Finally I use those to calculate Pythagorean winning percentages and predict the wins and losses for each team.

Most analysts think about the lineup when they project team performance, so mimicking that process seemed like a good idea. But the new model adds several steps, and that can be confusing. With a simpler model it was clear what generated the prediction. With a complicated model, it can be hard to see what is causing Team X to be ranked ahead of Team Y. Thus I want to spend some time describing the model in detail.

I start by attempting to predict the performance of every D-I player. One of the key things I need to do is shrink the number of x-axis variables, otherwise this project becomes too complex. In particular, I want a more invariant measure of each player's value. ORtgs are great, but I don't want variations in schedule strength, shot volume, or teammate quality to impact the projections. So I start by controlling for these factors.

Predicting 2012-13: 16 steps

Step 1: Adjust raw player stats for schedule strength
Imagine Clemson had a raw offense of 100 and every player on the team had an ORtg of 100. Imagine the Tigers' adjusted offense was 107. In principle we could multiply each player's ORtg by 107/100 to get a schedule-adjusted ORtg. Of course, not every player played against the same quality of defense. There were some bench players who only played in garbage time, etc. I suspect there are ways to create schedule-adjusted player efficiency stats, but for now I'm going to hope this simple adjustment comes close enough.

Step 2: Adjust raw player stats for percentage of possessions used
As a player uses more possessions, his ORtg will fall. NBA folks estimate this relationship as a 1 to 1.25 correlation and I'm going to steal that observation. Thus a player who uses 30 percent of his team's possessions when on the floor and has an ORtg of 110 is recoded as having an ORtg of 122.5. Since 20 percent is the average usage rate, 30 minus 20 equals 10, and the ORtg should be 12.5 higher.

Step 3: Adjust raw player efficiency for team quality
If I told you former Illinois guard Dee Brown's ORtg fell between 2005 and 2006, you might not be surprised. The average quality of his teammates fell between 2005 and 2006. (Deron Williams and Luther Head left.) So I make an adjustment to control for average teammate quality.

Hopefully as a result of all of the above the ORtgs have now been adjusted to more accurately reflect the player's true value. Next I project player performance. I use a slightly different methodology depending on the type of player, but in each case I use the 10 years of historical data to fit the model.

Step 4: Project the performance of freshmen
First a general note about these projections. I'm not trying to accurately project everything about these players. If you want a sense of whether a freshman will be an aggressive three-point shooter or a terrible offensive rebounder, take a look at some of the work Drew Cannon is doing. (Some of Drew's findings, such as the idea that post players develop more slowly than guards, may be import things to incorporate in the model in the future.) For the purposes of projecting lineups I need a few basic stats for players, but not everything. The most important stat will remain the adjusted ORtg.

To project the ORtg for an RSCI top-10 recruit, I average the performance of all RSCI top-10 freshmen in the last 10 years. To project the ORtg for a top-20 recruit, I average the performance of all top-20 freshmen over the same time period. Not surprisingly, top-10 freshmen are the most efficient on average, followed by top-20, top-50, and top-100 recruits. But what about the vast majority of freshmen who enter D-I as unranked recruits?

In the future I hope to incorporate a more comprehensive recruiting ranking system that ranks all D-I players. But that will require merging historical recruiting data with RealGM.com's data, a long and painful project. And given the relative plateau in freshmen performance outside the elite level, I'm not convinced there will be a huge pay-off for all that work.

Instead, I try to take advantage of the fact that different programs tend to get a different level of recruit. I estimate each program's average unranked recruit using the historical data. Without sharing all the program rankings, let me just say that they make a lot of intuitive sense. Gonzaga and Notre Dame, for example, excel at turning unranked recruits into efficient players. At the other end of that spectrum is a program like Grambling.

Step 5: Project the performance of junior college transfers
This group would more accurately be called "non-freshmen that have not yet played in D-I basketball." I group junior college, Division II, and D-III players together.

Once again historic data can give us an average level of performance for a particular category of player, in this case junior college transfers. Typically JC transfers are better than freshmen, but not quite as good as D-I transfers. A handful of JC transfers are former RSCI top-100 high school prospects who needed to get their academics in order. Not surprisingly, these JC transfers play better on average.

It'd be nice to sort JC transfers by program like the model does with freshmen, but there are relatively few JC players in my database, and team-by-team distinctions aren't feasible. So I use the historical data to pull out some rough trends. For example, better programs (read: major-conference programs) get better JC recruits, but this analysis is rather rough. There are junior college player rankings available online and I may attempt to use that information in the future.

One problem here is simply identifying JC transfers in the historical data. To my knowledge no one has kept a historical database of these players. A simple rule of thumb is to pick up players that debut on a roster after their freshman season and never played for another D-I team, but this will include a ton of junior and senior walk-ons. Based on my rough estimation with scholarship data I can find, walk-ons typically don't use more than 25 possessions in a season. Thus I drop all players with fewer than 25 possessions when constructing this group. Luckily, for current rosters identifying junior college transfers is easy.

Step 6: Project performance for returning sophomores, juniors, and seniors
Player development is not even. While a player with a 95 ORtg will typically have a higher projection than a player with a 90, the projection won't be five points higher. Players with low ORtgs that come back typically have more areas to improve on and make slightly more progress.

Another point to heed here is that past ORtgs are not all equally reliable. If a player used less than 25 possessions last year, his ORtg has virtually no predictive power. Conversely the more a player plays, the more valuable the past information becomes.

While the most recent season is the most important, for players with multiple years of data earlier seasons do still have predictive power. If a player struggles as a sophomore, but did well as a freshman, the model is more likely to expect a bounce-back season as a junior. This observation's confirmed by 10 years of data and is intuitively attractive, but it does cause the model to make slightly more pessimistic projections for a player like Florida State's Michael Snaer. Many people believe Snaer's expanded his game and will become a true superstar after his efficient 2012 season, but the model is skeptical of whether this development is real based on his inconsistent play in 2010 and 2011.

Of course ORtg isn't the only thing that matters. Usage also matters. If a player never cracks the starting lineup, that means something, and this is particularly true for older players. We might not learn much from the ORtg of a player with only 40 possessions as a junior. But we learn a lot from the fact that the coach didn't trust the player enough to play him. (The one exception here is of course injuries, which is why coding them is important.)

Even when a player has multiple seasons of college data, the high school recruiting rank still has predictive power. A former top-100 recruit is more likely to break out, regardless of class. In 2012 Henry Sims was a surprise late bloomer for Georgetown, but he wasn't a complete surprise since he was once an elite recruit.

Finally, I also incorporate information about the coach when predicting player development. The top player development coaches are the coaches you would expect. Both elite coaches (like Mike Krzyzewski) and true player developers (like Utah State's Stew Morrill) tend to have the biggest impact on player development on the offensive end.

Step 7: Project the performance for transfers who sat out a year
Past ORtgs and usage rates still matter for transfers, but not nearly as much as they do for players that stay with the same team. Obviously in a whole new circumstance, against a whole new quality of competition, things may be different. You can't assume a player who had a 120 ORtg at Detroit will put up a 120 at Michigan. Again, high school recruiting ranks matter, and the relative rank of the program matters less, but players that play at multiple major-conference programs tend to do the best. Maybe these are just flukes (remember Syracuse's Wesley Johnson), but if multiple major-conference coaches like a player, there's usually a reason.

Step 8: Project the performance for transfers who play back-to-back seasons
Right now this group is the most frustrating to project. First there are the graduate school transfers, who represent a recent trend -- we just don't have a large enough sample of players to project performance with much confidence. Second, there are the transfers have been granted hardship waivers by the NCAA due to family challenges or even tragedies, leaving the players understandably distracted and at a performance disadvantage. Lastly there are the mid-year transfers, and there are all sorts of issues with such players. In addition to having their own small sample-size issues, mid-year transfers generally have more trouble integrating with a team because they join in the middle of the schedule. I don't know what to make of transfers who play in back-to-back seasons, other than to say that these players remain a wild card.

Step 9: Project a player's defensive stats
Since blocks, steals, and defensive rebounds are all relatively poor measures of player defense, I keep the model simple and only use one year of past data for each player. High school recruiting rankings do help project extreme shot blockers, etc.

Step 10: Project playing time
Without question this is the most difficult and most important step. One could assign each player a position, for example, and assume players fight for playing time at each position. But I don't think that's quite accurate. With some exceptions, the data suggests college basketball lineups are rather fungible. Northwestern didn't play John Shurna at the center position last year because Shurna "won the job" at center. They played him there because Bill Carmody wanted to put his best players on the floor, and that often meant doing without a true post defender. (True, Shurna was playing center partly because of injuries, but the point about position flexibility still holds.) With some adjustments, the model's going to be focused on talent above position.

As another alternative, one could take last year's starters and only move in bench players or freshmen if there's a hole in the lineup. But I don't believe that's quite accurate either. In most cases incumbent starters will hold their positions because they're the better players. But if a coach has a new player (a transfer or elite recruit), the model allows for the possibility that an incumbent player can see his role slip.

Basically the model assumes a player's minutes will depend on his expected performance and the expected performance of his teammates. Thus each regression will include the projected stats for each player, and the projected stats for his teammates.

I don't include past playing time directly in order to prevent the model from being biased against freshmen and transfers. But I still attempt to use this information. One key observation is that if a player has poor measured statistics but played a bunch of minutes in the past, it's likely that he brought some key unmeasured contribution (e.g., lock-down defense) to the team. Sure enough we find that players who log more minutes than their measured statistics would suggest they should in year N, tend to do it again in year N+1. For that reason the model incorporates the measured stats of players and their teammates, as well as past deviations from expected playing time.

The defensive stats are given less weight than the projected ORtg in large part because defensive ability is so poorly measured by traditional stats. But I recognize that teams need size on defense. If a team's too short, it may have to give more minutes to a weak forward instead of a quality fifth guard. Thus I make an adjustment based on team size. There are plenty of major-conference teams that run lineups with only one post player, so I don't require teams to give 40 percent of their minutes to centers and power forwards.

I'm also convinced that teams will bump a weak point guard up in the rotation if there's a hole at the position. But demonstrating this empirically has proven elusive. First, assists are actually quite permeable. Natural shooting guards can often become passable point guards when given the opportunity, but it is hard to tell which 2-guards can make the conversion based on historical stats. Second, assists are not as great a measure of the point guard position as you might think. Markel Starks was the "point guard" for Georgetown last year, but in the Hoya system he barely touched the ball. Thus the model is less inclined to believe a team will give the ball to a freshman point guard than what I see in the data. Freshmen point guards started about 14 percent of the time in 2012, but the model predicted it would happen only five percent of the time.

Finally, I account for mid-year transfers, past injuries, and current injuries. Jabari Brown is projected to play a lower percentage of his own available minutes for Missouri this season because he joins the team after the first semester.

Step 11: Adjust playing time based on coaching effects
I plan to amplify this step in the future. Right now I have a very minor adjustment for coaches that are disinclined to use freshmen (see Bo Ryan), but there is more to be done. Some coaches tend to use a shorter rotation (John Thompson) while other coaches prefer a longer rotation (Tubby Smith), and that does impact a team's average performance. You can get a much better offensive rating with a tight rotation, and right now I haven't accounted for that.

My current setup results in a longer-than-usual 10-player rotation for most teams. (The exception is when the No. 9 or 10 player projects to get trivial minutes.) The longer lineup is not necessarily more accurate, but it allows for fluctuations in playing time due to probabilities of injury, academic ineligibility, and mid-year transfers. Thus while it's true very few coaches bring a 10-player rotation into the NCAA tournament, this assumption allows for a more accurate minutes prediction at the player level.

For teams with great starters but little depth, this may result in a slight under-prediction in my model. But realistically, like Cracked Sidewalks' top-three model, the offensive prediction is going to depend most heavily on the top three players in every rotation. Whether or not you have stars has a much larger impact on your overall offensive prediction than whether or not you have a good option at No. 10.

Step 12: Project percentage of possessions used for each player
Past values for percentage of possessions used tend to be the best predictor, with an emphasis on the previous season. Players do tend to become more aggressive as they become upperclassmen, and top-100 recruits do tend to shoot more, regardless of their ORtg. The key fact, as Ken Pomeroy noted years ago, is that role players very rarely become high-usage players.

Naturally the percentage of possessions used has to add up across players. No matter what happened last year, you can't have a roster this year where every player's under 20 percent usage. In such cases I have to inflate each player's usage until the numbers add up, and this is why losing a star player matters. The loss of a star forces all the returning players to shoot more than they were comfortable with previously, and ultimately that's going to lower their ORtgs.

Step 13: Back out the initial adjustments
To back out the raw player stats, we basically need to reverse steps 1 through 3. First, I reverse the team quality adjustment. If Deron Williams and Luther Head leave Dee Brown behind, that is going to hurt Brown's projection in 2006. Second, I reverse the percentage of possessions equation. If a player is going to shoot 30 percent of the time, his ORtg is going to be lower.

If you follow the progression backward, the next step would seem to be to reverse the schedule-adjusted ORtgs and convert them into raw offensive ratings. But I don't want to do that yet because I need to:

Step 14: Calculate adjusted offense
The model sums schedule-adjusted ORtgs weighted by playing time and usage to get a 2012-2013 projection for each team's adjusted offense. So far all that was needed was 2000 lines of programming code.

To complete the process, I make the schedule-strength adjustment. For ease in comparison to last season I use the same schedule-strength adjustment as 2011-2012. That won't be accurate in some cases (see TCU), but it makes the player data easier to interpret.

Step 15: Calculate adjusted defense
Since we already have the starting lineup and the projected defensive rebound rate, block rate, and steal rate for various players, we can say quite a bit about a team's defense. But based on historical defensive data, there are also several other variables that matter. Height matters, of course. More height equates to better defense. I include measures of average team height, but the most important effect is the height at the center and power forward positions. Similarly, the number of elite recruits matters. On average, elite recruits are better defenders. And, not surprisingly, experience matters. If a team gives more minutes to new players and underclassmen, the defensive projection will be worse.

Lastly I include one unusual factor in the model. Recall that earlier I determined whether players played more minutes than their measured statistics would expect. It turns out this unexpected playing time does have some small amount of predictive power on team defense. If a team has a player with significant minutes and few measured statistics, it is more likely that this player is a lock-down defender. I call this the Travis Walton adjustment, because it's based on something Kyle Jen wrote about Michigan State's superb defender a few years ago. Teams that lose players like Walton do tend to do a little worse on defense the following season.

That being said, all of these defensive factors do a relatively poor job predicting team defense. The sad reality is that unless Luke Winn and David Hess spend countless hours charting players, we simply miss too much of what happens on the defensive end.

In an attempt to spare Luke and David all that toil, I've incorporated another factor. I think coaches have a dramatic impact on team defense. There's a reason that John Calipari is competing for a national title on a consistent basis, while Scott Drew hasn't been able to win a conference title at Baylor. Both coaches have had skilled offensive players. But Calipari simply teaches a level of defense that Drew can't match. Looking at 10 years of defensive data, after accounting for the above factors, the following coaches have the biggest annual impact on their own defense:

1. Bill Self
2. Rick Pitino
3. Kevin O'Neill
4. John Calipari
5. Bo Ryan
6. Larry Shyatt
7. Matt Painter
8. Frank Martin
9. Thad Matta
10. Mike Krzyzewski
11. Brad Stevens
12. Anthony Grant
13. Kevin Willard
14. Bruce Weber
15. Rick Barnes

In case you're wondering, Leonard Hamilton is at No. 19. Larry Shyatt only has one year of observed data, but he absolutely turned Wyoming around on the defensive end last year. Rick Barnes doesn't get a lot of credit for what Texas does on defense, but keep in mind how young his teams have been over the last decade.

Now, the question is how much weight to put on these defensive coach effects. Most national publications put a heavy weight on the offensive talent on a team. If I ignored these coaching effects, my preseason rankings would look a lot more similar to those in most national publications.

I fit the data best with a model that gives a particularly high weight to the most recent season for each coach, but also some weight to all historic seasons. Again, while this fits the data better, it will mean that the model disagrees more frequently with the preseason top 25. For example these coach effects, particularly on defense, cause the model to love Duke ahead of NC State in the ACC in 2012-2013.

Step 16: Calculate Pythagorean ratings and project wins and losses
Next I calculate the Pythagorean ranking which equals 1/(1+(Preddef/Predoff)10.25. Then I simulate each team against its schedule where the probability of winning equals (Own Pyth.-Own Pyth.*Opp. Pyth.)/(Own Pyth.+Opp. Pyth.-2*Own Pyth.*Opp. Pyth.) and I calculate the expected number of conference wins for each team.

I haven't ranked Northern Kentucky or New Orleans for 2013 even though both teams should be playing a near full D-I schedule this year. But since those rosters didn't play a full D-I schedule in 2012, I don't have a great historical basis to make a projection.

Closing thoughts
Notice that nowhere in all of the above does the model explicitly turn returning minutes into an offensive prediction. But to break out a 12th grade science word, it happens "endogenously" in the system. Freshmen will typically have lower projected ORtgs than veteran players, and teams that return fewer players will have more freshmen higher in the lineup. Still, that effect's not linear. Losing one or two players to graduation will not be equivalent to losing seven players to graduation. And losing one or two post players could be more costly if a team has fewer quality big men on the roster.

I'm hardly a neutral observer, but let me point to two aspects I like about this approach to projecting team performance. First, we no longer get what I call a "Chip Armelin" penalty. In a returning minutes model, the loss of Armelin hurts Minnesota because he did play some minutes and score some points. But Armelin wouldn't be projected into the top 10 players for the Gophers this year, so the truth is his transfer should be nearly irrelevant. Sure enough in the new playing time projection modeI presented here, his departure is regarded as irrelevant.

Second and more importantly, this model picks up the loss of star players. In an additive returning minutes model if a team loses a 20-point scorer and two-point scorers, that's the same as losing three eight-point scorers. But in this model losing a star player hurts more. It significantly lowers the quality of the starting rotation (which matters a lot), it has a team spillover effect (the Dee Brown effect), and it may impact the usage rate of all returning players.

The model still doesn't measure everything. I don't think any model can measure everything Draymond Green brought to Michigan State last year. But in this model, his departure gets a heavier weight than in a returning minutes model.

Bottom line
Data can't tell us everything about the upcoming season. There will always be surprises, and that's what makes college basketball great. But the techniques for predicting the season are getting better, and through them we can gain insight into teams we may have overvalued or overlooked.

Follow Dan on Twitter. Dan Hanner writes about college hoops at RealGM.com, and is a longtime contributor to the College Basketball Prospectus book series.

This free article is an example of the content available to Basketball Prospectus Premium subscribers. See our Premium page for more details and to subscribe.

1 comment has been left for this article.

<< Previous Article
The Books Are Coming (10/05)
Next Article >>
Premium Article Five Burning Questions (10/15)

State of Basketball Prospectus: A Brief Anno...
Tuesday Truths: March-at-Last Edition
Easy Bubble Solver: The Triumphant Return
Premium Article Bubbles of their Own Making: Villanova, Temp...
Tuesday Truths: Crunch Time Edition

2012-10-29 - Indiana's No. 1: Preseason Rankings 1 to 345
2012-10-12 - So You Want to Predict the Future: Beta Test...

Basketball Prospectus Home  |  Terms of Service  |  Privacy Policy  |  Contact Us
Copyright © 1996-2017 Prospectus Entertainment Ventures, LLC.