As the 2013 season gets under way, it's a good time to look at how we fared in 2012 and release our latest modifications to the algorithm.  A few things elicited strong comments during the year, but on the whole the feedback has been mostly positive.

For starters, we received congratulations after the WFTDA championships for how well we got the final rankings right.  Of course it makes us feel good to have everyone agree with us, but it's worth remembering that no mathematical system is ever exactly right.  We just got lucky this year in how the championship teams finished up.  For a more rigorous analysis of our performance, we look at how the algorithm's predictions fared over all the bouts this year.  As I did last year, I construct the following plot by comparing the actual DoS result of each of the 708 bouts in 2012 with the respective predictions from the algorithm (there were actually 743 bouts in 2012, but only 708 that did not contain an unranked team).

The x-axis shows the difference between the actual result and the prediction, from the perspective of the home team. A positive value means that the home team performed better than predicted and a negative value means that the home team performed worse. Of course a zero means that the prediction was exactly right. The height of each bar shows the number of bouts that had that result.

A quick comparison to how we did in 2011 shows that the 2012 version of the algorithm appears to be keeping closer to zero for more of the bouts.  So that's good, we're moving in the right direction.  The dashed line shows the same distribution from all the previous bouts that were used to train the algorithm – that would be all bouts from 2005 through 2011 Championships. Only the height of the curve was scaled to match the number of bouts in 2012. The fact that it fits the data quite well – for those who geek out on this sort of thing, a chi-squared test gives an 87% probability that these two are describing the same distribution – means that our mathematical assumptions of the statistical variations are holding.

Of course not everything was peaches and cream.  A lot of people were concerned about a few glaring anomalies in which newly ranked teams appeared to be rated significantly higher than would seem credible.  Internally we referred to this as the "VRDL effect", although I think we actually received more comments about Black-n-Bluegrass.  I spent a while studying these and others in detail.  Luckily there aren't too many of them and the self-correcting nature of the algorithm eventually gets everything back in balance.  That being said, I was able to identify what appears to be a systematic effect whenever a team's first bout is a severe blowout.  This can cause a new team to be under-rated as well as over-rated, depending on the direction of the blowout. 

Needless to say, identifying an effect is much easier than fixing it.  For the coming 2013 season, I've implemented a scaling parameter that appears to minimize the impact.  There are still one or two anomalies, but on the whole the system looks to be more robust.  At this point, I return to the statement that no mathematical system can ever be perfect.  I will be watching along with everyone else to see how this latest version fares over the season.

So that's it.  The same algorithm is still going strong and I continue to be amazed at how well it's performing.  The only modifications have been to how we handle new/unranked teams.  The parameters have been re-tuned with all the latest adjustments.  I tried to keep the ratings in the same range as before, but there will be some slight variations, so don't be alarmed if your team goes up or down a few places.

As always, we welcome your feedback and encourage you to keep coming back.  We've got some big things building behind the scenes that we'll be bringing out in the next few months.


Comments

Woohoo, for retrained parameters!
How many weeks did that take?

Running my code to re-optimize the parameters only takes about an hour. The real time was spent in diagnosing the systematic effect, investigating ways to deal with it, testing code using various possible fixes and comparing them against each other, and then writing the final code to perform the optimizations. I think I probably started working on it back in October, so it took a little over two months to get it all wrapped up.

reynman wrote:
Running my code to re-optimize the parameters only takes about an hour.

Southbay:
Math Envy.

Apparently Angel City is rated as high as is because Arizona forfeited a game against them. Is that right? And if so, should that matter?

Yeah, there's no good solution for forfeits. I remind you that we delved into this topic in depth back in 2011.  A 40 point shift is a lot, so it will take a while for the algorithm to self-correct.  Apparently, forfeits are becoming a more frequent thing than we orignally expected.  That being said, I don't think we can chose which bouts to accept and which to ignore.  I repeat Aaron's wise comments from that previous blog:

"...we defer to WFTDA for whether a bout is official or unofficial.  Their rules are deliberated, tested, and revised, and they have a process in place for reporting and approving bouts.  I think it's better to stick with the big picture of roller derby, than forging our own opinions on the way it should be."

I agree there's no good way to reflect forfeits, but I still have a problem with rating something that did not happen. Although you may not want to choose which bouts to accept, in affect you are, by the method of how you reflect a forfeit. Yeah I know, you're on the horns of a dilemma.