Expanded Elo rating

Expanding the Elo rating system

Oliver Brown
— This upcoming video may not be available to view yet.

One of the features I have planned for Tic-tac-toe Collection is a player rating system.

One such common rating system is the Elo rating system, most famously used in chess. There is a lot of discussion about how good it is, but for my purposes it seems fine - at least as a starting point.

There are two fundamental restrictions on Elo that I would have to overcome though:

  • It is limited to two players.
  • It assumes each player has an equal chance of winning if they have equal skill.

Elo

First an introduction into how Elo rating works. There are two parts:

Given two player ratings, calculate an expected result.

The expected result is essentially the chance of winning, given as a decimal between 0 and 1, in which 0 represents a guaranteed loss and 1 a guaranteed win.

Given a player rating, an expected result, and an actual result, calculate how much the rating should change.

If you do better than expected then your score will go up. If you do worse then your score will go down. With a bigger difference there will be a bigger change.

Multiplayer

Simple Multiplayer Elo

I found an existing implementation of Elo for more than two players by Tom Kerrigan called Simple Multiplayer Elo.

In SME, players are ordered by performance, and then a simulated match is played between pairs of consecutive players, with the players’ ratings updated using the normal Elo method for two players.

This is certainly acceptable, and Tom includes some tests to show it works well. One oddity is if you get a victory over someone with a much higher rating than you, but also beat them by more than one place, you essentially don’t get the benefit for it. For example, consider the following result:

Position Player Initial rating vs player above vs player below Total
1st A 1000 +24 +24
2nd B 1200 −24 +27 +3
3rd C 1500 −27 −27

Here, Player A with a rating of 1000 has beaten Player C with a rating of 1500, but essentially hasn’t gotten the credit for it.

My unnamed Elo

My solution is to conceptually run the algorithm on every pair of players (in this case there would be A v B, B v C and A v C). There is a straightforward optimization so that the estimating part of the operation does not directly depend on the other players, just the difference between a single player’s expected and actual scores. So the algorithm is actually:

  1. Calculate the expected scores for each pair of players.
  2. Sum the expected score for each player.
  3. Divide each score by the number of players to normalize back to the range 0 to 1.
  4. Calculate the rating change for each player using their actual score and the combined expected score.

The details

With the same data as above, the results are as follows:

Player Initial rating Expected score
A 1000 0.097
B 1200 0.304
C 1500 0.599

Which is made up of the following score pairs:

Pair A B C
A v B 0.24 0.26
B v C 0.15 0.85
A v C 0.05 0.95

This results in rating changes of:

Position Player Initial rating Change
1st A 1000 +29
2nd B 1200 −9
3rd C 1500 −19
  • Player A has gained more which is good. That was basically the goal of the change.
  • Player B has now changed a small gain into a moderate loss. That’s a little odd and probably not desired, after all the victory against C should be more significant than the loss against A
  • Player C has changed a big loss into a slightly smaller (but still big) loss. After some thought, that is probably reasonable. Although C is expected to win just having more players should generally reduce your expectation of winning and therefore how much you are penalized for failing.

Improvements

So why did these results happen? It might be better to look at a table also including expected and actual results, which reveals a choice that has to be made that I have not yet mentioned:

Position Player Initial rating Expected score Actual score Change
1st A 1000 0.097 1 +29
2nd B 1200 0.304 0 −9
3rd C 1500 0.599 0 −19

Notice the actual scores used. Player A has a score of 1 indicating a win. Players B and C have a score of 0, which indicates a loss, but an equal loss.

If we instead use different values for the actual score we get a more sensible result

Position Player Initial rating Actual score Change
1st A 1000 0.67 +18
2nd B 1200 0.33 +1
3rd C 1500 0 −19

In this case, the “actual score” for a player placed Nth out of C is:

2 N C × C - 1
Explain this formula

This formula is conceptually simple, but a bit opaque when simplified.

Think of it as each player getting a number of shares of the total score. The person in last gets 0 shares, the person second from last gets 1 share, then next player gets 2 shares, and so on. The person in first place gets C−1 shares.

That means each player gets N shares and the total number of shares is equal to the C−1th Triangular number.

The formula for a triangular number is:

T x = x × x + 1 2

Substituting C-1:

T C - 1 = C - 1 × C - 1 + 1 2

This simplifies to: T C - 1 = C × C - 1 2

Therefore each player gets:

N / C × C - 1 2

which simplifies to:

2 N C × C - 1

Handling ties

To be as general as possible (and because it actually happens in some of my multiplayer Tic-tac-toe variants) we need to handle arbitrary ties.

The simplest way is to evenly distribute the score to all the players that are tied. So if the top two players in a three-player game tie, their scores of 0.67 and 0.33 are summed and split between them (0.5 each).

As a more complex example, consider:

Position “Raw” score Actual score
1st 0.286 0.262
1st 0.238 0.262
3rd 0.190 0.143
3rd 0.143 0.143
3rd 0.095 0.143
6th 0.048 0.048
7th 0 0

Final thoughts

I’ve addressed the first restriction of Elo, only supporting two players. As for unfair games, that will have to wait as this post is long enough as it is.

Elo rating for unfair games

Oliver Brown
— This upcoming video may not be available to view yet.

In my previous post, I described an extension to Elo that could handle multiple players. The next restriction to overcome is that Elo assumed players of equal skill have an equal chance of winning.

In most games, players don’t actually have an equal chance of winning a single game. Chess overcomes this by having a match consist of several individual games with players switching who goes first. This is a good solution for two player games but gets awkward for multiplayer games. It is also inconvenient if having a multi-game match is undesirable for any reason.

Adjusted rating

Elo is fundamentally designed to handle players of different skill levels and produce a probability of them winning. The approach is therefore to determine an adjusted rating for someone such that the probability of winning is as desired.

The formula to work out an Elo expected score is:

E = 1 1 + 10 R 400

In our case we have an expected score (the win probability) and want to work out a rating difference, so we can just rearrange to get:

R = 400 × log 10 1 E - 1

We then subtract this adjustment from the player’s actual rating to get their effective rating to use in the rest of the Elo calculations.

The reason we subtract it is that for players A and B, the rating difference is calculated as RB - RA which is the opposite way round to what we want.

Here are the rating adjustments for some sample win probabilities:

Win probability Rating adjustment
0.01 -798
0.10 -382
0.25 -191
0.49 -7
0.50 0
0.51 +7
0.75 +191
0.90 +382
0.99 +798

If you try to calculate the adjustment for a win probability of 0 or 1, you get -∞ and +∞ respectively.

This means, for example, if you beat someone of an equal rating at a game you only have a 10% chance of winning, it’s the same as beating someone rated 382 points higher at a fair game.

Putting it all together

After explaining the theory of how to expand an Elo rating to support all the variations in Tic-tac-toe Collection, it’s time to put it into practice.

While writing this I’ve also been working on an implementation.

As well as adding it to Tic-tac-toe Collection, I’m planning on creating some kind of site and/or app just to work out Elo ratings for various results (and hopefully even to maintain ratings for players in any kind of competition you might want to run).

Elo rating for ad hoc teams

Oliver Brown
— This upcoming video may not be available to view yet.

One final feature my expanded Elo rating needs (or at least the last I can think of) is the ability to deal with ad hoc teams.

By “ad hoc teams”, I mean teams of individual players with their own ratings that are on the same team for a specific game, but don’t generally stay as a team (established teams that always play together should be treated as their own “players” with their own rating).

This is not a common requirement, but the specific use case I had was an office ping pong table. Some times people would play singles and some times they would play doubles, but with no really established teams.

Necessary features

Firstly, the two key ratings operations need to work:

  • Estimate the result of an unplayed game
  • Updating ratings after an actual result

And all the existing features should be supported:

  • Two or more teams
  • Unfair games
  • Ties

Additionally, it should support teams of arbitrary (and mixed) sizes, including teams of size one. This brings us to one of our first less-obvious requirements - since this is expanding an existing system, it should be compatible with the existing system where it overlaps. So the following additional requirement makes sense:

  • Teams of one should give the same result as just using individuals

Simple solution

Just like with unfair games in which an adjusted rating is calculated first, and then used in the rest of the algorithm, and adjusted rating should be calculated for a team. This would trivially allow all the existing features to just work.

The most obvious way to calculate such a rating would be a simple arithmetic mean of all the players. This would definitely support our key requirement, but would it produce meaningful results?

At this point I think simplicity has to win out over sophistication. The most general solution would allow players to be weighted on each team (perhaps different roles in a team have different impacts on the result) but I think those situations are more likely to be handled with a per team rating.