Understanding rating lists

Subject: Understanding rating lists Sun May 08, 2022 8:11 am

I am no an expert running a rating list, I have some experience running the GRL, but I can't stand in the shadow of those who doing it for years, by head both CEGT and CCRL started in 2005/6. So the below is no criticism at all, I just want to understand the different systems that are in use.

So Dio, Graham and Stefan if you read this I am interested about your comments.

This is a fragment of the ongoing testing of Rebel 15 for the CCRL 40/15 list. As we can see Rebel has to play all the top engines including multiple SF derivatives.

Code:: RANK ENGINE GAMES POINTS
1. REBEL 15 64-BIT 329 138.5
2. DRAGON 3 BY KOMODO 64-BIT 12 10.0
3. BERSERK 8.5.1 64-BIT 12 9.5
4. SHASHCHESS 21.1 64-BIT 10 9.5
5. FAT FRITZ 2 (IN SF) 64-BIT 12 9.0
6. STOCKFISH 15 64-BIT 10 9.0
7. ARASAN 23.3 64-BIT 12 8.0
8. ETHEREAL 13.50 64-BIT 12 8.0
9. RUBICHESS 20220223 64-BIT 10 8.0
10. SLOWCHESS BLITZ 2.8 64-BIT 10 8.0
11. HOUDINI 6.03 64-BIT 12 7.5
12. SUGAR AI 2.50 64-BIT 10 7.5
13. FIRE 8 64-BIT 12 7.0
14. REVENGE 2.0 64-BIT 10 7.0
15. SEER 2.5.0 64-BIT 10 7.0
16. CLOVER 3.1 64-BIT 12 6.5
17. IGEL 3.0.5 64-BIT 11 6.5
18. KOIVISTO 8.0 64-BIT 10 6.5
19. NEMORINO 6.00 64-BIT 10 6.5
20. FRITZ 18 NEURONAL 64-BIT 12 6.0
21. MINIC 3.18 64-BIT 10 5.5
22. DANASAH 9.0 64-BIT 12 5.0
23. TUCANO 10.00 64-BIT 10 5.0
24. COMBUSKEN 2.0.0 64-BIT 12 4.5
25. DEEP SHREDDER 13 64-BIT 12 4.5
26. WASP 5.50 64-BIT 10 4.5
27. CHIRON 5 64-BIT 12 4.0
28. VELVET 3.3.0 64-BIT 10 3.5
29. MARVIN 5.2.0 64-BIT 10 3.0
30. LC0 0.28.2 W752187 64-BIT 10 2.5
31. HIARCS 15.0 64-BIT 12 1.5
Total games = 329

Next CEGT

Code:: Rebel 15.0NN x64 1CPU

LC0 0.28.2 DX12 Vega11 771721 3500 24,5 3305 [W.B.]
SlowChess Blitz 2.6NN x64 3435 34,0 3320 [W.B.]
Komodo 13.02 x64 1CPU 3344 42,0 3288 [W.B.]
Clover 3.1NN x64 1CPU 3300 49,5 3297 [W.B.]
Wasp 5.50NN x64 1CPU 3263 47,5 3246 [W.B.]
Velvet 3.3.0NN x64 1CPU 3203 60,5 3275 [W.B.]
Weiss 2.0 x64 1CPU 3175 58,0 3231 [W.B.]
Combusken 2.0.0NN x64 1CPU 3174 57,0 3222 [W.B.]
Black Marlin 5.0NN x64 1CPU 3155 69,0 3294 [W.B.]
Ginkgo 2.18 x64 1CPU 3152 63,5 3248 [W.B.]
Fritz 18NN x64 1CPU 3255 47,0 3234 [J.B.]
Tucano 10.00NN x64 1CPU 3282 52,5 3299 [J.B.]
Arasan 23.3NN x64 1CPU 3344 40,0 3274 [J.B.]
Hiarcs 15 x64 1CPU 3174 61,0 3252 [J.B.]
Booot 6.5 x64 1CPU 3224 52,0 3238 [J.B.]
Komodo 12.1.1 x64 1CPU 3335 42,0 3279 [J.B.]
Xiphos 0.6 x64 1CPU 3234 51,0 3241 [J.B.]
Ethereal 12.75 x64 1CPU 3324 42,0 3268 [J.B.]
Revenge 1.0NN x64 1CPU 3367 40,5 3300 [J.B.]
Rubichess 20220223NN x64 1CPU 3437 27,5 3269 [J.B.]
LCZero 0.28.2 771721 DNNL CPU 3214 57,0 3263 [J.B.]

CEGT takes a total different path, a bit like I did for the GRL. You create an elo pool based on the predicted elo gain of the programmer with a not too big elo margin, certainly not 200-300 elo as seen in the above example.

Next GRL, for instance the testing of Zahak 9.0

Code:: No. Name Win Draw Loss Unf. Score Games %
-----------------------------------------------------------
  1 Zahak 9.0 +821 =692 -487 *0 1167.0 2000 58.4%
  2 Weiss 2.0 +66 =79 -55 *0 105.5 200 52.8%
  3 Minic 3.17 +56 =84 -60 *0 98.0 200 49.0%
  4 Berserk 4.3.0 +65 =64 -71 *0 97.0 200 48.5%
  5 Stockfish 5 +57 =66 -77 *0 90.0 200 45.0%
  6 Clover 2.4 +48 =76 -76 *0 86.0 200 43.0%
  7 Seer 2.1.0 +55 =60 -85 *0 85.0 200 42.5%
  8 Beef 0.3.6 +41 =65 -94 *0 73.5 200 36.8%
  9 Wasp 4.50 +39 =64 -97 *0 71.0 200 35.5%
10 Halogen 10 +24 =85 -91 *0 66.5 200 33.2%
11 Toga III-03.12 +36 =49 -115 *0 60.5 200 30.2%

I want to ask Stefan if he is willing to contribute his way of testing and how he composes a pool of engines.

Posts : 159 Join date : 2022-03-01 Location : Berlin

SPCC

Normal way of SPCC testing: versus 7 engines (1000 games each, 500 balanced HERT Openings), which are not too much weaker or stronger than the engine, which is tested (result should be in a range of 30%-70%, if possible (vs. Stockfish, most opponents score below 30%...).
When a tested engine is an update, the older version of the engine (and its played games vs other engines) is deleted in the next weeks. If that leads to less than 7000 played games for any other engine in the ratinglist, this engine must be an opponent for the updated engine in its testrun, because each rated engine should have 7000 games at least. So, sometimes, an engine plays more than 7000 games...

Posts : 222 Join date : 2021-08-28

In CEGT, usually 50 per cent of the test opponents have a comparable playing strength to the engine to be tested. The other test opponents should be in the range of +/- 150 Elo. I try to ensure that the tested engine has a score of about 50 per cent at the end of the test. A first test usually includes at least 1000 test games, currently this value is significantly higher. In our experience, the Elo value is already quite stable after 1000 games and changes only slightly thereafter. The CEGT 40/4 list includes more than 3.500.000 games.

Subject: Re: Understanding rating lists Mon May 09, 2022 9:14 pm

Thanks guys for contributions. I can see that CEGT and SPCC have much in common with the way I composed the elo pool for the GRL regarding the upper limit and the only exception is CCRL. I guess in the end it does not matter much following the individual TPR's. It seems that 6 draws against FF2 is enough to get a positive TPR of 87 elo Very Happy

I assume CCRL does this kind of -- at first sight -- nonsense testing to ensure that 1) much stronger engines should also effectively deal with much weaker engines and 2) that top-notch engines (like SF and Komodo) play enough games because the difference with the rest of the pack is pretty big. Hopefully Graham has something to say about it.

Posts : 6 Join date : 2022-05-10

Dio wrote:: ....The other test opponents should be in the range of +/- 150 Elo.

Just a short addition:
150 plus/minus would be great and we try to reach this. But especially for the "Big Guns" (Stockfish, Komodo and LCZero) this is nearly impossible to achieve. Too few opponents, at least for Stockfish.
So we normally increase the number of games from 100 to 200 per match and/or test vs. opponents running on 4 or even 8 Threads
.....

Quote :: The CEGT 40/4 list includes more than 3.500.000 games.

Not yet (3.15 million actually) but we work hard on it....

. Therefore your contribution is much appreciated...

Best
Wolfgang

Subject: Re: Understanding rating lists Tue May 10, 2022 4:40 pm

I agree, I had the same problem every time testing a new SF version, you wanted at least 2000 games and the usual poor victims where ethereal, rubichess, pedone, nemorino etc. But in the end ORDO did the magic and the result was not much different than other rating lists.

Posts : 207 Join date : 2020-11-28

Wolfgang wrote:

Dio wrote:: ....The other test opponents should be in the range of +/- 150 Elo.

Just a short addition:
150 plus/minus would be great and we try to reach this. But especially for the "Big Guns" (Stockfish, Komodo and LCZero) this is nearly impossible to achieve. Too few opponents, at least for Stockfish.
So we normally increase the number of games from 100 to 200 per match and/or test vs. opponents running on 4 or even 8 Threads
.....

Quote :: The CEGT 40/4 list includes more than 3.500.000 games.

Not yet (3.15 million actually) but we work hard on it....

. Therefore your contribution is much appreciated...

Best
Wolfgang

At least for long time control 150 plus/minus is not impossible to achieve with the big guns.
Here is the top CCRL 40/15 list

1 Stockfish 13 64-bit 4CPU 3537 +17 −17 75.7% −168.0 48.3% 1172
2 Dragon by Komodo 3 64-bit 4CPU 3531 +24 −23 62.0% −67.8 75.5% 466
3 Fat Fritz 2 (in SF) 64-bit 4CPU 3516 +12 −12 65.6% −93.8 66.5% 2053
4 Koivisto 8.0 64-bit 4CPU 3473 +19 −19 53.0% −16.8 74.1% 722
5 Berserk 8.5.1 64-bit 4CPU 3465 +18 −18 55.5% −34.9 74.9% 840
6 SlowChess Blitz 2.8 64-bit 4CPU 3462 +15 −15 53.8% −20.8 81.1% 1061
7 Ethereal 13.50 64-bit 4CPU 3458 +17 −17 51.1% −5.5 78.5% 898
8 Revenge 2.0 64-bit 4CPU 3450 +17 −17 49.9% −0.9 77.0% 840
9‑10 RubiChess 20220223 64-bit 4CPU 3441 +18 −17 51.1% −7.7 68.9% 872
9‑10 Seer 2.5.0 64-bit 4CPU 3441 +32 −32 50.8% −6.0 72.8% 250
11 Arasan 23.3 64-bit 4CPU 3416 +19 −19 47.8% +9.8 69.6% 740
12 Igel 3.0.5 64-bit 4CPU 3405 +12 −12 45.5% +25.0 73.5% 1991
13‑14 Houdini 6 64-bit 4CPU 3387 +8 −8 52.2% −15.5 62.5% 4322
13‑14 RofChade 2.321 64-bit 3387 +31 −31 50.0% +1.0 74.8% 270
15 Nemorino 6.00 64-bit 4CPU 3386 +12 −12 45.2% +25.5 65.5% 2153

Posts : 207 Join date : 2020-11-28

I can add that I think that the gap between the best and the rest of the field get smaller and I guess we are close to see 100% draws in computer chess if we do not do something.

Maybe it is better to have a different list that is not ranked based on elo when the target is to get maximal number of wins against the leaders of the rating list when draws and losses are the same so an engine that can get 20% wins and 70% losses against the top 10 leaders of the rating list is going to be ranked higher than an engine that get 10% wins and no losses.

Posts : 3110 Join date : 2020-11-18

Uri Blass wrote:: I can add that I think that the gap between the best and the rest of the field get smaller and I guess we are close to see 100% draws in computer chess if we do not do something.

Maybe it is better to have a different list that is not ranked based on elo when the target is to get maximal number of wins against the leaders of the rating list when draws and losses are the same so an engine that can get 20% wins and 70% losses against the top 10 leaders of the rating list is going to be ranked higher than an engine that get 10% wins and no losses.

Very good point! Unfortunately, I don't think it will work. We seem to be getting to a level where it's prohibitively difficult to get a win.

Also, a rating list made this way would probably not be popular: you won't often see people picking something from a rating list that loses more often than it wins! Try selling an investment tool which only profits on 20% of its recommendations, returns your stake on 10%, and loses your money on 70%!

Posts : 207 Join date : 2020-11-28

TheSelfImprover wrote:

Uri Blass wrote:: I can add that I think that the gap between the best and the rest of the field get smaller and I guess we are close to see 100% draws in computer chess if we do not do something.

Maybe it is better to have a different list that is not ranked based on elo when the target is to get maximal number of wins against the leaders of the rating list when draws and losses are the same so an engine that can get 20% wins and 70% losses against the top 10 leaders of the rating list is going to be ranked higher than an engine that get 10% wins and no losses.

Very good point! Unfortunately, I don't think it will work. We seem to be getting to a level where it's prohibitively difficult to get a win.

Also, a rating list made this way would probably not be popular: you won't often see people picking something from a rating list that loses more often than it wins! Try selling an investment tool which only profits on 20% of its recommendations, returns your stake on 10%, and loses your money on 70%!

I prefer for analysis of games to have a tool that win 20% even if it lose 70% and not only a tool that win only 10% and draw 90%.
I do not suggest this list instead of a rating list but as an additional list that is not a rating list but a winner list.

Posts : 1254 Join date : 2020-11-17 Location : France

Uri Blass wrote:: I can add that I think that the gap between the best and the rest of the field get smaller and I guess we are close to see 100% draws in computer chess if we do not do something.

Maybe it is better to have a different list that is not ranked based on elo when the target is to get maximal number of wins against the leaders of the rating list when draws and losses are the same so an engine that can get 20% wins and 70% losses against the top 10 leaders of the rating list is going to be ranked higher than an engine that get 10% wins and no losses.

W:L:D
20:70:10 = 25 pts
10:0:90 = 55 pts

Basically you want the engine to go all out for a win, and otherwise not care, a draw is the same as a loss.
In extremis this means it might as well throw drawn endgames, for example KRKB, black may as well just put his bishop en prise. White will take, because he gets an “unexpected” win, and black loses but doesn’t care. Or KRKR, which side throws the rook, black or white?
I’ld guess that if you were to train a NN evaluator on W only, there’s going to be some mighty weird move choices being made by the trained net.

» FUN with rating lists [ PART ONE]
» FUN with rating lists [ PART TWO]
» Game adjudication in rating lists
» CEGT - rating lists May 26th 2024
» CEGT - rating lists February 18th