ProDeo
Would you like to react to this message? Create an account in a few clicks or log in to continue.
ProDeo

Computer Chess
 
HomeHome  CalendarCalendar  Latest imagesLatest images  FAQFAQ  SearchSearch  MemberlistMemberlist  UsergroupsUsergroups  RegisterRegister  Log in  

 

 Understanding rating lists

Go down 
+3
Dio
pohl4711
Admin
7 posters
AuthorMessage
Admin
Admin
Admin


Posts : 2608
Join date : 2020-11-17
Location : Netherlands

Understanding rating lists Empty
PostSubject: Understanding rating lists   Understanding rating lists EmptySun May 08, 2022 8:11 am

I am no an expert running a rating list, I have some experience running the GRL, but I can't stand in the shadow of those who doing it for years, by head both CEGT and CCRL started in 2005/6. So the below is no criticism at all, I just want to understand the different systems that are in use.

So Dio, Graham and Stefan if you read this I am interested about your comments.

This is a fragment of the ongoing testing of Rebel 15 for the CCRL 40/15 list. As we can see Rebel has to play all the top engines including multiple SF derivatives.

Code:
RANK ENGINE                                    GAMES  POINTS
1.   REBEL 15 64-BIT                            329   138.5  
2.   DRAGON 3 BY KOMODO 64-BIT                   12    10.0  
3.   BERSERK 8.5.1 64-BIT                        12     9.5  
4.   SHASHCHESS 21.1 64-BIT                      10     9.5  
5.   FAT FRITZ 2 (IN SF) 64-BIT                  12     9.0  
6.   STOCKFISH 15 64-BIT                         10     9.0  
7.   ARASAN 23.3 64-BIT                          12     8.0  
8.   ETHEREAL 13.50 64-BIT                       12     8.0  
9.   RUBICHESS 20220223 64-BIT                   10     8.0  
10.  SLOWCHESS BLITZ 2.8 64-BIT                  10     8.0  
11.  HOUDINI 6.03 64-BIT                         12     7.5  
12.  SUGAR AI 2.50 64-BIT                        10     7.5  
13.  FIRE 8 64-BIT                               12     7.0  
14.  REVENGE 2.0 64-BIT                          10     7.0  
15.  SEER 2.5.0 64-BIT                           10     7.0  
16.  CLOVER 3.1 64-BIT                           12     6.5  
17.  IGEL 3.0.5 64-BIT                           11     6.5  
18.  KOIVISTO 8.0 64-BIT                         10     6.5  
19.  NEMORINO 6.00 64-BIT                        10     6.5  
20.  FRITZ 18 NEURONAL 64-BIT                    12     6.0  
21.  MINIC 3.18 64-BIT                           10     5.5  
22.  DANASAH 9.0 64-BIT                          12     5.0  
23.  TUCANO 10.00 64-BIT                         10     5.0  
24.  COMBUSKEN 2.0.0 64-BIT                      12     4.5  
25.  DEEP SHREDDER 13 64-BIT                     12     4.5  
26.  WASP 5.50 64-BIT                            10     4.5  
27.  CHIRON 5 64-BIT                             12     4.0  
28.  VELVET 3.3.0 64-BIT                         10     3.5  
29.  MARVIN 5.2.0 64-BIT                         10     3.0  
30.  LC0 0.28.2 W752187 64-BIT                   10     2.5  
31.  HIARCS 15.0 64-BIT                          12     1.5  
Total games = 329

Next CEGT

Code:
Rebel 15.0NN x64 1CPU

LC0 0.28.2 DX12 Vega11 771721  3500  24,5  3305 [W.B.]
SlowChess Blitz 2.6NN x64      3435  34,0  3320 [W.B.]
Komodo 13.02 x64 1CPU          3344  42,0  3288 [W.B.]
Clover 3.1NN x64 1CPU          3300  49,5  3297 [W.B.]
Wasp 5.50NN x64 1CPU           3263  47,5  3246 [W.B.]
Velvet 3.3.0NN x64 1CPU        3203  60,5  3275 [W.B.]
Weiss 2.0 x64 1CPU             3175  58,0  3231 [W.B.]
Combusken 2.0.0NN x64 1CPU     3174  57,0  3222 [W.B.]
Black Marlin 5.0NN x64 1CPU    3155  69,0  3294 [W.B.]
Ginkgo 2.18 x64 1CPU           3152  63,5  3248 [W.B.]
Fritz 18NN x64 1CPU            3255  47,0  3234 [J.B.]
Tucano 10.00NN x64 1CPU        3282  52,5  3299 [J.B.]
Arasan 23.3NN x64 1CPU         3344  40,0  3274 [J.B.]
Hiarcs 15 x64 1CPU             3174  61,0  3252 [J.B.]
Booot 6.5 x64 1CPU             3224  52,0  3238 [J.B.]
Komodo 12.1.1 x64 1CPU         3335  42,0  3279 [J.B.]
Xiphos 0.6 x64 1CPU            3234  51,0  3241 [J.B.]
Ethereal 12.75 x64 1CPU        3324  42,0  3268 [J.B.]
Revenge 1.0NN x64 1CPU         3367  40,5  3300 [J.B.]
Rubichess 20220223NN x64 1CPU  3437  27,5  3269 [J.B.]
LCZero 0.28.2 771721 DNNL CPU  3214  57,0  3263 [J.B.]

CEGT takes a total different path, a bit like I did for the GRL. You create an elo pool based on the predicted elo gain of the programmer with a not too big elo margin, certainly not 200-300 elo as seen in the above example.

Next GRL, for instance the testing of Zahak 9.0

Code:
No. Name            Win Draw Loss Unf.  Score Games       %
-----------------------------------------------------------
  1 Zahak 9.0      +821 =692 -487   *0 1167.0  2000   58.4%
  2 Weiss 2.0       +66  =79  -55   *0  105.5   200   52.8%
  3 Minic 3.17      +56  =84  -60   *0   98.0   200   49.0%
  4 Berserk 4.3.0   +65  =64  -71   *0   97.0   200   48.5%
  5 Stockfish 5     +57  =66  -77   *0   90.0   200   45.0%
  6 Clover 2.4      +48  =76  -76   *0   86.0   200   43.0%
  7 Seer 2.1.0      +55  =60  -85   *0   85.0   200   42.5%
  8 Beef 0.3.6      +41  =65  -94   *0   73.5   200   36.8%
  9 Wasp 4.50       +39  =64  -97   *0   71.0   200   35.5%
 10 Halogen 10      +24  =85  -91   *0   66.5   200   33.2%
 11 Toga III-03.12  +36  =49 -115   *0   60.5   200   30.2%

I want to ask Stefan if he is willing to contribute his way of testing and how he composes a pool of engines.

adminx and Dio like this post

Back to top Go down
http://rebel13.nl/
pohl4711

pohl4711


Posts : 159
Join date : 2022-03-01
Location : Berlin

Understanding rating lists Empty
PostSubject: Re: Understanding rating lists   Understanding rating lists EmptySun May 08, 2022 12:36 pm

SPCC

Normal way of SPCC testing: versus  7 engines (1000 games each, 500 balanced HERT Openings), which are not too much weaker or stronger than the engine, which is tested (result should be in a range of 30%-70%, if possible (vs. Stockfish, most opponents score below 30%...).
When a tested engine is an update, the older version of the engine (and its played games vs other engines) is deleted in the next weeks. If that leads to less than 7000 played games for any other engine in the ratinglist, this engine must be an opponent for the updated engine in its testrun, because each rated engine should have 7000 games at least. So, sometimes, an engine plays more than 7000 games...

Admin and Dio like this post

Back to top Go down
https://www.sp-cc.de
Dio




Posts : 222
Join date : 2021-08-28

Understanding rating lists Empty
PostSubject: Re: Understanding rating lists   Understanding rating lists EmptySun May 08, 2022 6:31 pm

In CEGT, usually 50 per cent of the test opponents have a comparable playing strength to the engine to be tested. The other test opponents should be in the range of +/- 150 Elo. I try to ensure that the tested engine has a score of about 50 per cent at the end of the test. A first test usually includes at least 1000 test games, currently this value is significantly higher. In our experience, the Elo value is already quite stable after 1000 games and changes only slightly thereafter. The CEGT 40/4 list includes more than 3.500.000 games.

Admin likes this post

Back to top Go down
Admin
Admin
Admin


Posts : 2608
Join date : 2020-11-17
Location : Netherlands

Understanding rating lists Empty
PostSubject: Re: Understanding rating lists   Understanding rating lists EmptyMon May 09, 2022 9:14 pm

Thanks guys for contributions. I can see that CEGT and SPCC have much in common with the way I composed the elo pool for the GRL regarding the upper limit and the only exception is CCRL. I guess in the end it does not matter much following the individual TPR's. It seems that 6 draws against FF2 is enough to get a positive TPR of 87 elo  Very Happy

I assume CCRL does this kind  of -- at first sight -- nonsense testing to ensure that 1) much stronger engines should also effectively deal with much weaker engines and 2) that top-notch engines (like SF and Komodo) play enough games because the difference with the rest of the pack is pretty big. Hopefully Graham has something to say about it.

Dio likes this post

Back to top Go down
http://rebel13.nl/
Wolfgang




Posts : 6
Join date : 2022-05-10

Understanding rating lists Empty
PostSubject: Re: Understanding rating lists   Understanding rating lists EmptyTue May 10, 2022 3:12 pm

Dio wrote:
....The other test opponents should be in the range of +/- 150 Elo.

Just a short addition:
150 plus/minus would be great and we try to reach this. But especially for the "Big Guns" (Stockfish, Komodo and LCZero) this is nearly impossible to achieve. Too few opponents, at least for Stockfish.
So we normally increase the number of games from 100 to 200 per match and/or test vs. opponents running on 4 or even 8 Threads
.....
Quote :
The CEGT 40/4 list includes more than 3.500.000 games.

Not yet (3.15 million actually) but we work hard on it.... 😄. Therefore your contribution is much appreciated...😉

Best
Wolfgang

Admin, Dio and Wolfgang like this post

Back to top Go down
Admin
Admin
Admin


Posts : 2608
Join date : 2020-11-17
Location : Netherlands

Understanding rating lists Empty
PostSubject: Re: Understanding rating lists   Understanding rating lists EmptyTue May 10, 2022 4:40 pm

I agree, I had the same problem every time testing a new SF version, you wanted at least 2000 games and the usual poor victims where ethereal, rubichess, pedone, nemorino etc. But in the end ORDO did the magic and the result was not much different than other rating lists.

Dio and Wolfgang like this post

Back to top Go down
http://rebel13.nl/
Uri Blass




Posts : 207
Join date : 2020-11-28

Understanding rating lists Empty
PostSubject: Re: Understanding rating lists   Understanding rating lists EmptyTue May 10, 2022 6:38 pm

Wolfgang wrote:
Dio wrote:
....The other test opponents should be in the range of +/- 150 Elo.

Just a short addition:
150 plus/minus would be great and we try to reach this. But especially for the "Big Guns" (Stockfish, Komodo and LCZero) this is nearly impossible to achieve. Too few opponents, at least for Stockfish.
So we normally increase the number of games from 100 to 200 per match and/or test vs. opponents running on 4 or even 8 Threads
.....
Quote :
The CEGT 40/4 list includes more than 3.500.000 games.

Not yet (3.15 million actually) but we work hard on it.... 😄. Therefore your contribution is much appreciated...😉

Best
Wolfgang

At least for long time control 150 plus/minus is not impossible to achieve with the big guns.
Here is the top CCRL 40/15 list

1 Stockfish 13 64-bit 4CPU 3537 +17 −17 75.7% −168.0 48.3% 1172
2 Dragon by Komodo 3 64-bit 4CPU 3531 +24 −23 62.0% −67.8 75.5% 466
3 Fat Fritz 2 (in SF) 64-bit 4CPU 3516 +12 −12 65.6% −93.8 66.5% 2053
4 Koivisto 8.0 64-bit 4CPU 3473 +19 −19 53.0% −16.8 74.1% 722
5 Berserk 8.5.1 64-bit 4CPU 3465 +18 −18 55.5% −34.9 74.9% 840
6 SlowChess Blitz 2.8 64-bit 4CPU 3462 +15 −15 53.8% −20.8 81.1% 1061
7 Ethereal 13.50 64-bit 4CPU 3458 +17 −17 51.1% −5.5 78.5% 898
8 Revenge 2.0 64-bit 4CPU 3450 +17 −17 49.9% −0.9 77.0% 840
9‑10 RubiChess 20220223 64-bit 4CPU 3441 +18 −17 51.1% −7.7 68.9% 872
9‑10 Seer 2.5.0 64-bit 4CPU 3441 +32 −32 50.8% −6.0 72.8% 250
11 Arasan 23.3 64-bit 4CPU 3416 +19 −19 47.8% +9.8 69.6% 740
12 Igel 3.0.5 64-bit 4CPU 3405 +12 −12 45.5% +25.0 73.5% 1991
13‑14 Houdini 6 64-bit 4CPU 3387 +8 −8 52.2% −15.5 62.5% 4322
13‑14 RofChade 2.321 64-bit 3387 +31 −31 50.0% +1.0 74.8% 270
15 Nemorino 6.00 64-bit 4CPU 3386 +12 −12 45.2% +25.5 65.5% 2153
Back to top Go down
Uri Blass




Posts : 207
Join date : 2020-11-28

Understanding rating lists Empty
PostSubject: Re: Understanding rating lists   Understanding rating lists EmptyTue May 10, 2022 6:52 pm

I can add that I think that the gap between the best and the rest of the field get smaller and I guess we are close to see 100% draws in computer chess if we do not do something.

Maybe it is better to have a different list that is not ranked based on elo when the target is to get maximal number of wins against the leaders of the rating list when draws and losses are the same so an engine that can get 20% wins and 70% losses against the top 10 leaders of the rating list is going to be ranked higher than an engine that get 10% wins and no losses.

Back to top Go down
TheSelfImprover

TheSelfImprover


Posts : 3110
Join date : 2020-11-18

Understanding rating lists Empty
PostSubject: Re: Understanding rating lists   Understanding rating lists EmptyTue May 10, 2022 8:22 pm

Uri Blass wrote:
I can add that I think that the gap between the best and the rest of the field get smaller and I guess we are close to see 100% draws in computer chess if we do not do something.

Maybe it is better to have a different list that is not ranked based on elo when the target is to get maximal number of wins against the leaders of the rating list when draws and losses are the same so an engine that can get 20% wins and 70% losses against the top 10 leaders of the rating list is going to be ranked higher than an engine  that get 10% wins and no losses.


Very good point! Unfortunately, I don't think it will work. We seem to be getting to a level where it's prohibitively difficult to get a win.

Also, a rating list made this way would probably not be popular: you won't often see people picking something from a rating list that loses more often than it wins! Try selling an investment tool which only profits on 20% of its recommendations, returns your stake on 10%, and loses your money on 70%!
Back to top Go down
Uri Blass




Posts : 207
Join date : 2020-11-28

Understanding rating lists Empty
PostSubject: Re: Understanding rating lists   Understanding rating lists EmptyTue May 10, 2022 8:56 pm

TheSelfImprover wrote:
Uri Blass wrote:
I can add that I think that the gap between the best and the rest of the field get smaller and I guess we are close to see 100% draws in computer chess if we do not do something.

Maybe it is better to have a different list that is not ranked based on elo when the target is to get maximal number of wins against the leaders of the rating list when draws and losses are the same so an engine that can get 20% wins and 70% losses against the top 10 leaders of the rating list is going to be ranked higher than an engine  that get 10% wins and no losses.


Very good point! Unfortunately, I don't think it will work. We seem to be getting to a level where it's prohibitively difficult to get a win.

Also, a rating list made this way would probably not be popular: you won't often see people picking something from a rating list that loses more often than it wins! Try selling an investment tool which only profits on 20% of its recommendations, returns your stake on 10%, and loses your money on 70%!

I prefer for analysis of games to have a tool that win 20% even if it lose 70% and not only a tool that win only 10% and draw 90%.
I do not suggest this list instead of a rating list but as an additional list that is not a rating list but a winner list.
Back to top Go down
Chris Whittington




Posts : 1254
Join date : 2020-11-17
Location : France

Understanding rating lists Empty
PostSubject: Re: Understanding rating lists   Understanding rating lists EmptyTue May 10, 2022 10:46 pm

Uri Blass wrote:
I can add that I think that the gap between the best and the rest of the field get smaller and I guess we are close to see 100% draws in computer chess if we do not do something.

Maybe it is better to have a different list that is not ranked based on elo when the target is to get maximal number of wins against the leaders of the rating list when draws and losses are the same so an engine that can get 20% wins and 70% losses against the top 10 leaders of the rating list is going to be ranked higher than an engine  that get 10% wins and no losses.


W:L:D
20:70:10 = 25 pts
10:0:90 = 55 pts

Basically you want the engine to go all out for a win, and otherwise not care, a draw is the same as a loss.
In extremis this means it might as well throw drawn endgames, for example KRKB, black may as well just put his bishop en prise. White will take, because he gets an “unexpected” win, and black loses but doesn’t care. Or KRKR, which side throws the rook, black or white?
I’ld guess that if you were to train a NN evaluator on W only, there’s going to be some mighty weird move choices being made by the trained net.

Back to top Go down
Sponsored content





Understanding rating lists Empty
PostSubject: Re: Understanding rating lists   Understanding rating lists Empty

Back to top Go down
 
Understanding rating lists
Back to top 
Page 1 of 1
 Similar topics
-
» FUN with rating lists [ PART ONE]
» FUN with rating lists [ PART TWO]
» Game adjudication in rating lists
» CEGT - rating lists May 26th 2024
» CEGT - rating lists February 18th

Permissions in this forum:You cannot reply to topics in this forum
ProDeo :: Computer Chess-
Jump to: