Posts : 2608 Join date : 2020-11-17 Location : Netherlands
Subject: Understanding rating lists Sun May 08, 2022 8:11 am
I am no an expert running a rating list, I have some experience running the GRL, but I can't stand in the shadow of those who doing it for years, by head both CEGT and CCRL started in 2005/6. So the below is no criticism at all, I just want to understand the different systems that are in use.
So Dio, Graham and Stefan if you read this I am interested about your comments.
This is a fragment of the ongoing testing of Rebel 15 for the CCRL 40/15 list. As we can see Rebel has to play all the top engines including multiple SF derivatives.
CEGT takes a total different path, a bit like I did for the GRL. You create an elo pool based on the predicted elo gain of the programmer with a not too big elo margin, certainly not 200-300 elo as seen in the above example.
I want to ask Stefan if he is willing to contribute his way of testing and how he composes a pool of engines.
adminx and Dio like this post
pohl4711
Posts : 159 Join date : 2022-03-01 Location : Berlin
Subject: Re: Understanding rating lists Sun May 08, 2022 12:36 pm
SPCC
Normal way of SPCC testing: versus 7 engines (1000 games each, 500 balanced HERT Openings), which are not too much weaker or stronger than the engine, which is tested (result should be in a range of 30%-70%, if possible (vs. Stockfish, most opponents score below 30%...). When a tested engine is an update, the older version of the engine (and its played games vs other engines) is deleted in the next weeks. If that leads to less than 7000 played games for any other engine in the ratinglist, this engine must be an opponent for the updated engine in its testrun, because each rated engine should have 7000 games at least. So, sometimes, an engine plays more than 7000 games...
Admin and Dio like this post
Dio
Posts : 222 Join date : 2021-08-28
Subject: Re: Understanding rating lists Sun May 08, 2022 6:31 pm
In CEGT, usually 50 per cent of the test opponents have a comparable playing strength to the engine to be tested. The other test opponents should be in the range of +/- 150 Elo. I try to ensure that the tested engine has a score of about 50 per cent at the end of the test. A first test usually includes at least 1000 test games, currently this value is significantly higher. In our experience, the Elo value is already quite stable after 1000 games and changes only slightly thereafter. The CEGT 40/4 list includes more than 3.500.000 games.
Admin likes this post
Admin Admin
Posts : 2608 Join date : 2020-11-17 Location : Netherlands
Thanks guys for contributions. I can see that CEGT and SPCC have much in common with the way I composed the elo pool for the GRL regarding the upper limit and the only exception is CCRL. I guess in the end it does not matter much following the individual TPR's. It seems that 6 draws against FF2 is enough to get a positive TPR of 87 elo
I assume CCRL does this kind of -- at first sight -- nonsense testing to ensure that 1) much stronger engines should also effectively deal with much weaker engines and 2) that top-notch engines (like SF and Komodo) play enough games because the difference with the rest of the pack is pretty big. Hopefully Graham has something to say about it.
....The other test opponents should be in the range of +/- 150 Elo.
Just a short addition: 150 plus/minus would be great and we try to reach this. But especially for the "Big Guns" (Stockfish, Komodo and LCZero) this is nearly impossible to achieve. Too few opponents, at least for Stockfish. So we normally increase the number of games from 100 to 200 per match and/or test vs. opponents running on 4 or even 8 Threads .....
Quote :
The CEGT 40/4 list includes more than 3.500.000 games.
Not yet (3.15 million actually) but we work hard on it.... . Therefore your contribution is much appreciated...
Best Wolfgang
Admin, Dio and Wolfgang like this post
Admin Admin
Posts : 2608 Join date : 2020-11-17 Location : Netherlands
I agree, I had the same problem every time testing a new SF version, you wanted at least 2000 games and the usual poor victims where ethereal, rubichess, pedone, nemorino etc. But in the end ORDO did the magic and the result was not much different than other rating lists.
....The other test opponents should be in the range of +/- 150 Elo.
Just a short addition: 150 plus/minus would be great and we try to reach this. But especially for the "Big Guns" (Stockfish, Komodo and LCZero) this is nearly impossible to achieve. Too few opponents, at least for Stockfish. So we normally increase the number of games from 100 to 200 per match and/or test vs. opponents running on 4 or even 8 Threads .....
Quote :
The CEGT 40/4 list includes more than 3.500.000 games.
Not yet (3.15 million actually) but we work hard on it.... . Therefore your contribution is much appreciated...
Best Wolfgang
At least for long time control 150 plus/minus is not impossible to achieve with the big guns. Here is the top CCRL 40/15 list
I can add that I think that the gap between the best and the rest of the field get smaller and I guess we are close to see 100% draws in computer chess if we do not do something.
Maybe it is better to have a different list that is not ranked based on elo when the target is to get maximal number of wins against the leaders of the rating list when draws and losses are the same so an engine that can get 20% wins and 70% losses against the top 10 leaders of the rating list is going to be ranked higher than an engine that get 10% wins and no losses.
I can add that I think that the gap between the best and the rest of the field get smaller and I guess we are close to see 100% draws in computer chess if we do not do something.
Maybe it is better to have a different list that is not ranked based on elo when the target is to get maximal number of wins against the leaders of the rating list when draws and losses are the same so an engine that can get 20% wins and 70% losses against the top 10 leaders of the rating list is going to be ranked higher than an engine that get 10% wins and no losses.
Very good point! Unfortunately, I don't think it will work. We seem to be getting to a level where it's prohibitively difficult to get a win.
Also, a rating list made this way would probably not be popular: you won't often see people picking something from a rating list that loses more often than it wins! Try selling an investment tool which only profits on 20% of its recommendations, returns your stake on 10%, and loses your money on 70%!
I can add that I think that the gap between the best and the rest of the field get smaller and I guess we are close to see 100% draws in computer chess if we do not do something.
Maybe it is better to have a different list that is not ranked based on elo when the target is to get maximal number of wins against the leaders of the rating list when draws and losses are the same so an engine that can get 20% wins and 70% losses against the top 10 leaders of the rating list is going to be ranked higher than an engine that get 10% wins and no losses.
Very good point! Unfortunately, I don't think it will work. We seem to be getting to a level where it's prohibitively difficult to get a win.
Also, a rating list made this way would probably not be popular: you won't often see people picking something from a rating list that loses more often than it wins! Try selling an investment tool which only profits on 20% of its recommendations, returns your stake on 10%, and loses your money on 70%!
I prefer for analysis of games to have a tool that win 20% even if it lose 70% and not only a tool that win only 10% and draw 90%. I do not suggest this list instead of a rating list but as an additional list that is not a rating list but a winner list.
Chris Whittington
Posts : 1254 Join date : 2020-11-17 Location : France
I can add that I think that the gap between the best and the rest of the field get smaller and I guess we are close to see 100% draws in computer chess if we do not do something.
Maybe it is better to have a different list that is not ranked based on elo when the target is to get maximal number of wins against the leaders of the rating list when draws and losses are the same so an engine that can get 20% wins and 70% losses against the top 10 leaders of the rating list is going to be ranked higher than an engine that get 10% wins and no losses.
W:L:D 20:70:10 = 25 pts 10:0:90 = 55 pts
Basically you want the engine to go all out for a win, and otherwise not care, a draw is the same as a loss. In extremis this means it might as well throw drawn endgames, for example KRKB, black may as well just put his bishop en prise. White will take, because he gets an “unexpected” win, and black loses but doesn’t care. Or KRKR, which side throws the rook, black or white? I’ld guess that if you were to train a NN evaluator on W only, there’s going to be some mighty weird move choices being made by the trained net.