My idea is simply to get a fixed list of deterministic engines and make a tournament when everybody play against everybody.
Complexity of the position can be measured by the gap in rating between tbe player in place 14 out of 56 and the player in place 42 out of 56.
I got the following table in a tournament with banksia gui:
1)Stockfish16.1 65536 nodes per move 679(105.5 points out of 110)
2)Stockfish16.1 32768 nodes per move 501(96.5)
3)Obsidian11.0 65536 nodes per move 472(93.5)
4)Caissa1.17 65536 nodes per move 455(93)
5)RubiChess2024 65536 nodes per move 449(92.5)
6)Berserk12 65536 nodes per move 445(92)
7)Clover6.1 65536 nodes per move 427(90.5)
8-9)Alexandria6.1 65536 nodes per move 410(89)
8-9)Seer2.8 65536 nodes per move 414(89)
Note that I do not understand how the rating is different when the engine scored the same number of points.
14)Seer2.8 32768 nodes per move 362(84)
41-42)Alexandria6.1 2048 nodes per move -301(28.5)
41-42)Caissa1.17 2048 nodes per move -310(28.5)
The first problem is not reliable rating but assume I solve this problem with a better gui then I can have the difference in rating between place 14 and place 42 to get a complexity score for every chess position.
Note that the tournament had 3080 games but they did not took a long time and I could finish all of them in less than 2 hours so it is possible for every position to get a complexity score in less than 2 hours and it may be interesting to find positions with bigger difference relative to the opening position.
I am not sure biased positions are best for complexity because I am afraid that with biased positions you may get less wins for the weaker side and as a result smaller gap between engines because 1.5-0.5 is a smaller gap than 2-0.