how to add floats?

Posts : 46 Join date : 2020-12-01

I got a problematic behaviour in my small neural network engine. After hitting depth 21 or so scores get badly out of sync: the engine (https://github.com/nescitus/LizardBrain) shows like -100, then -300, all the while proposing reasonable moves and principal variation. Resetting accumulator every million nodes or so stops it. I suspect some rounding error in adding floats. Is there some known technique to avoid this problem?

Subject: Re: how to add floats? Fri May 17, 2024 11:32 pm

It's best to use SIMD instead of float.

https://en.algorithmica.org/hpc/simd/intrinsics/

Posts : 46 Join date : 2020-12-01

OK, I got something like this. Speed gain is negligible, something like half a percent, and I think the way of extracting quantized_vals is at fault. Any hints how to improve it?

Code:: const int hiddenLayerSize = 32;
alignas(32) int quantized[hiddenLayerSize][768]; // in the Network object that holds all the weights
alignas(32)int hidden[hiddenLayerSize]; // in the cAccumulator class

void cAccumulator::Add(int cl, int pc, int sq) {

// get piece-on-square index
int idx = Idx(cl, pc, sq);

// process 8 elements at a time using AVX2 (256-bit)
for (int i = 0; i < hiddenLayerSize; i += 8) {

// load 8 integers from the hidden and quantized arrays
__m256i hidden_vals = _mm256_load_si256((__m256i*) & hidden[i]);
__m256i quantized_vals = _mm256_set_epi32(
Network.quantized[i + 7][idx], Network.quantized[i + 6][idx], Network.quantized[i + 5][idx], Network.quantized[i + 4][idx],
Network.quantized[i + 3][idx], Network.quantized[i + 2][idx], Network.quantized[i + 1][idx], Network.quantized[i][idx]
);

// add the hidden and quantized values together
__m256i result = _mm256_add_epi32(hidden_vals, quantized_vals);

// store the result back into the hidden array
_mm256_store_si256((__m256i*) & hidden[i], result);
}
}

Posts : 46 Join date : 2020-12-01

Come to think of it, I could flatten Network.quantized, so that the indices are continous

Posts : 46 Join date : 2020-12-01

flattening works indeed: I went down from 12,5 seconds on benchmark (naive implementation) to 10,1 (Add, Del and Move functions using flattened array). Any other ideas for optimization?

Code:: void cAccumulator::Add(int cl, int pc, int sq) {

// get piece-on-square index
int idx = Idx(cl, pc, sq);

// process 8 elements at a time using AVX2 (256-bit)
for (int i = 0; i < hiddenLayerSize; i += 8) {

// load 8 integers from the hidden and quantized arrays
__m256i hidden_vals = _mm256_load_si256((__m256i*) & hidden[i]);
__m256i quantized_vals = _mm256_load_si256((__m256i*) & Network.flat_quantized[idx * 32 + i]);

// add the hidden and quantized values together
__m256i result = _mm256_add_epi32(hidden_vals, quantized_vals);

// store the result back into the hidden array
_mm256_store_si256((__m256i*) & hidden[i], result);
}
}