Skip to content

RL Experiments

Nikita Korobkov edited this page Nov 3, 2019 · 2 revisions

V-learning

Idea is to train a model to predict the value of the field V(s) for any field. And then use it to choose the best action at state s by computing V(s_t+1) for any possible s_t+1.

Linear model with handcrafted features

Let's try to approximate V(s) with linear regression. V(s) = w_1 * f_1(s) + w_2 * f_2(s) + ... + w_k * f_k(s)
where f_i(s) is some fixed feature of game state s.

In my experiments, I used binary features from move pure cell positions and 2x1 and 2x2 convolutions with a fixed kernel that outputted 1 on a specific configuration of cells. Total of ~2350 features.

Temporal difference updates

To learn a linear approximation I sampled self-play games with existing value approximator V and updated weights of V to minimize the difference between V(s_t) and V(s_t+1). Given that (end_game_state) = (+/- 1). To speed up learning I first trained net on endgame positions only to predict +/- 1 and only than sampled games with V.

Trained net for 400 epochs with SGD before resampling games with updated weights. To sample games used e-greedy policy with e=0.1

Training converged is some sense, but net didn't show to learn meaningful weights or play better than simple cell counting strategy.

Clone this wiki locally