In the case of a learnable codebook, it seems to me that the encoder outputs are not completely detached before they are used to compute the codebook loss because they are still connected to the loss indirectly via distance matrix and quantized vectors . The encoder still accumulates gradients (even though they are not updated by the in-place optimizer). Should not the encoder outputs be detached also before they are used to compute the distances ?