I am using agentlightning 0.3.0 to train an agent with RL. The total memory of the node is 250GB, and the memory usage keeps increasing during training. The program shuts down when the usage reaches around 220GB. Here are my logs for analysis, including stdout, stderr, config, tensorboard, and ray's session_latest, among others.
debug_log.tar.gz