Reinforcement Learning
In this project, we implemented a Proximal Policy Optimization (PPO) agent that plays the first level of Sonic the Hedgehog. To create the environment, we used the Gymnasium libary. We also used some preprocessing wrappers from stable-baselines3 to enable the actor-critic neural network feedback loop, reduce feature dimensionaity, and allow for concurrent agent training. Ultimately, we achieved the best completion rate of 78% overall and 83% when ignoring the first 20% of training runs. We engineered a 3 pass method in which each pass used a slightly different reward function, but ultimately recognized implementing human ingenuity should not be the focus, but rather to enable the learning process itself. Our project implementation and presentation recieved high remarks from the professor.
Machine Learning Engineer
Darroll Saddi
Ryan Li
mar 2024 - jun 2024
Gymnasium
Python
PyTorch
PPO is an actor-critic algorithm, which means that it uses two neural networks to approximate the policy and value functions. In actor-critic models, the actor controls which action the agent should take by executing the policy, while the critic returns a value that represents how good that action was in the state it was taken in.
The Surrogate Objective Function's main purpose is to keep policy updates within a trust region (to ensure stable training), using clipping within the range [1 - ϵ, 1 + ϵ], where ϵ is the hyperparameter that determines the allowed deviation between the new and old policy. The Probablity Ratio is primarily how the algorithm compares the current policy, πθ, and the previous policy, πθ old. By simply dividing the probability that the current policy chooses action a_t in some state s_t by the probability that the old policy chooses the same action in that state, we can quantify how much the input action influenced the policy. A resultant value greater than 1 indicates the action a_t in the state s_t is more likely than the average action, and is therefore better. Conversely, a value less than 1 means the action is worse. Genralized Advantage Estimation (GAE) is used to calculate advantage, a measure of how much better (or worse) an action is in comparison to the expected return from the current policy, AKA the average action. At a high level, GAE significantly reduces variance (policy updates that are too large) while maintaining a tolerable level of bias (maintaining aspects of the current policy).
To preprocess game frames, we passed them to the Frame Stacker, Observer, Action Mapper, and Frame Skipper.
The Frame Stacker was used so the agent could recognize a time dimension. The Frame Observer was used to feed game snapshots into a Convolutional Neural Network (CNN) reducing feature dimensionality, therby speeding up training. The Action Mapper was used to create a discrete one-hot encoded action space. The Frame Skipper was used to periodically skip frames, allowing for faster training and less overfitting.
The preprocessed frames were passed to a Convolutional Neural Network to extract features. These features were passed to the fully connected layers which outputed the a probality distribution over the actions and a estimated value of the current state. The stochastic nature of the model outputs helped to balance the exploration-exploitation tradeoff.
Due to initial hyperparameters, the agent got stuck on the loop obstacle. This meant that it was entering an exploitation phase too early, so to encourage more initial exploration, we increased entropy and reduced the learning rate.
Our initial results, using a reward function that did not punish lack of progess, resulted in a 43% winrate.
To further control the exporation-exploitation tradeoff, we used a 3 pass method. The first pass did not punish lack of progress, the second did punish lack of progress, and the third added additional reward for faster completion times. This 3 pass method yielded a 78% winrate.
To validate our results, we retrained agents using each of the reward functions, individually. We found that the pass 3 reward function yielded the same results as our 3 pass method.
Using only the pass 3 reward function, we achieved a similar winrate as before.
We want AI-agents that can discover like we can, not which contain what we have already discovered. Through the work conducted in this project, we learned that implementing human ingenuity should not be the focus of practioning AI, but rather the focus should be on the learning process itself.