Bike Environment with PPO Agent

This project implements a bike environment using the OpenAI Gym interface and trains a Proximal Policy Optimization (PPO) agent to navigate the bike towards a goal position while avoiding obstacles.

Environment

The BikeEnv class defines the bike environment. It has the following key components:

Action Space: The action space is a continuous space represented by a 2D vector. The first element represents the normalized angle (-1 to 1), and the second element represents the speed (0 to 5). The angle is normalized to map it to the range [-0.9, 0.9] before applying it to the bike's angular velocity. The speed is ensured to be non-negative, allowing only forward motion.
Observation Space: The observation space is a continuous space represented by a 30-dimensional vector. It includes the following information:
Bike's position (2D vector)
Bike's velocity (2D vector)
Bike's angle and angular velocity (2 scalar values)
Wheel angles (2 scalar values)
Goal position (2D vector)
Relative positions of the nearest obstacles (10D vector, 2D position for each of the 5 nearest obstacles)
x and y distances to the target (2 scalar values)
State Update: The environment updates the bike's state based on the provided action. It applies the angle delta to update the angular velocity and ensures the speed is non-negative. The position, velocity, and angle are updated based on physics equations. The acceleration is calculated using the gravity and the sine of the bike's angle. The velocity is updated based on the speed and angle, and the position is updated using the velocity and acceleration. The angle and wheel angles are updated based on the angular velocity.
Collision Detection: The environment checks for collisions between the bike and the obstacles. It calculates the distances between the bike's position and each obstacle and compares them with the obstacle radius. If the distance is less than or equal to the obstacle radius, a collision is detected.
Reward: The reward is calculated based on the distance to the goal position. The Euclidean distance between the bike's position and the goal position is computed, and the negative of this distance is used as the reward. If the bike collides with an obstacle, a large negative reward (-10.0) is given. Additionally, if the bike falls down (i.e., the absolute value of the bike's angle exceeds 120 degrees), an additional negative reward (-20.0) is given.
Termination: The episode terminates if any of the following conditions are met:
The bike collides with an obstacle.
The bike falls down (i.e., the absolute value of the bike's angle exceeds 120 degrees).
The maximum number of steps per episode is reached (not explicitly shown in the code provided).

PPO Agent

The PPOAgent class implements the Proximal Policy Optimization algorithm to train the agent. It consists of two neural networks: a policy network and a value network.

Policy Network: The policy network is a multilayer perceptron (MLP) that takes the state as input and outputs the parameters of the action distributions. It has two hidden layers with 64 units each and ReLU activation. The output layer is divided into three parts:
Angle Mean: A single output unit with tanh activation, representing the mean of the angle action distribution.
Angle Standard Deviation: A single output unit with exponential activation, representing the standard deviation of the angle action distribution. The output is clipped to the range [1e-6, 1.0] to ensure numerical stability.
Speed Logits: A output unit for each speed action, representing the logits of the speed action distribution.
Value Network: The value network is also an MLP that takes the state as input and estimates the value of the state. It has two hidden layers with 64 units each and ReLU activation. The output layer is a single unit without any activation function.
Action Selection: The agent selects actions by sampling from the angle and speed distributions generated by the policy network. The angle action is sampled from a Normal distribution with the mean and standard deviation obtained from the policy network. The speed action is sampled from a Categorical distribution based on the logits obtained from the policy network.
Experience Buffer: The agent stores the experiences (state, action, log probability, reward, next state, done) in a buffer for training. The buffer is implemented as a deque with a maximum size specified by the buffer_size parameter. When storing experiences, the rewards are normalized by subtracting the mean and dividing by the standard deviation of the rewards in the buffer.
Update: The agent updates the policy and value networks using the PPO algorithm. It performs the following steps:

Sample a batch of experiences from the buffer.
Calculate the advantages using the value estimates from the value network.
Calculate the value targets by adding the rewards and the discounted next state values.
Update the policy network by minimizing the PPO objective function, which includes the clipped surrogate objective for both the angle and speed actions.
Update the value network by minimizing the mean squared error between the estimated values and the value targets.
Optimize the networks using the RMSprop optimizer.

Bike Simulator

The BikeSimulator class provides a visual representation of the bike environment using Pygame. It renders the bike, obstacles, and goal position on the screen. The simulator window has a size of 800x600 pixels, and the positions of the bike, obstacles, and goal are scaled and translated to fit within the window. The bike is represented by a red circle, the obstacles by blue circles, and the goal by a larger green circle. The simulator updates the display at each step to show the current state of the environment.

Training

The training process involves the following steps:

Create an instance of the BikeEnv with the specified obstacle positions and goal position.
Create an instance of the PPOAgent with the appropriate state and action dimensions, learning rate, discount factor, clipping parameter, buffer size, and batch size.
For each episode:

Reset the environment to the initial state.
Initialize an empty list to store the episode rewards.
For each step:
- Select an action using the agent's policy network.
- Execute the action in the environment and obtain the next state, reward, done flag, and info dictionary.
- Store the experience (state, action, log probability, reward, next state, done) in the agent's buffer.
- Update the state to the next state.
- Append the reward to the episode rewards list.
- If the episode is terminated (done is True), break the step loop.
Sample a batch of experiences from the buffer.
If a batch is obtained, update the agent's policy and value networks using the sampled batch.
Print the episode number and the sum of episode rewards every 10 episodes.

Periodically run simulations to visualize the agent's performance:

Create an instance of the BikeSimulator with the current environment.
Reset the environment to the initial state.
Initialize the total reward to zero.
While the episode is not done:
- Handle Pygame events.
- Select an action using the agent's policy network.
- Execute the action in the environment and obtain the next state, reward, done flag, and info dictionary.
- Update the state to the next state.
- Add the reward to the total reward.
- Render the current state of the environment using the simulator.
- Calculate the x and y distances to the target and provide feedback.
- Add a small delay to visualize each step.
Print the simulation reward.
Display the reason for the end of the simulation (collision, falling over, or reaching the maximum number of steps).
Close the simulator.

Close the environment.

Requirements

Python 3.x
OpenAI Gym
NumPy
PyTorch
Pygame

Usage

Install the required dependencies mentioned above.
Run the script main.py to start the training process.
The agent will be trained for the specified number of episodes (num_episodes).
During training, the agent's performance will be evaluated periodically (every 2000 episodes) through simulations using the BikeSimulator.
The script will display the reward achieved in each episode and provide feedback on the distance to the target during simulations.

koushik-sss / bike-control-using-rl-ppo Goto Github PK

bike-control-using-rl-ppo's Introduction

Bike Environment with PPO Agent

Environment

PPO Agent

Bike Simulator

Training

Requirements

Usage

bike-control-using-rl-ppo's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent