Supervised Learning Gym CarRacing

Imitation learning and DAgger compared with reinforcement learning for autonomous racing in OpenAI Gym's CarRacing environment. Imitation learning was less efficient and performed worse than RL, but is a useful technique when reward functions are unavailable.

View repository View demo Machine Learning Project

Project overview

The OpenAI Gym CarRacing environment is a game where keyboard input is used to drive a car around a racetrack. The environment provides a reward at each timestep based on how much new track is covered, with penalties for going off-track. The goal is to maximize the total reward by completing laps as quickly as possible while staying on the track.

RL is a lot easier to implement, but the real world does not come with reward functions, so a lot of my effort has gone into improving the imitation learning network using DAgger. The RL agent achieves an average score of about 710 out of a maximum possible 1000 (state-of-the-art scores range around 800-900). The imitation learning agent achieves a score of about 560, but continues to get better as I improve the demonstration data. Using DAgger to improve an imitation model takes time because it requires collecting new data with the expert in the loop.

Reinforcement Learning (PPO)

Demo Video

Performance: The trained agent successfully navigates racing tracks using only visual input, demonstrating learned steering, acceleration, and braking behaviors. Achieved an average score of 710 out of a maximum possible 1000.

Network Architecture

Actor-Critic Network

The reinforcement learning agent uses a shared CNN backbone that feeds into separate actor (policy) and critic (value) heads:

flowchart TB
    I["Input State
(B, 4, 84, 84)"] C1["Conv2d 4→32
k=8 s=4
ReLU"] C2["Conv2d 32→64
k=4 s=2
ReLU"] C3["Conv2d 64→64
k=3 s=1
ReLU"] F["Flatten"] S["Shared FC
Linear → 512
ReLU"] A1["Actor FC
512 → 256
ReLU"] AM["Action Mean
256 → 3"] ALS["action_log_std
(learned param)"] D["Gaussian + Tanh"] ACT["Action
[-1,1]^3"] C4["Critic FC
512 → 256
ReLU"] V["Value
256 → 1"] I --> C1 --> C2 --> C3 --> F --> S S --> A1 --> AM --> D --> ACT ALS --> D S --> C4 --> V

Supervised Learning (Imitation)

Recording Expert Demonstrations

Interactive system for collecting training data:

# Record expert demonstrations
python main.py teach

# Controls:
# Arrow keys: Steer left/right, gas, brake
# SPACE: Restart on a new track
# TAB: Save the current run
# ESC: Quit

# Data is saved as:
# observation_00000.npy, observation_00001.npy, ...
# action_00000.npy, action_00001.npy, ...

Each demonstration run captures synchronized observation-action pairs.

Training & Evaluation

Training Pipeline

# Train on collected demonstrations
python main.py train

# Training process:
# 1. Load observation-action pairs from data folder
# 2. Convert continuous actions to 9-class labels
# 3. Train CNN with cross-entropy loss
# 4. Save trained model as train.t7

Evaluation

# Evaluate trained agent
python main.py test

# Calculate leaderboard score (10 episodes)
python main.py score

# Metrics:
# - Average reward per episode
# - Track completion percentage
# - Action distribution analysis

DAgger Implementation

The project includes scaffolding for DAgger (Dataset Aggregation), an iterative imitation learning algorithm that improves upon behavioral cloning by:

  • Training an initial policy on expert demonstrations
  • Rolling out the learned policy to collect new states
  • Querying the expert for optimal actions at visited states
  • Aggregating new data with original dataset
  • Retraining to handle distribution shift

Status: DAgger implementation scripts and integration in progress.