Supervised Learning Gym CarRacing

Imitation learning and DAgger compared with reinforcement learning for autonomous racing in OpenAI Gym's CarRacing environment. Imitation learning was less efficient and performed worse than RL, but is a useful technique when reward functions are unavailable.

View repository View demo Machine Learning Project

Role

Author / ML Engineer

Tech

• Python (PyTorch)

• AI Architecture

• Dataset Curation

Overview
Reinforcement Learning (PPO)
Supervised Learning (Imitation)

Project overview

The OpenAI Gym CarRacing environment is a game where keyboard input is used to drive a car around a racetrack. The environment provides a reward at each timestep based on how much new track is covered, with penalties for going off-track. The goal is to maximize the total reward by completing laps as quickly as possible while staying on the track.

RL is a lot easier to implement, but the real world does not come with reward functions, so a lot of my effort has gone into improving the imitation learning network using DAgger. The RL agent achieves an average score of about 710 out of a maximum possible 1000 (state-of-the-art scores range around 800-900). The imitation learning agent achieves a score of about 560, but continues to get better as I improve the demonstration data. Using DAgger to improve an imitation model takes time because it requires collecting new data with the expert in the loop.

Reinforcement Learning (PPO)

Demo Video

Performance: The trained agent successfully navigates racing tracks using only visual input, demonstrating learned steering, acceleration, and braking behaviors. Achieved an average score of 710 out of a maximum possible 1000.

Network Architecture

Actor-Critic Network

The reinforcement learning agent uses a shared CNN backbone that feeds into separate actor (policy) and critic (value) heads:

flowchart TB
    I["Input State
(B, 4, 84, 84)"]

    C1["Conv2d 4→32
k=8 s=4
ReLU"]
    C2["Conv2d 32→64
k=4 s=2
ReLU"]
    C3["Conv2d 64→64
k=3 s=1
ReLU"]

    F["Flatten"]
    S["Shared FC
Linear → 512
ReLU"]

    A1["Actor FC
512 → 256
ReLU"]
    AM["Action Mean
256 → 3"]
    ALS["action_log_std
(learned param)"]
    D["Gaussian + Tanh"]
    ACT["Action
[-1,1]^3"]

    C4["Critic FC
512 → 256
ReLU"]
    V["Value
256 → 1"]

    I --> C1 --> C2 --> C3 --> F --> S
    S --> A1 --> AM --> D --> ACT
    ALS --> D
    S --> C4 --> V

Supervised Learning (Imitation)

Recording Expert Demonstrations

Interactive system for collecting training data:

# Record expert demonstrations
python main.py teach

# Controls:
# Arrow keys: Steer left/right, gas, brake
# SPACE: Restart on a new track
# TAB: Save the current run
# ESC: Quit

# Data is saved as:
# observation_00000.npy, observation_00001.npy, ...
# action_00000.npy, action_00001.npy, ...

Each demonstration run captures synchronized observation-action pairs.

Training & Evaluation

Training Pipeline

# Train on collected demonstrations
python main.py train

# Training process:
# 1. Load observation-action pairs from data folder
# 2. Convert continuous actions to 9-class labels
# 3. Train CNN with cross-entropy loss
# 4. Save trained model as train.t7

Evaluation

# Evaluate trained agent
python main.py test

# Calculate leaderboard score (10 episodes)
python main.py score

# Metrics:
# - Average reward per episode
# - Track completion percentage
# - Action distribution analysis

DAgger Implementation

The project includes scaffolding for DAgger (Dataset Aggregation), an iterative imitation learning algorithm that improves upon behavioral cloning by:

Training an initial policy on expert demonstrations
Rolling out the learned policy to collect new states
Querying the expert for optimal actions at visited states
Aggregating new data with original dataset
Retraining to handle distribution shift

Status: DAgger implementation scripts and integration in progress.

Tech stack

• Python 3.x

• PyTorch (deep learning)

• OpenAI Gym / Gymnasium

• NumPy (data processing)

• OpenCV (preprocessing)

• Pyglet (keyboard input)

Key Concepts

Imitation Learning
Learning from expert demonstrations

Behavioral Cloning
Supervised learning approach to policy learning

DAgger
Iterative algorithm for handling distribution shift

CNN
Convolutional networks for visual perception

Learning Approaches

Behavioral Cloning

Fast training with expert data, limited by demonstration quality

PPO (rl_approach)

Learns from rewards, no demonstrations needed, longer training

DAgger (in progress)

Iterative improvement combining both approaches

Development Status

✅ Expert demonstration recording

✅ CNN policy network

✅ Training pipeline

✅ Evaluation system

✅ PPO alternative implementation

✅ Score: 780/1000

🚧 DAgger algorithm

Table of Contents