Framework Architecture

Here we introduce our framework architecture. The design draws great inspiration from MALib and RLLib.

Workflow 

The framework has five major components serving different roles. They are Rollout Manager, Training Manager, Data Buffer, Agent Manager and Task Scheduler.

Components 

Rollout Manager 

light_malib/rollout/rollout_manager.py

The Rollout Manager establishes multiple parallel rollout workers and delegates rollout tasks to each worker. Each rollout task includes environment settings, policy distributions for simulation, and information pertaining to the Episode Server.

Training Manager 

light_malib/training/training_manager.py

The Training Manager sets up multiple distributed trainers and assigns training tasks to each trainer. Training task descriptions consist of training configurations and details regarding the Policy and Episode buffers.

Data Buffer 

light_malib/buffer/

The Data Buffer serves as a repository for episodes and policies. The Episode Server saves new episodes submitted by the rollout workers, while trainers retrieve sampled episodes from the Episode Server for training. The Policy Server, on the other hand, stores updated policies submitted by the Training Manager. Rollout workers subsequently fetch these updated policies from the Policy Server for simulation.

The Evaluation Manager conducts simulations between each pair of policies in the current population. These simulations evaluate the performance of each policy against others, providing valuable information about their relative strengths.
The Policy Data Manager updates the payoff table based on the simulation results. The payoff table captures the performance metrics and outcomes of the policy interactions. Using this information, the manager computes the Nash equilibrium.
The Agent Manager records the simulation results and generates the Nash mixture distribution of opponent policies.
Training and rollout processes are executed according to the framework illustrated in Figure 12. The rollout process simulates matches between the policies, while the training process involves updating the policies using the collected data. This process is monitored and terminated by the Stopper component. The Prefetcher component preloads data to expedite the training process.
The trained policy for the current generation is stored in the population. The procedure then returns to step one, initiating the next generation of evaluation and training.

Framework Architecture

Workflow 

Components 

Rollout Manager 

Training Manager 

Data Buffer 

Agent Manager 

Task Scheduler 

Population-based Training Workflow 