College of Engineering University of Wisconsin-Madison
Electrical and Computer Engineering The Fountain
Integrated gem5 + GPGPU-Sim Simulator

Last modified on: CST


The integrated gem5 + GPGPU-Sim simulator is a CPU-GPU simulator for heterogeneous computing.

The integrated simulator infrastructure is developed based on gem5 and GPGPU-Sim. The gem5 and GPGPU-Sim run as two separate processes and communicate through shared memory in the Linux OS.

gem5 is used to model the CPU cores and memory subsystem in which a MEOSI directory coherence protocol is supported by Ruby, and GPGPU-Sim is used to model the streaming multiprocessors (SMs) and on-chip interconnect within the GPU. The memory subsystem and DRAM model at GPGPU-Sim side are completely removed, leaving only a set of request and response queues per memory controller (MC); GPGPU-Sim communicates with the memory subsystem of gem5 to service its memory accesses through shared memory structures.

Lock-Step Execution

In order to ensure that both simulators are running in lock-step, gem5 provides periodic SM-blocking ticks and memory ticks (configured through GPU core and memory clock multipliers) to GPGPU-Sim. gem5 issues one blocking tick for all SMs, while one memory tick per MC in GPGPU-Sim. gem5 triggers SMs or MCs in GPGPU-Sim by setting a flag in shared memory structure; gem5 then blocks itself until GPGPU-Sim completes the execution of a GPU cycle and resets the flag to resume gem5.

Shared Memory System

At GPGPU-Sim side, on each memory tick received for a particular MC, it pushes a pending request, from its internal queue into the request queue in shared memory structure in FIFO order. Similarly, it pops pending read responses, if there are any, in FIFO order from the response queue in shared memory structure and pushes them into its internal response queue to be returned to an appropriate SM.

At gem5 side, once a pending memory tick is reset by GPGPU-Sim, gem5 resumes to execute its portion of memory tick. At front-end, an arbiter is used to select a request between CPU and GPU to push into front-end queue for scheduling. If GPU wins the arbitration, it pops a GPU memory request present in the shared memory. Currently, FR-FCFS policy is applied on front-end queue to schedule a request and push into back-end command queue. At back-end, it scans the command queue and queries the DRAM banks to issue commands. When a read/write command is issued, the request is pushed into a response queue, with the ready time set according to CAS latency. Any response that is intended for GPU will be popped from gem5's response queue when it's ready, and pushed into the response queue in shared memory structure.

Note that above procedures happen in reverse-order in code to model real-hardware behaivor.

 Simulation Flow:

  1. gem5 starts with AtomicSimpleCPU to create a checkpoint right before Region of Interest (ROI).
  2. gem5 restores from the checkpoint with detailed O3CPU and Ruby memory system.
  3. The integration-related code in gem5 is activated in "dumpresetStats" pseudo-instruction if "activate_gsim" option is set. So a "dumpresetStats" pseudo-instruction is inserted at the beginning of ROI.
  4. gem5 & GPGPU-Sim run separately and communicate with each other through shared memory.
  5. If GPU simulation finishes first, GPGPU-Sim will notify gem5 to stop providing ticks; if CPU simulation finishes first, gem5 will disable "m5exit" pseudo-instruction and thus the rcS script should keep trying "exit". See Running Simulator for details.

 Package Layout:

 Build Simulator:

gem5 and GPGPU-Sim are still built separately, and there is no additional requirement or step needed. For quick start, below is some brief instructions. Please refer to gem5 site and README file in GPGPU-Sim distribution for detailed instructions.

 Running Simulator:

  1. Configure
  2. Prepare
  3. Run
  4. Please refer to the run script in run_example/ directory to help with a quick start. Below is a brief explanation of the script.

    1. Set the paths to input data of Rodinia package for CUDA binary, gem5 simulator binary.
    2. Clear the shared memory segments in case previous simulation did not exit correctly and thus had shared memory left in the system.
    3. Run gem5 with AtomicSimpleCPU and classic memory system to create a checkpoint right before ROI.
    4. Run gem5 with O3CPU and Ruby memory system to restore from the checkpoint, and set the "activate_gsim" option and GPU clock multiplier.
    5. Wait several seconds to ensure shared memory creation completes, and then launch GPGPU-Sim simulation.

 Configuration Notices:


If you use our Integrated gem5+GPGPU-Sim Simulator in your work, please cite:


For any technical questions, please send an email to hwang223 AT

Personal Homepage