Integrated gem5 + GPGPU-Sim Simulator

Integrated gem5 + GPGPU-Sim Simulator

Last modified on: CST

Overview
Simulation Flow
Package Layout
Package Download
Build Simulator
Running Simulator
Configuration Notices
Publication
Contact

Overview:

The integrated gem5 + GPGPU-Sim simulator is a CPU-GPU simulator for heterogeneous computing.

The integrated simulator infrastructure is developed based on gem5 and GPGPU-Sim. The gem5 and GPGPU-Sim run as two separate processes and communicate through shared memory in the Linux OS.

gem5 is used to model the CPU cores and memory subsystem in which a MEOSI directory coherence protocol is supported by Ruby, and GPGPU-Sim is used to model the streaming multiprocessors (SMs) and on-chip interconnect within the GPU. The memory subsystem and DRAM model at GPGPU-Sim side are completely removed, leaving only a set of request and response queues per memory controller (MC); GPGPU-Sim communicates with the memory subsystem of gem5 to service its memory accesses through shared memory structures.

Lock-Step Execution

In order to ensure that both simulators are running in lock-step, gem5 provides periodic SM-blocking ticks and memory ticks (configured through GPU core and memory clock multipliers) to GPGPU-Sim. gem5 issues one blocking tick for all SMs, while one memory tick per MC in GPGPU-Sim. gem5 triggers SMs or MCs in GPGPU-Sim by setting a flag in shared memory structure; gem5 then blocks itself until GPGPU-Sim completes the execution of a GPU cycle and resets the flag to resume gem5.

Shared Memory System

At GPGPU-Sim side, on each memory tick received for a particular MC, it pushes a pending request, from its internal queue into the request queue in shared memory structure in FIFO order. Similarly, it pops pending read responses, if there are any, in FIFO order from the response queue in shared memory structure and pushes them into its internal response queue to be returned to an appropriate SM.

At gem5 side, once a pending memory tick is reset by GPGPU-Sim, gem5 resumes to execute its portion of memory tick. At front-end, an arbiter is used to select a request between CPU and GPU to push into front-end queue for scheduling. If GPU wins the arbitration, it pops a GPU memory request present in the shared memory. Currently, FR-FCFS policy is applied on front-end queue to schedule a request and push into back-end command queue. At back-end, it scans the command queue and queries the DRAM banks to issue commands. When a read/write command is issued, the request is pushed into a response queue, with the ready time set according to CAS latency. Any response that is intended for GPU will be popped from gem5's response queue when it's ready, and pushed into the response queue in shared memory structure.

Note that above procedures happen in reverse-order in code to model real-hardware behaivor.

Simulation Flow:

gem5 starts with AtomicSimpleCPU to create a checkpoint right before Region of Interest (ROI).
gem5 restores from the checkpoint with detailed O3CPU and Ruby memory system.
The integration-related code in gem5 is activated in "dumpresetStats" pseudo-instruction if "activate_gsim" option is set. So a "dumpresetStats" pseudo-instruction is inserted at the beginning of ROI.
gem5 & GPGPU-Sim run separately and communicate with each other through shared memory.
If GPU simulation finishes first, GPGPU-Sim will notify gem5 to stop providing ticks; if CPU simulation finishes first, gem5 will disable "m5exit" pseudo-instruction and thus the rcS script should keep trying "exit". See Running Simulator for details.

Package Layout:

/alpha-ruby/codebase_alpha.tar.bz2
gem5 and GPGPU-Sim package. This version is tested with Alpha ISA and Ruby memory system.
/arm-classic/codebase_arm.tar.bz2
gem5 and GPGPU-Sim package. This version is tested with ARM ISA and classic memory system. (may need some more test)
/utils-alpha/disk-image-alpha.tar.bz2
ALpha full system files, pre-compiled Linux kernel, PAL/Console binaries and a file system from gem5 site.

A set of pre-compiled OpenMP binaries of Rodinia benchmark suite is installed under /rodinia/bin.ckpt/ with ROI tagged by m5 pseudo-instructions.
/utils-alpha/run_alpha_example.tar
A sample simulation directory to run Hotspot benchmark.
- Simulation configuration files for GPGPU-Sim: gpgpusim.config and icnt_config.txt.
- CUDA binary: hotspot.
- rcS script for gem5 simulation: hotspot.rcS.
- Keys for shared memory structures: keys.txt. See Running Simulator for details.
- A run script: run_alpha.sh. Users may go through the script to change various paths accordingly.
/utils-alpha/bench_build.tar
Hooks tool and hotspot source code.
- hooks/: It provides a C interface to gem5 pseudo-instructions, so that benchmark program could interact with the simulator. E.g. creating checkpoint, dumping and resetting statistics
- hotspot_omp/: The OpenMP version of Hotspot benchmark with pseudo-instructions inserted. Search for "wangh" for the modifications
Users may go through the Makefile to set the path to Alpha cross-compiler. A pre-compiled Alpha cross-compiler can be downloaded from gem5 site.
/utils-arm/disk-image-arm.tar.bz2
ARM full system files from gem5 site. A simple C test program (vector add) is installed under /wangh/bin/test.

Linux 3.3 VExpress_EMM kernel is used to support 1GB memory.
/utils-arm/run_arm_example.tar
A sample simulation directory. Note that arm-classic package use classic memory, so ignore the ruby stuff throughput this page.
- Simulation configuration files for GPGPU-Sim: gpgpusim.config and icnt_config.txt.
- CUDA binary: hotspot.
- rcS script for gem5 simulation: test.rcS.
- Keys for shared memory structures: keys.txt. See Running Simulator for details.
- A run script: run_arm.sh. Users may go through the script to change various paths accordingly.

Build Simulator:

gem5 and GPGPU-Sim are still built separately, and there is no additional requirement or step needed. For quick start, below is some brief instructions. Please refer to gem5 site and README file in GPGPU-Sim distribution for detailed instructions.

gem5:
See gem5 site for dependencies.

Type "scons build/ALPHA_FS_MOESI_CMP_directory/gem5.opt" in gem5_integ/ for Alpha-Ruby version.

Type "scons build/ARM/gem5.opt" in gem5/ for ARM - Classic version.
GPGPU-Sim:
Set CUDA_INSTALL_PATH environment variable; The simulator is built on an older version of GPGPU-Sim, so CUDA Toolkit v3.1 is recommended.

Type "make" in gppgu-sim/.

Running Simulator:

Configure

Set the path to disk-image/ in /gem5_integ/configs/Syspath.py.
Set the image file name in /gem5_integ/configs/common/Benchmarks.py.

Prepare

Have CUDA binary, GPGPU-Sim configuration files, and rcS file for gem5 in simulation directory.
Have a file called keys.txt in simulation directory. See /run_example/keys.txt as reference.

Run

Please refer to the run script in run_example/ directory to help with a quick start. Below is a brief explanation of the script.

Set the paths to input data of Rodinia package for CUDA binary, gem5 simulator binary.
Clear the shared memory segments in case previous simulation did not exit correctly and thus had shared memory left in the system.
Run gem5 with AtomicSimpleCPU and classic memory system to create a checkpoint right before ROI.

$GEM5_ROOT/build/gem5.opt $GEM5_ROOT/configs/example/fs.py --num-cpus=4 --clock=4GHz --script=hotspot.rcS

Run gem5 with O3CPU and Ruby memory system to restore from the checkpoint, and set the "activate_gsim" option and GPU clock multiplier.

$GEM5_ROOT/build/gem5.opt $GEM5_ROOT/configs/example/ruby_fs.py --checkpoint-restore=1 --num-cpus=4 --clock=4GHz --gpu_l2_clock=10 --mem_clock_multiplier=5 --activate_gsim

Wait several seconds to ensure shared memory creation completes, and then launch GPGPU-Sim simulation. ./hotspot 64 64 1 1 temp_64 power_64

Configuration Notices:

Frequencies:

CPU frequency is set by gem5 option "--clock"; Memory frequency is set through "--mem_clock_multiplier". GPU frequency is set through clock multiplier option "--gpu_l2_clock", the frequency values set in GPGPU-Sim configurations files are deprecated;

For example, --clock=4.0GHz, --mem_clock_multiplier=5.0, --gpu_l2_clock=10.0 sets the CPU frequency to 4GHz, memory frequency to 800MHz GPU L2 cache frequency to 400MHz and the core frequency is half of L2 as 200MHz.

Note that in GPGPU-Sim the width of the pipeline is equal to warp size. To compensate for this, SMs run at 1/4 the frequency reported on product specification. For example, 1.3GHz shader clock rate of NVIDIA's Quadro FX 5800 corresponds to 325MHz SIMT core clock in GPGPU-Sim. See GPGPU-Sim Manual for details.
Memory Settings:

Number of Memory channels is set by "--num-dirs" on gem5 side, and "gpgpu_n_mem" in configurations file on GPGPU-Sim side.

Note that num-dirs, numa-high-bit, ranks_per_dimm, dimms_per_channel, mem_addr_map_mask on gem5 side and the gpgpu_n_mem, gpgpu_mem_addr_mapping, nbk of gpgpu_dram_timing_opt in GPGPU-Sim configuration file should be consistent.

*numa-high-bit denotes the position of highest channel bit in DRAM address map. Search for "m_numa_bit_high" in /gem5_integ/src/mem/ruby/system/MemoryControl.cc for details.

The provided example (Alpha) includes above verbose settings in command line as a reference.

Publication:

If you use our Integrated gem5+GPGPU-Sim Simulator in your work, please cite:

Hao Wang, Vijay Sathish, Ripudaman Singh, Michael Schulte, Nam Sung Kim, "Workload and Power Budget Partitioning for Single-Chip Heterogeneous Processors", IEEE/ACM Int. Conf. on Parallel Architecture and Compilation Techniques (PACT), Sep. 2012

Contact:

For any technical questions, please send an email to hwang223 AT wisc.edu

Personal Homepage

Lock-Step Execution

Shared Memory System

Frequencies:

Memory Settings: