Integrated gem5 + GPGPU-Sim Simulator |
Last modified on: CST
The integrated gem5 + GPGPU-Sim simulator is a CPU-GPU simulator for heterogeneous computing.
The integrated simulator infrastructure is developed based on gem5 and GPGPU-Sim. The gem5 and GPGPU-Sim run as two separate processes and communicate through shared memory in the Linux OS.
gem5 is used to model the CPU cores and memory subsystem in which a MEOSI directory coherence protocol is supported by Ruby, and GPGPU-Sim is used to model the streaming multiprocessors (SMs) and on-chip interconnect within the GPU. The memory subsystem and DRAM model at GPGPU-Sim side are completely removed, leaving only a set of request and response queues per memory controller (MC); GPGPU-Sim communicates with the memory subsystem of gem5 to service its memory accesses through shared memory structures.
In order to ensure that both simulators are running in lock-step, gem5 provides periodic SM-blocking ticks and memory ticks (configured through GPU core and memory clock multipliers) to GPGPU-Sim. gem5 issues one blocking tick for all SMs, while one memory tick per MC in GPGPU-Sim. gem5 triggers SMs or MCs in GPGPU-Sim by setting a flag in shared memory structure; gem5 then blocks itself until GPGPU-Sim completes the execution of a GPU cycle and resets the flag to resume gem5.
At GPGPU-Sim side, on each memory tick received for a particular MC, it pushes a pending request, from its internal queue into the request queue in shared memory structure in FIFO order. Similarly, it pops pending read responses, if there are any, in FIFO order from the response queue in shared memory structure and pushes them into its internal response queue to be returned to an appropriate SM.
At gem5 side, once a pending memory tick is reset by GPGPU-Sim, gem5 resumes to execute its portion of memory tick. At front-end, an arbiter is used to select a request between CPU and GPU to push into front-end queue for scheduling. If GPU wins the arbitration, it pops a GPU memory request present in the shared memory. Currently, FR-FCFS policy is applied on front-end queue to schedule a request and push into back-end command queue. At back-end, it scans the command queue and queries the DRAM banks to issue commands. When a read/write command is issued, the request is pushed into a response queue, with the ready time set according to CAS latency. Any response that is intended for GPU will be popped from gem5's response queue when it's ready, and pushed into the response queue in shared memory structure.
Note that above procedures happen in reverse-order in code to model real-hardware behaivor.
gem5 and GPGPU-Sim package. This version is tested with Alpha ISA and Ruby memory system.
gem5 and GPGPU-Sim package. This version is tested with ARM ISA and classic memory system. (may need some more test)
ALpha full system files, pre-compiled Linux kernel, PAL/Console binaries and a file system from gem5 site.
A set of pre-compiled OpenMP binaries of Rodinia benchmark suite is installed under /rodinia/bin.ckpt/ with ROI tagged by m5 pseudo-instructions.
A sample simulation directory to run Hotspot benchmark.
Hooks tool and hotspot source code.
ARM full system files from gem5 site. A simple C test program (vector add) is installed under /wangh/bin/test.
Linux 3.3 VExpress_EMM kernel is used to support 1GB memory.
A sample simulation directory. Note that arm-classic package use classic memory, so ignore the ruby stuff throughput this page.
gem5 and GPGPU-Sim are still built separately, and there is no additional requirement or step needed. For quick start, below is some brief instructions. Please refer to gem5 site and README file in GPGPU-Sim distribution for detailed instructions.
See gem5 site for dependencies.
Type "scons build/ALPHA_FS_MOESI_CMP_directory/gem5.opt" in gem5_integ/ for Alpha-Ruby version.
Type "scons build/ARM/gem5.opt" in gem5/ for ARM - Classic version.
Set CUDA_INSTALL_PATH environment variable; The simulator is built on an older version of GPGPU-Sim, so CUDA Toolkit v3.1 is recommended.
Type "make" in gppgu-sim/.
Please refer to the run script in run_example/ directory to help with a quick start. Below is a brief explanation of the script.
CPU frequency is set by gem5 option "--clock"; Memory frequency is set through "--mem_clock_multiplier". GPU frequency is set through clock multiplier option "--gpu_l2_clock", the frequency values set in GPGPU-Sim configurations files are deprecated;
For example, --clock=4.0GHz, --mem_clock_multiplier=5.0, --gpu_l2_clock=10.0 sets the CPU frequency to 4GHz, memory frequency to 800MHz GPU L2 cache frequency to 400MHz and the core frequency is half of L2 as 200MHz.
Note that in GPGPU-Sim the width of the pipeline is equal to warp size. To compensate for this, SMs run at 1/4 the frequency reported on product specification. For example, 1.3GHz shader clock rate of NVIDIA's Quadro FX 5800 corresponds to 325MHz SIMT core clock in GPGPU-Sim. See GPGPU-Sim Manual for details.
Number of Memory channels is set by "--num-dirs" on gem5 side, and "gpgpu_n_mem" in configurations file on GPGPU-Sim side.
Note that num-dirs, numa-high-bit, ranks_per_dimm, dimms_per_channel, mem_addr_map_mask on gem5 side and the gpgpu_n_mem, gpgpu_mem_addr_mapping, nbk of gpgpu_dram_timing_opt in GPGPU-Sim configuration file should be consistent.
*numa-high-bit denotes the position of highest channel bit in DRAM address map. Search for "m_numa_bit_high" in /gem5_integ/src/mem/ruby/system/MemoryControl.cc for details.
The provided example (Alpha) includes above verbose settings in command line as a reference.
If you use our Integrated gem5+GPGPU-Sim Simulator in your work, please cite:
For any technical questions, please send an email to hwang223 AT wisc.edu