Table of contents

8.1.2010The year I started blogging
9.1.2010Linux initramfs with iSCSI and bonding support for PXE booting
9.1.2010Using manually tweaked PTX assembly in your CUDA 2 program
9.1.2010OpenCL autoconf m4 macro
9.1.2010Mandelbrot with MPI
10.1.2010Using dynamic libraries for modular client threads
11.1.2010Creating an OpenGL 3 context with GLX
11.1.2010Creating a double buffered X window with the DBE X extension
11.1.2010Eurographics 2010 here I come!
12.1.2010A simple random file read benchmark
14.12.2011Change local passwords via RoundCube safer
5.1.2012Multi-GPU CUDA stress test
6.1.2012CUDA (Driver API) + nvcc autoconf macro



5.1.2012

Multi-GPU CUDA stress test

I work with GPUs a lot and have seen them fail in a variety of ways: too much (factory) overclocked memory/cores, unstable when hot, unstable only at the ambient temperature (not kidding), memory partially unreliable, and so on. What's more, failing GPUs often fail silently and produce incorrect results when they are just a little unstable, and I have seen such GPUs consistently producing correct results on some apps and incorrect results on others.

What I needed in my tool box was a stress test for multi-GPGPU-setups that used all of the GPUs' memory and checked the results while keeping the GPUs burning. There are not a lot of tools that can do this, let alone for Linux. Therefore I hacked together my own. It runs on Linux and uses the CUDA driver API.

My program forks one process for each GPU on the machine, one process for keeping track of the GPU temperatures if available (e.g. Fermi Teslas don't have temp. sensors), and one process for reporting the progress. The GPU processes each allocate 90% of the free GPU memory, initialize 2 random 1024*1024 matrices, and continuously perform efficient CUBLAS matrix-matrix multiplication routines on them and store the results across the allocated memory. Correctness of the calculations is checked by comparing results of new calculations against a previous one -- on the GPU. This way the GPUs are 100% busy all the time and CPUs idle. The number of erroneous calculations is brought back to the CPU and reported to the user along with the number of operations performed so far and the GPU temperatures.

Real-time progress and summaries every ~10% are printed as shown below. Matrices processed are cumulative, whereas errors are for that summary. GPUs are separated by slashes. The program exits with a conclusion after it has been running for the number of seconds given as the only command line parameter. The example below was on a machine that had one working GPU and one faulty (too much factory overclocking and thus slightly unstable (you wouldn't have noticed it during gaming)):

% ./gpu_burn 120
GPU 0: GeForce GTX 580 (UUID: N/A)
GPU 1: GeForce GTX 580 (UUID: N/A)
Initialized device 0 with 3071 MB of memory (2525 MB available, using 2272 MB of it)
Initialized device 1 with 3071 MB of memory (2987 MB available, using 2688 MB of it)
10.8%  proc'd: 3396 / 3180   errors: 0 / 0   temps: 53 C / 49 C 
	Summary at:   Thu Jan  5 01:12:01 EET 2012

22.5%  proc'd: 7358 / 6890   errors: 2264 (WARNING!) / 0   temps: 57 C / 58 C 
	Summary at:   Thu Jan  5 01:12:15 EET 2012

34.2%  proc'd: 10754 / 10653   errors: 1703 (WARNING!) / 0   temps: 62 C / 64 C 
	Summary at:   Thu Jan  5 01:12:29 EET 2012

45.8%  proc'd: 14716 / 14363   errors: 3399 (WARNING!) / 0   temps: 65 C / 68 C 
	Summary at:   Thu Jan  5 01:12:43 EET 2012

56.7%  proc'd: 18678 / 17808   errors: 3419 (WARNING!) / 0   temps: 67 C / 72 C 
	Summary at:   Thu Jan  5 01:12:56 EET 2012

68.3%  proc'd: 22640 / 22110   errors: 5715 (WARNING!) / 0   temps: 69 C / 75 C 
	Summary at:   Thu Jan  5 01:13:10 EET 2012

83.3%  proc'd: 25470 / 26130   errors: 7428 (WARNING!) / 0   temps: 71 C / 77 C 
	Summary at:   Thu Jan  5 01:13:28 EET 2012

98.3%  proc'd: 30564 / 30150   errors: 9183 (WARNING!) / 0   temps: 71 C / 79 C 
	Summary at:   Thu Jan  5 01:13:46 EET 2012

100.0%  proc'd: 32828 / 32160   errors: 9219 (WARNING!) / 0   temps: 72 C / 80 C 
Killing processes.. done

Tested 2 GPUs:
	0: FAULTY
	1: OK

With this tool I've been able to spot unstable GPUs that performed well under every other load they were subjected to. So far it has also never missed a GPU that was known to be unstable. *knocks on wood*

Grab it here gpu_burn.tar.gz and burn for an hour: make && ./gpu_burn 3600
You might have to show the Makefile to your CUDA if it's not in the default path, and also to a version of gcc your nvcc can work with. It expects to find nvidia-smi from your default path.

Comments

21.1.2012

Gahh! You gave me a tar-bomb! Stop that!
- Iesos

24.1.2012

You didn't literally burn your card, did you? ;-)
- wili

27.3.2012

Hi,
I was trying to use your tool to stress test one of our older CUDA Systems (Intel(R) Core(TM)2 Q9650, 8 GiB Ram, GTX 285 Cards). When I run the tool I get the following output:
./gpu_burn 1
GPU 0: GeForce GTX 285 (UUID: N/A)
Initialized device 0 with 1023 MB of memory (967 MB available, using 871 MB of it)
Couldn't init a GPU test: Error in "load module": CUDA_ERROR_NO_BINARY_FOR_GPU
100.0%  proc'd: 0   errors: 164232 (WARNING!)   temps: 46 C 
        Summary at:   Tue Mar 27 16:24:16 CEST 2012

100.0%  proc'd: 0   errors: 354700 (WARNING!)   temps: 46 C 
Killing processes.. done

Tested 1 GPUs:
        0: FAULTY

I guess the card is not properly supported, it is at least weird that proc'd is always 0. 
Any hints on that?
- Ulli Brennenstuhl

27.3.2012

Well... Could have figured that out faster. Had to make a change in the makefile, as the gtx 285 cards only have computing capability 1.3. (-arch=compute_13)
- Ulli Brennenstuhl


Nick     E-mail  

Is this spam?