README.md 9.57 KB
Newer Older
Xavier Besseron's avatar
PS 2021  
Xavier Besseron committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363

[[_TOC_]]



# Debugging and Performance Engineering


**Lecture**
   - [Slides of the lecture](slides.pdf)
   - Don't hesitate to refer to the slides to complete the execises


**Practical Session**
   - Instructions: https://gitlab.uni.lu/SC-Camp/2021/debugging-and-profiling
   - First, check the part *0 - Pre-requisites*
   - If you're not familiar with GDB or Vagrind, go through part *1 - GDB Tutorial* and/or part *2 - Valgrind Tutorial*
   - Exercises 3, 4 and 5 are independent and can be done in any order
   - Solutions and explanations will be pushed in the repository at the end of the day



<br/><br/><br/><br/>

## 0 - Pre-requisites

For this practical session, we will use the GUANE cluster.

Unless indicated otherwise, you should connect to a computing node for the tutorials and exercises.


### Connect to a computing node of the GUANE cluster

Before connectin The first step is the reservation of a resource. Connect to the cluster frontend
Let's access the GUANE access node via toctoc:
```
[username@laptop ~]$ ssh username@167.249.40.26
[username@toctoc ~]$ ssh guane
```

Start an interactive session with 1 task and 4 cores:
```
[username@guane ~]$ srun -n 1 -c 4 --time 8:0:0 --pty bash
```


### Download the practical session materials

On the cluster, run the following command to download all the exercises:
```
git clone https://gitlab.uni.lu/SC-Camp/2021/debugging-and-profiling.git
```

Alternatively, if the download is too slow, you can make a copy from a local directory on GUANE and update to the latest version:
```
cp -r /home/xbesseron/debugging-and-profiling .
cd debugging-and-profiling
git pull
```


<br/><br/><br/><br/>

## 1 - GDB Tutorial

### Objective

- Learn the basic commands of GDB

### Instructions

1. Read carefully the page [A GDB Tutorial with Examples](http://www.cprogramming.com/gdb.html)
2. Follow and run the step-by-step example *An Example Debugging Session*

### Notes

- The example program `main.cpp` is available in the `tutorial_gdb` directory.
- The example program is waiting for you to enter a number as input. If the program or the GDB session appears to be stuck, enter a number (eg `3`) and press `Enter`.

### More help on GDB

To get help about the GDB commands:
- in GDB prompt, use `help` or `help <command>`
- in the shell, use `man gdb`
- online [GDB documentation](https://sourceware.org/gdb/current/onlinedocs/gdb/)



<br/><br/><br/><br/>

## 2 - Valgrind Tutorial

### Objective

- Run Valgrind on simple example and understand the error messages


### Setup

To use Valgrind on GUANE, we need to load the module
```
# search for Valgrind
module avail valgrind

# load the module
module load valgrind/3.15.0
```

### Instructions

1. Read carefully the page [Using Valgrind to Find Memory Leaks and Invalid Memory Use](http://www.cprogramming.com/debugging/valgrind.html)
2. Reproduce the execution and the analyses of the tutorials
3. Check the documentation and explanations about the reported errors

### Notes

- If you successfully loaded the Valgrind module as described above, you can skip the first *Getting Valgrind* part.
- The example programs of the tutorial are available in the `tutorial_valgrind` directory.
- Don't forget to compile the example programs: you can just type `make` for that.
- If you get the error `example1: command not found`, just use `./example1` instead.

### More help on Valgrind

- For more info on the command line options of Valgrind, use `man valgrind`
- [Valgrind User Manual](https://valgrind.org/docs/manual/manual.html)
- [Documentation of the Memcheck tool](https://valgrind.org/docs/manual/mc-manual.html)
- [Explanation of error messages from Memcheck](https://valgrind.org/docs/manual/mc-manual.html#mc-manual.errormsgs)




<br/><br/><br/><br/>

## 3 - Profiling with Callgrind

### Objective

- Profile a program with Valgrind and optimize it


### Setup

We will use Valgrind.

```
# Remove any previsouly loaded module
module purge

# Load Valgrind 
module load valgrind/3.15.0
```


### Instructions

1. Compile and run the program

A example program is available in the directory `profiling`.
Let's compile it and run it.

```
# Compile the program
cd profiling
make

# Test the program
time ./main
```

It is a bit slow to execute. Can we optimize it?


2. Profile with Valgrind

Let's use Valgrind to profile it:

```
valgrind --tool=callgrind ./main
```

The profiling with Valgrind is slow (about 20-25x slower than the original) and should last around 2 minutes on GUANE for this example.

Valgrind will generate a trace file named `callgrind.out.XXXXX`.
One of the best way to look at it is to use **KCacheGrind**. 
This tool is not installed on GUANE, but you can download the tracefile and visualize it on your laptop.


4. Download the tracefile

Use `scp` (or any `sftp` client) to download the tracefile `callgrind.out.XXXXX` on your computer.


3. Install KCacheGrind 


- To install Kcachegrind on Linux, use your package manager. For example on Ubuntu, run `sudo apt install kcachegrind`
- For Kcachegrind on Windows, you have to install [QCacheGrind](https://sourceforge.net/projects/qcachegrindwin/) and [Visual C++ Redistributable for Visual Studio 2012 Update 4](https://www.microsoft.com/en-us/download/details.aspx?id=30679).


5. Visualize the tracefile with KCacheGrind

Open the tracefile with KCacheGrind. You should obtain something similar to that.

![Visualization with KCacheGrind](profiling/callgrind.png)

You can also download the source files to visualize the source code in KCacheGrind.


6. Optimize the program

This program contains a beginner C++ mistake that makes it slow.

**Can you figure out what is wrong and improve the performance of the program?**

Tip: 



<br/><br/><br/><br/>

## 4 - Bug Hunting

### Objective

- Encounter different types of bugs and experiments with various debugging tools

### Instructions

A list of programs demonstrating the different kind of bus are available in the `exercises` directory.
Try the different debugging tools on every example to see how they behave and find the bugs.

**Can you exterminate all the bugs?**




### Notes

- You can compile each program manually using `gcc` or `icc`. You are encouraged to try both to see how differently they behave. Example: `gcc program.c -o program`. Add any additional parameter you might need.
  - To use `gcc` you need to load the module `devtools/gcc/9.2.0`
  - To use `icc` you need to load the module `devtools/intel/oneAPI`
- The files are named according to the type of bug they trigger. Your can refer to the [slides of the lecture](slides.pdf) for help.
- Look at the comment at the beginning of each `.c` file for tips or specific compilation options.
  
  

<br/><br/><br/><br/>
 
## 5 - Roofline with Intel Advisor

This exercise compares 3 implementations of matrix multiplication:
- Naive algorithm
- Block algorithm
- using Eigen library
with 2 different set of compilation options:
- without vectorization instructions
- with vectorization instructions

### Objectives

- Compare the performance of different implementations of the same algorithm
- Use Intel Advisor for the Roofline Analysis

### Setup

To use the graphical interface of Intel Advisor, we need to enable the X-forwarding with the `-X` of SSH.

```
[username@laptop ~]$ ssh -X username@167.249.40.26
[username@toctoc ~]$ ssh -X guane
```

We use this trick to connect to the first node of the job using `ssh -X`
```
[username@guane ~]$ salloc -n 1 -c 4 --time=8:00:00 bash -c 'ssh -X $(scontrol show hostnames | head -n 1)'
```

Once connected to the computing node, load the required modules (Intel compiler, a recent GCC and Eigen).
```
module purge
module load devtools/gcc/9.2.0 devtools/intel/oneAPI libraries/eigen3/3.3.7
```

Load the Intel Advisor module
```
module load advisor/2021.4.0
```


**Note:**
- If using Windows, you need to install a X server, for exemple with [MobaXterm](https://mobaxterm.mobatek.net/).


### Instructions

1. Compile and run the program

A example program is available in the directory `roofline`.
Let's compile it and run it.

```
# Compile the program
cd roofline
make
```

Two executables are compiled:
- `matmul_all_novec` without SIMD instructions
- `matmul_all_vec` with SIMD instructions

Run the two executables and compare the performance:

```
./matmul_all_novec
```

```
./matmul_all_vec
```


2. Start Intel Advisor

Start the GUI
```
advisor-gui &
```

The Intel Advisor interface will appear after some time (the connection might be slow).


3. Profile one executable

Run the roofline analysis in the Intel Advisor GUI
- Do *Create Project*, set the project name
- Select one of the executable above as the *Application* and click *OK*
- Select *CPU / Memory Roofline Insights* and click *Choose*

![Select analysis in Intel Advisor](roofline/advisor_select_analysis.png)

- Click the *Play* (triangle) / *Start Survey* button

![Roofline in Intel Advisor](roofline/advisor.png)

The analysis takes a bit of time. Advisor will run the program twice, once to collect the performance, once to collect the number of data accesses and floating-point operations.


4. Explore the plot

- Identify the *roof* lines of the plot: 
    - Maximum floating-point operation (FLOP) for scalar/vectorized instructions?
    - Maximum bandwidth for RAM and cache accesses?
- Identify the different loops for the difference algorithms: not that easy :-)
- Theoretical comparison of the *naive* matrix multiplication:
    - How much data is accesed by the algorithm? (read and write)
    - How many floating-point operations are performed?
    - What is the arithmetic intensity? Does it match the one found by Intel Advisor?
    - What appears to the bottleneck for this algorithm on this machine?