# Optimization and GPU Offloading Workflow with Intel oneAPI

## **oneAPI – 가속 컴퓨팅을 개발하기 위한 스마트한 방식**

2021. 10. 28. **MOASYS** 

### **Content**

- **oneAPI Compilers and Analytics Tool**
- **· Intel Optimization Workflow:** 
	- Compiler Optimization Report
	- II. Application Performance Snapshot
	- III. Memory Access Analysis
	- IV. CPU Roofline Analysis
	- V. GPU Offload Modeling
	- VI. GPU Roofline Analysis
	- VII. Minimization of Analysis Overhead

#### • Conclusion



Open, Standards-Based **Unified Software Stack** 

Freedom from proprietary programming models

1

Full performance from the hardware

Piece of mind for developers





### oneAPI: One Programming Model for Multiple Architectures and Vendors

#### **• Freedom to Make Your Best Choice**

- Choose the best accelerated technology the software doesn't decide for you
- **Realize all the Hardware Value** 
	- **Performance across CPU, GPUs, FPGAs, and other accelerators**
- **Develop & Deploy Software with Peace of Mind** 
	- Open industry standards provide a safe, clear path to the future
	- Compatible with existing languages and programming models including C++, Python, SYCL, OpenMP, Fortran, and MPI



2

### Intel Xe Architecture: Building the Foundation for Exascale Computing









4 Tile

moasys



- Intel architecture day 2020:
	- <https://newsroom.intel.com/wp-content/uploads/sites/11/2020/08/Intel-Architecture-Day-2020-Presentation-Slides.pdf>
	- Xe-HP can scale up to 4 tiles with a peak FP32 performance of 42 Tflops

### A New Era of Accelerated Computing



**• Roofline Analysis:** get insights about performance headroom against hardware limitations.

**• Offload Advisor**: get your code ready for efficient GPU offload before buying the hardware.

#### **intel** software

### Heterogeneous Computing with Intel Compilers



- **EXICC/icpc/ifort: classic Intel HPC compilers**
- **icx/ifx: next generation compilers based on Clang/LLVM with Intel proprietary technologies** 
	- Support for OpenMP offloading to Intel GPUs
- **Examplementation of SYCL standard** 
	- <https://www.khronos.org/sycl/>
	- SYCL = High level abstraction C++ and OpenCL runtime to target heterogenous architectures.
- **·** intel-llym: open-source development version of dpcpp
	- <https://github.com/intel/llvm>
	- Experimental support for NVIDIA devices using CUDA PTX backend



### Optimization Workflow I: Compiler Optimization Report

```
■ Use compiler option - qopt-report=5
     ▪ Detailed information regarding optimizations done by Intel compilers (-O2)
  LOOP BEGIN at matmul baseline.c(89,5)
     remark #15542: loop was not vectorized: inner loop was already vectorized
                                                                                                                              A
     LOOP BEGIN at matmul baseline.c(90,9)
         remark #15542: loop was not vectorized: inner loop was already vectorized
         LOOP BEGIN at matmul baseline.c(92,13)
           remark #15388: vectorization support: reference A[i*p+k] has aligned access [ matmul_baseline.c(93,29) ]
           remark #15328: vectorization support: non-unit strided load was emulated for the variable <B[k*n+j]>, stride
                                                 is unknown to compiler [ matmul_baseline.c(93,40) ]
           remark #15305: vectorization support: vector length 4
                                                                                                                              B
           remark #15309: vectorization support: normalized vectorization overhead 0.250
           remark #15355: vectorization support: *(C+(i*n+j)*4) is float type reduction [ matmul_baseline.c(93,17) ]
           remark #15300: LOOP WAS VECTORIZED
           remark #15442: entire loop may be executed in remainder
           remark #15448: unmasked aligned unit stride loads: 1 
           remark #15452: unmasked strided loads: 1 
           remark #15475: --- begin vector cost summary ---
           remark #15476: scalar cost: 11 
           remark #15477: vector cost: 10.000 
                                                                                                                              C
           remark #15478: estimated potential speedup: 1.090
                                                                              void mat_mul(float *A, float *B, float *C, 
           remark #15488: --- end vector cost summary ---
                                                                                            int m, int n, int p) { 
         LOOP END
                                                                               for (int i = 0; i < m; i++) {
                                                                                  for (int i = 0; i < n; i+1)
       LOOP END
                                                                                      for (int k = 0; k < p; k++)
                                                                                          C[i*n+j] += A[i*p+k] * B[k*n+j]; LOOP END
intel software
                                                                                                                                      moas<sup>'</sup>
                                                                  6
```
### Optimization Workflow II : Application Performance Snapshot (APS)

#### **Analyzing Shared Memory Applications**



- Command-line interface to generate HTML report: easy to use, low overhead, and high scalability
- **For shared memory applications:**

```
aps <my app> <app parameters>
```
**• For MPI applications:** 

<mpi launcher> <mpi parameters> aps <my app> <app parameters>

▪ HTML report: *aps\_result\_<date>*

#### **intel** software

### Optimization Workflow II : Application Performance Snapshot (APS)

#### **Application Performance Snapshot**

Application: matmul\_baseline.x Report creation date: 2021-10-24 19:39:38 HW Platform: Intel(R) Xeon(R) Processor code named Cascadelake Frequency: 2.99 GHz Logical Core Count per node: 32 Collector type: Driverless Perf per-process counting



Your application might underutilize the available logical CPU cores

level profiling with tools like Intel® VTune™ Profiler to discover why the CPU is underutilized.

because of insufficient parallel work, blocking on synchronization, or too much I/O. Perform function or source line-

#### **EXECUTE:** <Memory Level> Stalls definition:

▪ Percentage of cycles when the CPU is stalled (정지), waiting for data to come from <Memory Level>



 $\times$ 

### In-depth Analysis with oneAPI Toolkits



- **Trace Analyzer and Collector: understand MPI application for weak and strong scaling optimization**
- **VTune Profiler: CPU/GPU hotspot analysis, OpenMP threading efficiency, and memory access efficiency**

moasys

**EXTERGHT Advisor: vectorization efficiency, roofline analysis and GPU off-loading advisor** 

### Optimization Workflow III: Memory Access Analysis

- The following command-line options are recommended for best experiences with Advisor:
	- $\bullet$  -*g*  $\overline{\phantom{a}}$  full debug information
	- *-O2* moderate optimization
	- *-no-ipo* disable Intel's inter-procedural optimization during offload modeling
- Perform survey with Advisor advisor -collect survey -project-dir ./result -- ./matmul.x
	- This shows loop hotspots and corresponding degree of vectorization
- Select loop on line #92 for memory access pattern (MAP) analysis: advisor -collect map –select matmul.c:92 -project-dir ./result -- ./matmul.x
	- This shows whether an array has continuous memory access, i.e. unit stride
	- **.** Unit stride allow compiler to effectively vectorize the loop
- **Perform memory access analysis with VTune Profiler:** • This show the amount of load/store/LLC miss vtune -collect memory-access -knob analyze-mem-objects=true -result-dir ./mem -- ./matmul.x

moas<sup>®</sup>

### Optimization Workflow III: Cache Optimization to Improve Vectorization



*intel* software

11

### Optimization Workflow III: Cache Optimization to Improve Vectorization

for (int i = 0; i  $\langle m; i+1 \rangle$ for (int  $j = 0; j < n; j++)$ ) **for (int k = 0; k < p; k++) C[i\*n+j] += A[i\*p+k] \* B[k\*n+j]**



#### $\vee$  Top Time-Consuming Loops  $\circledcirc$



```
intel software
```
### Optimization Workflow IV: Arithmetic Intensity



- **.** In a *first order approximation*, the performance of an application is assumed to be bound by:
	- Machine theoretical Double Precision/Single Precision Peaks (FLOP/s)
	- **Memory bandwidth such as DRAM, L1, L2, L3 caches (Byte/s)**
- Q: How can we combine machine's theorical FLOPs and memory bandwidth in a single model ?
- A: Arithmetic Intensity
	- Ratio of total floating-points operations to total data movement (FLOP/byte)
	- AI is an intrinsic properties of algorithm, reflecting how effectively data in cache is reused:
		- BLAS3 can archive higher AI via cache optimization techniques such as loop titling and low-level optimizations (oneMKL)

### Optimization Workflow IV : Roofline Model



- Product between AI (software-intrinsic) and Memory BW (hard-intrinsic) has unit of FLOP/s
	- **Performance increases linearly as a function of AI (slope roof)**
	- **•** Performance is also bound by machine theoretical peaks (horizontal roof)
- Roofline graph is represented in log to log scale:
	- Increase memory bandwidth results in a vertical shift of the slope roof
	- **Hierarchical structure of cache can be represented in a single roofline graph**



### Optimization Workflow IV: Hierarchical Roofline Model



- **Each dot represents a loop:** 
	- **Example 2** Bigger dots are more time-consuming loops: red > yellow > green
	- **·** Best candidate loops for optimizations: A and G
	- Vectorization and threading moves dots vertically (higher GFLOPS):
		- #pragma omp simd
		- **.** #pragma vector aligned
	- Optimization of memory access moves dots horizontally (higher AI)



### Optimization Workflow IV: Optimization Guides

### **Next Steps**

#### If under or near a memory roof...

- Try a MAP analysis. Make any appropriate cache optimizations.
- If cache optimization is impossible, try reworking the algorithm to have a higher Al.

#### If Under the Vector Add Peak

Check "Traits" in the Survey to see if FMAs are used. If not, try altering your code or compiler flags to **induce FMA** usage.

#### **FLOPS**



#### If just above the **Scalar Add Peak**

**Check vectorization** efficiency in the Survey. Follow the recommendations to improve it if it's low.

#### If under the **Scalar Add Peak...**

**Check the Survey Report** to see if the loop vectorized. If not, try to get it to vectorize if possible. This may involve running Dependencies to see if it's safe to force it.

moasys

**Arithmetic Intensity** 

### Optimization Workflow IV: Intel® Advisor Roofline Analysis

**• Generate performance survey and code analytics:** 

advisor -collect survey -project-dir ./result -- ./matmul.x

■ Generate roofline graph:

advisor -collect tripcounts -flop -project-dir -enable-cache-simulation ./result -- ./matmul.x

■ Generate roofline report in HTML format:

advisor -report roofline -project-dir ./result -report-output ./roofline.html

▪ View result with Advisior GUI:

advisor-gui result/result.advixeproj



### Optimization Workflow IV: Cache Optimization Roofline



- What is the machine theoretical FLOPS and memory bandwidth ?
- Is the application mainly memory bound or compute bound ?

### Optimization Workflow IV: Data Aligned for Vectorization

```
for (i = 0; i < m; i++)for (i = 0; j < n; j++)for (k = 0; k < p; k++)C[i*n+j] += A[i*p+k] * B[k*n+j]for (i = 0; i < m; i++)for (k = 0; k < p; k++)for (j = 0; j < n; j++)C[i*n+j] += A[i*p+k] * B[k*n+j] 
       float* A = (float*) _mm_malloc(sizeof(float)*m*p,64); 
        ...
       for (i = 0; i < m; i++)for (k = 0; k < p; k++)#pragma vector aligned
                   #pragma omp simd reduction(+:C[i*n+j])
                   for (j = 0; j < n; j++)C[i*n+j] += A[i*p+k] * B[k*n+j];
```
#### **\_mm\_free(A);**

*remark #15388: vectorization support: reference C[i\*n+j] has aligned access [ matmul\_aligned.c(96,17) ] remark #15388: vectorization support: reference C[i\*n+j] has aligned access [ matmul\_aligned.c(96,17) ] remark #15388: vectorization support: reference B[k\*n+j] has aligned access [ matmul\_aligned.c(96,40) ]*

■ Use Intel intrinsics to align vectors at 64-byte boundary for AVX512 vectorization

### Optimization Workflow V: GPU Offload Modeling

- **The following command-line options are recommended for best experiences with Advisor:** 
	- *-g* full debug information
	- *-O2* moderate optimization
	- *Ano-ipo* and disable Intel's inter-procedural optimization during offload modeling

▪ Modeling performance on Intel DG1 GPU:

```
advisor-python $(APM)/run oa.py \
     result gen9
     --config gen9_gt4 \
     --collect basic \
```
- --no-assume-dependencies
- -- ./matmul.x
- Legacy HTML report:
	- result gen9/rank.0/pp000/data.0/report.html

#### Arguments

<string> is one of the following device configurations:



### Optimization Workflow V: Modeling Performance on GPU



### Optimization Workflow V: Gen9 Offload Modelling



### Optimization Workflow V: Gen12 Offload Modelling



### Optimization Workflow V: Customized GPU with Configuration Slider



#### ▪ Use configuration slider to model custom GPU:

- **Executing unit (EU): 96**  $\rightarrow$  **192**
- $\cdot$  HBW: 54 GB/s  $\rightarrow$  96 GB/s
- Save new config file as *scalers.toml*
- Redo offload modeling
- Results:
	- 8x performance gain vs 4.6x (default)
	- For example, Xe-HP can support up to 512 EUs



moasys

### Estimation of Performance Gain on Xe HPC with Offload Advisor



<https://www.intel.com/content/www/us/en/newsroom/resources/press-kit-architecture-day-2021.html>

- With Offload Advisor, you can estimate performance gain of your codes on new GPUs before buying.
- Configuration slider can be used to simulate higher-tier GPU such as Xe HP and Xe HPC

### Optimization Workflow VI: DPCPP and OpenMP Offloading

#### ■ DPCPP port

```
sycl::device device(sycl::default_selector{});
sycl::queue queue(device);
...
float *A USM = sycl::malloc shared<float>(m * p, queue);
float *B USM = sycl::malloc_shared<float>(p * n, queue);
float *C_USM = sycl::malloc_shared<float>(m * n, queue); 
...
queue.parallel_for(range(m, n), [=](auto index) { 
    auto i = index[0];auto j = index[1];for (int k=0; k < p; k++)
        C USM[i*n+j] += A USM[i*p+k] * B USM[k*n+j];
});
```
#### ■ OpenMP offloading

```
#pragma omp target teams distribute parallel for
#pragma omp target data map(to: A[0:m*p], B[0:p*n]) map(tofrom: C[0:m*n]) 
for (int i = 0; i < m; i++)for (int j = 0; j < n; j++)for (int k = 0; k < p; k++)
            C[i * n + j] += A[i * p + k] * B[k * n + j];
```
### Optimization Workflow VI: GPU Roofline Analysis of DPCPP Code



**• Generate roofline for Gen9 graphics** 

advisor -collect survey -profile-gpu -project-dir ./gen9\_result -- ./matmul\_sycl.x

advisor -collect tripcounts -profile-gpu -stacks -flop -project-dir ./gen9\_result -- ./matmul\_sycl.x

### Optimization Workflow VII : Minimization Analysis Overhead



- **Techniques to minimize overhead:** 
	- Collection controls:
		- Pause/resume long analysis
		- **Stop collection after a specific time**
		- Skip unimportant phase of code execution such as initialization
	- Loop markup:
		- Skip unimportant loops and focus only on important ones
	- **·** Filtering:
		- Skip unimportant functions and focus only on important ones
	- **Execution Speed/Duration/Scope Properties:** 
		- Disable stack collection, increase sampling interval, etc

#### ▪ <https://software.intel.com/content/www/us/en/develop/documentation/advisor-user-guide/top/minimize-analysis-overhead.html>

```
#include "advisor-annotate.h"
```
}

```
void mat mul(float *A, float *B, float *C,
              int m, int n, int p) { 
   ANNOTATE_SITE_BEGIN(); 
   for (int i = 0; i < m; i++) {
        ANNOTATE_ITERATION_TASK(); 
        for (int j = 0; j < n; j++)for (int k = 0; k < p; k++)
                C[i*n+j] += A[i*p+k] * B[k*n+j];}
    ANNOTATE_SITE_END();
```
moasy:

```
intel software
```


- oneAPI allows developers archive best performance for heterogenous platforms:
	- **Easy to use with well designed user interfaces**
	- Memory access analysis to improve efficiency of vectorization
	- **EXTERGHTM** Automated roofline analysis to understand hardware limitations
	- Offload simulation to gauge potential performance gain on Intel GPUs before purchase

