PDN Course: HPC & Networking

Module 0: Introduction to Performance

Before we begin, Latency and Throughput are key concepts.

M

Master

W

Workers

Click "Next Step" to begin

Calculate speedup if serial time is 10^6 and parallel time is 2*10^3:

What happens to efficiency when processors p increase while n is fixed?

What does Throughput measure?

In SISD model:

What happens to the efficiency when p is fixed and n is increased?

Module 1 Ch 1: OpenMP Basics

Let's dive into Module 1 Ch 1: OpenMP Basics.

Which compiler directive is used to parallelize a loop in OpenMP?

What happens to an OpenMP directive if compiled with a non-compatible compiler?

What is the purpose of the reduction clause?

In a parallel block, thread execution order is:

If y is declared outside an OpenMP parallel for loop and updated inside it without reduction, what occurs?

Module 1 Ch 2: Embarrassingly Parallel

Let's dive into Module 1 Ch 2: Embarrassingly Parallel.

What variable scope must be used for a loop index inside a parallel for loop?

How does OpenMP handle the loop index automatically in `#pragma omp parallel for`?

What happens if diff is not declared as a private variable inside the loop?

Why might a critical section be used when computing the index of a minimum value?

How can you improve scalability when finding the minimum index?

Module 1 Ch 3: Mutual Exclusion

Let's dive into Module 1 Ch 3: Mutual Exclusion.

Why is using an array of locks more scalable than a single critical section?

What is the advantage of using `#pragma omp atomic` over locks?

What is a limitation of the `#pragma omp atomic` directive?

When doing a histogram count in parallel, what should the variable 'x' be?

Which of these has the best performance if no mutual exclusion is used at all?

Module 2 Ch 1: GPU/CUDA

Let's dive into Module 2 Ch 1: GPU/CUDA.

If n=2000, and block size is 256, how many blocks will be launched?

If 2048 threads are launched for 2000 additions, what happens to the extra 48 threads?

What is the thread index formula for a 1D grid and block?

Why is there no loop in a typical vector addition CUDA kernel?

What happens if there are not enough physical cores for the launched threads?

Module 2 Ch 2: CUDA Organization

Let's dive into Module 2 Ch 2: CUDA Organization.

For 3D data of size 2000x1500x1800 and 8x8x8 block size, what is the grid dimension in X?

In iterative stencil computation like image blurring, what synchronization is typically needed between iterations?

What does `__syncthreads()` do?

In a 3D kernel, how do you access the z-dimension of the block index?

Why might some threads have no computation to do in a block?

Module 2 Ch 3: GPU Memory & Reduction

Let's dive into Module 2 Ch 3: GPU Memory & Reduction.

How many iterations does a parallel tree reduction take for an array of 1024 elements?

For a block size of 1024, how many warps are launched?

In tree reduction with interleaved addressing, what causes poor performance at later iterations?

What is the theoretical efficiency of parallel tree reduction with N=1024?

In tree reduction with sequential addressing, how many elements remain to be combined after the 5th iteration (N=1024)?

Module 3 Ch 1: MPI Intro

Let's dive into Module 3 Ch 1: MPI Intro.

What does SPMD stand for?

In an MPI point-to-point communication, which functions are primarily used?

Why does process 0 need to put its own value into the sum first before receiving?

After process 0 computes the total sum, how do other processes get it in a naive point-to-point implementation?

What happens to variables like 'recv_value' in processes that do not use them?

Module 3 Ch 2: Foster's Method

Let's dive into Module 3 Ch 2: Foster's Method.

What is the main difference between CUDA threads and MPI processes regarding memory?

Does MPI tree reduction using MPI_Recv require an explicit MPI_Barrier?

What is the complexity of computing a global sum serially in MPI with N processes?

What is the complexity of computing a global sum using parallel tree reduction in MPI?

In an MPI point-to-point tree reduction, how are send and receive operations matched?

Module 3 Ch 3: MPI Matrix-Vector

Let's dive into Module 3 Ch 3: MPI Matrix-Vector.

Which collective communication function is used to collect data from all processes to a single array on the root process?

What is the communication time for MPI_Gather?

When checking if numbers are sorted locally in parallel, which function allows simultaneously sending and receiving?

If each process determines a local boolean result, how can you determine if ALL processes are true?

If process 0 gathers numbers from P processes and sorts them serially, what is the computation time?

Module 4: MapReduce

Let's dive into Module 4: MapReduce.

In MapReduce, how are elements usually processed in the Mapper?

Why might calculating an average be done with a Combiner, but not a median?

What is the main drawback of NOT using a combiner?

In a MapReduce weather program parsing YYYY-MM-DD to find monthly stats, what should be the Key?