OpenCL in trading - page 9

 

EECE.6540 Heterogeneous Computing (University of Massachusetts Lowell) - 46. Basic FPGA concepts



46. Basic FPGA concepts

This video covers basic concepts of field programmable gate arrays (FPGAs). Unlike CPUs, FPGAs can be programmed to fit specific hardware resources meaning they are highly customizable. The video discusses the importance of latency in circuit design and how it can be balanced with maximizing f max. It introduces the concept of pipeline design to increase the frequency at which a computation can be performed, as well as discussing data paths and control paths in a circuit. Finally, the video discusses circuit occupancy in an FPGA and how decreasing bubbles and increasing occupancy can increase f max.

  • 00:00:00 In this section, we learn about the basic concepts of field programmable gate arrays (FPGAs) and how they differ from other processor architectures. Unlike CPUs, FPGAs do not have a fixed datapath and can be programmed to fit specific hardware resources to accommodate different numbers of registers, computational logic, and their connections, making them highly customizable. FPGAs contain various components, such as adaptive logic, RAM blocks, DSP blocks, and programmable routing switches, which can be connected to implement different designs. Additionally, the maximum operation frequency of an FPGA is determined by the physical propagation delay of the combinational logic, which affects the circuit's latency and operation speed.

  • 00:05:00 In this section, the concept of latency and its importance in circuit design is explained. Latency is the time it takes for a circuit to complete one or more operations, and it can be measured in time or clock cycles. Lowering latency is usually the goal, but it must be carefully balanced with maximizing f max. An example is given of a design that is optimized for latency and one that is optimized for f max through pipeline use. The importance of pipeline registers to break up long critical paths and improve speed is emphasized.

  • 00:10:00 In this section, the video explains the concept of pipeline design and how it increases the frequency at which a computation can be performed. By using a pipeline register, the delay can be shortened, allowing for higher clock speeds and an increase in throughput. The video then introduces the idea of data path and control path in a circuit, where the data path is the chain of registers and combinational logic that performs the computation, and the control path is everything else outside of the data path that controls the circuit's operation. Various logics are added to the control path for handshaking flow control, loop control, and branch control, among others, and occupancy is used to define the initialization ratio of the data path in a circuit.

  • 00:15:00 In this section, the speaker explains the concept of circuit occupancy in an FPGA. The circuit has several registers, with one being a valid register that determines if the input data is valid or not. The goal is to decrease bubbles in the circuit to increase occupancy by minimizing bubbles in the whole circuit. Each cycle is counted to see how resources are being used, and unoccupied portions in the circuit are called bubbles. By decreasing bubbles and increasing occupancy, the f max can be increased.
Basic FPGA concepts
Basic FPGA concepts
  • 2021.04.07
  • www.youtube.com
This video introduces basic concepts of FPGA and design for FPGA
 

47. Design Analysis (I): Analyze FPGA Early Images



47. Design Analysis (I): Analyze FPGA Early Images

This section of the video focuses on the process of analyzing FPGA early images for a DPC++ design. The speaker explains the steps involved, such as compiling the program, generating FPGA binary, and running profiling. The video includes a demo of how to generate reports and interpret the various information panels provided in the reports. The speaker also analyzes the FPGA early images of a b2 module and discusses the various logic blocks, loops, load unit, and unroll factor. They also discuss how the design of a kernel function can significantly impact the internal design on FPGA and provide examples of how the inner and outer loops can be unrolled to increase throughput. The examples illustrate the flexibility of high-level language programming in influencing the FPGA's hardware resources.

  • 00:00:00 In this section of the video, the speaker discusses the process of analyzing a DPC++ design, specifically focusing on analyzing FPGA early images. This involves several steps such as compiling the program to an FPGA emulator to ensure functional correctness, generating FPGA binary, executing it on a feature card, and running profiling to determine runtime performance. The second stage involves generating reports and looking for bottlenecks before revising the design if necessary. The video includes a demo of how to generate reports and analyze them in a browser, and explains how to interpret the various information panels and summaries provided in the reports.

  • 00:05:00 In this section, the speaker gives an overview of the design analysis tool and its features. They explain how the resource utilization panel displays information about the resources that will be used by the design and how the system viewers panel allows users to view different components of the design, such as memory components and pipes. The graph viewer within the system viewers panel shows a list of components used in the design and how they are interconnected. The speaker also demonstrates how users can zoom into individual modules to view their corresponding segments in the source code.

  • 00:10:00 In this section, the speaker analyzes the FPGA early images of a b2 module and discusses the various logic blocks and loops that make up the design. They also explain the use of the load unit to load data from the memory component and the unroll factor to increase the efficiency of loop operations. The speaker also compares parallel 4 and single task designs and recommends using single task for FPGA designs to optimize memory access.

  • 00:15:00 In this section, the speaker discusses how the design of a kernel function can significantly impact the internal design on FPGA. The speaker provides examples of how the inner and outer loops can be unrolled to make use of the FPGA's hardware resources, thereby increasing the throughput. The examples illustrate the flexibility of high-level language programming in influencing the FPGA's hardware resources.
Design Analysis (I): Analyze FPGA Early Images
Design Analysis (I): Analyze FPGA Early Images
  • 2021.04.07
  • www.youtube.com
This video introduce the design workflow and how to analyze the early image using DPC++ compiler report.
 

48. DPC++ FPGA Design Analysis (II): Runtime Profiling



48. DPC++ FPGA Design Analysis (II): Runtime Profiling

In this video, the presenter discusses the process of analyzing a program's runtime performance using tools that collect performance data by adding profiling instrument registers to the FPGA bit streams. They demonstrate how to compile for profiling and interpret the collective profiling results using the Intel FPGA dynamic profiler with user-added performance counters. They show how the V2 profiler displays kernel functions and executables used for analyzing runtime profiling results, and how to identify partition bottlenecks and optimize them. The example used is a matrix modification kernel that had a lot of memory accesses to global memory, which was optimized by using local memory to reduce communication with global memory and improve design efficiency.

  • 00:00:00 In this section, the presenter discusses analyzing a program's runtime performance. They explain how to use tools to collect performance data by adding profiling instrument registers to the FPGA bit streams. The Intel FPGA dynamic profiler is used with user-added performance counters to collect kernel performance data. The performance data is stored in a file called profile.mall at the end of the execution, and a json file is generated to allow the Intel v2 profiler to read the files. Both files are needed when loading the results folder to the Intel Retune toolkit. Importing the data from the directory is important when loading data to v2.

  • 00:05:00 In this section, the presenters demonstrate how to compile for profiling and interpret the collective profiling results. They show how the V2 profiler displays kernel functions and executables used for analyzing runtime profiling results. The panel shows memory operation functions and loop operations, their time, percentage of stall, occupancy, idle time, activity percentage, data transfer size, and average bandwidth. The bottom portion of the panel shows in more detail the FPGA utilization and other metrics for global memory bandwidth occupancy. By showing the detailed statistics, designers can understand individual kernels and their functions, which help them to optimize and improve their designs.

  • 00:10:00 In this section of the video, the speaker discusses how to identify partition bottlenecks and optimize them. The example used is a matrix modification kernel that has a lot of memory accesses to global memory, resulting in 15.57GB of data transfer. The kernel's design is memory-bond, and the solution is to optimize memory access by using local memory to reduce communication with global memory and improve design efficiency. This optimization will be discussed in the next lecture.
DPC++ FPGA Design Analysis (II): Runtime Profiling
DPC++ FPGA Design Analysis (II): Runtime Profiling
  • 2021.04.07
  • www.youtube.com
This video introduces DPC++ profiling and how to analyze run-time profiling reports using Intel VTune tool.
 

EECE.6540 Heterogeneous Computing (University of Massachusetts Lowell) - 49. OpenCL Examples



49. OpenCL Examples (I)

The YouTube video "OpenCL Examples (I)" covers the implementation of matrix multiplication using nested loops in C programming, and its implementation as an OpenCL kernel. The lecturer explains how to use two levels of nested loops for the dot product calculation of the resulting element in the matrix, and how each output element of matrix C is treated as a separate work item in OpenCL. The video also covers the steps required to prepare the OpenCL kernel for execution and retrieve the resulting matrix from a device to a host, as well as setting work group sizes and executing the kernel with modified kernel arguments. Additionally, a sample code for matrix multiplication is provided, and the speaker demonstrates the process of obtaining device and platform IDs on a Mac OS and creating a program object on different platforms. Lastly, the video explains buffer management, tracing the resources allocated on the host side and OpenCL resources used, and provides a simple multiplication kernel example.

This video covers various examples of using OpenCL, including matrix multiplication, image rotation, and image filtering. For image rotation, the speaker explains how to break down the problem using input decomposition and demonstrates the kernel function used to identify the original and new location of each pixel. For image filtering, the speaker discusses the concept of creating image objects on the device side and the use of OpenCL sampler to define how to access the image. They also present a sample implementation of the image convolution function with two nested for loops. The video concludes with a demonstration of using OpenCL to perform a convolution filter on an image and verifying the results.

  • 00:00:00 In this section, the lecturer introduces matrix multiplication, a classical computing example, and explains how it can be implemented by nested loops in C programming. They also explain the dot product calculation of the resulting element in the matrix, which is the product of a row from matrix A and a column from matrix B. The lecture explains that the two levels of nested loops can be executed independently and can be performed in a random fashion.

  • 00:05:00 In this section, the concept of work item and how it can be applied to implement matrix multiplication in OpenCL kernel is discussed. Each output element of matrix C is treated as a separate work item, and with the help of FPGA or GPU processing elements, a two-dimensional range of work items can be mapped to other for loops in the hardware implementation. To implement matrix multiplication, a kernel function named "simple multiply" with a list of arguments is defined, which requires all the necessary input matrices and their dimensions. The body of the kernel function uses global IDs to calculate the two-dimensional position of the work item and initializes the sum to calculate the resulting element of matrix C.

  • 00:10:00 In this section, the speaker explains the kernel function for matrix multiplication using OpenCL programming framework. The kernel function utilizes a dot product operation and a for loop to iterate through the elements of the row vector from a and the column vector from B. The indices for the two-dimensional input matrix are calculated using the row numbers and column numbers to find the right element in the row vector and column vector. Once the dot product is calculated, the resulting element is assigned to the corresponding element in C. The environment for computation is also discussed, which is platform-dependent and involves understanding the available resources and important parameters in the platform.

  • 00:15:00 In this section, the speaker outlines the steps required to prepare an OpenCL kernel to work, starting with creating a context and a command queue to instantiate kernels. Then, the input data is prepared by allocating buffers on the host side, copying data from the host memory to the device memory, and dispatching the kernel. The program then waits for kernel completion to collect the results by reading the device memory. The openCL application has two layers: the platform layer and the runtime layer, and the kernel program must be compiled into a binary that can be executed on the accelerator device, either an FPGA or GPU. These steps differ depending on the device, and the compilation for a PG binary can take hours, while GPU compilation is often quick.

  • 00:20:00 In this section, the video discusses how to set up the environment for OpenCL programming. The first step involves getting the platform ID, which is done with the CL get platform IDs function that returns the number of platforms available in the system. Next, the video explains how to get a specific device within the platform based on user preference, and how to create an OpenCL context, which is the enclosure of all resources such as command queues and buffers. The tutorial recommends checking the return value to ensure the success of the operation.

  • 00:25:00 In this section, the video explains how to create and move data from input matrices B and C, and output matrix C, by declaring buffers and using OpenCL API functions. They assume that matrices A, B, and C have already been declared as float-type arrays and store the data in linear address space in physical memory. The video demonstrates how to use the CL create buffer function to declare buffers for matrices A and B, and the CL Inc you write buffer function to copy the initial data from matrices A and B into the buffers created, which reside on the device. The next step is to allocate space for matrix C, which is declared as a CL memory write-only buffer since the device writes the calculation results to it.

  • 00:30:00 In this section of the YouTube video "OpenCL Examples (I)", the speaker explains the process of retrieving results from a device and copying the resulting matrix from buffer C to a host. The API C definition is shown, with an explanation of the five arguments for creating a buffer, including context, flags, size, host pointer, and return value. The speaker then goes on to explain the third major step in the OpenCL program, which is kernel compilation, using a simple compilation process for FPGA devices. The process involves creating and building a program and selecting the right kernel function from the source code. Finally, the speaker discusses how to initialize kernel arguments before running the kernel program, using the CL set kernel argument OpenCL API.

  • 00:35:00 In this section, the speaker discusses the process of initializing kernel arguments, setting work group sizes, executing the kernel, and obtaining results in an OpenCL program. The user must use the 'CL create kernel' API and specify the kernel argument's size and value, using a pointer to the actual value. The speaker emphasizes the importance of accurately setting argument indices and modifying them for each line when copying and pasting. The local and global workgroup sizes must be set for the number of work items and work groups. Finally, the joined CL buffer is used to read the output buffer on the host's memory, indicating sync requirements for proper execution.

  • 00:40:00 In this section, the speaker introduces an example of matrix multiplication using OpenCL. The source code for the matrix multiplication example consists of several files, including the host-side program, the kernel program, and a makefile to help compile the project. The host-side program is written in C and includes standard libraries and specific header files for the OpenCL framework. The example includes input matrices and declarations for variables, including the number of platforms and devices, the context, the OpenCL program, and the OpenCL kernel. The speaker also explains how to compile the kernel source code and describes the sizes of the input matrices and the resulting output matrix.

  • 00:45:00 In this section of the video, the presenter demonstrates how to obtain device and platform IDs for OpenCL on a Mac OS. By calling various OpenCL functions such as get platform ID and creating command queues, the speaker creates an OpenCL context and compiles the program. They also explain that the code shown supports both Mac OS and an OpenCL SDK, and will report an error if it is run on a different platform.

  • 00:50:00 In this section, the video demonstrates how to create a program object using OpenCL on different platforms. On Mac OS, the program object is created from a kernel source code file, while on the Altera FPGA OpenCL SDK, it is created from a binary file generated through compilation. Once the program object is created, the kernel program can be built and the specific kernel function can be selected from that program object. By the end of this section, the necessary objects and functions are ready for the next section of the program.

  • 00:55:00 In this section, the video discusses the buffer management process, including allocating a buffer to store the matrix results and using CL create buffer to create buffers on the device side. The video also highlights the importance of checking the return value of the CL in Q and the range colonel to ensure successful execution, especially when using FPGAs. Additionally, the video explains the process of verifying the results by printing out the resources allocated on the host side and OpenCL resources used, and provides a simple multiplication kernel example where seven arguments are used to perform dot product operation through iterations.

  • 01:00:00 In this section of the video, the speaker explains two examples of using OpenCL. The first one is matrix multiplication. The program processes two matrices and multiplies their corresponding elements to store the result in a third matrix. The second example is image rotation, where the program rotates the pixels of an image based on certain formulas. These formulas take into account the original and new coordinates of each pixel and the rotation angle.

  • 01:05:00 In this section, the speaker discusses how to break down an image rotation problem into smaller ones using input decomposition. They explain that the image's pixel information will be copied to a new location through independent calculations of the x and y dimensions. Work groups are assigned to calculate the new location of each pixel using its global ID. The speaker also details how to determine work item groups and dimensions and the kernel function needed to complete this operation. The goal is to create a more efficient and scalable method for image rotation calculations.

  • 01:10:00 In this section, the video presenter explains how to use OpenCL to rotate an image. The kernel function is used to identify the original location of a pixel, calculate the new location of the pixel using rotation parameters, check for boundary checking to ensure the new coordinates fall within the original image size, and copy the pixel information from the original location to the new location. The code also includes C++ bindings for OpenCL API and steps to query platforms, acquire devices, and declare buffers to move data from the host memory to the device buffer. The read-only buffer is also created to ensure the security of the original data.

  • 01:15:00 In this section, the speaker explains the steps required to perform an image rotation using OpenCL. First, the original image must be copied into an image buffer. Then, the kernel is compiled and executed by initializing the destination buffer and setting the right kernel arguments, including the dimensions of the original picture and the rotation parameters. The kernel is executed with the global workgroup size and local workgroup size. Finally, the result is read back to the host using the in Q read buffer. The speaker also demonstrates the example source code for image rotation, which includes header files, utility functions, platform and device IDs, command queues, program and kernel objects, and input/output buffers for the original and rotated images.

  • 01:20:00 In this section, the video covers the process of rotating an image using OpenCL. The host reads the image in BMP format and converts it into an array of floating-point numbers stored in the input image buffer. The output buffer on the host is created and initialized with random numbers. The platform is queried to find the devices on the platform and create a context and command queue. The program and kernel objects are created, and device side buffers are created to store the original and rotated image. The original image is copied to the buffer on the device side, and the kernel arguments are set. The kernel is executed by instantiating it with the global and local workgroup sizes. The return value is checked to ensure the kernel ran successfully.

  • 01:25:00 In this section, the speaker gives an overview of image rotation using OpenCL. After completing the kernel, the output data is read back to the host using the pointer to the global memory on the device side, and a host buffer is provided for storing the image. BMP formatting is involved in the process, and a utility function called write BMP float is used for creating a new BMP file that shows the result. The kernel function is described in detail, where the destination and source buffer pointers are passed along with the image dimensions and rotation parameters. The formula for calculating new coordinates of each pixel is used, and a boundary check is applied before copying the pixel information from the original location to the new location. The process is demonstrated with an example of rotating a cat image by 45 degrees.

  • 01:30:00 In this section, the speaker explains the concept of image filtering using OpenCL. He describes the process of using a 3x3 filter to multiply and sum the values of neighboring pixels to obtain the new filtered pixel value. He also emphasizes the need to be careful when dealing with pixels near the boundary that have fewer neighboring pixels to apply the filter. The speaker then demonstrates different types of image filters that can be applied to an original image using OpenCL. Afterwards, he presents a sample implementation of the image convolution function with two nested for loops that go through every pixel in the image and a third loop that goes through the elements in the filter.

  • 01:35:00 In this section, the speaker talks about the image data structure in OpenGL, which is an opaque type that is maintained as a multi-dimensional structure and used for image data types. Unlike integer or pointer types, images cannot be directly viewed through points in a device, and their pixel values can be specified as float or integer values. Creating image objects on the device side allows the OpenCL computing units to read and write new pixels to the image objects, and it is beneficial for long, optimized instruction sequences specific to image data processing or graphics processors. The speaker also explains how to create a source image buffer, an output image object, and a filter by copying image and filter data to the device using APIs like CL Write Image and CL Create Buffer.

  • 01:40:00 In this section, the presenter introduces the concept of OpenCL sampler, which is an object used to describe how to access an image. The sampler is created using the API function that takes the context as the argument and defines whether the coordinates will be normalized or not. The addressing mode is also defined, which handles how the image coordinates are handled when they are out of range. The filtering mode specifies the filter that needs to be applied when the coordinates fall between the pixels. A kernel function named convolution is also introduced, which takes input and output 2D image objects, a constant float filter to store filter values, and the sampler object. The kernel function reads data items from the image object and returns a vector of four floating-point numbers to perform arithmetic on image data.

  • 01:45:00 In this section, the speaker explains how to perform operations on floating-point vectors using a four-element floating-point vector. They go through the process of initializing the filter index, declaring variables for two-element coordinates, iterating through filter rows, and calculating coordinates in two dimensions. The pixel is read from the image object using the read image F function and multiplied with the filter pixel value, with the updated value stored in the output image. Lastly, the image is read back using a CL ink you read image function.

  • 01:50:00 In this section, the code for OpenCL Examples (I) is discussed, which provides filters for use in image processing. The program assigns different sizes and values to each type of filter, and uses helper functions to initialize filter values and read BMP image data from a file. Platform and device discovery is performed before creating the input and output images, as well as the filter buffer. Once initialized, the sampler sets up how pixels falling out of the boundary will be processed before executing the kernel with the appropriate filter parameters. The global size is set to the number of columns and rows in the image.

  • 01:55:00 In this section, the speaker demonstrates an example of using OpenCL to perform a convolution filter on an image. The process involves setting up a kernel that processes the entire image and uses a local size of 8 by 8 work items in a group in one data dimension. The output image is stored on the device side and can be read back to the host using CL read image. The results are then compared to a filtered reference image that was generated by performing the filter on the host side. The two images are visually identical, verifying the results. Finally, the resources on both the host and device side are freed.
OpenCL Examples (I)
OpenCL Examples (I)
  • 2017.09.29
  • www.youtube.com
Lectures on OpenCL Examples (I)
 

A Comparison of SYCL, OpenCL, CUDA, & OpenMP for Massively Parallel Support Vector Classification (WOCL / SYCLcon 2022)



A Comparison of SYCL, OpenCL, CUDA, & OpenMP for Massively Parallel Support Vector Classification

The video compares the performance of SYCL, OpenCL, CUDA, and OpenMP on different hardware platforms for massively parallel support vector machine classification. The speaker explains the parallelization of matrix-vector multiplication with an implementation called Parallel Fibonacci, which supports multigpu execution, but only binary classification and dense calculations. The hardware used for testing includes Nvidia A100 and RTX 380 GPUs, AMD Radeon Pro 7 GPU, and Intel Core E9-10-09020X CPU. Results show that CUDA is the fastest backend for Nvidia GPUs, while OpenCL is the fastest backend for CPUs. SYCL is user-friendly, while Hipsicle is faster than DPC++ and OpenCL for cheap use. Additionally, the speaker discusses future work, such as investigating performance on FPGAs, adding support for distributed systems via MPIs, and using mixed precision calculations and special machine learning hardware like NVIDIA’s tensor cores.

  • 00:00:00 In this section of the video, the speaker introduces a comparison of different parallel programming languages, including SYCL, OpenCL, CUDA, and OpenMP, with a focus on their usage for massively parallel support vector machine (SVM) classification on multi-vendor hardware. The speaker introduces the support vector machine and outlines its use in supervised machine learning for binary classification. However, conventional support vector machines have a problem in that they solve a convex quadratic programming problem in a sequential manner. To solve this problem, the speaker uses the least squares support vector machine formulation, which reduces the problem to solving a system of linear equations. The speaker also discusses the implementation details of their library, which is called Parallel Fibonacci.

  • 00:05:00 In this section, the speaker explains the PLSS VM which is written in modern C++. Using a single template parameter, it's possible to switch between single and double precision floating point types. The speaker also talks about the parallelization of matrix-vector multiplication in the CG algorithm, as it is the most computationally extensive part of the algorithm. They implemented four different backends (OpenMP, CUDA, OpenCL, Signal) and supported multigpu execution. However, currently, only binary classification and dense calculations are supported, and they do not support multiclass classification out of the box. Additionally, the OpenMP backend diverges highly from other implementations, and for the GPU backends (CUDA, OpenCL, and SYCL) they implemented one CG and use it for all three backends to reduce code duplication and potential bugs.

  • 00:10:00 In this section of the video, the hardware used and the methodology for the tests are explained. The focus is on four different platforms, namely the Nvidia A100 and RTX 380 GPUs, AMD Radeon Pro 7 GPU, and Intel Core E9-10-09020X CPU, and the results for these are discussed. The N-point scaling for the Nvidia A100 and RTX 380 GPUs and data point and feature scaling for the AMD Radeon Pro 7 GPU are examined, and it was found that the runtimes increase in a similar way with the number of data points on both NVIDIA GPUs. Among these, Cuda is found to be the fastest backend followed by OpenCL, and the roofline model generated with N-site compute showed that the hierarchical kernel formulations tend to be more memory-bound than their ND-range counterparts. Overall, the runtimes for AMD are more than those for NVIDIA.

  • 00:15:00 In this section, the video discusses the performance comparison of SYCL, OpenCL, CUDA, and OpenMP on different hardware platforms. The Nvidia GPUs showed no runtime increase and the fastest backend was OpenCL. However, the AMD GPU had worse performance than expected, possibly due to the blocking sizes not being fine-tuned. The Intel Core E9 CPU had similar behavior to the Nvidia GPUs, with OpenCL being the fastest backend. DPC++ was the fastest except for small data sets where OpenMP was faster. The DPC++ hierarchical kernel formulation was slower than its indie range counterpart on all hardware platforms, indicating potential for optimization. Lastly, the OpenCL cheat compilation overhead was fastest on Nvidia GPUs and slowest on the Intel Iris Xe Max GPU, but with built-in caching, the overhead can be reduced in subsequent executions.

  • 00:20:00 In this section of the transcript, the presenter discusses the results of their testing on various parallel programming languages and frameworks for targeting hardware from different vendors, such as NVIDIA, AMD, and Intel. They note that if you only need to target NVIDIA GPUs, CUDA is still the best option, as it had the fastest performance in their tests. For targeting only CPUs, OpenMP is a good start, although it didn't have the best performance in their tests. If you need to target different vendor hardware, OpenCL or SYCL is recommended, although SYCL is better if you're implementing a new algorithm from scratch because it is more user-friendly. Hipsicle is the best option for cheap use and faster than DPC++ and OpenCL on GPUs, and they plan to optimize their OpenMP backend and investigate other signal implementations like ComputeCPP in the future.

  • 00:25:00 In this section, the speaker concludes the video by discussing future work and improvements to their support vector classification implementation using various parallel computing frameworks. They plan to investigate performance on different hardware such as FPGAs, add support for distributed systems via MPIs, and explore the impact of using mixed precision calculations and special machine learning hardware like NVIDIA’s tensor cores. They believe these improvements will increase the speed and efficiency of their implementation on larger datasets.
A Comparison of SYCL, OpenCL, CUDA, & OpenMP for Massively Parallel Support Vector Classification
A Comparison of SYCL, OpenCL, CUDA, & OpenMP for Massively Parallel Support Vector Classification
  • 2022.05.22
  • www.youtube.com
Presented at: IWOCL / SYCLcon 2022.Additional Information and Slides: https://www.iwocl.org/iwocl-2022/programIWOCL NewsletterSignup to receive regular updat...
 

Reaching Even Richer C++ in OpenCL Kernels with use of libclcxx (WOCL / SYCLcon 2022)



Reaching Even Richer C++ in OpenCL Kernels with use of libclcxx

The video discusses the use of libclcxx to enable the integration of C++ libraries into open source kernel development. The project integrates type traits, an essential library for meta programming in C++, with the goal of exposing more C++ functionality to developers. The video showcases how the type traits library can optimize the performance of OpenCL kernels through its ability to manipulate address space and vector types. The video encourages developers to experiment with the library and contribute to reducing development cycles while obtaining maximum compatibility with C++. The library provides oxygen documentation in a similar style to the C++ reference pages, making it easy for developers to navigate through the new functionality.

  • 00:00:00 In this section, Anastasia's Stulova discusses the use of lipclcxx to enable the use of C++ libraries in open source kernel development. While C++ for OpenCL kernel language has C++ capabilities, it lacks library support, making it important to address the limitation presented. As a result, the leap clcxx project was created, integrating lipcxcxx, with the goal of exposing more C++ functionality to open source kernel developers. Additionally, Tulva argues that typetraits are an essential library to facilitate fully meta programming in C++, and extends namespace std to provide specializations for existing traits while adding new traits for open cell vector type, among others. The new library provides oxygen documentation in a similar style to the C++ reference pages, making it easier for developers to navigate through the new functionality.

  • 00:05:00 In this section, the video discusses how the use of the type trait library can enhance the performance of OpenCL kernels, specifically in regards to address space and vector trades. The video provides examples demonstrating how the library can be used to create a template function for different pointer types, and how removing the address space from the type can solve problems in the OpenCL environment. Additionally, the video shows how the inclusion of vector size trades can make computations more efficient and highlights how the implementation of reduction algorithms can be adapted for vector types. Overall, the use of type trades in OpenCL kernels can lead to even richer C++ programming.

  • 00:10:00 In this section, the speaker explains how to define the add alarms function in OpenCL kernels using the vector size as a condition. They clarify that for different vector sizes, a different implementation is chosen, and if the type passed in isn't a vector type, one item from the buffer will be returned. The speaker also invites developers to experiment and contribute and obtain the maximum compatibility with C++ to reduce development cycles. They request feedback on missing features or bugs and encourage joining a discussion on an existing issue on the project page.
Reaching Even Richer C++ in OpenCL Kernels with use of libclcxx
Reaching Even Richer C++ in OpenCL Kernels with use of libclcxx
  • 2022.05.22
  • www.youtube.com
Presented at: IWOCL / SYCLcon 2022.Additional Information and Slides: https://www.iwocl.org/iwocl-2022/programIWOCL NewsletterSignup to receive regular updat...
 

SYCL beyond OpenCL: The architecture, current state and future direction of hipSYCL (IWOCL / SYCLcon 2020)



SYCL beyond OpenCL: The architecture, current state and future direction of hipSYCL

The hipSYCL project is an open-source implementation of SYCL that targets GPUs through the HIP programming model instead of OpenCL. It consists of a compiler component, sickle interface, and secure runtime. The secure compiler identifies kernels, handles local memory allocation, and implements a signaling mechanism. The dispatch function creates specific items based on user-provided kernels, and optimized functions can be defined with rock prim. The future direction is to allow for multiple back-ends and remove restrictions on the static compilation model. The operation submission model is transitioning to a batch submission for higher task throughput, and hipSYCL is interoperable at the source code level, enabling mixing and matching with hip and CUDA. As an open-source project, contributors are welcome.

  • 00:00:00 In this section, the speaker discusses the motivation behind the hipSYCL project, which is an open-source SYCL implementation that directly targets GPUs through the HIP programming model instead of using OpenCL. The aim is to optimize code and make it easier to use vendor-provided profilers and debuggers while avoiding the adoption friction that may occur with other programming models. The speaker also compares hipSYCL to other solutions available for SYCL and CUDA interoperability, placing hipSYCL at the CUDA interoperability end of the scale due to its use of the HIP programming model.

  • 00:05:00 In this section, the video explains how hipSYCL works and its three main components: a compiler component, sickle interface, and secure runtime. The compiler component allows the compiler to understand both CUDA and sickle, making it possible to have interoperability at the source code level. The secure runtime takes care of data management and scheduling of tasks, while the sickle interface consists of classes and functions in the sickle namespace. Additionally, the video mentions that sickle is a flexible solution that can be used to build implementations that cover all possible use cases. To target accelerators, a dedicated compiler component is needed, which identifies kernels and compiles them for the accelerator.

  • 00:10:00 In this section of the video, the speaker discusses how the secure compiler in hipSYCL functions. They explain that the compiler must identify the kernels and determine which code needs to be emitted to the device, then handle how the kernels are allocated in local memory. The sickle-specific diagnostics feature is also mentioned as a priority for future development. The speaker explains that using the hip secure compiler component is relatively simple with the use of a compiler wrapper called Cycle CC, which hides complexity involved, like invocation and linking of the compiler correctly, and how to set include paths. They discuss how invoking kernels requires a bit of trickery and explain how it is done. Additionally, signaling mechanism at the moment using Coroutines and Hip events is used in hipSYCL for dynamic out-of-order processing, but the downside of this is discussed.

  • 00:15:00 In this section, the speaker discusses how the dispatch function is used to create a specific item based on the user-provided kernel and how the parallel for and secure can be implemented by instantiating the dispatch function with the user-provided kernel. However, initially, all the code is passed as the host code, where the user-provided kernel is a host lambda and cannot be invoked directly, so they add a dummy attribute called HIP kernel, which will only be replaced by a proper click on the attribute once the initial parsing is complete and the hipSYCL plug-in has taken over. They achieve good memory performance for both hipSYCL and CUDA, and using hipSYCL, they can achieve hip and CUDA interoperability at a source code level.

  • 00:20:00 In this section, the speaker discusses how to implement an optimized reduction using rock prim with epsilon. They suggest defining an optimized function with the macro zip secure platform cuda or hips a good platform rockem, which is marked as host and device. If compiling for the target platform, the optimized function is called, while a fallback function is called otherwise. The speaker explains that vendor-optimized libraries like rock prim achieve faster performance because they have more knowledge about the target hardware, and although hipSYCL is still pre-conformance and missing a few features like images and open state interoperability, it is still usable for real-world applications. However, an arranged parallel fall on the CPU pack is slow due to an issue with pure library cycle implementations.

  • 00:25:00 In this section, the speaker discusses the performance differences using basic parallel form or hierarchical parallel form versus using nd range parallel form with hipSYCL on a CPU. The latter results in a massive performance loss because it requires launching as many threats as there are work items in each work group. The speaker then talks about the future direction of hipSYCL, which is to create a new runtime that allows for arbitrary backends to be active simultaneously and removes restrictions regarding the static compilation model. They are also transitioning to an end-to-end mapping where n cyclic uses map to M back-end queues to optimize hardware utilization. Additionally, there will be a strict separation between the new runtime and the existing SYCL interface for easier maintenance and experimentation.

  • 00:30:00 In this section, the speaker discusses the improvements being made to the operation submission model in hipSYCL. They are transitioning from a signal-based submission model to a batch submission model, where the signaling to the runtime that things have completed only happens once per batch of operations, allowing for higher task throughput. The speaker explains the process in which the operations are submitted and then processed by the deck builder, which collects and orders them. The deck scheduler then assigns execution cues to the operations, which then go to the backend executors to execute the kernel and determine what synchronization operations are necessary. The cost estimate of this configuration then goes back to the deck scheduler to optimize further or submit the operations as they are. The speaker also provides information on how to obtain hipSYCL through their package repositories and installation scripts.

  • 00:35:00 In this section, it is explained that hipSYCL is an implementation of SICL for CPUs, video GPUs, and AMD GPUs. It is built on top of low-level vendor APIs hip and CUDA, which makes it interoperable at the source code level. This allows developers to mix and match with hip and CUDA, making it suitable for a range of HPC and other use cases that require access to the latest low-level hardware optimizations or vendor-optimized libraries. Additionally, it allows for the creation of highly optimized code paths for specific hardware, and kernel performance is expected to be at par with regular hip or CUDA. As an open-source project, contributors are always welcome, and interested individuals can learn more about it on the GitHub page.
SYCL beyond OpenCL: The architecture, current state and future direction of hipSYCL
SYCL beyond OpenCL: The architecture, current state and future direction of hipSYCL
  • 2020.04.28
  • www.youtube.com
This video was presented at the online version of IWOCL / SYCLcon 2020.Authors: Aksel Alpay and Vincent Heuveline (Heidelberg University) Additional Informat...
 

SYCL: the future is open, parallel and heterogenous (Core C++ 2022)



SYCL: the future is open, parallel and heterogenous

In this video about SYCL programming, the speaker highlights the need to go up the abstraction level to increase productivity and attract more developers, as complex models require increased compute power, which is met by accelerator systems. The importance of software portability and OneAPI is emphasized, as it allows devices to work on CPUs, GPUs, and other devices. The benefits of SYCL, an open, parallel, and heterogeneous programming model, are also discussed, with the speaker highlighting the numerous online resources and tools available to optimize code and improve performance. The speaker encourages viewers to visit oneapi.io and their YouTube channel for resources and support.

  • 00:00:00 In this section, the speaker discusses the need to go up the abstraction level to increase productivity and attract more developers. As models become more complex, the demand for compute power increases rapidly. The speaker mentions the ninja gap, which refers to the difficulty of finding and hiring lower-level experts such as assembly or Cuda developers. Going up the level of abstraction leads to the loss of performance, which is why AI accelerators such as GPUs and GAUDI are necessary to meet the increasing demand for compute power.

  • 00:05:00 In this section, the speaker discusses the need for accelerators to achieve the fastest performance, but notes that one accelerator is not enough to cover the full range of applications. Heterogeneous systems are required, combining CPUs and accelerators such as ASIC-like accelerators and CPUs. The speaker emphasizes the importance of software portability and the ability of code to run on any machine or device, regardless of the hardware used, without needing to recode, recompile or rebuild for each platform or operating system. OneAPI is an industry effort to streamline software stacks, unifying libraries and tools to ensure software portability that is open, free and cross-device, which means software stacks can work on CPUs, GPUs and other devices. OneAPI offers a base toolkit that has everything needed to get started.

  • 00:10:00 In this section, the speaker discusses the power of using the OneAPI-based toolkit and the concept of the Data Policy++ compiler, which is designed to add heterogeneity to C++ libraries. By using predefined policies, you can easily access the CPU or GPU without needing to know too much about lower level details about OpenCL or CUDA. The compiler provides the ability to control and disjoint memories that are all around, handle exception codes and create parallel computing.

  • 00:15:00 In this section of the video, the speaker explains that in order to have good heterogeneous computing capabilities, there are three things that are required. The first is the ability to discover the device and obtain information about it. This is where the speaker shows a simple piece of code that detects and lists all the devices connected to the system. The second requirement is information about the status of the devices in real-time, which allows for utilization and temperature monitoring, and also enables users to switch between the CPU and GPU. The third requirement is the ability to exchange memory efficiently and seamlessly between the device and the host, which is achieved through two main ways in SYCL – buffers and unified chart memory.

  • 00:20:00 In this section, the speaker explains the benefits of using SYCL, an open, parallel, and heterogeneous programming model. By adding SYCL to C++, one can write code that can run on multiple devices, including GPUs, CPUs, ARM, and FPGAs. The speaker mentions that there are numerous online resources and examples of how to make SYCL work with multiple devices. Intel Advisor is a tool that the speaker recommends, which can help optimize code and provides the option to offload specific functions to a GPU. The speaker emphasizes the importance of using this tool, which can make the code run much faster and improve the overall performance of the program.

  • 00:25:00 In this section, the speaker promotes using SYCL as the fastest way to make code available on multiple devices by multiple vendors, and encourages viewers to visit the oneapi.io website and his YouTube channel for resources and support. He also mentions the possibility of SYCL being faster than CUDA in certain examples, but emphasizes the main advantage of SYCL being its portability, as it allows for coding to a single platform that can then run on various devices, freeing up the need to make multiple coding decisions for different devices. Additionally, the speaker offers to answer any questions and provide resources, such as Jupyter notebooks and access to the Intel Devcloud, to help users get started with SYCL.
Core C++ 2023
Core C++ 2023
  • Oleh Zasadnyy, GDG Lviv
  • corecpp.org
Core C++ 2023
 

GPU acceleration in Python



GPU acceleration in Python

The video explains how to achieve GPU acceleration in Python programming by leveraging the power of graphics processing units, which can provide a speedup of up to 10x with data parallelism. The two standards for GPU computing, OpenCL and CUDA, are briefly introduced, and the video demonstrates the use of Pi OpenCL and CUDA for matrix multiplication in Python. The speaker explains the use of global memory and the kernel for matrix multiplication, and also discusses the algorithm used for computing one element in the matrix-matrix product. The code for GPU acceleration in C and Python is discussed, with emphasis on understanding internal representations of matrices and memory allocation. The exercises in the lecture provide a basis for further exploration of GPU computing.

  • 00:00:00 In this section, the video introduces GPU computing as a way to achieve data parallelism and accelerate programs by leveraging the power of graphics processing units, which can process billions of floating-point operations per second and provide a speedup of a factor of 10. The two standards for GPU computing, OpenCL and CUDA, are briefly introduced, with examples of high-end GPUs such as Kepler, Pascal, and Volta from Nvidia. The massively parallel aspect of GPU computing is emphasized as a way to keep the GPU occupied, with scheduling of sufficiently many threads often required. The video also mentions the potential applications of hardware accelerators in scientific and engineering fields.

  • 00:05:00 In this section of the video, the speaker discusses the evolution of GPU acceleration, from Kepler which had a peak performance of one teraflop to the current generation which exceeds 7.9 teraflops. The programming model of massively parallel computing follows a single instruction multiple data approach and the data is divided into blocks of threads and every block runs at least one thread. The speaker touches upon open computing language which is an open standard for parallel programming and covers multi-core and multi-threaded computing in addition to GPU computing.

  • 00:10:00 In this section, the speaker discusses the use of OpenCL and Pi OpenCL for GPU acceleration in Python. OpenCL is a general standard that was originally supported on NVIDIA graphics cards but has been abandoned. However, it works well on MacBooks as it was initiated by Apple. Pi OpenCL simplifies OpenCL programming by reducing the boilerplate code and allowing for easier focus on the kernel. It also supports NumPy arrays, but the data structures are more limited due to the data parallelism. The speaker demonstrates the use of Pi OpenCL for matrix multiplication on two integer matrices for testing purposes.

  • 00:15:00 In this section, the speaker explains how GPUs can be used for matrix multiplication in Python using OpenCL. They start with importing the necessary libraries, including OpenCL and NumPy. The speaker also notes that the graphics card used did not support 64-bit arithmetic, so they opted for 32-bit floating point arithmetic. They then define the matrices, generate random integers, and change their type to a 32-bit flow matrix. The speaker then explains the boilerplate code required in defining the counterparts of matrices on the device and creating queues. Finally, the speaker defines the kernel for matrix multiplication, which is compiled when the program is run, and demonstrates how to multiply matrices on the GPU.

  • 00:20:00 In this section, the speaker explains the concept of "global" in the context of GPU acceleration in Python programming. Global indicates that matrices reside in the graphics card's global memory, providing every thread with access to the data. The dimensions are passed as short integers, with every thread having a unique identification number. The matrix multiplication process benefits from GPU acceleration as almost every part can be done independently through matrix row and column indexing. The matrices are stored in C-wise, as one long array, and pointers determine their locations in memory.

  • 00:25:00 In this section, the speaker explains the algorithm in C for computing one element in the matrix-matrix product and the potential speedup for the matrix-matrix multiplication, which is generally a cubic operation in the dimensions of the matrices. However, with the use of GPUs and kernel launches, the operation can be simplified to a linear one, leading to a massive reduction in costs and resulting in significant speedups. The speaker also mentions that while the simplest way to perform the operation is through Python and without the need to explicitly compile, the actual algorithms used in supercomputers make use of shared memories on the GPUs and a compilation process that goes beyond what is discussed in the video. The speaker emphasizes the idea that PiCUDA and PiOpenCL enable programmers to develop code at a higher level, without having to worry about the lower-level compiling and linking processes.

  • 00:30:00 In this section, the video talks about installing CUDA for GPU acceleration in Python. To use CUDA, a user must have an NVIDIA GPU and drivers installed. The lecture walks through instructions to check if the system is set up properly, and the presenter notes that the technique is highly interactive parallel computing. The lecturer explains that one can get good performance out of a high-end laptop with a good graphics card. The course then showcases matrix multiplication as an example. The presenter notes that one must typically have a program running on the CPU, and the GPU accelerates only the portions that are computationally intensive. Finally, the lecture discusses the allocation of memory for corresponding matrices on the GPU and initializing resulting matrices, stating that allocations with NumPy are better than they would be with C. Additionally, there is no compilation needed at this stage.

  • 00:35:00 In this section, the code for GPU acceleration in C is discussed. The matrices in C are stored in a row-wise manner, and the code exploits this fact. The syntax for launching blocks of threads in a two-dimensional structure is used to explicitly calculate the threads. A loop with explicit bracketing is used to avoid pointer arithmetic. The function takes on the dimensions and pointers to the data, which include matrices A and B for the input and the result matrix C_gpu. The memory copy to device must end before the print, and the memory copy device to host must be done before printing the data because printing within the functions within the kernel functions executed by the GPU may not be possible. Finally, the discussion ends by stating that pyCUDA is more recent than piOpenCL and PyCUDA.

  • 00:40:00 In this section, the speaker discusses GPU acceleration in Python, which is geared towards CUDA but also has efforts underway to run on other GPUs. It takes care of both the compilation and execution, making subsequent runs much faster. It is possible to develop GPU kernels in Python while staying within the scripting environment; however, one must understand how GPUs work and how matrices are represented internally using C syntax. The exercises in the lecture are open-ended and can provide a basis for a second project exploring GPU computing. Overall, this was an introductory section that aimed to give an idea of how programmers and software developers can develop functions that run on the GPU.
GPU acceleration in Python
GPU acceleration in Python
  • 2022.02.10
  • www.youtube.com
This lecture introduces PyOpenCL and PyCUDA to define and run functions on General Purpose Graphics Processing Units (GPUs). The running example is a basic ...
 

OpenCL 3.0 Launch Presentation (IWOCL / SYCLcon 2020)



OpenCL 3.0 Launch Presentation

The launch of OpenCL 3.0 is discussed in this video, with a focus on its importance for low-level parallel programming in the industry. OpenCL 3.0 does not add new functionality to the API, but provides an ecosystem realignment to enable OpenCL to reach more developers and devices. The presenter also discusses the addition of extensions for DSP light processors, the roadmap for future functionality, and the growing ecosystem of open-source kernel language compilers that can generate spirit kernels for OpenCL Vulcan. Feedback from users is encouraged to help finalize the spec as the working group prepares for the first wave of implementations over the next few months.

  • 00:00:00 In this section, Neil Travis from NVIDIA and the Khronos group discusses the launch of OpenCL 3.0 and the importance of the standard for low-level parallel programming in the industry. OpenCL is widely used by GPU vendors and increasingly used by applications, engines, and libraries. The launch of OpenCL 3.0 provides an ecosystem realignment rather than adding new functionality to the API with the intent of enabling OpenCL to reach even more developers and devices. OpenCL 3.0 makes all the 2x functionality beyond point to optional, allowing vendors to focus on shipping the functionality they need for their customers, and resets the opportunity to raise the bar on core functionality.

  • 00:05:00 In this section, it is explained that OpenCL 3.0 is a new specification that ships with a unified API designed to query all OpenCL 2.x functionality with added extensions for DSP light processors to transfer 2D and 3D data between global and local memories flexibly and asynchronously via Direct Memory Access (DMA) transactions. Although OpenCL 3.0 does not include the OpenCL C++ specification, other implementations are encouraged to use the C++ for OpenCL open source front-end compiler to generate Spir-V kernels by mixing OpenCL C with much of C++17. The roadmap for OpenCL includes shipping new functionality as extensions first for industry adoption, allowing them to mature and be proven before folding them into future core specifications. The OpenCL working group also sees profiles as a vital tool to balance implementation flexibility with application portability and avoid fragmentation for target markets.

  • 00:10:00 In this section, the presenter discusses the growing ecosystem of open-source kernel language compilers, which includes clang and lvm, that can generate spirit kernels for OpenCL Vulcan, or for further translation into shaders to run on other APIs such as metal. This would enable OpenCL applications on Apple platforms without needing to use OpenCL drivers. The presenter also mentions the OpenCL 12 project, which translates LLVM-generated spirit kernels using an open source conversion pipeline to DXi L, enabling language compilers to innovate independently from runtimes. The spec for OpenCL 3 is provisional and feedback from users is encouraged to help finalize the spec as the working group prepares for the first wave of implementations over the next few months.
OpenCL 3.0 Launch Presentation
OpenCL 3.0 Launch Presentation
  • 2020.05.07
  • www.youtube.com
This video was presented as part of the panel discussion at the online version of IWOCL / SYCLcon 2020, and was presented by Neil Trevett, Khronos Group Pres...
Reason: