Machine Learning and Neural Networks - page 30

 

Lecture 10. Measurement and Timing



Lecture 10. Measurement and Timing

In this video, the speakers discuss various aspects of measurement and timing in computing, including external factors that can affect accuracy. They emphasize the need for models and clear understanding of data, reducing variance to compensate for errors, and using appropriate timing calls to ensure reliability. They also discuss challenges in measuring software performance accurately, such as the power law and non-deterministic factors, while highlighting tools like performance modeling, timing calls, hardware counters, and simulators. Finally, they caution against relying on a single tool, recommending triangulation and adaptation as techniques to ensure accurate results.

The speaker explains the importance of reliable performance measurement in performance software engineering. He recommends the use of statistics to accurately measure performance, and discusses various types of summary statistics such as the mean, which can provide useful measures of performance in different contexts. He points out that taking the arithmetic mean of ratios is not a valid approach, and suggests using the geometric mean instead. The speaker emphasizes the importance of choosing the correct way to aggregate data when comparing programs, and suggests taking the low order statistic, such as the minimum or 10%, from multiple runs to more accurately compare performance. The speaker also discusses different methodologies for measuring and comparing performance, including head-to-head comparison and fitting a model to derive statistics, but warns about the issue of over-fitting in modeling. Overall, the video stresses the importance of reliable performance measurement for improving program performance.

  • 00:00:00 In this section, the speaker discusses a study on sorting code that was done by one of his former students. The code uses the time.h header file to get access to the clock get time routine for timing measurements. A sorting routine is timed, and the elapsed time is computed and printed out. The graph of the measured run times show that the sorting routine mostly follows an N log N growth, but the measured times are slow even for the largest arrays.

  • 00:05:00 In this section, a professor discusses the importance of having a model for the data being observed. He presents an example where the measured data deviates significantly from what was expected and asks the students to suggest possible hypotheses for this deviation. While some thought it might be an issue with the cache or precision with the timing, the professor points out that there could be unrelated processes running on the machine which are causing the deviation. In this case, the issue is with multi-tenancy, where the machine is being used by more than one process simultaneously. The professor emphasizes the need for quality measurements and having a clear understanding of what the data means.

  • 00:10:00 In this section, the speakers discuss measurement and timing in computing, and how external factors can affect the accuracy of measurements. They specifically focus on how clock frequency changes can happen due to the system temperature and power-saving techniques, which can significantly affect measurement results. The speakers introduce the concept of dynamic frequency and voltage scaling, which is a technique to reduce power by adjusting the clock frequency and voltage to transistors. They emphasize that it is crucial to consider external factors when taking measurements to avoid inaccurate results.

  • 00:15:00 In this section, the speaker discusses the challenges of measuring software performance due to the power law that electrical engineers use. The power law states that power equals CV squared F, where C is the dynamic capacitance, V is the supply voltage, and F is the clock frequency. To reduce power and heat, one can reduce the frequency and voltage, resulting in a cubic reduction in power. This presents a challenge for measuring software performance reliably, and the speaker suggests quiescing systems by getting rid of some of the noise to take accurate measurements. They also discuss tools for measuring software performance, such as performance modeling.

  • 00:20:00 In this section, the speaker discusses the importance of reducing variance when it comes to measurement and timing, especially in the context of performance engineering. By reducing variability, it is possible to compensate for systematic and random measurement errors and to require fewer trials to compare programs. The speaker also points out various sources of variability in computer systems, including background jobs, interrupts, and thread placement, among others. Finally, he emphasizes the importance of focusing on reducing variance first before attempting to make changes to the system, as measuring during high variance may lead to unreliable results.

  • 00:25:00 In this section, the speaker talks about the concept of hyper-threading in processors and how it can lead to inaccurate measurements in cloud systems. Hyper-threading runs two instruction streams through one functional unit, appearing like two processors, but actually only provides a 20% speedup rather than two separate processors. The speaker also discusses other techniques such as turbo boost and quiescing a system, which can affect the measurement outcomes significantly. By turning off hyper-threading and turbo boost and quiescing all the demons, the speaker and his group were able to achieve remarkably reliable measurements that were less than 1% slower than the minimum runtime.

  • 00:30:00 In this section, the speaker talks about various factors that can affect the determinism of a program running on modern hardware. While there are certain measures one can take to make the program more deterministic, it is impossible to completely eliminate the non-deterministic factors. Most of the non-deterministic factors that can arise during program execution such as caching, interrupts, threading, and branch prediction, are deterministic processes. However, the randomness in the hardware is outside the control of the user, and can result in a memory error. The speaker suggests that code alignment can make a difference in the program determinism, and any change made in the code may cause a shift in the cache alignment, affecting the program's determinism.

  • 00:35:00 In this section, the speaker discusses how small changes can have a big impact on a program's performance. For example, inserting one byte can cause everything after it to be linearly impacted, and changing the order in which dot o files appear on the linker command line can have a greater effect than going between -OH- and -O3. To address these issues, compilers are recognizing the problem and doing more alignment. LLVM has switches that can align all functions and blocks in a function, but alignment can slow down the program. In addition, the program's name can impact its speed as the executable's name ends up in an environment variable that ends up on the call stack.

  • 00:40:00 In this section, the speaker discusses various tools and methods for measuring software performance, including measuring the program externally with the time command, putting timing calls into the program using clock get time or other methods, interrupting the program with gdb to identify which routine is taking the most time, exploiting hardware and OS support such as those used in perf toolset or simulating the program to get a deeper understanding but at a slower speed. The speaker explains the difference between elapsed time, user time, and system time when using the time command, and how they may not add up to the total time due to processor allocation.

  • 00:45:00 In this section, the speaker discusses the different types of timing and measurement, including wall clock time, processor time spent in user mode, and time spent in the kernel. The timing call recommended for use is clock get time with the clock monotonic option which guarantees never to run backwards. While it may take a non-deterministic amount of time to run, it is more reliable than other timers such as RTIDTSC which may give different answers on different cores and may run backwards. The speaker warns against using get time of day.

  • 00:50:00 In this section, the speaker discusses various ways to measure and time events in programming, including using clock_gettime and measuring with the time command. They caution that over time, techniques may change and engineers may need to adapt. The speaker also suggests interrupting code at random moments as a simple profiling technique and mentions that large companies such as Facebook use this method. Additionally, the speaker introduces the idea of hardware counters and the library livepFM4, which allows access to these counters on a per-process basis. They emphasize that an engineer may not always need surgically precise tools and that sometimes a simple tool can suffice for the job.

  • 00:55:00 In this section, the speaker discusses the different ways of collecting measurements, including hardware counters and simulators. They caution that many hardware counters are poorly documented and that some tools time-share bandwidth if more than four or five counters are used. Simulators are praised for their repeatability but may not model everything in the cache. The speaker recommends triangulation and using at least two measurements to ensure accuracy. They also stress the importance of having a performance model to guide data interpretation.

  • 01:00:00 In this section, the speaker explains the process of performance software engineering, which involves taking a program, making changes to it, and measuring its performance to produce a faster program. However, reliable performance measurement is vital for making small changes that add up and accurately drawing conclusions. Therefore, the speaker recommends using statistics to achieve an accurate picture of what is going on. He also offers a puzzle involving measuring the raw performance of a deterministic program, and concludes that using the minimum is the best way to reject noise and determine the program's underlying hardware performance.

  • 01:05:00 In this section, various types of summary statistics are discussed, including the mean, which can provide useful measures of performance in different contexts. When evaluating web servers, the CPU utilization and total task completion time can be measured using the arithmetic mean, while wall clock time and percentile behavior may be more relevant for optimizing request satisfaction. However, when summarizing ratios, such as comparing the performance of programs A and B, taking the arithmetic mean is not a valid approach, even though it is commonly misused. This is because the mean of ratios is not the same as the ratio itself, as illustrated in the example given in the video.

  • 01:10:00 In this section, the speaker discusses the issues with comparing ratios and means when measuring performance. They demonstrate that taking the arithmetic mean of the ratios can lead to suspect results, and it is better to use the geometric mean. Additionally, they note that when comparing rates, the harmonic mean is often useful. The importance of choosing the correct way to aggregate data when comparing programs is emphasized, and the speaker suggests taking the low order statistic, such as the minimum or 10%, from multiple runs to more accurately compare performance.

  • 01:15:00 In this section, the speaker discusses different methodologies for measuring and comparing performance. One approach is to compare the 10% cheapest options and do noise reduction to compare the means, although it may not provide conclusive evidence. Alternatively, a head-to-head comparison can be conducted between A and B to see which one wins more frequently. Through this, the null hypothesis can be tested, and the p-value can be calculated to determine whether the deviation is significant. This method is commonly used in social sciences and can be useful in noisy environments. Finally, the speaker briefly touches on the idea of fitting a model to derive statistics and discusses the use of least squares approximation.

  • 01:20:00 In this section, the speaker discusses the issue of over-fitting in modeling and how adding more basis functions can improve data fit but also lead to over-fitting. The solution is to remove a basis function and check if it affects the model's quality. The speaker also mentions Lord Kelvin, known as the Guru of measurement, who stated that to measure is to know and if one cannot measure, one cannot improve. The video concludes with the speaker wishing luck on a quiz on Tuesday.
10. Measurement and Timing
10. Measurement and Timing
  • 2019.09.23
  • www.youtube.com
MIT 6.172 Performance Engineering of Software Systems, Fall 2018Instructor: Charles LeisersonView the complete course: https://ocw.mit.edu/6-172F18YouTube Pl...
 

Lecture 11. Storage Allocation



Lecture 11. Storage Allocation

The lecturer discusses storage allocation in this video, covering both memory allocation and freeing. Different types of storage are addressed, including the stack and the heap, as well as fixed and variable size allocation schemes. External fragmentation caused by blocks of memory being spread out is also discussed, with solutions such as free lists or bitmaps per disk page. The concept of garbage collection is also introduced, including reference counting and the limitations of this algorithm. The "mark and sweep" and "stop and copy" garbage collection procedures are also explained, with a focus on how the latter addresses fragmentation and the possible correctness issue created by pointer updates. The video concludes with notes on the time and space complexity of the stop and copy algorithm and its elimination of external fragmentation.

This video covers various topics related to dynamic storage allocation, including the Buddy system, mark and sweep procedures, optimizations, generational and real-time garbage collection, multi-threaded storage allocation, and parallel garbage collection. The lecturer explains that generational garbage collection is based on the idea that younger objects are scanned more frequently, while real-time garbage collection runs in the background during program execution, but can lead to correctness issues if the object and pointer graph keeps changing. Finally, the lecture summarizes different types of storage allocation, including stack and heap, and different methods of garbage collection, such as reference counting, mark-and-sweep, and stop-and-copy. The lecturer mentions that students will get to try some of these methods in their upcoming homework assignment.

  • 00:00:00 In this section, the lecturer discusses storage allocation, which includes both memory allocation and freeing. The simplest form of storage is a stack, where memory is allocated by incrementing the stack pointer and freed by decrementing the stack pointer. However, the stack has limited applicability because it can only free the last thing that was allocated and cannot free anything in the middle of the used region. The lecturer also notes that for efficiency reasons, stack overflow is not usually checked by compilers.

  • 00:05:00 In this section, the lecturer talks about two types of storage: the stack and the heap. The stack is a fixed size storage that is efficient but cannot be used for everything, while the heap is more general but difficult to organize and use efficiently. For heap allocation, malloc and free C functions can be used, but since C and C++ do not provide a garbage collector, programmers have to manage memory themselves, which can lead to memory leaks, dangling pointers, and double freeing. The lecturer suggests using memory checkers like address sanitizer and Valgrind to reduce the number of memory bugs in the program.

  • 00:10:00 In this section, the speaker discusses different ways of allocating storage, beginning with fixed size allocation, in which each piece of storage has the same size and some blocks are used while others are unused. The unused blocks are kept in a list called the free list, and each block in the free list has a pointer to the next block in the list. To allocate one object from the free list, the program sets the pointer to be free and then sets the free pointer to point to the next thing in the free list, returning the pointer to the first object in the free list. To deallocate, one just sets the next pointer of the object to free and sets free equal to the object to be freed. With the free list, allocating and freeing take constant time, and one also gets good temporal locality.

  • 00:15:00 In this section, the concept of external fragmentation is introduced, which occurs when blocks of used memory are spread out throughout the space of all memory. This can lead to a larger page table, which can cause disk thrashing and make it less efficient to look up translations. To mitigate external fragmentation, a free list or bitmap per disk page can be used, and allocating from the fullest page, freeing blocks of stories, and paging out empty pages can be done. Fixed size heap allocation can also be useful to reduce external fragmentation. Finally, variable size heap allocation can be used with bin free lists, which leverage the efficiency of free lists for different sizes of allocated memory.

  • 00:20:00 In this section, a scheme for storing blocks of memory of varying sizes in a memory allocation system is presented. The scheme involves dividing the blocks into bins based on their size, with each bin holding blocks of a specified size that are powers of two. When allocating memory, the appropriate bin is selected based on the requested size, and if it is empty, a larger non-empty bin is split up into smaller chunks to obtain the desired size. However, this scheme can result in internal fragmentation of memory, which is wasteful, so the number of bins used is limited, and small blocks are grouped into pages to reduce overhead. The operating system can be used to provide more memory if needed, and there are system calls available for this purpose.

  • 00:25:00 In this section, the video discusses how memory allocation works and notes that standard implementations of 'malloc' rely on commands like 'break' to get memory from the operating system. The OS gives a large chunk of memory, and it's up to the storage allocation system to break it down into smaller blocks. This section also covers variants of the memory allocator, including different schemes for storing block sizes and how the virtual memory address space is laid out in a program. It notes that the stack and heap grow towards each other but never intersect. Finally, it is mentioned that pre-computing large tables of constants in a program can increase loading time as it requires reading from the data segment.

  • 00:30:00 In this section, the issue of fragmentation with virtual memory is discussed, which can lead to external fragmentation, degraded performance of the page table, disk thrashing, and low TLB hit rates if the memory is spread out. The goal of storage allocation is to use as little virtual memory as possible while keeping used portions of memory compact. The bin free list allocation scheme is analyzed, with a theorem stating that the amount of virtual memory consumed by the heap storage is upper bounded by M log M. Coalescing smaller blocks into larger ones can improve the bin free list scheme, but the overhead of this method can be high compared to the standard bin free list algorithm, which is only a constant factor worse than the optimal allocator, according to a paper by Charles Leiserson.

  • 00:35:00 In this section, the speaker explains the concept of storage allocation and how it can reduce fragmentation when dealing with heap storage. He also discusses the idea of garbage collection, which frees the programmer from having to manually free objects. The speaker explains the three types of memory objects - roots, live objects, and dead objects - and how garbage collection can recycle the latter. Lastly, he describes reference counting, a simple form of garbage collection that counts the number of pointers referencing an object to determine if it can be freed.

  • 00:40:00 In this section, the speaker introduces the concept of reference counting as a garbage collection scheme and explains how the reference count algorithm works. However, a limitation of the algorithm is pointed out, where cycles in the reference graph cannot be collected. The speaker then discusses the use of strong and weak pointers in some programming languages to avoid this limitation. Finally, the speaker previews two other garbage collection schemes, "mark-and-sweep" and "stop and copy," that avoid the limitation of reference counting when dealing with cycles in the reference graph.

  • 00:45:00 In this section, the speaker discusses the use of a breadth-first search algorithm to find all live objects in memory. An array with two pointers is used to represent a FIFO queue for the search, and the algorithm marks each live object that is reachable from the roots. After the search is complete, the unmarked objects are available to be garbage-collected. The mark-and-sweep procedure involves two stages: the marked stage, which involves the breadth-first search algorithm, and the sweep stage, which involves scanning memory to free unmarked objects. However, there are limitations to this scheme, such as the need to scan over all memory and the overhead associated with looking at unreferenced objects.

  • 00:50:00 In this section, the video introduces the "stop and copy" garbage collection procedure that deals with fragmentation. This procedure is similar to the mark and sweep algorithm but takes a different approach to ensure that live objects are contiguous in memory. The procedure uses two separate memory spaces - the from space and the to space - and allocates and frees objects from the from space until it becomes full. Then, the garbage collection algorithm is run, and the two space is used as the queue for the breadth-first search to identify live objects, which will be placed contiguously in memory in the two space. However, using this algorithm can lead to a possible correctness issue where pointers that point to objects in the from space may not be correct anymore. To address this, the procedure stores a forwarding pointer in the corresponding object in the from space and updates all pointers by following these forwarding pointers when removing an object from the queue in the breadth-first search.

  • 00:55:00 In this section, the speaker discusses the stop and copy garbage collection algorithm, its time complexity and how it deals with external fragmentation. The stop and copy algorithm involves copying objects from the from space to the to space and then updating the pointers of these objects in the to space. This algorithm is linear in time and space, which makes it a more efficient and effective way of garbage collection. When the from space becomes full, the user can request a new heap space and double the size of the from space. The virtual memory space required by this scheme is constant times optimal, and this algorithm eliminates external fragmentation.

  • 01:00:00 In this section, the speaker discusses more topics related to dynamic storage allocation such as the Buddy system for doing coalescing, variants of the mark and sweep procedure, optimizations to improve performance, generational garbage collection, real-time garbage collection, multi-threaded storage allocation, and parallel garbage collection. Generational garbage collection is based on an idea that many objects are short-lived, thus only younger objects are scanned most of the time. Real-time garbage collection works in the background as the program is running, but it could lead to correctness issues if the graph of objects and pointers is changing. Multi-threaded storage allocation and parallel garbage collection have multiple threads running, which makes the algorithms trickier to deal with races and other correctness issues. The speaker also summarizes the different types of storage allocation, including the stack and heap, and different ways to do garbage collection, such as reference counting, mark-and-sweep, and stop-and-copy.

  • 01:05:00 In this section, the instructor mentioned that they will delve further into storage allocation schemes and that students will have the opportunity to try some of them out in their upcoming homework assignment. The instructor then ended the lecture and opened the floor for any further questions.
11. Storage Allocation
11. Storage Allocation
  • 2019.09.23
  • www.youtube.com
MIT 6.172 Performance Engineering of Software Systems, Fall 2018Instructor: Julian ShunView the complete course: https://ocw.mit.edu/6-172F18YouTube Playlist...
 

Lecture 12. Parallel Storage Allocation



Lecture 12. Parallel Storage Allocation

The video discusses various memory allocation strategies and their trade-offs. The use of malloc and aligned allocation, as well as the importance of proper memory deallocation with free, is explained. The use of Mmap for virtual memory allocation is also covered, along with the issues of external fragmentation and slow performance. The concept of stacks in C and C++ programming is explored, with emphasis placed on a heap-based cactus stack approach for allocating stack frames, which can lead to better space-bound proofs and upper bounds. The use of a pool of linear stacks is also discussed, along with the importance of optimizing for small blocks to minimize fragmentation. The video concludes with a discussion of global and local heap approaches and their potential issues, along with approaches such as bin free lists and periodic memory rebalancing to address these issues.

The video also discusses solutions for parallel storage allocation and introduces the Hoard allocator as a solution. The Hoard allocator uses a combination of local and global heaps and large super blocks that can be moved between heaps to reduce false sharing, improve scalability, and reduce external fragmentation. The maximum storage allocated in the global heap is at most the maximum storage allocated in the local heaps, and the footprint is upper bounded by the user footprint plus allocated but unused storage in the local heaps. The video also briefly discusses other allocators such as Je malloc and Super malloc, with benchmark results indicating that Super malloc performed better than Je malloc and Hoard. The discussion concludes with a reference to online slides for garbage collection details.

  • 00:00:00 In this section, the lecturer reviews memory allocation primitives, including malloc and aligned allocation. Aligned allocation can be used to align memory to cache lines so that accessing an object within a cache line only requires one cache access, reducing the number of cache misses. The lecturer also discusses the importance of properly deallocating memory with the free function and avoiding memory leaks and double freeing. Finally, the lecturer introduces the system call Mmap, which can be used to allocate virtual memory without a backing file.

  • 00:05:00 In this section, the speaker explains the process of allocating memory using M map, which is a system call that finds a contiguous unused region in the address space of the application large enough to hold the requested memory size and updates the page table to contain an entry for the allocated pages. Unlike malloc, which is a library call and part of the heap management code of the C library, M map doesn't immediately allocate physical memory for the requested memory but populates the page table with entries pointing to a special zero page that gets modified and translated into physical memory only when the user first writes to it. Moreover, M map is responsible for getting memory from the operating system, while malloc is responsible for reusing previously allocated memory to reduce fragmentation and improve temporal locality.

  • 00:10:00 In this section, the video discusses the differences between using MMAP and MALLOC for storage allocation, emphasizing that MMAP is relatively heavy-weight and allocates entire pages even for small allocations, leading to external fragmentation and slow performance. The video also reviews the process of address translation, where the virtual address is used to access memory and the hardware looks up the appropriate physical address in the page table. The video explains how TLBs work by caching recent page table lookups, and notes that programs with spatial or temporal locality will have many recent accesses stored in the TLB, leading to faster performance.

  • 00:15:00 In this section, the speaker discusses how address translation works in modern machines and also delves into the concept of stacks in C and C++ programming. Stacks are used to keep track of function calls and local variables and follow a linear order. The speaker emphasizes one rule of pointers in traditional linear stacks: a parent can pass pointers to its stack variables down to its children, but not the other way around, to avoid overwriting variables. The speaker also suggests options, such as allocating memory on the heap or on the parent's stack, for passing memory from a child function back to the parent.

  • 00:20:00 the parallel heap allocation correct, the potential issue with using a heap-based cactus stack is memory fragmentation, where there may not be enough contiguous memory left over for future allocations, leading to wasted space and potentially, running out of memory or crashing the program. While early systems for parallel programming used this strategy, it is important to optimize the code to minimize the performance impact and consider the consequences of memory fragmentation.

  • 00:25:00 In this section, the video discusses the use of a heap-based cactus stack approach for allocating stack frames without putting an upper bound on the maximum depth. However, interoperability is an issue when trying to use traditional linear stack code with this heap-based cactus stack. This can be fixed using an approach called thread-local memory mapping but this approach requires changes to the operating system. Despite this, the heap-based cactus stack approach is still good in practice and can be used to prove a space-bound of a Silk program that uses it. This space-bound shows that the stack space of a P worker execution using a heap-based cactus stack is upper bounded by P times s1, where s1 is the stack space required by a serial execution of the Silk program. This is a nice feature of the heap-based cactus stack approach.

  • 00:30:00 In this section, the divide-and-conquer matrix multiplication code is analyzed to understand how it can be applied to the Space-Time Tradeoff Theorem. The code makes eight recursive calls, each of size N/2, and before each call, it does a malloc to get temporary space of size order N squared, which is then freed right before the function returns. Since the allocations follow a stack discipline, even when allocating off the heap, the space bound from the previous slide can be used to upper bound the P worker space. The work is N cubed and the span is order log squared N. The recurrence for space is s of N/2 + theta of N squared, which solves to N squared. Therefore, the busy leaves property and theorem for the space bound show that on P processors, the space is bounded by P times N squared.

  • 00:35:00 In this section, the speaker walks the audience through how to prove a stronger space bound for the algorithm discussed in the previous section. By analyzing the structure of the algorithm and focusing on branching as much as possible near the top of the recursion tree, the speaker is able to demonstrate a space bound of P to the one-third times N squared, which is better than previous upper bounds. The speaker notes that analyzing the structure of an algorithm can lead to stronger space-bound proofs than simply applying a general theorem. The section concludes by discussing how parallel functions fail to interoperate with legacy and third-party serial binaries.

  • 00:40:00 In this section, the speaker discusses the use of a pool of linear stacks in storage allocation, which can be used to maintain a pool of stacks for legacy code. The number of stacks in the pool should be constant times the number of processors so that the space bound is preserved. However, this strategy can impact the time bound of the work-stealing algorithm because the worker may not be able to steal if there are no more stacks available. The speaker then talks about basic properties of heap storage allocators, including allocator speed and user footprint, emphasizing the importance of optimizing for small blocks due to the potential for fragmentation and overhead in allocation. Fragmentation is defined as the allocator footprint divided by the user footprint.

  • 00:45:00 In this section of the video, the speaker discusses how to keep the allocated footprint as close as possible to the user footprint in order to minimize the ratio of the two. One solution is to use something called M advise, which tells the operating system that a certain page of memory is no longer needed but should be kept in virtual memory. The speaker also mentions the theorem from the previous lecture that the fragmentation for bin free lists is order log base two of U, where U is the total amount of memory allocated, and explains the differences between space overhead, internal fragmentation, and external fragmentation. Finally, the speaker notes that modern 64-bit processors provide about 2 to 48 bytes of virtual address space, which is enough for most programs but still more than the physical memory on a typical machine.

  • 00:50:00 In this section, the video explores a parallel heap allocation strategy using a global heap with mutex protection, which is how the default C++ allocator works. The blow-up for this strategy is one, as it doesn't require additional space compared to a serial allocator. However, the potential issue with this approach is the performance hit, as acquiring the lock for every allocation and deallocation is slow and gets worse with increasing processors. Lock contention is the primary reason for poor scalability, which is more problematic for small allocations due to higher request rates, and large block allocations require more work.

  • 00:55:00 In this section, the video discusses the potential issue with using a local heap approach, which is that it can lead to memory drift and unbounded blow-up. Essentially, if you allocate all of your memory in one thread but free it in another, there can be unused memory sitting around in the system that the allocator isn't smart enough to reuse. As a result, your program running on multiple processors could eventually run out of memory, but it won't happen if it's only run on a single core. The video suggests using a bin free list approach to address this issue, which involves setting pointers in the bin free list data structure so that the block that is freed can appear in one of the linked lists. Another strategy is to periodically rebalance the memory, but a simpler approach is also discussed.

  • 01:00:00 In this section, the video discusses how to change the memory allocator to achieve the desired behavior of having each thread accessing its own heap without the need for a global heap. One approach is to label each object with an owner and return it to the owner's heap when it's freed. This way, local objects get fast allocation and freeing, while remote objects require some synchronization but not as much as with a global heap. The maximum amount of memory that can be allocated is bounded by X, the amount used by the sequential program, and the blow-up ratio is upper bounded by P, the number of heaps. This approach also has resilience to false sharing, which occurs when multiple processors access different memory locations but on the same cache line.

  • 01:05:00 In this section, the speaker discusses how parallel heap allocation can lead to false sharing and introduces the hoard allocator as a solution. The hoard allocator uses a combination of local and global heaps, organizing memory into large super blocks that can be moved between heaps. This approach is fast, scalable, and resilient to false sharing. When an allocation is made, the allocator checks for a free object in the local heap and gets it if one exists. If not, it goes to the global heap to get more memory.

  • 01:10:00 In this section, the speaker explains how the Horde allocator reduces external fragmentation by allocating a free object from the fullest non-full super block to densify the movement of super blocks. If the object is not found in the local heap, it checks the global heap, and if the global heap is empty, it calls a new super block from the OS. The Horde allocator maintains an invariant where the heap is at most half-utilized, and if it detects that the heap is under-utilized, it moves a half-empty superblock to the global heap. The speaker then presents a lemma stating that the maximum storage allocated in the global heap is at most the maximum storage allocated in the local heaps. Finally, the speaker proves the theorem that the Horde allocator's footprint is upper bounded by order u plus SP, where u is the user footprint for the program, and SP is the allocated but unused storage in the local heaps. The blow-up is then calculated to be one plus order SP divided by u.

  • 01:15:00 In this section, the speaker discusses different allocators including Je malloc and Super malloc. Je malloc has separate global locks for different allocation sizes and releases empty pages using em advice which makes it robust to different allocation traces. On the other hand, Super malloc has very few lines of code and is developing very fast. The speaker shares benchmark results where Super malloc performed better than J malloc, which performed better than Horde, while the default allocator had poor performance. The discussion also extends to garbage collection, however, the speaker defers the details of this to the online slides.
12. Parallel Storage Allocation
12. Parallel Storage Allocation
  • 2019.09.23
  • www.youtube.com
MIT 6.172 Performance Engineering of Software Systems, Fall 2018Instructor: Julian ShunView the complete course: https://ocw.mit.edu/6-172F18YouTube Playlist...
 

Lecture 13. The Cilk Runtime System



Lecture 13. The Cilk Runtime System

The Cilk runtime system is responsible for scheduling and load balancing computations on parallel processors, using a randomized work stealing scheduler to map computations onto processors at runtime. The system is designed to optimize serial execution of the program even at the expense of additional cost to steals. The worker maintains a separate deck data structure that contains pointers to the stack frames and uses head and tail pointers. The frames that are available to be stolen have an additional local structure that contains necessary information for stealing to occur while the deck is external to the call stack. The section explains how the system enables processors to start executing from the middle of a running function and how synchronization between processors becomes complicated when reaching a sync statement that cannot execute beyond the point because specific processors are still waiting for computations to finish. Additionally, the speaker addresses the performance of the system, design considerations and data structures, and how the system optimizes execution using the THC protocol, involving two sets of protocols, one for the worker executing work and one for the thief.

The Cilk Runtime System uses a set jump and long jump protocol to handle the theft of computations from victim processes' execution decks. The Cactus Stack abstraction allows the thief process to have its own call stack to prevent corruption of the victim's stacks. The system's synchronization protocol uses a tree of computations and a Cactus Stack to ensure synchronization only occurs between nested computations within a function. The Full Frame Tree maintains parent-child relationships between computations with outstanding sub-computations to simplify the synchronization process. Additionally, the runtime system optimizes the common case where the current function has no outstanding child computations and all workers are busy. Other features supported include C++ exceptions, reducer hyper objects, and pedigrees.

  • 00:00:00 In this section, Tibi explains the Cilk runtime system, which is responsible for scheduling and load balancing computations on parallel processors. The runtime system uses a randomized work stealing scheduler to map computations onto processors at runtime, ensuring an efficient execution. Tibi notes that the Cilk runtime system is a complex piece of software, but its underlying magic is implemented through the collaboration of the Cilk compiler and the runtime library. Additionally, he shows the pseudocode for a simple Cilk program after compiling, which highlights the complex process underlying Cilk's runtime system.

  • 00:05:00 In this section, the speaker explains the functionality required for the Cilk runtime system, as well as performance considerations. He uses an example of parallelized Fibonacci routine to illustrate how the Cilk program executes a computation dag, which unfolds dynamically as the program runs on one processor. However, when the silk spawn statement is encountered, a new frame is created for fib of 3, and another strand is available to execute in parallel. The processor then starts executing fib of 3 as if it is an ordinary function call. The process repeats itself when the instruction pointer returns to the beginning of the fib routine.

  • 00:10:00 In this section, we see how multiple processors execute a fib routine in parallel with the help of Cilk runtime system. Each processor jumps into the middle of an executing function and starts running it, despite having independent registers, which begs the question of how the Silk system enables processors to start executing from the middle of a running function. Furthermore, the synchronization between processors becomes complicated when reaching a sync statement that cannot execute beyond the point because specific processors are still waiting for computations to finish, which requires a fine-grain synchronization within the nested pattern.

  • 00:15:00 In this section, the speaker discusses the implementation considerations for the Cilk runtime system. They mention three primary considerations: a single worker being able to execute the program, the ability to jump in the middle of executing functions and pick up where other processors left off, and synchronization. Additionally, Silk has a notion of a cactus stack, which needs to be implemented correctly to make all views of the stack available and consistent. Finally, the system needs to allow for work-stealing by allowing processors to steal frames from each other.

  • 00:20:00 In this section, the speaker discusses the functionality required for the Cilk Runtime System, which includes serial execution, thieves jumping into the middle of running functions, syncs to synchronize in a nested fine-grain way, a cactus stack for all workers to see, and dealing with mixtures of spawn frames and called frames that may be available when they steal computation. The section then moves on to address the performance of the system, where we need ample parallelism, T1 over T infinity should be a lot larger than P, and we want linear speed-up when increasing the number of processors to run the program. The section also covers the expected running time of TCP on P processors, which is proportional to the work of the computation divided by the number of processors plus something on the order of the span of the computation.

  • 00:25:00 In this section, we learn about the Cilk Runtime System and its aim to maintain high work efficiency in programs with sufficient parallelism. The Silk Runtime System abides by the work first principle by optimizing the serial execution of the program even at the expense of some additional cost to steels. The runtime system library handles issues with parallel execution and handles slower paths of execution when steels occur. The worker maintains a separate deck data structure that contains pointers to the stack frames and uses head and tail pointers. The frames that are available to be stolen have an additional local structure that contains necessary information for stealing to occur while the deck is external to the call stack.

  • 00:30:00 In this section, we learn about the design of the Cilk Runtime System and its basic data structures. The system generates a spawn helper function for each computation, which maintains a worker structure and Silk stack frame structure for each instantiation of a spawning function and a spawn helper, respectively. The Silk RTS stack frame contains fields such as a buffer and a flags integer to summarize the state of the Silk stack frame, and a pointer to a parent Silk stack frame in the call stack. The worker structure maintains a deck and a pointer to the current Silk RTS stack frame. These data structures are the essentials of the Cilk Runtime System that the Intel so+ runtime system elaborates on.

  • 00:35:00 In this section, the video explores the code for the spawn function "foo" and the spawn helper function. The code for "foo" includes a call to initialize the stack frame, set up for a spawn with the set jump routine, a call to the spawn helper function "spawn bar," a conditional code block for a sink in the Silk RTS, and calls to pop and leaf frames for cleanup. The code for the spawn helper includes a call to initialize the stack frame and a call to detach the Silk RTS with additional cleanup functions for the stack structure. The set up routine is described as a function that allows thieves to steal the continuation of the function, taking as an argument a buffer to store necessary information for resuming the function location.

  • 00:40:00 In this section, the speaker discusses how to restrict the set of registers and save them into a predetermined set for a function call. The responsibility of saving the registers falls on the function, but the instruction pointer and stack pointers are saved into the buffer. The section goes on to discuss a spawn helper function and how it updates worker structures and current stack frames. Finally, the section explains how the enter frame fast routine establishes a parent pointer in the call stack and updates the worker's current stack frame to point at the bottom.

  • 00:45:00 In this section, the transcript of the YouTube video titled "The Cilk Runtime System" explains how the Silk RTS detach function allows for computation to be stolen, where once silk artistic is done executing, a thief could come and steal the continuation of the silk spawn. The pop frame is responsible for cleaning up the stack frame structure and updating the current stack frame to point to the parent. This function call may or may not return, and which of these two cases is more important for the runtime system to optimize is the success case, based on the two that work first principle.

  • 00:50:00 In this section, the speaker discusses the Cilk runtime system's optimization on worker execution in case one, where workers are assumed to do regular serial execution. They also explain how worker stealing works, with the thief targeting the top of the victim's deck, dequeueing the item, and updating the head of the deck and the current stack frame pointer. The result of the function spawned is returned to the parent stack frame in the original SIL code.

  • 00:55:00 In this section, the speaker explains the approach of the Cilk Runtime System in handling concurrent accesses involving multiple processors using a protocol called the THC protocol. The protocol involves two sets of protocols, one for the worker executing work and one for the thief. The worker's protocol is optimized by trying to pop something from the bottom of the deck, and only on failure does it acquire a lock on the deck, while the thief always grabs a lock before performing any deck operations. The thief then magically resumes a stolen continuation by calling the long jump function, passing the stack frame buffer retrieved from the set jump function, and setting its register state to be that of the stolen continuation.

  • 01:00:00 In this section, the speaker discusses the interaction between the set jump and long jump calls within the Cilk runtime system. They explain how, when a long jump routine is called, it effectively returns from the set jump a second time, this time with a different value specified in the second argument. Using the appropriate buffer and second argument, the thief can execute a long jump to skip over a certain computation in the victim's code. The speaker notes that it is possible to statically compute the distance needed to jump over a call and use a different approach, but the set jump and long jump protocol serves as a more flexible method for the Cilk runtime system. Overall, the speaker highlights the various techniques available for the thief to steal computations from the victim's deck using Cilk.

  • 01:05:00 In this section, the need for a cactus stack abstraction for the Cilk runtime system is discussed. It is explained that using the victim's stack can lead to corruption of the victim's stack and cause problems with maintaining consistency across all workers. Therefore, a separate call stack for the thief is needed. The implementation of the cactus stack involves the thief stealing the continuation and setting its stack pointer to its own stack while still being able to access the state of the function on the victim's stack through offsets from RB p.

  • 01:10:00 In this section, the speaker explains how the SIL runtime system handles synchronization of computation on different processors. The system uses the cactus stack and a tree of computations to ensure that synchronization only happens among nested sub-computations under a function, not the program in general. When a worker reaches a silk sync statement before all sub-computations return, it becomes a thief and continues working on the stack frame of a stolen computation. This allows the system to avoid wasting workers and to ensure that the nested computations are properly synchronized. The system specifically instructs the compiler not to use an optimization that could conflict with this approach.

  • 01:15:00 In this section, the Cilk Runtime System is described as maintaining a tree of states called full frames, which keeps track of which computations are waiting on which other computations to finish. This system uses a full frame tree to keep track of loads of information for parallel execution, including pointers to parent frames, potentially to child frames, and how outstanding subcomputations relate to each other. When a worker encounters a sync, if it has an outstanding child computation, it cannot synchronize and must suspend its full frame until it can become a thief to steal more work. This system allows for a program to have ample parallelism and simplifies synchronization protocols.

  • 01:20:00 In this section, the speaker discusses how the Cilk Runtime System optimizes the common case where the executing function has no outstanding children and all workers on the system are busy with their own tasks. The runtime system uses bits from an associated stack frame's flag field to evaluate whether synchronization is necessary for a silk sync. The speaker also mentions several other features supported by the runtime system, including C++ exceptions, reducer hyper objects, and pedigrees.
13. The Cilk Runtime System
13. The Cilk Runtime System
  • 2019.09.23
  • www.youtube.com
MIT 6.172 Performance Engineering of Software Systems, Fall 2018Instructor: Tao B. SchardlView the complete course: https://ocw.mit.edu/6-172F18YouTube Playl...
 

Lecture 14. Caching and Cache-Efficient Algorithms



Lecture 14. Caching and Cache-Efficient Algorithms

In the video on caching and cache-efficient algorithms, the instructor explains the cache hierarchy of modern machines, the differences between fully associative and direct mapped caches, and the advantages and disadvantages of set associative caches. The video also introduces the ideal cache model and the concept of cache-efficient algorithms. The speaker discusses the cache miss lemma, the tall cache assumption, and the sub-matrix caching lemma. They also analyze cache misses incurred when reading a sub-matrix and during matrix multiplication. The video concludes by introducing the concept of tiled matrix multiplication and how it can significantly improve performance, but also notes that it is not portable and can be expensive to optimize for multiple levels of cache.

The lecture covers caching and cache-efficient algorithms, using recursive matrix multiplication as an example. The speaker emphasizes the importance of analyzing both work and cache misses separately and notes that cache-aware algorithms may not be optimal for all machines due to differing cache sizes. The speaker also discusses cache-oblivious algorithms that auto-tune for any cache size and mentions developments in parallel cache-oblivious code. Finally, the goal is to design algorithms that are either cache-aware or cache-oblivious, and the discussion on cache-oblivious algorithm design will continue in the following lecture.

  • 00:00:00 In this section on caching and cache-efficient algorithms, the instructor discusses the cache hierarchy of modern machines, which includes private L1 data and instruction caches for each processor, a private L2 cache, and a last level cache that is shared among all processors. The sizes of these caches increase as one moves up the memory hierarchy, with DRAM being the slowest and largest. The associativity of the cache and the latency also increases as one moves up the memory hierarchy, with L1 cache being the fastest and most associative. The instructor also mentions the importance of cache coherence protocols to ensure consistency in memory address accesses for parallel processing. Understanding how to take advantage of locality in programs can help make efficient use of the faster memory caches.

  • 00:05:00 In this section, the differences between fully associative and direct mapped caches are discussed. A fully associative cache requires searching the entire cache in order to find a block which can be slow and inefficient as finding a block can require accessing multiple lines. The direct mapped cache, on the other hand, only has one possible location for each block, which makes accessing blocks much faster, with fewer conflicts. The three components of the address space, the offset, the set, and the tag are also explained when dividing up the virtual memory address to determine the location of the cache block. The tag identifies which memory block the cache block corresponds to, and the set determines which position in the cache the block goes into, with a total address space of W bits.

  • 00:10:00 In this section, the video discusses the advantages and disadvantages of direct mapped caches, which are fast because only one location in the cache needs to be searched, but are prone to conflict misses where cache blocks keep evicting each other, reducing performance. The video then introduces set associative caches, a hybrid solution where each set contains multiple lines, allowing more than one possible location in the cache for each cache block. The video also mentions a taxonomy of different types of cache misses, including cold misses that occur the first time a cache block is accessed and cannot be avoided.

  • 00:15:00 In this section, the video discusses different types of cache misses, including capacity misses, conflict misses, and sharing misses. Capacity misses occur when the cache is full and cannot fit all of the cache blocks that need to be accessed, while conflict misses happen in set associative caches when too many blocks from the same set map to the cache. Finally, sharing misses occur in a parallel context when multiple processors are accessing the same cache line and at least one of them is writing to it. The video also provides an example of a bad case for conflict misses when accessing a sub matrix within a larger matrix stored in row major order.

  • 00:20:00 In this section, the speaker discusses caching and cache-efficient algorithms. They use an example of accessing a sub-matrix within a larger matrix and how conflict misses can occur when the cache is only 4-way associative. The speaker suggests adding a constant amount of space to the end of each row in the matrix, so each row is longer than 2^15 bytes, which can help reduce the conflict misses.

  • 00:25:00 In this section, the speaker discusses the ideal cache model which is a model used to analyze the cache performance of algorithms. This model assumes a two-level cache hierarchy, a fully associative cache, and an optimal Nishant replacement policy. The speaker explains that the two performance measures that matter are the order of N and the number of cache misses, where the ideal scenario is to have a low number of cache misses without increasing the work from the traditional standard algorithm. The LRU lemma, which was proven in 1985, is also mentioned, which states that an algorithm that incurs Q cache misses on an ideal cache of size M will incur 2Q cache misses on a fully associative cache of size 2M that uses the LRU policy.

  • 00:30:00 In this section, the speaker introduces the concept of cache-efficient algorithms and how they can improve program performance by minimizing the number of cache misses. He explains the least recently used (LRU) replacement policy, which states that a fully associative cache twice the size of the ideal cache using LRU will incur no more than twice the number of cache misses. This allows for easier analysis of cache misses when performing asymptotic analysis. The speaker also presents a cache miss lemma, stating that if a program reads a set of data segments whose size is less than one-third of the cache size and whose average size is at least the size of a cache line, then the number of cache misses to read them all is at most three times the total size of the segments divided by the size of a cache line.

  • 00:35:00 In this section, the video discusses two assumptions related to caching - the cache miss lemma and the tall cache assumption. The cache miss lemma states that if the average length of data segments is larger than a cache block size, it increases the number of cache misses by only a constant factor. The tall cache assumption assumes that the cache is taller than it is wide and is usually satisfied in practice. The video also explains the sub-matrix caching lemma, which discusses the issue with short caches in fitting sub-matrices into cache lines and why the tall cache assumption is useful in solving this problem.

  • 00:40:00 In this section, the speaker discusses caching and cache-efficient algorithms. They analyze the number of cache misses incurred when reading a square sub-matrix into cache, and use the cache miss lemma to show that the number of cache misses required to read all elements of the matrix into cache is at most 3n^2/B. They then analyze the number of cache misses incurred in the standard cubic work matrix multiplication algorithm, assuming the matrix is in row major order and they satisfy the tall cache assumption. They consider three cases and assume LRU for the second and third cases, ultimately showcasing the importance of cache optimization in algorithm design.

  • 00:45:00 In this section, the speaker discusses caching and cache-efficient algorithms. They analyze two cases for matrix multiplication, where n is greater than M over B and when n is less than C times M over B. They use an LRU replacement policy and calculate the number of cache misses incurred when accessing matrix B. In the first case, they find that they need theta of n cubed cache misses, resulting in no gain from having locality in the algorithm. In the second case, they can exploit spatial locality and reduce the number of cache misses by a factor of B, resulting in theta of n cubed over B cache misses, which is more efficient.

  • 00:50:00 In this section of the video, the speaker discusses caching and cache-efficient algorithms. They first analyze the case where the entire matrix fits into the cache, resulting in a total number of cache misses of theta of n squared over B. They then discuss an optimization by swapping the order of the inner two loops to benefit from spatial locality and reduce the total number of cache misses to theta of n cubed over B. However, they note that it is possible to do better by using an optimization called tiling, where six for loops are used to loop over tiles and computation is done for each sub-matrix before moving on to the next. The work for a sub-matrix of size s by s is then s cubed.

  • 00:55:00 In this section, the concept of tiled matrix multiplication is introduced and discussed in detail. The goal of this algorithm is to fit the sub-matrices into cache so that all computations can be done in the cache with no more cache misses. The number of cache misses is analyzed and it is found that the number of cache misses is n/s^3 times s^2 /B, for a total of n^3/B * sqrt(M) cache misses. This number is very significant in improving the performance of the matrix multiplication code. However, a problem arises with the algorithm: it is not portable as it depends heavily on the cache size of the machine it is run on. Furthermore, multi-dimensional tuning optimization becomes much more expensive and the code becomes messier when optimizing for multiple levels of cache.

  • 01:00:00 In this section, the lecture explores caching and cache-efficient algorithms. The speaker discusses the challenges of tuning the parameters for a cache-efficient algorithm and optimizing it for a particular machine. They introduce a recursive matrix multiplication algorithm where the input matrices are divided into four sub-matrices or quadrants. For each quadrant of the output matrix, a sum of two matrix multiplies occurs recursively. Finally, the speaker analyses the work of this algorithm and concludes that it's a simpler design that still maintains good cache performance.

  • 01:05:00 In this section, the speaker discusses caching and cache-efficient algorithms, and uses the example of matrix multiplication to explain the difference between analyzing work and cache misses. The speaker draws a recursion tree to visualize the problem of size and branching subproblems, and notes that the number of levels until the leaves is log base 2 of n. The number of leaves is calculated as eight to the log base 2 of n, which is the same as n cubed. The amount of work is simply theta n cubed, and the cache misses are analyzed with a different base case, where the submatrix fits in the cache, and the theta of N squared over cache misses is used. The speaker highlights that the number of levels is not just log base 2 of n, but also depends on the size of the cache, resulting in a different number of leaves, theta of n cubed over m to the three halves.

  • 01:10:00 In this section, the speaker explains how to analyze the number of cache misses in a cache-efficient algorithm using the example of a recursive matrix multiplication code. The code is cache-oblivious, meaning it doesn't have any explicit knowledge of the caches and it passively auto-tunes itself for the particular cache size of the machine it's running on. The speaker also mentions that the best cache oblivious codes today work on arbitrary rectangular matrices and perform binary splitting instead of eight-way splitting. The speaker discusses the parallel context and explains how to analyze the number of cache misses in a deterministic cell computation that runs on multiple processors with private caches.

  • 01:15:00 In this section, the speaker discusses caching and cache-efficient algorithms. By minimizing cache misses in the serial illusion, we can essentially minimize them in parallel execution for low span algorithms. The speaker provides a cache miss bound for a parallel recursive matrix multiplication algorithm, which is the same as the serial execution. The goal is to design algorithms that have explicit knowledge of the cache or are cache-oblivious. The speaker provides examples of both and mentions that there will be further discussion on cache-oblivious algorithm design in the following lecture.
14. Caching and Cache-Efficient Algorithms
14. Caching and Cache-Efficient Algorithms
  • 2019.09.23
  • www.youtube.com
MIT 6.172 Performance Engineering of Software Systems, Fall 2018Instructor: Julian ShunView the complete course: https://ocw.mit.edu/6-172F18YouTube Playlist...
 

Lecture 15. Cache-Oblivious Algorithms



Lecture 15. Cache-Oblivious Algorithms

The video discusses cache-oblivious algorithms, which can automatically tune for the cache size on a machine, achieve good cache efficiency and do not require knowledge of the cache parameters of the machine. The video shows how to write code for simulating heat diffusion through differential equations using the stencil method on a 2D matrix. The code has both looping and trapezoidal versions, with the latter being more cache-efficient but not significantly faster in a 2D simulation due to the predictability of the looping code's access pattern. The video also discusses the interplay between caching and parallelism and the diagnosis of potential parallel speed-up bottlenecks. Finally, the speaker explains stencil computations and how each point in an array is updated using a fixed pattern called a stencil, which can suffer from cache misses and has storage requirements that can be reduced using efficient algorithms exploiting temporal and spatial locality.

The second part of the video discusses cache-oblivious algorithms for sorting and merging. Specifically, the video covers the cache complexity of merge sort, introduces the concept of multi-way merging, and explains the cache complexity of the multi-way merge algorithm. The video also discusses the funnel sort algorithm, which is a cache-oblivious sorting algorithm that can merge K squared elements and K sorted lists. The funnel sort algorithm is optimal and recursively constructed with square root of K funnels. The video explains that there are many other cache-oblivious algorithms and data structures, such as b-trees, ordered file maintenance, and priority queues. Overall, the video provides an introduction to cache-oblivious algorithms for those interested in learning more about the topic.

  • 00:00:00 In this section, the video discusses cache-oblivious algorithms, which are an algorithm that automatically tunes for the cache size on a machine and achieves good cache efficiency, without the code needing to have any knowledge of the cache parameters of the machine. The video uses the example of simulating heat diffusion through differential equations to show how scientific computing often relies on differential equations to describe physical processes and then must write efficient code to simulate the process. The video demonstrates how to write code based on finite difference approximation method for simulating the 1D heat equation.

  • 00:05:00 In this section, the speaker covers how to write code for a simulation equation that allows for finite difference approximation using the stencil method. The equation uses partial derivatives of U with respect to T and X, and the speaker shows how to obtain these partial derivatives using approximation methods. The 2D space is represented using a matrix with the X and T values represented horizontally and vertically, respectively. The speaker explains the boundaries of the matrix and computes U using the equation.

  • 00:10:00 In this section, the presenter explains stencil computations, a method widely used in scientific computing for various purposes. In this method, each point in an array is updated using a fixed pattern called a stencil. The computation depends on three other values, and this is known as a three-point stance. Although stencil computations can be used for large X values, the performance may suffer concerning caching, resulting in cache misses. Moreover, the storage required to store the data can be reduced by only storing two rows, the current and the previous ones, to update the point values.

  • 00:15:00 works: in each time step, we are only updating the values of X for a specific row using the values from the previous row. So, we can alternate which row we are updating and which row we are using as the previous row, and only need to keep around one extra row of values. However, this code is not very cache efficient, and we can make it more so using tiling or cache oblivious algorithms. The ideal cache model assumes a fully associative cache with an optimal or LRU replacement policy, and captures capacity misses but not conflict misses in a serial execution. Nevertheless, it is still useful for designing efficient algorithms that exploit temporal and spatial locality.

  • 00:20:00 In this section, the speaker explains how a cache-oblivious algorithm can be used to compute points inside a trapezoidal space in a 2D matrix without looking outside of it. The computation involves a kernel function that takes a pointer to UT mod to X and returns W of 0 plus alpha times W of negative 1 minus 2 times W of 0 plus W of 1. The caching behavior is analyzed assuming LRU replacement policy, and the number of cache misses is Theta of NT over B for every row. However, the speaker notes that a better performance can be achieved with tiling, but they proceed with discussing the cache-oblivious version. The trapezoid has a top base at T1 and a bottom base at T0. The algorithm computes all the points inside the trapezoid using inequality conditions that involve T, t0, t1, X, x0, x1, DX0, DX1, where DX0 and DX1 are either -1, 0, or 1 representing slopes.

  • 00:25:00 In this section, the speaker describes a divide and conquer approach for the trapezoid rule algorithm. The approach has a base case where the height of the trapezoid is 1, and a loop computes all the values based on the order of computation. The algorithm has two recursive cases, namely the space cut and the time cut, that depend on the width and height of the trapezoid, respectively. When the trapezoid is too wide, a space cut is applied where the trapezoid is cut vertically with a line with slope one negative one going through the center of the trapezoid. In contrast, a time cut is applied when the trapezoid is too tall, and it is cut with a horizontal line through the center. The recursive algorithm makes two calls that traverse the trapezoid's left and right sides and bottom and top, respectively, in a specific order to prevent interdependence between the points.

  • 00:30:00 In this section, the speaker discusses the cache complexity of the cache-oblivious algorithms. They analyze the recursion tree approach and find that the algorithm divides itself into two subproblems at each level until it reaches the base case, which represents the theta of HW points where H is Theta of W. The number of cache misses per leaf is theta W over B, and the number of leaves is Theta of NT over HW. The internal nodes do not contribute substantially to the cache complexity. The cache complexity generalizes to more than one dimension, and if d is one, it results in NT over MB.

  • 00:35:00 In this section, the speaker explains the difference between the looping code and the trapezoidal code, which has much better cache complexity by saving a factor of M where M is the number of cache lines. The simulation demonstrates that the trapezoidal code incurs fewer cache misses compared to the looping code, thereby finishing the computation much faster. However, the speaker notes that this simulation was for one dimension only and goes on to show a demo for what happens in two dimensions.

  • 00:40:00 In this section, the presenter demonstrates a real-time simulation of heat diffusion in a 2D space, where the colors on the screen correspond to temperatures. The presenter compares the performance of the looping code and the trapezoid code, and reveals that although the trapezoid code incurs far fewer cache misses, it is only marginally faster than the looping code. It is suggested that this could be due to the hardware prefetching helping the looping code, as its access pattern is regular and easy to predict.

  • 00:45:00 In this section, the speaker discusses the interplay between caching and parallelism. They explain a theorem that states minimizing cache misses in the serial execution can essentially minimize them in the parallel execution assuming a low span algorithm. They then demonstrate how the trapezoidal algorithm can work in parallel using a V cut, where two independent trapezoids execute in parallel, and the gray trapezoid executes afterwards. They also mention the upside down V cut that is used for inverted trapezoids.

  • 00:50:00 In this section, the speaker discusses the performance of parallel looping code and parallel trapezoidal code in comparison to their serial counterparts. The parallel looping code achieves less than half of the potential speedup despite having more potential parallelism than the trapezoidal code. This is because, in the parallel case, there is a memory bandwidth bottleneck, which prevents prefetching from helping the parallel looping code, compared to the serial case where there is plenty of memory bandwidth. In contrast, the trapezoidal code achieves near-linear speedup, which is consistent with the fact that cache efficiency plays a much larger role in larger input sizes where the cache is so small, compared to the input size.

  • 00:55:00 In this section, the speaker discusses how to diagnose a parallel speed-up bottleneck and identifies several things that could cause it, such as insufficient parallelism, scheduling overhead, lack of memory bandwidth, and contention. They suggest several ways to diagnose these issues, including using Silk Scale to measure the parallelism in the code and performing hardware counters to measure memory bandwidth. However, diagnosing contention is particularly challenging, and the speaker advises looking at the first three potential causes before evaluating whether contention is the issue. Finally, the speaker moves on to discuss stencil computation.

  • 01:00:00 In this section, the speaker discusses analyzing the cache complexity of merge sort by first analyzing the complexity of merging two sorted input arrays. The time to do this is proportional to the sum of the sizes of the input arrays. The number of cache misses incurred while merging n elements is theta n over B, where B is the number of bytes in a cache line. Merge sort has two recursive calls on inputs of half the size and merges the sorted outputs of its recursive calls. The overall work of merge sort is theta of n log n, which is analyzed using the recursion tree method. The cache complexity of merge sort is also discussed, and it is shown that the number of cache misses is proportional to the number of cache blocks needed to store the data, which is theta n over B log M, where M is the maximum size a sub-block can have.

  • 01:05:00 In this section, the speaker explains the recurrence for the cache complexity of merge sort. The base case occurs when the problem size fits into cache, incurring Θ(n/B) cache misses. Otherwise, the algorithm has two recursive calls of size n/2 and Θ(n/B) cache misses to merge the results. The analysis involves the number of levels of a recursion tree, which is log base 2 of n/CM, where CM is a constant. The number of cache misses per leaf is Θ(M/B * n/CM), giving an overall number of cache misses of Θ(n/B * log (n/CM)), saving a factor of B in the first term. However, for larger problem sizes, we only save a factor of B compared to the work, while we save a factor of B log n for small problem sizes. The speaker asks if there is a better solution, and the answer is always yes.

  • 01:10:00 In this section, the speaker explains how to use multi-way merging to merge sorted sub-arrays and introduces the idea of a tournament tree to determine the minimum elements of the sub-arrays. They also explain the work and cache complexities of this approach, which is used in cache-oblivious algorithms for sorting. The work complexity of the multi-way merge is the same as binary merge sort, while the cache complexity is determined by the number of cache misses required to fill the tournament tree and access the input arrays, which can be optimized if R is less than C*M/B for a small constant C.

  • 01:15:00 In this section, the speaker discusses the cache complexity of the multi-way merge sort algorithm and compares it to the binary merge sort algorithm. The multi-way merge sort algorithm involves dividing the problem into sub-problems of size n/R and paying n/B cache misses to merge them. The number of levels of the recursion tree is log base R of n/cm, and the leaf size is cm. The speaker sets R equal to theta of M/B, making it as large as possible, and analyzes the cache complexity, which turns out to be theta n log n divided by B log M. Compared to the binary merge sort algorithm, the multi-way merge sort algorithm saves a factor of log M in cache misses. However, the algorithm is not cache-oblivious, and the value of R needs to be tuned for a particular machine, which can be a problem if other jobs are running that compete for cache.

  • 01:20:00 In this section, the speaker discusses the funnel sort algorithm, which is a cache-oblivious sorting algorithm that can merge K squared elements and K sorted lists. The cost of this merge is shown to be the same as the multi-way merge sort algorithm's, but the funnel sort algorithm is cache-oblivious, and its bound is optimal. The speaker presents a picture of what a K funnel looks like and explains that the algorithm is recursively constructed with square root of K funnels that feed elements to buffers, which, in turn, feed elements to the final output square root of K funnel, producing the output for the K funnel. The speaker also mentions that there are many other cache-oblivious algorithms and data structures, such as b-trees, ordered file maintenance, and priority queues, and invites viewers to learn more about them online.
15. Cache-Oblivious Algorithms
15. Cache-Oblivious Algorithms
  • 2019.09.23
  • www.youtube.com
MIT 6.172 Performance Engineering of Software Systems, Fall 2018Instructor: Julian ShunView the complete course: https://ocw.mit.edu/6-172F18YouTube Playlist...
 

Lecture 16. Nondeterministic Parallel Programming



Lecture 16. Nondeterministic Parallel Programming

This video discusses various concepts related to deterministic and non-deterministic parallel programming. The speaker emphasizes the importance of avoiding non-determinism as it can lead to anomalous behaviors and difficult debugging. Strategies for managing non-determinism include using fixed values, record replay, analysis tools, encapsulation, and unit testing. The use of mutexes is explored in detail, including spinning vs. yielding, reentrant vs. non-reentrant, and fair vs. unfair. The speaker also touches on the concept of context switching and the "ski rental problem" in the context of parallel programming. The segment concludes by discussing basic principles of performance engineering in multi-core processors.

The second part of the video covers the problem of deadlock in parallel programming and offers solutions to avoid it, such as linear ordering of mutexes that guarantees there is no cycle of waiting. Additionally, the speaker introduces the concept of transactional memory, which ensures atomicity by representing a critical region as a transaction that commits all updates at once. The video then presents an algorithm that uses a lock-based approach with a finite ownership array and a release sort require method to prevent deadlocks and starvation without needing a global lock. Lastly, the algorithm release sort and reacquire is explained, which prevents multiple locks from trying to acquire a lock simultaneously, avoiding performance issues.

  • 00:00:00 In this section, the lecturer introduces the concept of determinism in programming and how it relates to parallel computing. A program is deterministic if every memory location is updated with the same sequence of values in every execution, and the program always behaves the same. Deterministic programs have the advantage of being repeatable, making them easier to debug. The lecturer emphasizes the importance of never writing non-deterministic parallel programs as they can exhibit anomalous behaviors that are difficult to debug. However, in practice, it can be challenging to avoid non-determinism in parallel computing.

  • 00:05:00 In this section, the speaker discusses non-deterministic parallel programming and the fact that it can sometimes lead to better performance, but should still be avoided unless necessary. The speaker recommends devising a test strategy to manage the non-determinism if you must write a program that way. Examples of strategies include turning off the non-determinism, using the same seed for random numbers or providing fixed values for program inputs that could otherwise change randomly. The speaker emphasizes the importance of having a way to handle bugs in the program caused by non-determinism.

  • 00:10:00 In this section, the speaker talks about strategies for dealing with non-deterministic programming, such as record replay, encapsulating non-determinism, substituting a deterministic alternative, using analysis tools, and unit testing. The speaker emphasizes the importance of controlling non-determinism to improve the chances of catching bugs in the code. The speaker also provides an example of using mutual exclusion and atomicity in a hash table to illustrate how to handle non-deterministic programming.

  • 00:15:00 In this section, the speaker discusses how parallel instructions accessing the same location can lead to race bugs and destroy the system's integrity. The standard solution is to make some instructions atomic, meaning they are viewed as either fully executed or not executed at all by the rest of the system. To achieve this, a mutual exclusion lock or mutex lock is used, which is an object with lock and unlock member functions. Each slot is made into a struct with a mutex lock and a pointer head to the slot context, allowing for the lock and unlock functions to be used before accessing the list, thereby ensuring the system's correctness.

  • 00:20:00 In this section, the video explores the use of mutexes to implement atomicity and how it relates to determinacy races. Mutex locks can ensure that critical sections are atomic, but the resulting code is non-deterministic due to a determinacy race which may be what is desired in some cases. The video emphasizes the importance of understanding the difference between data races and determinacy races, and notes that simply eliminating data races does not necessarily mean that there are no bugs present in the code. It is important to have a way to detect non-determinism in order to avoid false positives or missing race bugs in the code.

  • 00:25:00 In this section, the speaker explains that having no data races does not necessarily mean that the code has no bugs. While no data races are positive aspects of code, the example of locking to provide atomicity can lead to an atomicity violation. In addition, benign races can occur, such as when two parallel processes are accessing the same value, which may result in the same outcome, but it could also lead to incorrect values on some architectures. The speaker argues that while some people consider benign races as harmless, it is still essential to identify and avoid them.

  • 00:30:00 In this section, the speaker explains the dangers of non-deterministic programming, particularly due to race conditions that can occur when multiple processes try to set values on the same location. The speaker introduces the concept of "Silks," which allows for intentional races but can be dangerous if not used correctly. The complexity of races can also make debugging difficult and confound tools meant to help detect correct code. The speaker also discusses how the implementation of mutexes and their parameters can affect their behavior, such as whether they are yielding or spinning.

  • 00:35:00 In this section, the video explains three basic properties of mutexes in parallel programming: spinning vs. yielding, reentrant vs. non-reentrant, and fair vs. unfair. The concept of spinning vs. yielding is the idea that a thread will not sit idle and continuously check for access to a lock but rather yield control to the operating system for a more efficient execution. Reentrant mutex allows a thread that is already holding a lock to acquire it again while non-reentrant locks will deadlock if the thread attempts to reacquire a mutex it already has. Lastly, fair mutex ensures that the thread which has been waiting the longest obtains the lock first to prevent the possibility of a starvation problem. The video also provides an implementation of a simple spinning mutex in assembly language.

  • 00:40:00 In this section, the video discusses the use of mutex in parallel programming and why the exchange instruction is used instead of just getting the mutex. It is explained that the exchange operation is similar to a right, and in order to perform a right, the cash line it is on must be invalided on the other ones and held in a modified or exclusive state. This creates traffic on the memory network and slows down the process. On the other hand, by using the exchange instruction, the cache lines are put in shared state, thereby keeping the process faster and generating less traffic.

  • 00:45:00 In this section, the speaker discusses the difference between a spinning mutex and a yielding mutex. In a spinning mutex, the program keeps checking for the mutex to be unlocked, while in a yielding mutex, the program yields control to the operating system, which allows it to schedule something else. The speaker notes that there is also another kind of mutex called a competitive mutex, which achieves both goals of a spinning mutex and a yielding mutex. The speaker also explores using message-passing or interrupts to inform waiting threads but notes that the simpler solution would be to use a mutual exclusion mechanism.

  • 00:50:00 In this section, the speaker explains the concept of context switching, which is how often the operating system switches between threads on available processors. Typically, the system does context switching around 100 times per second, meaning each switch takes about 10 milliseconds. However, this is more than an order of magnitude greater than the execution time of a simple instruction, meaning there are many instructions that can execute before the thread has a chance to come back in and grab a lock. This problem can be solved by using a combination of spinning and yielding. The speaker calls this the "ski rental problem" in the Theory literature.

  • 00:55:00 In this section, the video discusses the process of deciding whether to buy or rent equipment for a particular task, using the example of buying or renting sports equipment. The speaker encourages viewers to consider the costs of renting versus buying, and suggests renting until the cost equals that of buying, then making the purchase. The video also explores the concept of context switching time, disk access time, and the issue of deadlock when holding multiple locks at once. Overall, this segment covers basic principles of performance engineering in multi-core processors.

  • 01:00:00 In this section, the speaker explains deadlock, which is when two threads hold resources that are both required by the other thread, leading to a loss of performance. There are three basic conditions for deadlock: mutual exclusion (exclusive control over resources), non-preemption (resources held until finished), and circular waiting (a cycle of threads waiting for resources held by the next). Removing any one of these constraints can solve the deadlock issue. The speaker uses "Dining Philosophers," a story told by Tony Hoare based on an examination question by Edsger Dijkstra, to illustrate the problem with deadlock. The story involves philosophers seated around a table, eating noodles with chopsticks where each plate of noodles is situated between two chopsticks, and they require two chopsticks to eat the noodles. The philosophers grab the chopstick on the left and right to eat their noodles. However, if they all pick up the left chopstick on their left, they will end up starving.

  • 01:05:00 In this section, the speaker discusses the issue of deadlock in parallel programming and offers a solution that ensures non-preemption while avoiding deadlock: linear ordering of mutexes. By numbering each mutex and strategically locking them based on their numerical order, programmers can guarantee that a cycle of waiting (the necessary condition for deadlock) cannot occur. However, he cautions programmers to be aware of the runtime system's encapsulation of locks in Silk, as introducing additional non-determinism through these locks can lead to trouble. He shares an example where deadlock can occur with just one lock, emphasizing the importance of careful consideration when designing parallel programs.

  • 01:10:00 In this section, the speaker delves into the topic of transactional memory, a recent research-level technique that involves having database transactions where atomicity happens implicitly, without the need for specifying locks. The speaker gives an example of how transactional memory is useful in concurrent graph computation, namely Gaussian elimination, where having two nodes eliminated simultaneously causes atomicity violations. The idea behind transactional memory is to represent the critical region as a transaction, and upon committing, all memory updates in the critical region appear as if they happened at once.

  • 01:15:00 n this section, the speaker discusses conflict resolution, contention resolution, forward progress, and throughput in transactional memory. They introduce an algorithm that uses a lock-based approach with a finite ownership array and a release sort require method to prevent deadlocks and starvation without needing a global lock. The algorithm logs reads and writes to allow for rollback or atomic commit when a transaction is aborted. The lock array is used to lock a location that a hash function maps to, allowing for fair acquisition of locks. This algorithm allows for concurrent transactions without sacrificing performance or causing deadlocks.

  • 01:20:00 In this section, the speaker discusses an algorithm called release sort and reacquire. The algorithm works by greedily trying to acquire a lock of a memory location, and if a conflict arises, the transaction rolls back without releasing the locks it holds. Afterward, the algorithm releases all the locks with indexes greater than the index of the location it is trying to access. It then acquires the desired lock and finally acquires the released locks in sorted order. This process is repeated until the transaction is successful. The algorithm is designed to prevent multiple locks from trying to acquire a lock simultaneously, which can cause performance issues.
16. Nondeterministic Parallel Programming
16. Nondeterministic Parallel Programming
  • 2019.09.23
  • www.youtube.com
MIT 6.172 Performance Engineering of Software Systems, Fall 2018Instructor: Charles LeisersonView the complete course: https://ocw.mit.edu/6-172F18YouTube Pl...
 

Lecture 17. Synchronization Without Locks



Lecture 17. Synchronization Without Locks

Charles Leiserson explores the concept of synchronization without locks in a YouTube video. He provides an example which demonstrates the need for a global linear order of instructions to ensure sequential execution, and discusses how mutual exclusion can be achieved through sequential consistency, avoiding the difficulties and potential problems of using locks. Leiserson presents Peterson's algorithm as a solution that uses only load and store operations to access critical sections without interference from concurrent processes. The video also covers the challenges of synchronization through memory due to hardware reordering, and the concept of memory fences to maintain relative ordering with other instructions. Leiserson argues that supporting sequential consistency for parallel machines is beneficial but difficult to achieve for hardware designers.

The second part of the video discusses the use of memory fences and synchronization instructions in preventing errors in parallel programs. The speaker explores different ways of implementing memory fences, implicitly or explicitly, and the importance of careful engineering and coordination between different teams working on different aspects of a processor. Furthermore, the lecturer discusses the use of CAS instructions as part of the lock-free algorithm in C11 language standard to implement mutexes of n-thread deadlock-free mutual exclusion algorithms using only constant space. Charles Leiserson explains the problem of using locks in a multi-threaded system and the solution of using CAS instead, as a thread holding onto the lock for a long time can potentially block other threads waiting to access the same resource. Additionally, the video highlights the potential issue of ABA problem with compare-and-swap and encourages those interested in the topic to learn more about lock-free algorithms.

  • 00:00:00 and their instructions are interleaved to produce a global linear order of all instructions. This is the idea behind the memory model of sequential consistency, which is the most important memory model from a theoretical point of view. It's important to understand memory models because, in the absence of locks, the behavior of parallel programs depends on the memory model. One example presented in the lecture illustrates this point by asking whether it's possible for two processors to both have the value of 0 after executing their code in parallel, which depends on the memory model being used.

  • 00:05:00 In this section, Charles Leiserson discusses the concept of synchronization without locks where loads and stores must appear to obey some global linear order for the execution to be sequential. He uses the example of interleaving various values to demonstrate different execution paths that can occur and how the hardware can do whatever it wants. He also explains that even though there can be many different ways to interleave values, it must appear as if the loads and stores followed a linear order for the execution to be consistent. Ultimately, Leiserson concludes that there is no execution that ends with both values being zero, and it is interesting to note that of modern computers, none of them provide sequential consistency.

  • 00:10:00 In this section, the concept of sequential consistency is discussed, which involves understanding the happens before relationship between instructions and their order, the linear relationship between the right arrow connection, processor order, and sequencing of instructions. Additionally, mutual exclusion can be accomplished without using heavy-duty instructions or locks by thinking about sequential consistency and implementing the shared data structure through the use of loads and stores. The lecture notes describe methods using specialized operations such as test and set or compare and swap, but the solution presented avoids building deadlocks or other bad things that occur when using locks.

  • 00:15:00 In this section, Charles Leiserson explains Peterson's algorithm for achieving mutual exclusion in a concurrent program using only load and store operations. The algorithm allows two concurrent processes, Alice and Bob, to execute code without interfering with each other by using a set of variables and a turn-taking mechanism. The code ensures that only one process at a time can enter a critical section and modify a shared resource. Unlike locks, Peterson's algorithm does not rely on acquiring or releasing a lock and instead uses loads and stores to achieve mutual exclusion.

  • 00:20:00 In this section, the speaker discusses Peterson's algorithm for achieving mutual exclusion in the critical section without using locks. He explains that only one person at a time can enter the critical section, and the algorithm ensures that someone who wants to enter the critical section can do so if they're the only one who wants to go. The speaker then presents a proof of the algorithm, which involves assuming both Alice and Bob find themselves in the critical section together and deriving a contradiction. The proof relies on the "happens before" relationship and the program order of the instructions executed.

  • 00:25:00 In this section, the speaker explains the process of synchronizing without locks. They establish chains of instructions and ensure they occur in the correct order to avoid errors in synchronization. They use the example of Bob and Alice wanting to access a shared resource as a demonstration. They explain that when synchronizing, engineers need to be careful and review the "happens before things" to ensure they are occurring in the correct order.

  • 00:30:00 In this section, Charles Leiserson discusses the concept of model checking and its use in analyzing network, security, and cache protocols. He then goes on to explain the properties of Peterson's algorithm, which guarantees starvation freedom but does not work with more than two processes. Leiserson also explores the challenges of synchronizing through memory and the lack of sequential consistency in modern processors, which leads to relaxed memory consistency and difficulty in getting synchronization correct.

  • 00:35:00 it's safe to reorder instructions without causing issues with load latency? The second constraint is when there is no data dependency between instructions, meaning that the instructions do not share data or use the same location in memory. Reordering instructions in this case can improve performance and cover over load latency, as the load can be done earlier and other work can be done while waiting for the result. Understanding these hardware-level concepts can help in reasoning about software and optimizing performance.

  • 00:40:00 In this section, Charles Leiserson explains the concept of reordering in synchronization without locks. He clarifies that reordering is safe to perform as long as there is no concurrency, especially when the processor is operating or there are bubbles in the instruction stream. This is because, in modern processors, the hardware stores data in a buffer and bypasses loads by going directly to the memory system. However, this bypassing can lead to issues with correctness if one of the stores is the thing being loaded.

  • 00:45:00 In this section, Charles Leiserson explains how hardware reordering happens and why it doesn't satisfy sequential consistency, and how x86 has a memory consistency model called total store order, which is weaker than sequential consistency. The rules for total store order include never reordering loads with loads, loads being seen in the same order by external processors, stores never being reordered with stores, and loads only being reordered with prior stores to a different location, but not with prior loads to the same location. The lock instructions and memory ordering respect a total order.

  • 00:50:00 In this section, the speaker discusses how instruction reordering can violate sequential consistency and result in incorrect values. Reordering can occur in both hardware and the compiler, and it is important to know that locking instructions will not be moved before other instructions. The speaker also notes that loads can go before stores if they are to a different address. The danger of reordering is demonstrated in Peterson's algorithm, which could be completely screwed up if certain reorderings occur. Therefore, it is important to write deterministic code to avoid these issues when synchronizing through memory.

  • 00:55:00 In this section, Charles Leiserson discusses the issue of reordering when writing parallel and concurrent code, where it is important to avoid a particular load occurring before a store. To handle such scenarios, engineers introduce memory fences, which maintain relative ordering with other instructions, but they come at the cost of increased complexity and potential bugs. Leiserson also argues that it is beneficial for parallel machines to support sequential consistency, but it is a challenge for hardware designers to accomplish.

  • 01:00:00 In this section, the speaker discusses the importance of memory fences and synchronization instructions in preventing parallel programs from encountering errors. The speaker describes different ways in which memory fences can be implemented, such as explicitly as an instruction or implicitly through other synchronization instructions such as locking and exchanging. However, there are cases in which a memory fence can actually slow down a program, showing the importance of careful engineering and coordination between different teams working on different aspects of a processor. Additionally, the speaker advises using volatile variables and compiler fences to prevent the compiler from optimizing away variables and causing errors in parallel code.

  • 01:05:00 In this section, the lecturer discusses synchronization without locks in C11 language standard. The standard defines a weak memory model, which allows declaring things as atomic, requiring the use of expensive operations like memory fence or an atomic compare-and-swap for a deadlock-free mutual exclusion algorithm. Here, the CAS instruction (Compare-and-Swap) is explored as part of the lock-free algorithm. The instruction checks to see if the value in memory is the same as the old value before swapping it to the new value, all done atomically. The use of CAS allows for implementing mutexes of n-thread deadlock-free mutual exclusion algorithms using only constant space. A lock instruction, which spins until it gets the value true, is used to swap in true stating that someone holds the lock.

  • 01:10:00 In this section, Professor Charles Leiserson explains how to use compare-and-swap (CAS) to solve synchronization problems. He demonstrates how to use CAS as a lock and fixes a bug in the code presented by a student. He then presents an incorrect code used to compute the sum of values in an array, which leads to a race condition. He introduces mutex locks as a typical solution and explains the potential problem of a thread being swapped out after acquiring the lock, leading to other threads waiting for the lock, thus hindering progress.

  • 01:15:00 In this section, Charles Leiserson explains the problem of using locks in a multi-threaded system and the solution of using CAS instead. With locks, a thread can potentially hold onto the lock for a long time, blocking other threads waiting to access the same resource. However, with CAS, a load of x is followed by a store of x after computing a temp while also having the variables old and new to store the old result and add the temporary result to that old value to produce a new value which can be swapped in if the old value has not changed. Charles also mentions the ABA problem with compare-and-swap and encourages those interested in the topic to learn more about lock-free algorithms.
17. Synchronization Without Locks
17. Synchronization Without Locks
  • 2019.09.23
  • www.youtube.com
MIT 6.172 Performance Engineering of Software Systems, Fall 2018Instructor: Charles LeisersonView the complete course: https://ocw.mit.edu/6-172F18YouTube Pl...
 

Lecture 18. Domain Specific Languages and Autotuning



Lecture 18. Domain Specific Languages and Autotuning

In this video, Professor Saman Amarasignhe from the EECS department at MIT discusses the benefits of using domain-specific languages (DSLs) and autotuning in performance engineering. He emphasizes the importance of DSLs, which capture area-specific domains that are difficult to describe in general-purpose languages, allowing programmers to take advantage of domain experts' knowledge for better performance. Singh discusses optimization of graph processes using DSLs, including graph partitioning and the importance of the shape of the graph in computation. He introduces the DSL Halide for image processing, which enables fast code optimization and portability across machines. He discusses the use of Halide in various industries like Google and Adobe. Ultimately, he highlights the significance of experimenting with different approaches in optimizing code while focusing on parallelism, locality, and redundant work.

The video also talks about the challenges of performance engineering and finding the optimal parameters for a program to run efficiently. The speaker suggests that auto-tuning can address this problem by searching the large parameter space automatically to find the optimal values. He notes that exhaustive search can be impractical, and heuristic-based solutions may not be optimal. Autotuning, which defines a space of acceptable values and iterates based on performance results, can speed up the process of finding optimal solutions. The speaker also discusses the application of autotuning in generating schedules for Try, which was able to produce schedules more efficiently and effectively than exhaustive search.

  • 00:00:00 In this section, Professor Saman Amarasignhe, a professor in the EECS department at MIT, talks about domain-specific languages and auto-tuning. He explains that these languages have engineering benefits because they capture specific area-specific domains that are hard to describe in a general-purpose language. Domain-specific languages are much easier to understand and maintain, and they allow the programmer to take advantage of the knowledge of the domain experts to get really good performance. Additionally, if built correctly, a domain-specific language can leave the lower-level decision to the compiler, making the optimization process much simpler.

  • 00:05:00 In this section, the speaker discusses the use of domain-specific languages (DSLs) in performance engineering. They encourage leaving performance to the compiler whenever possible, and introduce three different programming languages/ frameworks for graph processing: Graffiti, Halide, and OpenTuner. They point out that graphs are everywhere, from Google searches to recommendation engines, and delve deeper into the PageRank algorithm used by Google to rank webpages. The PageRank algorithm looks at a webpage's neighbors and calculates a new rank based on their influence, showing the importance of graph processing in performance engineering.

  • 00:10:00 In this section, the speaker discusses graph algorithms, which can involve processing huge amounts of data and iterating over computations for the entire graph. To optimize the code for performance, a DSL can be used. The type of algorithms used for graph processing includes topology-driven algorithms, where the entire graph participates in the computation, and data-driven algorithms, where the processing is done on a small region or part of the graph. There are also multiple ways of doing graph reversals, and each way has a different set of outcomes.

  • 00:15:00 In this section, the topic of graph partitioning is discussed, where a large graph is divided into smaller pieces. The advantage of partitioning a graph is that it allows for parallelism, and it also provides good locality, meaning that the nodes being worked on are small enough to fit in cache. Graph partitioning also has an impact on social network graphs as they have a power law relationship. This means that certain nodes have more connections or a bigger impact than others, and when processing these graphs, certain connections and nodes need to be given more attention given their importance.

  • 00:20:00 In this section, the speaker discusses the importance of the shape of a graph in computation, specifically how the size and locality of a graph can affect the efficiency of parallelism algorithms. Factors such as parallelism, locality, and extra work needed must be carefully balanced to achieve the best performance for a given algorithm, and the right method must be selected based on the type of graph, type of algorithm, and hardware being utilized. A domain-specific language was developed to separate the constant algorithm from the processing methods to better optimize these factors.

  • 00:25:00 this section, the speaker discusses how domain-specific languages (DSLs) can be used to write code at a higher level, making it simpler and more elegant. They give examples of different algorithms and how DSLs provide a simple way of computing them. Additionally, the speaker explains how the scheduling of DSLs can be optimized to get the best possible speed, even allowing for parallel processing. The code can be varied based on changes in the graph or the machine, but the algorithm remains the same. The speaker emphasizes the importance of DSLs providing simplicity in programming while being powerful enough to generate optimized schedules.

  • 00:30:00 In this section, the speaker discusses another domain-specific language, Halide, which focuses on image processing with dense regular structures. Halide helps reduce the amount of programming required to achieve optimized performance and increases the program's portability across different machines. The speaker provides a three by three blur example to demonstrate how Halide works. Halide has a similar pattern to the graph language discussed earlier in terms of optimizing performance through different combinations of various techniques.

  • 00:35:00 In this section, the speaker discusses the use of domain-specific languages (DSLs) and autotuning to optimize code performance. He provides an example of an image filter that runs faster using a DSL called Halide, compared to a valid C code. Halide enables decoupling of algorithm from schedule, which allows for simple optimization of the pipeline of pure functions being executed. Lastly, the speaker emphasizes the need for a trade-off between locality, parallelism, and redundant work to achieve the best performance from the available computing resources.

  • 00:40:00 In this section, the speaker discusses the importance of locality when it comes to image processing. When processing a large image, it's not efficient to apply filters that operate on the entire image at once because it won't fit in the cache. Instead, the speaker suggests breaking down the image into smaller sections and applying filters to each section, focusing on locality and parallelism. This can be achieved by using a scheduling framework and optimizing computation bandwidth and storage granularity. It may also require some redundant work to achieve better locality and parallelism.

  • 00:45:00 In this section, the speaker discusses Domain Specific Languages and autotuning, focusing on the use of Halide. Halide can fuse library functions together, which is faster than calling libraries because there is more locality. The speaker shows how Halide can optimize computational processes such as bilateral filter computation and segmenting algorithms. In one example, a segmenting algorithm written in MATLAB, which had been calling hand-optimized libraries, was 70 times faster when written in Halide. Halide is an important part of Google, and it has been integrated into Android phones and was used in Google Glass.

  • 00:50:00 In this section, the speaker discusses the effectiveness of using Halide for front-end processing and how it is becoming widely used in different industries. Halide boasts a 4-5% increase in speed compared to previous versions, which is significant for Google considering the number of videos downloaded. Adobe has also announced that entire Photoshop filters are written in Halide. Snapdragon image processor and Intel are also starting to use Halide. The speaker also discusses how Halide and Graph share similarities in terms of being able to automate optimization, making the performance engineer's job easier. However, when it comes to algorithmic optimization, it is a higher-level change that requires specific contextual knowledge, making it difficult to automate. Nonetheless, tools like scheduled languages give users the option to try different approaches and achieve better performance.

  • 00:55:00 In this section, the speaker discusses the complexity of modern computer systems and how there is no one right way to optimize code. They emphasize the importance of trying out different approaches and the significance of caches, locality, and parallelism. They also discuss how people in various domains, such as biology and physics, spend a lot of time optimizing code rather than pursuing research because they are unable to write programs quickly enough due to code complexity. The speaker suggests that finding domains where people spend most of their time hacking and coming up with abstraction can be an interesting area to explore for research. Overall, the space to operate on to optimize programs includes parallelism, locality, and redundant work, and playing in this space is crucial for performance engineers.

  • 01:00:00 In this section, the speaker discusses the challenges of performance engineering, which involves finding the right parameters for a program to run optimally. He explains that there are numerous parameters that can be adjusted, such as memory allocation and block size, but that it can be difficult to determine the right values for each parameter for each program. However, the speaker suggests that auto-tuning can address this problem, by searching the large parameter space automatically to find the optimal values. He explains that heuristic-based solutions, which involve hard-coding rules for certain situations, can work most of the time but may not always be optimal. The speaker also notes that building models of the system can be problematic as the model doesn't account for all important factors, which can lead to suboptimal results.

  • 01:05:00 In this section, the speaker discusses the challenges of finding optimal solutions in the face of changing technology or evolving requirements. He notes that there are a variety of "heuristics" that programmers use to guide their solutions, often based on guidelines or rules of thumb that may no longer be applicable. One option is exhaustive search, but this can be impractical given the sheer number of possible permutations. To address this, the speaker recommends the use of autotuning as a way to prune the search space and find optimal solutions faster. Autotuning involves defining a space of acceptable values, randomly choosing a value to test, and then iterating based on performance results. OpenTuner is one example of a framework that uses an ensemble of techniques, such as hill climbers and random search, to speed up this iterative process.

  • 01:10:00 In this section, the speaker discusses the concept of auto-tuning and how it can be applied in generating schedules for Try. By understanding the graphs and rhythm involved, auto-tuning can be used to generate a schedule more efficiently and effectively than exhaustive search. This method was able to produce schedules in less than two hours, and in some cases, even better than what was originally thought to be the best possible schedule.
18. Domain Specific Languages and Autotuning
18. Domain Specific Languages and Autotuning
  • 2019.09.23
  • www.youtube.com
MIT 6.172 Performance Engineering of Software Systems, Fall 2018Instructor: Saman AmarasingheView the complete course: https://ocw.mit.edu/6-172F18YouTube Pl...
 

Lecture 19. Leiserchess Codewalk



Lecture 19. Leiserchess Codewalk

In this YouTube video titled "19. Leiserchess Codewalk", Helen Xu explains the rules of Lesierchess, a game played by two teams with the objective of shooting the opposing team's king or having your king shot. The game has two types of moves, basic and swat moves, and a Ko rule that makes a move illegal if it undoes the opponent's most recent move. Xu dives into various aspects of playing the game including the Fisher time control method, Forsythe-Edwards Notation, Cloud autotester and project organization, evaluating and comparing bots using Elo ratings, move generation, static evaluation, and search algorithms such as alpha-beta, principle variation, subtree pruning, and transposition tables. She also discusses the importance of test infrastructure for optimizing the program and the eval.c file, which contains heuristics that evaluate each square on the board based on the type of piece and its color.

The speaker also delves into the laser coverage aspect of the game, explaining the complicated system of generating all possible moves for a position based on the player's color using a while-true statement. They also explain the importance of correctness in code and how to test for it, suggesting the conversion of representation to ensure correctness before optimization for performance. The speaker also discusses the great flexibility provided by the Leiserchess UCI interface, which allows users to edit tables and commands to their liking, and reassures viewers that the dead code will be cleaned up, and any other bugs should be reported to be fixed.

  • 00:00:00 In this section, Helen walks through the rules of the Lesierchess game and how to play it. The game is played by two teams, orange and lavender, with each team having seven pawns and a king. The objective of the game is to shoot the opposing team's king or have your king shot. The game has two types of moves, basic and swat moves. The game further has a Ko rule that makes a move illegal if it undoes the opponent's most recent move by just moving back to where the opposing team was. A draw occurs if there have been 50 moves by each team without a pawn being zapped, the same position repeats itself, or the two teams agree to the draw.

  • 00:05:00 In this section, the speaker explains the Fisher time control method used in leiserchess. Each player starts with an initial time budget and a time increment. In the example used, each player starts with 60 seconds, and when they end their move, they get 0.5 extra seconds. But the actual time limit can be anything. The speaker then demonstrates how to play leiserchess by asking the audience to suggest moves for tangerine to zap the pawn in F5 while avoiding counter-attacks. However, the speaker warns of the subtleties of the game, such as naively killing all pieces, as it might open up the opponent.

  • 00:10:00 In this section, Helen Xu discusses the Forsythe-Edwards Notation as a tool for representing a chess position using a string, which is useful for debugging purposes, allowing one to move back to a specific position easily. She also explains how to read logs of chess games, where she breaks down each move and their corresponding annotations. Furthermore, she demonstrates how to use the scrimmage server to test bots with other teams in the class, as well as how to challenge other teams and view the matches played.

  • 00:15:00 In this section, Helen Xu discusses the Cloud autotester and project organization for Lesierchess. While the scrimmage server only allows one game at a time, the Cloud autotester can run multiple games and binaries on a time control that can be customized to each user's preference. The project organization in the repo includes the doc, auto tester, pgnstates, tests, player, and webgui directories. The auto tester is a Java local auto tester used to test changes locally, while in the tests directory, configurations can be specified for auto testing. Helen Xu also introduces the universal chest interface (UCI), a communication protocol used by Lesierchess to interface with the auto tester.

  • 00:20:00 In this section, the speaker discusses how the bots will be measured and compared using Elo ratings, which is a rating system based on relative skill level in zero-sum games. The Elo rating is not solely based on time, but also on the nodes per second searched using the UCI. The speaker then dives into more detail about playing the game, such as generating moves, board representation, and the struct used to store the position. Additionally, the move is represented using 28 bits, which includes the piece type, orientation, from square, intermediate square, and the two squares. The reference implantation generates all the moves depending on whose turn it is by iterating through the board and generating all the moves from that particular piece. The speaker mentions using the debugging function perft to ensure the modified move generation returns the same moves from each position.

  • 00:25:00 In this section, the speaker discusses how to determine if a move is good through static evaluation, which generates a score based on a position using heuristics. The heuristics include ones focused on the king, pawns, and distance, and can help determine if a position is good or not. The speaker also talks about how game-playing programs use search trees to play the game and pick the best move based on evaluation. To reduce the number of nodes, the speaker goes into detail about quiescence search and min-max search, which can improve the amount of nodes evaluated and lead to better performance.

  • 00:30:00 In this section, the speaker discusses the algorithm called alpha beta, which is used during a search from a node with a window alpha beta. If the value of the search falls below alpha, the move is not good enough, and you keep searching. Beta essentially means that one side is trying to maximize, and one is trying to minimize. Therefore, if the value falls above beta, then you generate a beta cut-off, because the opponent wouldn't let you make that move, as it would be too good. The speaker then explains the principle variation search, which assumes that the first move is the best one, and you run scout search, which is also called zero window search, on the remaining moves to verify that they're worse. This way of searching is a variation of the alpha-beta algorithm.

  • 00:35:00 In this section, the concept of subtree pruning in minimax search algorithms is discussed. The idea behind subtree pruning is to remove subtrees that have scores lower than the best score found so far. This reduces the number of nodes evaluated and speeds up the search process. The opponent also searches for the best move and tries to minimize the scores of the player. The balance between maximizing the player's score and minimizing the opponent's score is crucial, and the goal is to find a move that is good for the player and also something that the opponent would allow.

  • 00:40:00 In this section, Helen Xu explains the concept of principal variations pruning where the assumption is made that the leftmost subtree is the best path and early exits are taken if the assumption is true. If the assumption is false, the entire subtree must be searched. Scout search is used to apply this recursively to lower subtrees, with initial passes to verify assumptions. This method improves pruning and processes most of the game tree with zero window searches.

  • 00:45:00 In this section, we learn about search optimizations for the Leiserchess program. One of the most important optimizations is the use of a transposition table to store previously encountered positions, which saves time by avoiding unnecessary searches. Other optimizations include the use of Zobrist hashing to generate unique hashes for each position, the killer move table to store good moves so they don't have to be recomputed, and futility pruning to skip exploring moves that wouldn't increase the alpha score. The implementation of a better opening book is also recommended to store pre-computed positions at the beginning of the game and save time in the search.

  • 00:50:00 In this section, the speaker discusses some useful tools for optimizing a chess program, including opening books and end game databases that can be pre-computed. It is important to test and have a good infrastructure for testing to be able to innovate and make progress without debugging issues. The use of transposition tables in parallel programming makes it important to be able to turn them off for testing purposes, and it is recommended to do fixed-depth searches when testing. Overall, having a good test infrastructure is crucial for making progress and avoiding significant debugging issues.

  • 00:55:00 In this section, the speaker discusses the eval.c file in the Leiserchess project and how it can seem overwhelming at first glance due to its size and complexity. However, as users become more familiar with it, they will gain confidence in handling a decent sized piece of code. The eval.c file contains heuristics such as p between, p central, k face, and k aggressive that evaluate each square on the board based on the type of piece and its color. The p between heuristic determines whether a pawn is between both kings, while p central gives a bonus based on how close the pawn is to the center of the board. The k face and k aggressive heuristics give bonuses based on the orientation of the king and its distance from the opponent and board edges.

  • 01:00:00 In this section, the speaker explains the laser coverage, which is a complicated aspect of the game. The laser coverage generates all the possible moves of a particular position depending on the player's color. Then it provides a list of moves, and the speaker elaborates on how this map performs its function with a while-true statement. The laser path bounces off some pawns while drawing the path and increases the distance for each square. After generating the map, the process is repeated, and the distance is divided by the actual distance from the laser. This complicated system optimizes the iterations required in the evaluation method for finding each piece on the board, which slows the process down, and the speaker adds that representing the board differently and asserting that both conversion functions yield the same result is a good way to make optimization decisions.

  • 01:05:00 In this section of the video, the speakers discuss the importance of maintaining correctness in code and how to test for it. They explain that converting from one representation to another can slow down the process but is necessary to ensure correctness before optimizing for performance. One way of testing for correctness is to keep track of node counts and ensure that they remain the same after making changes. They also demonstrate how to run the code and show the UCI interface in Lesierchess.c, which allows you to set values for various things without having to recompile the binary each time. Lastly, they go over a table of heuristics and explain how to specify values for the auto tester.

  • 01:10:00 In this section, the speaker discusses the great flexibility that the Leiserchess game provides for editing tables and commands through the UCI interface. This allows users to access and implement any commands they desire, including new heuristics, and have control over parsing and implementation. Although some dead code exists, the professor reassures the viewers that it will be cleaned up and any other bugs should be reported to be fixed. Lastly, he states that while the project may not be fun every minute, it overall provides plenty of enjoyment.
19. Leiserchess Codewalk
19. Leiserchess Codewalk
  • 2019.09.23
  • www.youtube.com
MIT 6.172 Performance Engineering of Software Systems, Fall 2018Instructor: Helen XuView the complete course: https://ocw.mit.edu/6-172F18YouTube Playlist: h...
Reason: