Lecture 1 We looked at the ways computer performance can be improved - inc. processor speeds - designing better algorithms - application of parallel processing We discussed the difference b/t massive and lowly parallelism. THe course will focus on the latter, due to current trends. These trends include powerful microprocessors, high availability applications, faster, open networking. Types of lowly parallel computers, SMPs, NUMAs, Clusters; and, associated memory access models: shared memory, distributed shared memory, and message passing. We considered the affect of network communication parameters: bandwidth, latency, reliability, homogeneity, security, scalability. _Lecture 2_ We looked in some detail at the basis of the message passing model of communication. We showed that the general problem of distributed consensus was impossible to solve. From this it followed that completely reliable communication in which no messages are lost or duplicated is likewise impossible. We considered the design of protocols for network control processes (NCPs) that reduced the risk of lost and/or duplicated messages. This was done through the use of acknowledgment messages. _Lecture 3_ We gave a definition of a cluster: a collection of interconnected computer systems that is used as a single computational resource. We looked at some examples of the application of clusters in the workplace: compute clusters, data management clusters, word processing clusters, web serving clusters. We looked at the advantages enjoyed by clusters: better price-performance ratios, software pricing, system administration. better use of tools and open standards. _Lecture 4_ We looked at the difference between clusters and SMP computers. They differ along the lines of the following issues: scalability, availability, system management, software licensing We looked at the difference between clusters and other parallel computers, e.g., massively parallel (SIMD,MIMD), NUMA systems. _Lecture 5_ We considered the question of how to parallel program. We considered the question of what a programming model was. We considered the difference between descriptive and prescriptive models. We talked about the PRAM model and its lack of a communication and synchronization costs. The three basis models considered in the text are the shared memory (SMP), the distributed shared memory (NUMA), and the Message-passing (Cluster) models. _Lecture 6_ To illustrate the difference between the models and the underlying issues we used a running example: an iterative solution to a linear differential equation (Laplacian). This example computed local averages in a 2-D array, and iterated the process until a max-change value was sufficiently small. We took the obvious serial solution and looked at ways to parallelize. In the SMP model we saw that we could parallelize the loops over the array to compute the averages. but that updating the shared max_change value resulted in a race condition. We looked at ways to resolve this with locks. However, locks result in serialization and we need to look further to reduce this effect. One solution was to reduce the amount of parallelization, using row-oriented parallelism. This significantly reduce the effects of locking, however, the final output values were not consistent--another race condition. To achieve consistency, whole rows could be locked but this could lead to a circular wait - deadlock. Consistency could be achieved by using alternation schemes, either row alternation or checkerboard alternation. But this changes the original algorithm (from Gauss-Seidel to Jacobi). We saw a way to parallelize the original algorithm by processing along a wavefront across the array. We looked at a technique called chunking to optimize the computation of the shared max-change value. We looked back and saw that chunking (or blocking) could also be used on the wavefront to better the cache locality. We did an analysis to show what was the best choice of block size that was based on the size of the cache and the number of rows needed to avoid cache misses. In addition to cache locality, we saw that the issue of choice of block size can impact the load balance. Load imbalance can result in idle processors and reduced speedup. Blocksize and granularity of tasks can have a big affect in the SMP model where global task queues are used. The conclusion is that, contrary to many industry claims, SMP programming can be quite complex requiring careful orchestration to achieve good performance. _Lecture 7_ In this lecture we looked at the message-passing (MP) model. We saw that race conditions still exist, caused by message delivery speed variations (rather than by shared variable). No global queues, and so task assignment and load balance must be explicit. Universal connectivity is assumed. We looked at ways to rewrite the solution of the previous problem (Gauss/Seidel/Jacobi relaxation) to fit the MP model. Like the SM solution, tasks can be formed by organizing the data, braking the data into blocks. In the MP model we must map the block to nodes and use messages o exchange edge boundary values. Computing the maxchange must be done through a reduction operation and broadcast. Here MPI support can be quite useful for simplifying programming. It can support addressing, deadlock free messaging, and collective communication operations like reduction and broadcast. We looked at system-level optimizations for broadcasting, based on recursive doubling (or binary tree structure). We looked at some of the complexity issues surrounding data layout, and the lack of predictable message transfer times. We considered the communication overheads typical in message passing systems. Although bandwidths seem to be converging, there is still great gaps between SMPs and MPs in terms of latency. This results in forcing the granularity, i.e., the computation time between synchronization points, of MP systems to be huge. A big research question is to find standard communication protocols that operate at sub-microsec speeds. We looked at the problems caused by trying to overlap computation and communication. We considered a model where these operations were separated into phases. The BSP model delays sends and receives (puts and gets), bundles together to reduce overhead, and makes use of software (and randomization) techniques to reduce network congestion. Proponents claim effective implementation on all current programming models. _Lecture 8_ Here we looked at commercial programming models, that essentially look to include the costs of I/O. We find two model flavors: a small N model - that is an SMP with threads, and a large N model- that is message passing with either global or local I/O. We saw how user-mode threads can effectively relieve I/O bottlenecks. Kernel threads are what is need for parallel processing and must rely on the operating system support. Typically many user-modes thread will run on fewer kernel threads. We also considered the differences between global and local I/O models. _Lecture 9_ We presented a talk on high-level parallel programming languages used in academic settings for instruction and research. We covered in some detail the NESL language developed at CMU, and the BSP language standard. We saw how certain language constructs are used to express parallelism. We also considered how the associated costs of using such constructs can be analyzed. The slides of this talk were made available to students. _Lecture 10_ We considered the issues involved in creating a true cluster operating system. We looked at the general notion of single system images (SSI) and how they can appear at many different layers, from applications, middleware-subsystems, file systems, over-kernels, and kernels. We considered where SSI boundaries exist, and the difficulties inherent in widening them. We considered open standards and costs, distinctions with distributed operating systems, micro-kernel structuring, and the general impact on system administration. _Lecture 11_ We considered the issues involved in evaluating the performance of the different architectural designs of Clusters, NUMAs, and SMPs. We looked at specifying workload characteristics across two dimensional axes of bandwidth and latency requirements. Next time we look at the costs involved in these architectural designs.