Lecture 1
We looked at the ways computer performance can be improved
 - inc. processor speeds
 - designing better algorithms
 - application of parallel processing

We discussed the difference b/t massive and lowly parallelism.
THe course will focus on the latter, due to current trends. These 
trends include powerful microprocessors, high availability applications,
faster, open networking.

Types of lowly parallel computers, SMPs, NUMAs, Clusters;
and, associated memory access models: shared memory, distributed shared memory, and message passing.

We considered the affect of network communication parameters:
bandwidth, latency, reliability, homogeneity, security, scalability.

_Lecture 2_
We looked in some detail at the basis of the message passing model
of communication. We showed that the general problem of
distributed consensus was impossible to solve. From this it followed
that completely reliable communication in which no messages are lost
or duplicated is likewise impossible.

We considered the design of protocols for network control processes (NCPs) that
reduced the risk of lost and/or duplicated messages. This was done through the
use of acknowledgment messages.

_Lecture 3_
We gave a definition of a cluster: a collection of interconnected computer
systems that is used as a single computational resource.

We looked at some examples of the application of clusters in the 
workplace: compute clusters, data management clusters, word processing clusters, web serving clusters.

We looked at the advantages enjoyed by clusters: better price-performance
ratios, software pricing, system administration. better use of tools
and open standards.

_Lecture 4_
We looked at the difference between clusters and SMP computers.
They differ along the lines of the following issues:
scalability, availability, system management, software licensing

We looked at the difference between clusters and other parallel computers,
e.g., massively parallel (SIMD,MIMD), NUMA systems.


_Lecture 5_
We considered the question of how to parallel program. We considered
the question of what a programming model was. We considered the
difference between descriptive and prescriptive models. We talked about
the PRAM model and its lack of a communication and synchronization
costs. The three basis models considered in the text are 
the shared memory (SMP), the distributed shared memory (NUMA), and
the Message-passing (Cluster) models.


_Lecture 6_
To illustrate the difference between the models and the underlying issues
we used a running example: an iterative solution to a linear differential
equation (Laplacian). This example computed local averages in a 2-D array,
and iterated the process until a max-change value was sufficiently small.

We took the obvious serial solution and looked at ways to parallelize.

In the SMP model we saw that we could parallelize the loops over the
array to compute the averages. but that updating the shared max_change value
resulted in a race condition. We looked at ways to resolve this with
locks. However, locks result in serialization and we need to look further
to reduce this effect. One solution was to reduce the amount of 
parallelization, using row-oriented parallelism. This significantly
reduce the effects of locking, however, the final output values were
not consistent--another race condition. To achieve consistency, whole rows
could be locked but this could lead to a circular wait - deadlock.

Consistency could be achieved by using alternation schemes, either
row alternation or checkerboard alternation. But this changes the original
algorithm (from Gauss-Seidel to Jacobi). We saw a way to parallelize the
original algorithm by processing along a wavefront across the array.
We looked at a technique called chunking to optimize the computation of
the shared max-change value.

We looked back and saw that chunking (or blocking) could also be used
on the wavefront to better the cache locality. We did an analysis to
show what was the best choice of block size that was based on the size
of the cache and the number of rows needed to avoid cache misses.

In addition to cache locality, we saw that the issue of choice of
block size can impact the load balance. Load imbalance can result
in idle processors and reduced speedup. Blocksize and granularity
of tasks can have a big affect in the SMP model where global task queues
are used. The conclusion is that, contrary to many industry claims,
 SMP programming can be quite complex requiring careful orchestration
to achieve good performance.


_Lecture 7_

In this lecture we looked at the message-passing (MP) model. We saw that
race conditions still exist, caused by message delivery speed variations
(rather than by shared variable). No global queues, and so task 
assignment and load balance must be explicit. Universal connectivity
is assumed. 

 We looked at ways to rewrite the solution of the previous problem
(Gauss/Seidel/Jacobi relaxation) to fit the MP model.  Like the SM
solution, tasks can be formed by organizing the data, braking the data
into blocks. In the MP model we must map the block to nodes and use
messages o exchange edge boundary values. Computing the maxchange
must be done through a reduction operation and broadcast. 

Here MPI support can be quite useful for simplifying programming.
It can support addressing, deadlock free messaging, and
collective communication operations like reduction and broadcast.

We looked at system-level optimizations for broadcasting, based on
recursive doubling (or binary tree structure).

We looked at some of the complexity issues surrounding data layout,
and the lack of predictable message transfer times.
We considered the communication overheads typical in message passing
systems. Although bandwidths seem to be converging, there is still
great gaps between SMPs and MPs in terms of latency.  This results in
forcing the granularity, i.e., the computation time between
synchronization points, of MP systems to be huge. A big research
question is to find standard communication protocols that operate at
sub-microsec speeds.

We looked at the problems caused by trying to overlap computation and
communication. We considered a model where these operations were
separated into phases.  The BSP model delays sends and receives (puts
and gets), bundles together to reduce overhead, and makes use of
software (and randomization) techniques to reduce network congestion.
Proponents claim effective implementation on all current programming
models.


_Lecture 8_ Here we looked at commercial programming models, that
essentially look to include the costs of I/O. We find two model
flavors: a small N model - that is an SMP with threads, and a large N
model- that is message passing with either global or local I/O.

We saw how user-mode threads can effectively relieve I/O bottlenecks.
Kernel threads are what is need for parallel processing and must
rely on the operating system support. Typically many user-modes thread
will run on fewer kernel threads.

We also considered the differences between global and local I/O models.

_Lecture 9_
We presented a talk on high-level parallel programming 
 languages used in academic settings for instruction and research.  We
covered in some detail the NESL language developed at CMU, and the BSP
language standard. We saw how certain language constructs are used to
express parallelism. We also considered how the associated costs of
using such constructs can be analyzed. The slides of this talk 
were made available to students.

_Lecture 10_
We considered the issues involved in creating a true cluster operating
system. We looked at the general notion of single system images (SSI)
and how they can appear at many different layers, from applications,
middleware-subsystems, file systems, over-kernels, and kernels.
We considered where SSI boundaries exist, and the difficulties inherent
in widening them. We considered open standards and costs, 
distinctions with distributed operating systems, micro-kernel
structuring, and the general impact on system administration.

_Lecture 11_
We considered the issues involved in evaluating the performance of 
the different architectural designs of Clusters, NUMAs, and SMPs.
We looked at specifying workload characteristics across two
dimensional axes of bandwidth and latency requirements. 
Next time we look at the costs involved in these architectural designs.