Virtual Machines for Fine-Grained Parallelism

Over the years numerous studies have pursued the development of application specific operating systems/run-time environments for performance critical applications. This is especially true for parallel computing. Unfortunately such solutions require that the (expensive) parallel hardware be entirely dedicated for that run-time environment. Virtualization presents an interesting opportunity for developing a full custom solution without the requirement of a fully dedicated hardware platform. With virtualization, a shared hardware platform can be safely configured with a standard setup and with multiple, distinct application specific environments. However, in order to be effectively used for performance critical applications, the overheads of the virtualization services used by the performance critical applications must be minimal or, ideally, non-existent.

Network Latency in Virtual Environments

For our work with parallel simulation on clusters, the network latency in the virtual environment must be as good or better than native. Network virtualization does, however, come at a cost. Our initial studies with para-virtualized network drivers shows that latency performance between virtual machines is no better than about 70% of native. Furthermore, even with PCI-passthrough our studies show that performance rarely exceeds 95% of native. Therefore, in order to deploy a virtual machine for fine-grained parallel computing, additional performance gains must be found.

Attacking the software overheads contributing to network latency has been widely studied. Examples include Active Messages (GAMMA, U-NET, Fast Messages) and InfiniBand. While the former are mostly academic studies, InfiniBand has been widely adopted by the high performance computing community even though it comes at a very high per-node cost. Fortunately, InfiniBand uses both hardware and software elements to achieve its performance capabilities. Thus, because we are working in a virtual machine framework, the replacement of the TCP/IP virtual machine device drivers with InfiniBand based Ethernet (IBoE) drivers may allow us to increase performance to, and possibly, above native. Fortunately, there already exist network drivers for Ethernet that replace the TCP/IP protocol with the InfiniBand protocol. Using these drivers, we were able to measure message latency performance that was slightly better than native (TCP/IP based) performance. We are encouraged, but not entirely satisfied.

Our current attack on virtual network performance is to couple PCI-passthrough (or Direct Connect) with polling to achieve very low message latency. While this requires that the Virtual Machine Manager (VMM) assign a dedicated network card to the virtual machine, this assignment is temporary and the NIC is returned to the VMM upon completion of the application's execution. Furthermore, adding an extra network card to a compute node is a very low cost expenditure (on the order of $10), so we believe it is a worthwhile cost in many cluster deployments. Using this model, we have been able to reduce message latency between applications in a virtual machine so that they are 60% lower than native.

Low-LAtency Minimal Appliance Operating System (LlamaOS)

Leveraging our studies with low latency messaging in virtual machines, we have developed a general purpose Low-LAtency Minimal Appliance Operating System (LlamaOS). LlamaOS uses the standard Linux tool chain and supports development in C, C++, Fortran, and MPI. Many applications can be built directly into LlamaOS without modification (or with minimal modification). The system is still under active development and is rapidly changing.

The software architecture of LlamaOS is organized into two types of virtual machines, namely: LlamaNet and LlamaApp. LlamaNet handles the networking interface between the nodes of the cluster. Each parallel process from the user application is compiled as a LlamaApp and one or more instances of LlamaApp are rune on each node. One instance of LlamaNet is run on each node. Messages are exchanged between nodes by LlamaNet and communicated to/from various instances of LlamaApp through the processing nodes shared memory to the LlamaNet instance running there. Support for threads in a LlamaApp are currently being developed.

Finally we are beginning to examine file I/O support and performance in a virtual machine through experiments with LlamaOS. This performance will become critical before LlamaOS can be used for big data applications.