卷2:第15章 Open MPI

现有版本还是比较粗的翻译...

原文地址:http://www.aosabook.org/en/openmpi.html

作者:Jeffrey M. Squyres

15.1. Background

Open MPI [GFB+04] is an open source software implementation of The Message Passing Interface (MPI) standard. Before the architecture and innards of Open MPI will make any sense, a little background on the MPI standard must be discussed.

15.1. 背景

Open MPI [GFB+04] 是一个消息传递接口 (Message Passing Interface, MPI) 标准的开源软件实现。在深入Open MPI架构和内部结构之前,需要先介绍一些MPI标准的背景知识。

The Message Passing Interface (MPI)

The MPI standard is created and maintained by the MPI Forum, an open group consisting of parallel computing experts from both industry and academia. MPI defines an API that is used for a specific type of portable, high-performance inter-process communication (IPC): message passing. Specifically, the MPI document describes the reliable transfer of discrete, typed messages between MPI processes. Although the definition of an "MPI process" is subject to interpretation on a given platform, it usually corresponds to the operating system's concept of a process (e.g., a POSIX process). MPI is specifically intended to be implemented as middleware, meaning that upper-level applications call MPI functions to perform message passing.

消息传递接口 (MPI)

MPI标准是MPI论坛创立和维护的,该论坛是来自工业界和学术界的并行计算专家组成的开放团体。MPI定义一组API,用于一种可移植、高性能的进程间通信(IPC):消息传递。特别的,MPI文档描述了MPI进程间可靠传输离散的、类型化的消息。尽管“MPI进程”的定义是由特定平台自行解释,但是通常来说都对应于操作系统中进程的概念(比如,POSIX进程)。MPI的目的是成为一种中间件,上层应用通过调用MPI函数可以进行消息传递。

MPI defines a high-level API, meaning that it abstracts away whatever underlying transport is actually used to pass messages between processes. The idea is that sending-process X can effectively say "take this array of 1,073 double precision values and send them to process Y". The corresponding receiving-process Y effectively says "receive an array of 1,073 double precision values from process X." A miracle occurs, and the array of 1,073 double precision values arrives in Y's waiting buffer.

MPI定义一组高层次的API,对进程间传递消息的底层传输进行抽象。大致上,发送进程X可以高效的说:“将此数组的1073个双精度数值发送到进程Y”。对应的接受进程Y高效的说:“从进程X接受1073个双精度数值的数组”。奇迹发生了,这个含有1073个浮点数值的数组就会到达Y的等待缓冲中。

Notice what is absent in this exchange: there is no concept of a connection occurring, no stream of bytes to interpret, and no network addresses exchanged. MPI abstracts all of that away, not only to hide such complexity from the upper-level application, but also to make the application portable across different environments and underlying message passing transports. Specifically, a correct MPI application is source-compatible across a wide variety of platforms and network types.

请注意在此交换中未出现的东西:没有连接建立的概念,没有需要解释的字节流,也没有网络地址的交换。MPI将这些都进行了抽象,不只是对上层应用隐藏了这些复杂性,而且可以使应用在不同环境和底层的消息传输上可移植。特别的,一个正确的MPI应用可以在广泛的平台和网络类型上源代码级兼容。

MPI defines not only point-to-point communication (e.g., send and receive), it also defines other communication patterns, such as collective communication. Collective operations are where multiple processes are involved in a single communication action. Reliable broadcast, for example, is where one process has a message at the beginning of the operation, and at the end of the operation, all processes in a group have the message. MPI also defines other concepts and communications patterns that are not described here. (As of this writing, the most recent version of the MPI standard is MPI-2.2 [For09]. Draft versions of the upcoming MPI-3 standard have been published; it may be finalized as early as late 2012.)

MPI不只定义了点到点的通信(比如:发送和接受),而且定义了其他的通信模式,比如集合(collective)通信。集合操作是指一次通信中包含了多个进程。比如说,可靠的广播,即开始时只有一个进程有一个消息,广播操作后这个组内的所有进程都有了这个消息。MPI还定义了一些其他的概念和通信模式,没有在此处讨论。(在本文撰写中,最新的MPI标准是MPI-2.2[For09]。新的MPI-3标准的草稿版已经发布,可能最早在2012年末就会定稿。)

Uses of MPI

There are many implementations of the MPI standard that support a wide variety of platforms, operating systems, and network types. Some implementations are open source, some are closed source. Open MPI, as its name implies, is one of the open source implementations. Typical MPI transport networks include (but are not limited to): various protocols over Ethernet (e.g., TCP, iWARP, UDP, raw Ethernet frames, etc.), shared memory, and InfiniBand.

使用MPI

MPI标准具有许多实现,支持大量不同的平台,操作系统和网络类型。一些实现是开源的,一些是闭源的。Open MPI正如其名字所暗示的,是一个开源的实现。典型的MPI传输网络包括(但不限于):以太网上的多种协议(比如:TCP,iWARP,UDP,原始以太网帧等),共享内存和InfiniBand。

MPI implementations are typically used in so-called "high-performance computing" (HPC) environments. MPI essentially provides the IPC for simulation codes, computational algorithms, and other "big number crunching" types of applications. The input data sets on which these codes operate typically represent too much computational work for just one server; MPI jobs are spread out across tens, hundreds, or even thousands of servers, all working in concert to solve one computational problem.

典型的,MPI是在高性能计算(High Performance Computing, HPC)中使用。MPI本质上为模拟、计算算法和其他的大型数值计算应用提供IPC。通常来说,这些应用操作的输入数据代表了大量的计算,不适合于一台服务器。所以,MPI作业都是分布在几十个,几百个,甚至几千个服务器上,所有作业都是合作解决一个计算问题。

That is, the applications using MPI are both parallel in nature and highly compute-intensive. It is not unusual for all the processor cores in an MPI job to run at 100% utilization. To be clear, MPI jobs typically run in dedicated environments where the MPI processes are the only application running on the machine (in addition to bare-bones operating system functionality, of course).

所以,使用MPI的应用都是含有并行性并且是高度计算密集的。MPI作业中的所有处理器核都是100%的利用率并不是罕见的。很明显,MPI作业通常是运行在专门的环境中,即机器上只有MPI进程这唯一的应用运行(当然,还有基础的操作系统)。

As such, MPI implementations are typically focused on providing extremely high performance, measured by metrics such as:

  • Extremely low latency for short message passing. As an example, a 1-byte message can be sent from a user-level Linux process on one server, through an InfiniBand switch, and received at the target user-level Linux process on a different server in a little over 1 microsecond (i.e., 0.000001 second).
  • Extremely high message network injection rate for short messages. Some vendors have MPI implementations (paired with specified hardware) that can inject up to 28 million messages per second into the network.
  • Quick ramp-up (as a function of message size) to the maximum bandwidth supported by the underlying transport.
  • Low resource utilization. All resources used by MPI (e.g., memory, cache, and bus bandwidth) cannot be used by the application. MPI implementations therefore try to maintain a balance of low resource utilization while still providing high performance.

因此,MPI实现通常关注于提供非常高的性能,从以下尺度测量: - 短消息传递中非常低的延迟。例如,服务器上用户级Linux进程发送一条1字节的消息,通过InfiniBand交换机,被另外一台服务器的目标用户级Linux进程接受,整个过程只需1毫秒中很少一部分(比如:0.000001秒) - 短消息的极高网络注入率。一些制造商的MPI实现(配合专门的硬件)可以达到每秒向网络注入2千8百万条消息。 - 在消息大小增长时,可以快速达到底层传输支持的最大带宽。 - 低资源占用。所有MPI使用的资源(比如:内存,缓存和总线带宽)都不能被应用使用。所以,MPI实现尝试保持低资源占用和同样提供高性能的平衡。

Open MPI

The first version of the MPI standard, MPI-1.0, was published in 1994 [Mes93]. MPI-2.0, a set of additions on top of MPI-1, was completed in 1996 [GGHL+96].

Open MPI

1994年发布MPI标准第一个版本MPI-1.0[Mes93]。1996年,通过在MPI-1的基础上附加一些功能,完成了MPI-2.0[GGHL+96]。

In the first decade after MPI-1 was published, a variety of MPI implementations sprung up. Many were provided by vendors for their proprietary network interconnects. Many other implementations arose from the research and academic communities. Such implementations were typically "research-quality," meaning that their purpose was to investigate various high-performance networking concepts and provide proofs-of-concept of their work. However, some were high enough quality that they gained popularity and a number of users.

在MPI-1发布后的第一个十年内,涌现出许多MPI实现。很多实现来自于私有互连网络的厂商。另外一些实现来自于科研和学术界。这些实现是典型的“研究品”,它们的目标是探索各种高性能网络系统的想法,以及提供他们工作的概念验证。然而,其中有一些具有足够高的质量,获得一定数量的用户而普及起来。

Open MPI represents the union of four research/academic, open source MPI implementations: LAM/MPI, LA/MPI (Los Alamos MPI), and FT-MPI (Fault-Tolerant MPI). The members of the PACX-MPI team joined the Open MPI group shortly after its inception.

Open MPI融合了4个科研、学术界的开源MPI实现:LAM/MPI,LA/MPI(Los Alamos MPI)和FT-MPI(Fault-Tolerant MPI)。在PACX-MPI成立之后,其成员也立刻加入了Open MPI组。

The members of these four development teams decided to collaborate when we had the collective realization that, aside from minor differences in optimizations and features, our software code bases were quite similar. Each of the four code bases had their own strengths and weaknesses, but on the whole, they more-or-less did the same things. So why compete? Why not pool our resources, work together, and make an even better MPI implementation?

这4个开发团队决定进行合作,是因为共同认识到除了细微的优化和功能上的差别,我们的软件代码都是很类似的。其中各个软件实现都有各自的强项和弱项,但是总体上,它们或多或少都是在做同样的事情。所以,为什么要竞争?为什么不整合我们的资源,一起合作达到一个更好的MPI实现?

After much discussion, the decision was made to abandon our four existing code bases and take only the best ideas from the prior projects. This decision was mainly predicated upon the following premises:

在经过很多讨论后,决定放弃我们已有的4个代码,只从之前的项目中借用最好的思想。这个决定主要基于以下的考虑:

  • Even though many of the underlying algorithms and techniques were similar among the four code bases, they each had radically different implementation architectures, and would be incredible difficult (if not impossible) to merge.
  • Each of the four also had their own (significant) strengths and (significant) weaknesses. Specifically, there were features and architecture decisions from each of the four that were desirable to carry forward. Likewise, there were poorly optimized and badly designed code in each of the four that were desirable to leave behind.
  • The members of the four developer groups had not worked directly together before. Starting with an entirely new code base (rather than advancing one of the existing code bases) put all developers on equal ground.

  • 尽管这4个项目中很多底层和算法和技术都是类似的,但是它们的实现架构彻底不同,所以基本不可能合并在一起。

  • 这4个项目每一个都有各自明显的强项和弱项。特别地,每个项目都有一些希望发扬的功能和架构设计。类似地,每个项目也有一些期望丢弃的未优化和不好的设计。
  • 4个开发团队的成员之前从来没有直接合作过。从一个全新的项目开始(而不是基于某个已有的代码),可以让所有的开发者都处在同样的起跑线上。

Thus, Open MPI was born. Its first Subversion commit was on November 22, 2003.

最终,Open MPI诞生了。2003年11月22日产生了第一次Subversion提交。

15.2. Architecture

For a variety of reasons (mostly related to either performance or portability), C and C++ were the only two possibilities for the primary implementation language. C++ was eventually discarded because different C++ compilers tend to lay out structs/classes in memory according to different optimization algorithms, leading to different on-the-wire network representations. C was therefore chosen as the primary implementation language, which influenced several architectural design decisions.

15.2 架构

基于多种原因(绝大部分关系到性能或者可移植性),C 和 C++是两种可能的主要实现语言。最终,放弃C++是因为不同的C++编译器倾向于根据不同的优化算法进行数据结构或类的内存布局,这就导致了线上网络表现中的差异。因此,选择 C 作为主要的实现语言,进而影响了很多架构设计的决定。

When Open MPI was started, we knew that it would be a large, complex code base:

  • In 2003, the current version of the MPI standard, MPI-2.0, defined over 300 API functions.
  • Each of the four prior projects were large in themselves. For example, LAM/MPI had over 1,900 files of source code, comprising over 300,000 lines of code (including comments and blanks).
  • We wanted Open MPI to support more features, environments, and networks than all four prior projects put together.

在Open MPI开始的时候,我们就知道它会是一个大型和复杂的代码项目:

  • 在2003年,当时的MPI标准MPI-2.0已经定义了超过300个API函数。
  • 这4个先前的项目都是较大的。比如,LAM/MPI源代码文件超过1900个,总共超过300000行代码(包括注释和空行)。
  • 我们期望Open MPI能够比之前4个项目的总和支持更多的特性,环境和网络。

We therefore spent a good deal of time designing an architecture that focused on three things:

  1. Grouping similar functionality together in distinct abstraction layers.
  2. Using run-time loadable plugins and run-time parameters to choose between multiple different implementations of the same behavior.
  3. Not allowing abstraction to get in the way of performance.

因此,我们花费了一大堆时间进行架构设计,关注以下3件事情:

  1. 将类似的功能特性汇总到不同的抽象层次。
  2. 使用运行时可装载的插件和运行时参数,从而可以在多个具有相同行为的不同实现中进行选择
  3. 不让抽象妨碍性能

Abstraction Layer Architecture

Open MPI has three main abstraction layers, shown in Figure 15.1:

  • Open, Portable Access Layer (OPAL): OPAL is the bottom layer of Open MPI's abstractions. Its abstractions are focused on individual processes (versus parallel jobs). It provides utility and glue code such as generic linked lists, string manipulation, debugging controls, and other mundane—yet necessary—functionality.
    OPAL also provides Open MPI's core portability between different operating systems, such as discovering IP interfaces, sharing memory between processes on the same server, processor and memory affinity, high-precision timers, etc.

  • Open MPI Run-Time Environment (ORTE) (pronounced "or-tay"): An MPI implementation must provide not only the required message passing API, but also an accompanying run-time system to launch, monitor, and kill parallel jobs. In Open MPI's case, a parallel job is comprised of one or more processes that may span multiple operating system instances, and are bound together to act as a single, cohesive unit.
    In simple environments with little or no distributed computational support, ORTE uses rsh or ssh to launch the individual processes in parallel jobs. More advanced, HPC-dedicated environments typically have schedulers and resource managers for fairly sharing computational resources between many users. Such environments usually provide specialized APIs to launch and regulate processes on compute servers. ORTE supports a wide variety of such managed environments, such as (but not limited to): Torque/PBS Pro, SLURM, Oracle Grid Engine, and LSF.

  • Open MPI (OMPI): The MPI layer is the highest abstraction layer, and is the only one exposed to applications. The MPI API is implemented in this layer, as are all the message passing semantics defined by the MPI standard.
    Since portability is a primary requirement, the MPI layer supports a wide variety of network types and underlying protocols. Some networks are similar in their underlying characteristics and abstractions; some are not.

抽象层次架构

Open MPI具有3个主要的抽象层次,如图15.1所示:

  • 开放、可移植的访问层(Open Portable Access Layer, OPAL):OPAL位于Open MPI抽象的最底层,它关注于各个进程个体(相对于并行作业)。它提供了一些工具和集成代码,包括:链接列表,字符串操作,debug控制和其他平凡但是必须的功能。
    OPAL使得Open MPI的核心在不同操作系统间可移植。比如,发现IP网卡,在同一个服务器的进程间共享内存,处理器和内存的亲和性,高精度的计时器等。

  • Open MPI运行时环境(Open MPI Run-Time Environment, ORTE,发音为“or-tay”):MPI实现不只是提供必要的消息传递API,还必须提供辅助的运行时系统以发起,监视和杀死并行作业。在Open MPI中,并行作业是指一个或者多个可能跨多个操作系统的进程,组合在一起作为一个紧密耦合的单元。
    在只有很少或者没有分布式计算支持的简单环境下,ORTE使用rsh或者ssh启动并行作业中各个进程。更高级的情况下,专门的HPC环境通常会提供调度和资源管理器,以公平的在多个用户间共享计算资源。这种环境通常会提供特殊的API以在计算节点上发起和控制进程。ORTE支持众多这样的管理环境,包括(单不限于)Torque/PBS Pro, SLURM, Oracle Grid Engine和LSF。

  • Open MPI(OMPI):MPI层是最上面的抽象层次,唯一暴露给应用程序的层次。在这层次内,实现了MPI API并且所有的消息传递语意符合MPI标准的定义。
    因为可移植性是主要的需求,MPI层次支持广泛的网络类型和底层的协议。其中,一些网络在底层的特征和抽象上是类似的,有一些是不同的。

Open MPI 抽象层次 图15.1 Open MPI的抽象层次架构视图,包含3个主要层次:OPAL,ORTE和OMPI

Although each abstraction is layered on top of the one below it, for performance reasons the ORTE and OMPI layers can bypass the underlying abstraction layers and interact directly with the operating system and/or hardware when needed (as depicted in Figure 15.1). For example, the OMPI layer uses OS-bypass methods to communicate with certain types of NIC hardware to obtain maximum networking performance.

尽管各个层次都是依次上下分层排布,但是为了性能因素,如果必需时,ORTE和OMPI层可以越过下面的层次直接与所需的操作系统或硬件交互(如图15.1所示)。比如,OMPI层可以用绕过OS的方法直接与某种类型的NIC硬件通信,从而获得最大的网络性能。

Each layer is built into a standalone library. The ORTE library depends on the OPAL library; the OMPI library depends on the ORTE library. Separating the layers into their own libraries has acted as a wonderful tool for preventing abstraction violations. Specifically, applications will fail to link if one layer incorrectly attempts to use a symbol in a higher layer. Over the years, this abstraction enforcement mechanism has saved many developers from inadvertently blurring the lines between the three layers.

每一层都是一个独立的库,ORTE库依赖于OPAL库,OMPI库依赖于ORTE库。将各个层次分开到各自的库中,可以防止抽象的破坏。特别的,当一个层次不正确地试图使用高层的符号时,应用程序会链接失败。这么多年来,此种抽象的强制机制防止了很多开发人员不经意间混淆了三个层次间的界限。