1 | \documentclass[a4paper,10pt]{article} |
---|
2 | \usepackage[utf8]{inputenc} |
---|
3 | \usepackage{graphicx} |
---|
4 | \usepackage{listings} |
---|
5 | \usepackage[usenames,dvipsnames,svgnames,table]{xcolor} |
---|
6 | \usepackage{amsmath} |
---|
7 | |
---|
8 | % Title Page |
---|
9 | \title{Note for MPI Endpoints} |
---|
10 | \author{} |
---|
11 | |
---|
12 | |
---|
13 | \begin{document} |
---|
14 | \maketitle |
---|
15 | |
---|
16 | %\begin{abstract} |
---|
17 | %\end{abstract} |
---|
18 | |
---|
19 | \section{Purpose} |
---|
20 | |
---|
21 | Use threads as if they are MPI processes. Each thread will be assigned a rank and be associated with a endpoints communicator |
---|
22 | (EP\_Comm). Convention: one OpenMP thread corresponds to one endpoint. |
---|
23 | |
---|
24 | \section{MPI Endpoints Semantics} |
---|
25 | |
---|
26 | \begin{center} |
---|
27 | \includegraphics[scale=0.4]{scheme.png} |
---|
28 | \end{center} |
---|
29 | |
---|
30 | Endpoints are created from one MPI communicator and the number of available threads: |
---|
31 | \begin{center} |
---|
32 | \begin{verbatim} |
---|
33 | int MPI_Comm_create_endpoints(MPI_Comm parent_comm, int num_ep, |
---|
34 | MPI_Info info, MPI_Comm out_comm_hdls[]) |
---|
35 | \end{verbatim} |
---|
36 | \end{center} |
---|
37 | |
---|
38 | ``In this collective call, a single output communicator is created, and an array of \verb|num_ep| handles to this new communicator |
---|
39 | are returned, where the $i^{th}$ handle corresponds to the $i^{th}$ rank requested by the caller of \verb|MPI_Comm_create_endpoints|. Ranks |
---|
40 | in the output communicator are ordered sequentially and in the same order as the parent communicator. After it has been created, the |
---|
41 | output communicator behaves as a normal communicator, and MPI calls on each endpoint (i.e., communicator handle) behave as though |
---|
42 | they originated from a separate MPI process. In particular, collective calls must be made once per endpoint.''\cite{Dinan:2013} |
---|
43 | |
---|
44 | ``Once created, endpoints behave as MPI processes. For example, all ranks in an endpoints communicator must participate in |
---|
45 | collective operations. A consequence of this semantic is that endpoints also have MPI process progress requirements; that operations on |
---|
46 | that endpoint are required to make progress only when an MPI operation (e.g. \verb|MPI_Test|) is performed on that endpoint. This semantic |
---|
47 | enables an MPI implementation to logically separate endpoints, treat them independently within the progress engine, and eliminate |
---|
48 | synchronization in updating their state.''\cite{Sridharan:2014} |
---|
49 | |
---|
50 | \section{EP types} |
---|
51 | \subsection*{MPI\_Comm} |
---|
52 | \verb|MPI_Comm| is composed by: |
---|
53 | \begin{itemize} |
---|
54 | \item[$\bullet$] \verb|bool is_ep|: true $\implies$ EP, false $\implies$ MPI classic; |
---|
55 | \item[$\bullet$] \verb|int mpi_comm|: handle to the parent MPI communicator; |
---|
56 | \item[$\bullet$] \verb|OMPbarrier *ep_barrier|: openMP barrier, used for in-process synchronization and is different from |
---|
57 | \verb|omp barrier|; |
---|
58 | \item[$\bullet$] \verb|int[2] size_rank_info[3]|: topology information of the current endpoint: |
---|
59 | \begin{itemize} |
---|
60 | \item rank of parent MPI process; |
---|
61 | \item size of parent MPI communicator; |
---|
62 | \item rank of endpoint, returned by \verb|MPI_Comm_rank|; |
---|
63 | \item size of EP communicator, returned by \verb|MPI_Comm_size|; |
---|
64 | \item in-process rank of endpoint; |
---|
65 | \item in-process size of EP communicator, also noted as the number of endpoints in one MPI process. |
---|
66 | \end{itemize} |
---|
67 | \item[$\bullet$] \verb|MPI_Comm *comm_list|: pointer of the first endpoint communicator of one process; |
---|
68 | \item[$\bullet$] \verb|Message_list *message_queue|: location of in-coming messages for each endpoint; |
---|
69 | \item[$\bullet$] \verb|RANK_MAP *rank_map|: a map composed by an integer and a pair of integers. The integer key represents the rank of an |
---|
70 | endpoint. The mapped type (pair of integers) gives the in-process rank of the endpoint and the rank of its parent MPI process: |
---|
71 | \begin{center} |
---|
72 | \begin{verbatim} |
---|
73 | rank_map->at(ep_rank)=(ep_rank_local, mpi_rank) |
---|
74 | \end{verbatim} |
---|
75 | \end{center} |
---|
76 | \item[$\bullet$] \verb|BUFFER *ep_buffer|: buffer (of type \verb|int|, \verb|float|, \verb|double|, \verb|char|, \verb|long|, |
---|
77 | and \verb|unsigned long|) used for in-process communication. |
---|
78 | \end{itemize} |
---|
79 | |
---|
80 | \subsection{MPI\_Request} |
---|
81 | \verb|MPI_Request| is composed by: |
---|
82 | \begin{itemize} |
---|
83 | \item[$\bullet$] \verb|int mpi_request|: handle to the MPI request; |
---|
84 | \item[$\bullet$] \verb|int ep_datatype|: data type of the communication; |
---|
85 | \item[$\bullet$] \verb|MPI_Comm comm|: handle to the EP communicator; |
---|
86 | \item[$\bullet$] \verb|int ep_src|: rank of the source endpoint; |
---|
87 | \item[$\bullet$] \verb|int ep_tag|: tag of the communication. |
---|
88 | \item[$\bullet$] \verb|int type|: type of the communication: |
---|
89 | \begin{itemize} |
---|
90 | \item 1 $\implies$ non-blocking send; |
---|
91 | \item 2 $\implies$ pending non-blocking receive; |
---|
92 | \item 3 $\implies$ non-blocking matching receive. |
---|
93 | \end{itemize} |
---|
94 | |
---|
95 | \end{itemize} |
---|
96 | |
---|
97 | \subsection{MPI\_Status} |
---|
98 | \verb|MPI_Status| consists of: |
---|
99 | \begin{itemize} |
---|
100 | \item[$\bullet$] \verb|int mpi_status|: handle to the MPI status; |
---|
101 | \item[$\bullet$] \verb|int ep_datatype|: data type of the communication; |
---|
102 | \item[$\bullet$] \verb|int ep_src|: rank of the source endpoint; |
---|
103 | \item[$\bullet$] \verb|int ep_tag|: tag of the communication. |
---|
104 | \end{itemize} |
---|
105 | |
---|
106 | \subsection{MPI\_Message} |
---|
107 | \verb|MPI_Message| includes: |
---|
108 | \begin{itemize} |
---|
109 | \item[$\bullet$] \verb|int mpi_message|: handle to the MPI message; |
---|
110 | \item[$\bullet$] \verb|int ep_src|: rank of the source endpoint; |
---|
111 | \item[$\bullet$] \verb|int ep_tag|: tag of the communication. |
---|
112 | \end{itemize} |
---|
113 | |
---|
114 | Other types, such as \verb|MPI_Info|, \verb|MPI_Aint|, and \verb|MPI_Fint| are defined in the same way. |
---|
115 | |
---|
116 | \section{P2P communication} |
---|
117 | |
---|
118 | All EP point-to-point communication use tag to distinguish the source and destination endpoint. To be able to add these extra |
---|
119 | information to tag, we require that the tag value is represented using 31 bits in the underlying MPI inmplemention. |
---|
120 | |
---|
121 | \includegraphics[scale=0.4]{tag.png} |
---|
122 | |
---|
123 | EP\_tag is user defined. MPI\_tag is internally computed and used inside MPI calls. Because of the extension of tag, wild-cards as |
---|
124 | \verb|MPI_ANY_SOURCE| and \verb|MPI_ANY_TAG| will not be usable directly. An extra step of tag analysis is needed which leads to the |
---|
125 | message dequeuing mechanism. |
---|
126 | |
---|
127 | \begin{center} |
---|
128 | \includegraphics[scale = 0.5]{sendrecv.png} |
---|
129 | \end{center} |
---|
130 | |
---|
131 | |
---|
132 | In MPI environment, each MPI process has an incoming message queue. In EP case, messages for all threads inside one MPI process are stored |
---|
133 | in this MPI queue. With the MPI 3 standard, we use the \verb|MPI_Improbe| routine to inquire the message queue and relocate the incoming |
---|
134 | message in the local message queue for the corresponding thread/endpoint. |
---|
135 | |
---|
136 | |
---|
137 | \includegraphics[scale=0.3]{dequeue.png} |
---|
138 | |
---|
139 | % Any EP calls will trigger the message dequeuing and the probing, (matched-)receiving operations are performed upon the local message queue. |
---|
140 | % |
---|
141 | % Example: \verb|EP_Recv(src=2, tag=10, comm1)|: |
---|
142 | % \begin{itemize} |
---|
143 | % \item[1.] Dequeue MPI message queue; |
---|
144 | % \item[2.] call \verb|EP_Improb(src=2, tag=10, comm1, message)|; |
---|
145 | % \item[3.] if find corresponding triple (src, tag, comm1), call \verb|EP_Mrecv(src=2, tag=10, comm1, message)|; |
---|
146 | % \item[4.] else, repeat from step 2. |
---|
147 | % \end{itemize} |
---|
148 | |
---|
149 | |
---|
150 | \paragraph{Messages are \textit{non-overtaking}} Incoming messages' order is important! If one thread is receiving multiple messages from |
---|
151 | the same source with the same tag. The receive order should be the same order in which the messages are sent. That is to say, the n-th sent |
---|
152 | message should be the n-th received message. |
---|
153 | |
---|
154 | |
---|
155 | |
---|
156 | \paragraph{Progress} ``If a pair of matching send and receives have been initiated on two processes, then at least one of these |
---|
157 | two operations will complete, independently of other actions in the system: the send operation will complete, unless the receive is |
---|
158 | satisfied by another message, and completes; the receive operation will complete, unless the message sent is consumed by another matching |
---|
159 | receive that was posted at the same destination process.'' \cite{MPI} |
---|
160 | |
---|
161 | |
---|
162 | When one \verb|EP_Irecv| is issued, we first dequeue the MPI incoming message queue and distribute all incoming messages to the local |
---|
163 | queues according to the destination identifier. Next, the nonblocking receive request is added at the end of the request pending list. |
---|
164 | Third, the pending list is checked and requests with matching source, tag, and communicator will be accomplished. |
---|
165 | |
---|
166 | |
---|
167 | Because of the importance of message order, some communication completion functions must be discussed here such as \verb|MPI_Test| and |
---|
168 | \verb|MPI_Wait|. ``The functions \verb|MPI_Wait| and \verb|MPI_Test| are used to complete a nonblocking communication. The completion of a |
---|
169 | send operation indicates that the sender is now free to update the locations in the send buffer (the send operation itself leaves the |
---|
170 | content of the send buffer unchanged). It does not indicate that the message has been received, rather, it may have been buffered by |
---|
171 | the communication subsystem. However, if a synchronous mode send was used, the completion of the send operation indicates that a |
---|
172 | matching receive was initiated, and that the message will eventually be received by this matching receive. The completion of a |
---|
173 | receive operation indicates that the receive buffer contains the received message, the receiver is now free to access it, and that the |
---|
174 | status object is set. It does not indicate that the matching send operation has completed (but indicates, of course, that the send |
---|
175 | was initiated).'' \cite{MPI} |
---|
176 | |
---|
177 | \paragraph{Example 1} |
---|
178 | \verb|MPI_Test(MPI_Request *request, int *flag, MPI_Status *status)| |
---|
179 | \begin{itemize} |
---|
180 | \item[1.] If \verb|request->type == 1|, communication to be tested is indeed issued from a non-blocking send. The completion status is |
---|
181 | returned by: |
---|
182 | \begin{center} |
---|
183 | \begin{verbatim} |
---|
184 | MPI_Test(& request->mpi_request, flag, & status->mpi_status) |
---|
185 | \end{verbatim} |
---|
186 | \end{center} |
---|
187 | \item[2.] If \verb|request->type == 2|, it means that a non-blocking receive is called but the corresponding message is not yet probed. |
---|
188 | The request is in the pending list thus not yet completed. All incoming message is once again probed and all pending requests are |
---|
189 | checked. If after the second check, the matching message is found, thus a \verb|MPI_Imrecv| is called and the type is set to 3. |
---|
190 | Otherwise, the type is still 2, then \verb|flag = false| is returned. |
---|
191 | \item[3.] If \verb|request->type == 3|, this indcates that the request is issued from a non-blocking receive call and the |
---|
192 | matching message is probed thus the status of the communication lies in the status of the \verb|MPI_Imrecv| function. The |
---|
193 | completion result is returned by: |
---|
194 | \begin{center} |
---|
195 | \begin{verbatim} |
---|
196 | MPI_Test(& request->mpi_request, flag, & status->mpi_status) |
---|
197 | \end{verbatim} |
---|
198 | \end{center} |
---|
199 | \end{itemize} |
---|
200 | |
---|
201 | \paragraph{Example 2} |
---|
202 | \verb|MPI_Wait(MPI_Request *request, MPI_Status *status)| |
---|
203 | \begin{itemize} |
---|
204 | \item[1.] If \verb|request->type == 1|, communication to be tested is indeed issued from a non-blocking send. Jump to step 4. |
---|
205 | \item[2.] If \verb|request->type == 2|, it means that a non-blocking receive is called but the corresponding message is not yet probed. |
---|
206 | The request is in the pending list thus not yet completed. We repeat the incoming message probing and the pending request checking until the |
---|
207 | matching message is found, thus a \verb|MPI_Imrecv| is called and the type is set to 3. Jump to step 4. |
---|
208 | \item[3.] If \verb|request->type == 3|, this indcates that the request is issued from a non-blocking receive call and the |
---|
209 | matching message is probed thus the status of the communication lies in the status of the \verb|MPI_Imrecv| function. Jump to step 4. |
---|
210 | \item[4.] We force the completion by calling: |
---|
211 | \begin{center} |
---|
212 | \begin{verbatim} |
---|
213 | MPI_Wat(& request->mpi_request, & status->mpi_status) |
---|
214 | \end{verbatim} |
---|
215 | \end{center} |
---|
216 | \end{itemize} |
---|
217 | |
---|
218 | \section{Collective communication} |
---|
219 | |
---|
220 | All MPI classic collective communications are performed as the following pattern: |
---|
221 | |
---|
222 | \begin{itemize} |
---|
223 | \item[1.] Intra-process communication using OpenMP. \textit{e.g.} Collect data from slave threads to master thread. |
---|
224 | \item[2.] Inter-process communication using MPI collective calls on master threads. |
---|
225 | \item[3.] Intra-process communication using OpenMP. \textit{e.g.} Distribute data from master thread to slave threads. |
---|
226 | \end{itemize} |
---|
227 | |
---|
228 | \paragraph{Example 1} |
---|
229 | \verb|EP_Bcast(buffer, count, datatype, root = 4, comm)| with \verb|comm| composed by 4 MPI |
---|
230 | processes and 3 threads per process: |
---|
231 | |
---|
232 | We can consider the communicator as $\{\underbrace{(0,1,2)}_\textrm{proc 0} \quad \underbrace{(3,\textcolor{red}{4},5)}_\textrm{proc |
---|
233 | 1}\quad \underbrace{(6,7,8)}_\textrm{proc 2}\quad \underbrace{(9,10,11)}_\textrm{proc 3}\}$. |
---|
234 | |
---|
235 | This collective communication is performed by the following three steps: |
---|
236 | |
---|
237 | \begin{itemize} |
---|
238 | %\item[1.] EP process with rank 4 send the buffer to EP process rank 3 which is a master thread. |
---|
239 | \item[1.] We call \verb|MPI_Bcast(buffer, count, datatype, mpi_root = 1, mpi_comm) | with EP processes rank 0, 4, 6, and 9. |
---|
240 | \item[2.] EP processes rank 0, 4, 6, and 9 send the buffer to its slaves. |
---|
241 | \end{itemize} |
---|
242 | |
---|
243 | \begin{center} |
---|
244 | \includegraphics[scale=0.3]{bcast.png} |
---|
245 | \end{center} |
---|
246 | |
---|
247 | |
---|
248 | \paragraph{Example 2} |
---|
249 | \verb|EP_Allreduce(sendbuf, recvbuf, count, datatype, op, comm)| with \verb|comm| the same |
---|
250 | as in example 1. |
---|
251 | |
---|
252 | This collective communication is performed by the following three steps: |
---|
253 | |
---|
254 | \begin{itemize} |
---|
255 | \item[1.] We perform a intra-process ``allreduce'' operation: master threads collect data from its slaves and perform the reduce |
---|
256 | operation. |
---|
257 | \item[2.] Master threads call the classic \verb|MPI_Allreduce| routine. |
---|
258 | \item[3.] All master threads send the updated reduced data to its slaves. |
---|
259 | \end{itemize} |
---|
260 | |
---|
261 | \begin{center} |
---|
262 | \includegraphics[scale=0.3]{allreduce.png} |
---|
263 | \end{center} |
---|
264 | |
---|
265 | |
---|
266 | |
---|
267 | Other collective communications have the similar execution pattern. |
---|
268 | |
---|
269 | \section{Inter-communicator} |
---|
270 | |
---|
271 | In XIOS, inter-communicator is an very important component. Thus, our EP library must support inter-communications. |
---|
272 | |
---|
273 | \subsection{The splitting of intra-communicator} |
---|
274 | |
---|
275 | Before talking about the inter-communicator, we will start by splitting intra-communicator. |
---|
276 | The C prototype of the splitting routine is |
---|
277 | \begin{center} |
---|
278 | \begin{verbatim} |
---|
279 | int MPI_Comm_split(MPI_Comm comm, int color, int key, |
---|
280 | MPI_Comm *newcomm) |
---|
281 | \end{verbatim} |
---|
282 | \end{center} |
---|
283 | |
---|
284 | ``This function partitions the group associated with \verb|comm| into disjoint subgroups, one for |
---|
285 | each value of \verb|color|. Each subgroup contains all processes of the same color. Within each |
---|
286 | subgroup, the processes are ranked in the order defined by the value of the argument |
---|
287 | \verb|key|, with ties broken according to their rank in the old group. A new communicator is |
---|
288 | created for each subgroup and returned in \verb|newcomm|. A process may supply the color value |
---|
289 | \verb|MPI_UNDEFINED|, in which case \verb|newcomm| returns \verb|MPI_COMM_NULL|. This is a collective |
---|
290 | call, but each process is permitted to provide different values for color and key.''\cite{MPI} |
---|
291 | |
---|
292 | By definition of the routine, in the case of EP, each thread participating the split operation will have only one color |
---|
293 | (\verb|MPI_UNDEFINED| is also considered to be one color). However, in the process's point of view, it can have multiple colors as shown |
---|
294 | in the following figure. |
---|
295 | |
---|
296 | \begin{center} |
---|
297 | \includegraphics[scale=0.4]{split.png} |
---|
298 | \end{center} |
---|
299 | |
---|
300 | This figure shows the result of the EP communicator splitting. Here we used the EP rank as key to assign the new rank of the thread in the |
---|
301 | resulting split intra-communicator. If the key is anything else than the EP rank, we follow the convention that the key takes effect only |
---|
302 | inside a process. This means that the threads are at first ordered by the MPI process rank and then by the value of key. |
---|
303 | |
---|
304 | Due to the fact that one process can have multiple colors for its threads, the splitting operation is executed by the following steps: |
---|
305 | |
---|
306 | \begin{itemize} |
---|
307 | \item[1.] Master threads collect all colors from its slaves and communicate with each other to determine the total number of colors across |
---|
308 | the communicator. |
---|
309 | \item[2.] For each color, the master thread check all its slave threads to obtain the number of threads having the same color. |
---|
310 | \item[3.] If at least one of the slave threads holds the color, then the master thread takes this color. If not, the master thread takes |
---|
311 | color \verb|MPI_UNDEFINED|. All master threads call classic communicator splitting routine with key $=$ MPI rank. |
---|
312 | \item[4.] For master threads holding a defined color, we execute the endpoint creation routine according to the number of slave |
---|
313 | threads holding the same color. The resulting EP communicators are then assigned to these slave threads. |
---|
314 | \end{itemize} |
---|
315 | |
---|
316 | \begin{center} |
---|
317 | \includegraphics[scale=0.4]{split2.png} |
---|
318 | \end{center} |
---|
319 | |
---|
320 | \subsection{The creation of inter-communicator} |
---|
321 | |
---|
322 | In XIOS, the inter-communicators are create by the routine \verb|MPI_Intercomm_create| which is used to bind two intra-communicators into |
---|
323 | an inter-communicator. The C prototype is |
---|
324 | \begin{center} |
---|
325 | \begin{verbatim} |
---|
326 | int MPI_Intercomm_create(MPI_Comm local_comm, int local_leader, |
---|
327 | MPI_Comm peer_comm, int remote_leader, |
---|
328 | int tag, MPI_Comm *newintercomm) |
---|
329 | \end{verbatim} |
---|
330 | \end{center} |
---|
331 | |
---|
332 | |
---|
333 | According to the MPI standard, ``an inter-communication is a point-to-point communication between processes in |
---|
334 | different groups''. ``All inter-communicator constructors are blocking except for \verb|MPI_COMM_IDUP| and |
---|
335 | require that the local and remote groups be disjoint.'' |
---|
336 | |
---|
337 | As in EP the threads are considered as processes, the non-overlapping condition can be translated to ``non-overlapping'' at the thread |
---|
338 | level which means that one thread can not belong to the local group and the remote group. However, the parent process of the thread can |
---|
339 | be overlapped. As the EP library is built upon an existing MPI implementation which follows the non-overlapping condition at the process |
---|
340 | level, we can have an issue in the case. |
---|
341 | |
---|
342 | Before digging into this issue, we shall at first look at the case where the non-overlapping condition is perfectly respected. |
---|
343 | \begin{center} |
---|
344 | \includegraphics[scale=0.3]{intercomm.png} |
---|
345 | \end{center} |
---|
346 | |
---|
347 | As shown in the figure, we have two intra-communicators A and B and they are totally disjoint both at the thread and process level. Each of |
---|
348 | the communicators has a local leader. We also assume that both leaders belong to a peer communicator and have rank 4 and 9 respectively. |
---|
349 | |
---|
350 | To create the inter-communicator, all threads from the left intra-comm call: |
---|
351 | \begin{verbatim} |
---|
352 | MPI_Intercomm_create(commA, local_leader = 2, peer_comm, |
---|
353 | remote_leader = 9, tag, inter_comm) |
---|
354 | \end{verbatim} |
---|
355 | and for threads of the right intra-comm, they call: |
---|
356 | \begin{verbatim} |
---|
357 | MPI_Intercomm_create(commB, local_leader = 3, peer_comm, |
---|
358 | remote_leader = 4, tag, inter_comm) |
---|
359 | \end{verbatim} |
---|
360 | |
---|
361 | To perform the inter-communicator creation, we follow the 3 steps: |
---|
362 | \begin{itemize} |
---|
363 | \item[1.] Determine the leaders and ranks at the process level; |
---|
364 | \item[2.] Call classic \verb|MPI_Intercomm_create|; |
---|
365 | \item[3.] Create endpoints from process and assigned to threads. |
---|
366 | \end{itemize} |
---|
367 | |
---|
368 | \begin{center} |
---|
369 | \includegraphics[scale=0.25]{intercomm_step.png} |
---|
370 | \end{center} |
---|
371 | |
---|
372 | If we have overlapped process in the creation of inter-communicator, we should add an \textit{priority check} to assign the process to only |
---|
373 | one intra-communicator. |
---|
374 | Several possibilities: |
---|
375 | \begin{itemize} |
---|
376 | \item[1.] Process is shared and contains no local leader $\implies$ process belongs to group with higher rank in peer comm; |
---|
377 | \item[2.] Process is shared and contains one local leader $\implies$ process belongs to group with the leader; |
---|
378 | \item[3.] Process is shared and contains both local leaders : leader change is performed and the peer communicator is |
---|
379 | \verb|MPI_COMM_WORLD| and we note ``group A'' the group with smaller peer rank and ``group B'' the group with higher peer rank. |
---|
380 | \begin{itemize} |
---|
381 | \item[3a.] If group A has at least two processes, the leader of group A is changed to the master thread of the process with smallest |
---|
382 | rank except the overlapped process. The overlapped process belongs to group B. |
---|
383 | \item[3b.] If group A has only one processes, and group B has at least two processes, then the leader of group B is changed to the |
---|
384 | master thread of the process with smallest rank except the overlapped process. The overlapped process belongs to group A. |
---|
385 | \item[3c.] If both group A and group B have only one process, then an one-process intra-communicator is created though it will be |
---|
386 | considered (labeled) as an inter-communicator. |
---|
387 | \end{itemize} |
---|
388 | \end{itemize} |
---|
389 | |
---|
390 | \begin{center} |
---|
391 | \includegraphics[scale=0.25]{intercomm2.png} |
---|
392 | \end{center} |
---|
393 | |
---|
394 | \subsection{The merge of inter-communicators} |
---|
395 | \verb|MPI_Intercomm_Merge(MPI_Comm intercomm, int high, MPI_Comm *newintracomm)| creates an intra-communicator by merging the |
---|
396 | local and remote groups of an inter-communicator. All processes should provide the same \verb|high| value within each of the two groups. If |
---|
397 | processes in one group provided the value \verb|high=false| and processes in the other group provided the value \verb|high=true| then the |
---|
398 | union orders the âlowâ group before the âhighâ group. If all processes provided the same high argument then the order of the union |
---|
399 | is arbitrary. This call is blocking and collective within the union of the two groups. \cite{MPI} |
---|
400 | |
---|
401 | This routine can be considered as the inverse of \verb|MPI_Intercomm_create|. In the intercommunicator create function, all 5 cases are |
---|
402 | eventually transformed into the case where no MPI process is shared by two groups. It is from this case that the merge funtion takes place. |
---|
403 | |
---|
404 | \begin{itemize} |
---|
405 | \item[1.] The classic \verb|MPI_Intercomm_merge| is called and an MPI intracommunicator is created from the two disjoint groups and MPI |
---|
406 | processes are ordered by the high value of the local leader. |
---|
407 | \item[2.] Endpoints are created based on the MPI intracommunicator and the new EP ranks are orderd firstly according to the high value of |
---|
408 | each thread and then to the origianl EP ranks in the intercommunicators. |
---|
409 | \end{itemize} |
---|
410 | |
---|
411 | \begin{center} |
---|
412 | \includegraphics[scale=0.25]{merge.png} |
---|
413 | \end{center} |
---|
414 | |
---|
415 | \section{P2P communication on inter-communicators} |
---|
416 | |
---|
417 | In case of the intercommunicators, the \verb|MPI_Comm| class has 3 members to determine the topology along with the original |
---|
418 | \verb|rank_map|: |
---|
419 | \begin{itemize} |
---|
420 | \item \verb|RANK_MAP local_rank_map[size of commA]|: composed of the EP rank in commA' or commB'; |
---|
421 | \item \verb|RANK_MAP remote_rank_map[size of commB]|: = \verb|local_rank_map| of remote group; |
---|
422 | \item \verb|RANK_MAP intercomm_rank_map[size of commB']|: = \verb|rank_map| of remote group'; |
---|
423 | \item \verb|RANK_MAP rank_map|: rank map of commA' or commB'. |
---|
424 | \end{itemize} |
---|
425 | |
---|
426 | For example, in the following configuration: |
---|
427 | \begin{center} |
---|
428 | \includegraphics[scale = 0.3]{ranks.png} |
---|
429 | \end{center} |
---|
430 | |
---|
431 | For all endpoints in commA, |
---|
432 | \begin{verbatim} |
---|
433 | local_rank_map={(rank in commA' or commB', |
---|
434 | rank of leader in MPI_Comm_world)} |
---|
435 | ={(1,0), (0,1), (2,1), (4,1)} |
---|
436 | |
---|
437 | remote_rank_map={(remote endpoints' rank in commA' or commB', |
---|
438 | rank of remote leader in MPI_Comm_world)} |
---|
439 | ={(0,0), (1,1), (3,1), (5,1)} |
---|
440 | \end{verbatim} |
---|
441 | |
---|
442 | For all endpoints in commA' |
---|
443 | \begin{verbatim} |
---|
444 | intercomm_rank_map={(remote endpoints local rank in commA' or commB', |
---|
445 | remote endpoints MPI rank in commA' or commB')} |
---|
446 | ={(0,0), (1,0)} |
---|
447 | rank_map={(local rank in commA', mpi rank in commA')} |
---|
448 | ={(0,0), (1,0), (0,1), (1,1), (0,2), (1,2)} |
---|
449 | \end{verbatim} |
---|
450 | |
---|
451 | |
---|
452 | For all endpoints in comm B, |
---|
453 | \begin{verbatim} |
---|
454 | local_rank_map={(rank in commA' or commB', |
---|
455 | rank of leader in MPI_Comm_world)} |
---|
456 | ={(0,0), (1,1), (3,1), (5,1)} |
---|
457 | |
---|
458 | remote_rank_map={(remote endpoints' rank in commA' or commB', |
---|
459 | rank of remote leader in MPI_Comm_world)} |
---|
460 | ={(1,0), (0,1), (2,1), (4,1)} |
---|
461 | \end{verbatim} |
---|
462 | |
---|
463 | For all endpoints in commB' |
---|
464 | \begin{verbatim} |
---|
465 | intercomm_rank_map={(remote endpoints local rank in commA' or commB', |
---|
466 | remote endpoints MPI rank in commA' or commB')} |
---|
467 | ={(0,0), (1,0), (0,1), (1,1), (0,2), (1,2)} |
---|
468 | rank_map={(local rank in commB', mpi rank in commB')} |
---|
469 | ={(0,0), (1,0)} |
---|
470 | \end{verbatim} |
---|
471 | |
---|
472 | When calling a p2p communication on an inter-communicator, we should: |
---|
473 | \begin{itemize} |
---|
474 | \item[1.] Determine if the source and the destination endpoints are in a same group by checking the ``labels''. |
---|
475 | \begin{itemize} |
---|
476 | \item[$\bullet$] \verb|src_label = local_rank_map->at(src).second| |
---|
477 | \item[$\bullet$] \verb|dest_label = remote_rank_map->at(dest).second| |
---|
478 | \end{itemize} |
---|
479 | |
---|
480 | |
---|
481 | \item[2.] If \verb|src_label == dest_label|, then the communication is in fact a intra-communication. The new source rank and |
---|
482 | destination rank, as well as the local ranks, are deduced by: |
---|
483 | \begin{verbatim} |
---|
484 | src_rank = local_rank_map->at(src).first |
---|
485 | dest_rank = remote_rank_map->at(dest).first |
---|
486 | src_rank_local = rank_map->at(src_rank).first |
---|
487 | dest_rank_local = rank_map->at(dest_rank).first |
---|
488 | \end{verbatim} |
---|
489 | \item[3.] If \verb|src_label != dest_label|, then the inter-communication is required. The new ranks are obtained by: |
---|
490 | \begin{verbatim} |
---|
491 | src_rank = local_rank_map->at(src).first |
---|
492 | dest_rank = remote_rank_map->at(dest).first |
---|
493 | src_rank_local = intercomm_rank_map->at(src_rank).first |
---|
494 | dest_rank_local = rank_map->at(dest_rank).first |
---|
495 | \end{verbatim} |
---|
496 | \item[4.] Call MPI P2P function to start the communication. |
---|
497 | \begin{itemize} |
---|
498 | \item[$\bullet$] If intra-communication, \verb|mpi_comm = commA'_mpi or commB'_mpi|; |
---|
499 | \item[$\bullet$] If inter-communication, \verb|mpi_comm = inter_comm_mpi|. |
---|
500 | \end{itemize} |
---|
501 | \end{itemize} |
---|
502 | |
---|
503 | \begin{center} |
---|
504 | \includegraphics[scale = 0.3]{sendrecv2.png} |
---|
505 | \end{center} |
---|
506 | |
---|
507 | \section{One-sided communications} |
---|
508 | |
---|
509 | The one-sided communication is a type of communcation which involves only one process to specify all communication parameters, both for the |
---|
510 | sending side and the receiving side \cite[Chapter~11]{MPI}. To extend this type of communication in the context of endpoints, we encounter |
---|
511 | some limitations. In the current work, the one-sided communication can only be used in the client-server mode which means that |
---|
512 | RMA(remote memory access) can occur only between a server and a client. |
---|
513 | |
---|
514 | The construction of RMA windows is illustrated by the following figure: |
---|
515 | \begin{center} |
---|
516 | \includegraphics[scale=0.5]{RMA_schema.pdf} |
---|
517 | \end{center} |
---|
518 | |
---|
519 | \begin{itemize} |
---|
520 | \item we determin the max number of threads N in the endpoint environment (N=3 in the example); |
---|
521 | \item on the server side, N windows are declared and asociated with the same memory adress; |
---|
522 | \item we start a loop : i = 0, ..., N-1 |
---|
523 | \begin{itemize} |
---|
524 | \item each endpoint with thread number i declares an RMA window; |
---|
525 | \item the link between windows on the client side and the i-th window on the server side are created via \verb|MPI_Win_created|; |
---|
526 | \item if the number of threads on a certain process is less than N, then a \verb|NULL| pointer is used as memory adress. |
---|
527 | \end{itemize} |
---|
528 | \end{itemize} |
---|
529 | |
---|
530 | With the RMA windows created, we can then perform some communications: \verb|MPI_Put|, \verb|MPI_Get|, \verb|MPI_Accumulate|, |
---|
531 | \verb|MPI_Get_accumulate|, \verb|MPI_Fetch_and_op|, \verb|MPI_Compare_and_swap|, \textit{etc}. |
---|
532 | |
---|
533 | The main idea of any of the mentioned communications is to identify the threads which are involved in the connection. For example, we want |
---|
534 | to perform a put operation from EP 2 to the server. We know that EP 2 is the thread 0 of process 1. Thus the 0-th window (win A) of the |
---|
535 | server side should be used. Once the sender and the receiver are identified, the \verb|MPI_Put| communication can be established. |
---|
536 | |
---|
537 | Other RMA functions, such as \verb|MPI_Win_allocate|, \verb|MPI_win_Fence|, and \verb|MPI_Win_free|, remain nearly the same and we will |
---|
538 | skip the detail in this document. |
---|
539 | |
---|
540 | |
---|
541 | |
---|
542 | \bibliographystyle{plain} |
---|
543 | \bibliography{reference} |
---|
544 | |
---|
545 | \end{document} |
---|