Context Navigation

source: XIOS2/dev/branch_openmp/Note/EP note.tex

Last change on this file was 1552, checked in by yushan, 6 years ago
report updated. Todo: add performance results
File size: 27.1 KB

Line
1	\documentclass[a4paper,10pt]{article}
2	\usepackage[utf8]{inputenc}
3	\usepackage{graphicx}
4	\usepackage{listings}
5	\usepackage[usenames,dvipsnames,svgnames,table]{xcolor}
6	\usepackage{amsmath}
7
8	% Title Page
9	\title{Note for MPI Endpoints}
10	\author{}
11
12
13	\begin{document}
14	\maketitle
15
16	%\begin{abstract}
17	%\end{abstract}
18
19	\section{Purpose}
20
21	Use threads as if they are MPI processes. Each thread will be assigned a rank and be associated with a endpoints communicator
22	(EP\_Comm). Convention: one OpenMP thread corresponds to one endpoint.
23
24	\section{MPI Endpoints Semantics}
25
26	\begin{center}
27	\includegraphics[scale=0.4]{scheme.png}
28	\end{center}
29
30	Endpoints are created from one MPI communicator and the number of available threads:
31	\begin{center}
32	\begin{verbatim}
33	int MPI_Comm_create_endpoints(MPI_Comm parent_comm, int num_ep,
34	MPI_Info info, MPI_Comm out_comm_hdls[])
35	\end{verbatim}
36	\end{center}
37
38	``In this collective call, a single output communicator is created, and an array of \verb\|num_ep\| handles to this new communicator
39	are returned, where the $i^{th}$ handle corresponds to the $i^{th}$ rank requested by the caller of \verb\|MPI_Comm_create_endpoints\|. Ranks
40	in the output communicator are ordered sequentially and in the same order as the parent communicator. After it has been created, the
41	output communicator behaves as a normal communicator, and MPI calls on each endpoint (i.e., communicator handle) behave as though
42	they originated from a separate MPI process. In particular, collective calls must be made once per endpoint.''\cite{Dinan:2013}
43
44	``Once created, endpoints behave as MPI processes. For example, all ranks in an endpoints communicator must participate in
45	collective operations. A consequence of this semantic is that endpoints also have MPI process progress requirements; that operations on
46	that endpoint are required to make progress only when an MPI operation (e.g. \verb\|MPI_Test\|) is performed on that endpoint. This semantic
47	enables an MPI implementation to logically separate endpoints, treat them independently within the progress engine, and eliminate
48	synchronization in updating their state.''\cite{Sridharan:2014}
49
50	\section{EP types}
51	\subsection*{MPI\_Comm}
52	\verb\|MPI_Comm\| is composed by:
53	\begin{itemize}
54	\item[$\bullet$] \verb\|bool is_ep\|: true $\implies$ EP, false $\implies$ MPI classic;
55	\item[$\bullet$] \verb\|int mpi_comm\|: handle to the parent MPI communicator;
56	\item[$\bullet$] \verb\|OMPbarrier *ep_barrier\|: openMP barrier, used for in-process synchronization and is different from
57	\verb\|omp barrier\|;
58	\item[$\bullet$] \verb\|int[2] size_rank_info[3]\|: topology information of the current endpoint:
59	\begin{itemize}
60	\item rank of parent MPI process;
61	\item size of parent MPI communicator;
62	\item rank of endpoint, returned by \verb\|MPI_Comm_rank\|;
63	\item size of EP communicator, returned by \verb\|MPI_Comm_size\|;
64	\item in-process rank of endpoint;
65	\item in-process size of EP communicator, also noted as the number of endpoints in one MPI process.
66	\end{itemize}
67	\item[$\bullet$] \verb\|MPI_Comm *comm_list\|: pointer of the first endpoint communicator of one process;
68	\item[$\bullet$] \verb\|Message_list *message_queue\|: location of in-coming messages for each endpoint;
69	\item[$\bullet$] \verb\|RANK_MAP *rank_map\|: a map composed by an integer and a pair of integers. The integer key represents the rank of an
70	endpoint. The mapped type (pair of integers) gives the in-process rank of the endpoint and the rank of its parent MPI process:
71	\begin{center}
72	\begin{verbatim}
73	rank_map->at(ep_rank)=(ep_rank_local, mpi_rank)
74	\end{verbatim}
75	\end{center}
76	\item[$\bullet$] \verb\|BUFFER *ep_buffer\|: buffer (of type \verb\|int\|, \verb\|float\|, \verb\|double\|, \verb\|char\|, \verb\|long\|,
77	and \verb\|unsigned long\|) used for in-process communication.
78	\end{itemize}
79
80	\subsection{MPI\_Request}
81	\verb\|MPI_Request\| is composed by:
82	\begin{itemize}
83	\item[$\bullet$] \verb\|int mpi_request\|: handle to the MPI request;
84	\item[$\bullet$] \verb\|int ep_datatype\|: data type of the communication;
85	\item[$\bullet$] \verb\|MPI_Comm comm\|: handle to the EP communicator;
86	\item[$\bullet$] \verb\|int ep_src\|: rank of the source endpoint;
87	\item[$\bullet$] \verb\|int ep_tag\|: tag of the communication.
88	\item[$\bullet$] \verb\|int type\|: type of the communication:
89	\begin{itemize}
90	\item 1 $\implies$ non-blocking send;
91	\item 2 $\implies$ pending non-blocking receive;
92	\item 3 $\implies$ non-blocking matching receive.
93	\end{itemize}
94
95	\end{itemize}
96
97	\subsection{MPI\_Status}
98	\verb\|MPI_Status\| consists of:
99	\begin{itemize}
100	\item[$\bullet$] \verb\|int mpi_status\|: handle to the MPI status;
101	\item[$\bullet$] \verb\|int ep_datatype\|: data type of the communication;
102	\item[$\bullet$] \verb\|int ep_src\|: rank of the source endpoint;
103	\item[$\bullet$] \verb\|int ep_tag\|: tag of the communication.
104	\end{itemize}
105
106	\subsection{MPI\_Message}
107	\verb\|MPI_Message\| includes:
108	\begin{itemize}
109	\item[$\bullet$] \verb\|int mpi_message\|: handle to the MPI message;
110	\item[$\bullet$] \verb\|int ep_src\|: rank of the source endpoint;
111	\item[$\bullet$] \verb\|int ep_tag\|: tag of the communication.
112	\end{itemize}
113
114	Other types, such as \verb\|MPI_Info\|, \verb\|MPI_Aint\|, and \verb\|MPI_Fint\| are defined in the same way.
115
116	\section{P2P communication}
117
118	All EP point-to-point communication use tag to distinguish the source and destination endpoint. To be able to add these extra
119	information to tag, we require that the tag value is represented using 31 bits in the underlying MPI inmplemention.
120
121	\includegraphics[scale=0.4]{tag.png}
122
123	EP\_tag is user defined. MPI\_tag is internally computed and used inside MPI calls. Because of the extension of tag, wild-cards as
124	\verb\|MPI_ANY_SOURCE\| and \verb\|MPI_ANY_TAG\| will not be usable directly. An extra step of tag analysis is needed which leads to the
125	message dequeuing mechanism.
126
127	\begin{center}
128	\includegraphics[scale = 0.5]{sendrecv.png}
129	\end{center}
130
131
132	In MPI environment, each MPI process has an incoming message queue. In EP case, messages for all threads inside one MPI process are stored
133	in this MPI queue. With the MPI 3 standard, we use the \verb\|MPI_Improbe\| routine to inquire the message queue and relocate the incoming
134	message in the local message queue for the corresponding thread/endpoint.
135
136
137	\includegraphics[scale=0.3]{dequeue.png}
138
139	% Any EP calls will trigger the message dequeuing and the probing, (matched-)receiving operations are performed upon the local message queue.
140	%
141	% Example: \verb\|EP_Recv(src=2, tag=10, comm1)\|:
142	% \begin{itemize}
143	% \item[1.] Dequeue MPI message queue;
144	% \item[2.] call \verb\|EP_Improb(src=2, tag=10, comm1, message)\|;
145	% \item[3.] if find corresponding triple (src, tag, comm1), call \verb\|EP_Mrecv(src=2, tag=10, comm1, message)\|;
146	% \item[4.] else, repeat from step 2.
147	% \end{itemize}
148
149
150	\paragraph{Messages are \textit{non-overtaking}} Incoming messages' order is important! If one thread is receiving multiple messages from
151	the same source with the same tag. The receive order should be the same order in which the messages are sent. That is to say, the n-th sent
152	message should be the n-th received message.
153
154
155
156	\paragraph{Progress} ``If a pair of matching send and receives have been initiated on two processes, then at least one of these
157	two operations will complete, independently of other actions in the system: the send operation will complete, unless the receive is
158	satisfied by another message, and completes; the receive operation will complete, unless the message sent is consumed by another matching
159	receive that was posted at the same destination process.'' \cite{MPI}
160
161
162	When one \verb\|EP_Irecv\| is issued, we first dequeue the MPI incoming message queue and distribute all incoming messages to the local
163	queues according to the destination identifier. Next, the nonblocking receive request is added at the end of the request pending list.
164	Third, the pending list is checked and requests with matching source, tag, and communicator will be accomplished.
165
166
167	Because of the importance of message order, some communication completion functions must be discussed here such as \verb\|MPI_Test\| and
168	\verb\|MPI_Wait\|. ``The functions \verb\|MPI_Wait\| and \verb\|MPI_Test\| are used to complete a nonblocking communication. The completion of a
169	send operation indicates that the sender is now free to update the locations in the send buffer (the send operation itself leaves the
170	content of the send buffer unchanged). It does not indicate that the message has been received, rather, it may have been buffered by
171	the communication subsystem. However, if a synchronous mode send was used, the completion of the send operation indicates that a
172	matching receive was initiated, and that the message will eventually be received by this matching receive. The completion of a
173	receive operation indicates that the receive buffer contains the received message, the receiver is now free to access it, and that the
174	status object is set. It does not indicate that the matching send operation has completed (but indicates, of course, that the send
175	was initiated).'' \cite{MPI}
176
177	\paragraph{Example 1}
178	\verb\|MPI_Test(MPI_Request request, int flag, MPI_Status *status)\|
179	\begin{itemize}
180	\item[1.] If \verb\|request->type == 1\|, communication to be tested is indeed issued from a non-blocking send. The completion status is
181	returned by:
182	\begin{center}
183	\begin{verbatim}
184	MPI_Test(& request->mpi_request, flag, & status->mpi_status)
185	\end{verbatim}
186	\end{center}
187	\item[2.] If \verb\|request->type == 2\|, it means that a non-blocking receive is called but the corresponding message is not yet probed.
188	The request is in the pending list thus not yet completed. All incoming message is once again probed and all pending requests are
189	checked. If after the second check, the matching message is found, thus a \verb\|MPI_Imrecv\| is called and the type is set to 3.
190	Otherwise, the type is still 2, then \verb\|flag = false\| is returned.
191	\item[3.] If \verb\|request->type == 3\|, this indcates that the request is issued from a non-blocking receive call and the
192	matching message is probed thus the status of the communication lies in the status of the \verb\|MPI_Imrecv\| function. The
193	completion result is returned by:
194	\begin{center}
195	\begin{verbatim}
196	MPI_Test(& request->mpi_request, flag, & status->mpi_status)
197	\end{verbatim}
198	\end{center}
199	\end{itemize}
200
201	\paragraph{Example 2}
202	\verb\|MPI_Wait(MPI_Request request, MPI_Status status)\|
203	\begin{itemize}
204	\item[1.] If \verb\|request->type == 1\|, communication to be tested is indeed issued from a non-blocking send. Jump to step 4.
205	\item[2.] If \verb\|request->type == 2\|, it means that a non-blocking receive is called but the corresponding message is not yet probed.
206	The request is in the pending list thus not yet completed. We repeat the incoming message probing and the pending request checking until the
207	matching message is found, thus a \verb\|MPI_Imrecv\| is called and the type is set to 3. Jump to step 4.
208	\item[3.] If \verb\|request->type == 3\|, this indcates that the request is issued from a non-blocking receive call and the
209	matching message is probed thus the status of the communication lies in the status of the \verb\|MPI_Imrecv\| function. Jump to step 4.
210	\item[4.] We force the completion by calling:
211	\begin{center}
212	\begin{verbatim}
213	MPI_Wat(& request->mpi_request, & status->mpi_status)
214	\end{verbatim}
215	\end{center}
216	\end{itemize}
217
218	\section{Collective communication}
219
220	All MPI classic collective communications are performed as the following pattern:
221
222	\begin{itemize}
223	\item[1.] Intra-process communication using OpenMP. \textit{e.g.} Collect data from slave threads to master thread.
224	\item[2.] Inter-process communication using MPI collective calls on master threads.
225	\item[3.] Intra-process communication using OpenMP. \textit{e.g.} Distribute data from master thread to slave threads.
226	\end{itemize}
227
228	\paragraph{Example 1}
229	\verb\|EP_Bcast(buffer, count, datatype, root = 4, comm)\| with \verb\|comm\| composed by 4 MPI
230	processes and 3 threads per process:
231
232	We can consider the communicator as $\{\underbrace{(0,1,2)}_\textrm{proc 0} \quad \underbrace{(3,\textcolor{red}{4},5)}_\textrm{proc
233	1}\quad \underbrace{(6,7,8)}_\textrm{proc 2}\quad \underbrace{(9,10,11)}_\textrm{proc 3}\}$.
234
235	This collective communication is performed by the following three steps:
236
237	\begin{itemize}
238	%\item[1.] EP process with rank 4 send the buffer to EP process rank 3 which is a master thread.
239	\item[1.] We call \verb\|MPI_Bcast(buffer, count, datatype, mpi_root = 1, mpi_comm) \| with EP processes rank 0, 4, 6, and 9.
240	\item[2.] EP processes rank 0, 4, 6, and 9 send the buffer to its slaves.
241	\end{itemize}
242
243	\begin{center}
244	\includegraphics[scale=0.3]{bcast.png}
245	\end{center}
246
247
248	\paragraph{Example 2}
249	\verb\|EP_Allreduce(sendbuf, recvbuf, count, datatype, op, comm)\| with \verb\|comm\| the same
250	as in example 1.
251
252	This collective communication is performed by the following three steps:
253
254	\begin{itemize}
255	\item[1.] We perform a intra-process ``allreduce'' operation: master threads collect data from its slaves and perform the reduce
256	operation.
257	\item[2.] Master threads call the classic \verb\|MPI_Allreduce\| routine.
258	\item[3.] All master threads send the updated reduced data to its slaves.
259	\end{itemize}
260
261	\begin{center}
262	\includegraphics[scale=0.3]{allreduce.png}
263	\end{center}
264
265
266
267	Other collective communications have the similar execution pattern.
268
269	\section{Inter-communicator}
270
271	In XIOS, inter-communicator is an very important component. Thus, our EP library must support inter-communications.
272
273	\subsection{The splitting of intra-communicator}
274
275	Before talking about the inter-communicator, we will start by splitting intra-communicator.
276	The C prototype of the splitting routine is
277	\begin{center}
278	\begin{verbatim}
279	int MPI_Comm_split(MPI_Comm comm, int color, int key,
280	MPI_Comm *newcomm)
281	\end{verbatim}
282	\end{center}
283
284	``This function partitions the group associated with \verb\|comm\| into disjoint subgroups, one for
285	each value of \verb\|color\|. Each subgroup contains all processes of the same color. Within each
286	subgroup, the processes are ranked in the order defined by the value of the argument
287	\verb\|key\|, with ties broken according to their rank in the old group. A new communicator is
288	created for each subgroup and returned in \verb\|newcomm\|. A process may supply the color value
289	\verb\|MPI_UNDEFINED\|, in which case \verb\|newcomm\| returns \verb\|MPI_COMM_NULL\|. This is a collective
290	call, but each process is permitted to provide different values for color and key.''\cite{MPI}
291
292	By definition of the routine, in the case of EP, each thread participating the split operation will have only one color
293	(\verb\|MPI_UNDEFINED\| is also considered to be one color). However, in the process's point of view, it can have multiple colors as shown
294	in the following figure.
295
296	\begin{center}
297	\includegraphics[scale=0.4]{split.png}
298	\end{center}
299
300	This figure shows the result of the EP communicator splitting. Here we used the EP rank as key to assign the new rank of the thread in the
301	resulting split intra-communicator. If the key is anything else than the EP rank, we follow the convention that the key takes effect only
302	inside a process. This means that the threads are at first ordered by the MPI process rank and then by the value of key.
303
304	Due to the fact that one process can have multiple colors for its threads, the splitting operation is executed by the following steps:
305
306	\begin{itemize}
307	\item[1.] Master threads collect all colors from its slaves and communicate with each other to determine the total number of colors across
308	the communicator.
309	\item[2.] For each color, the master thread check all its slave threads to obtain the number of threads having the same color.
310	\item[3.] If at least one of the slave threads holds the color, then the master thread takes this color. If not, the master thread takes
311	color \verb\|MPI_UNDEFINED\|. All master threads call classic communicator splitting routine with key $=$ MPI rank.
312	\item[4.] For master threads holding a defined color, we execute the endpoint creation routine according to the number of slave
313	threads holding the same color. The resulting EP communicators are then assigned to these slave threads.
314	\end{itemize}
315
316	\begin{center}
317	\includegraphics[scale=0.4]{split2.png}
318	\end{center}
319
320	\subsection{The creation of inter-communicator}
321
322	In XIOS, the inter-communicators are create by the routine \verb\|MPI_Intercomm_create\| which is used to bind two intra-communicators into
323	an inter-communicator. The C prototype is
324	\begin{center}
325	\begin{verbatim}
326	int MPI_Intercomm_create(MPI_Comm local_comm, int local_leader,
327	MPI_Comm peer_comm, int remote_leader,
328	int tag, MPI_Comm *newintercomm)
329	\end{verbatim}
330	\end{center}
331
332
333	According to the MPI standard, ``an inter-communication is a point-to-point communication between processes in
334	different groups''. ``All inter-communicator constructors are blocking except for \verb\|MPI_COMM_IDUP\| and
335	require that the local and remote groups be disjoint.''
336
337	As in EP the threads are considered as processes, the non-overlapping condition can be translated to ``non-overlapping'' at the thread
338	level which means that one thread can not belong to the local group and the remote group. However, the parent process of the thread can
339	be overlapped. As the EP library is built upon an existing MPI implementation which follows the non-overlapping condition at the process
340	level, we can have an issue in the case.
341
342	Before digging into this issue, we shall at first look at the case where the non-overlapping condition is perfectly respected.
343	\begin{center}
344	\includegraphics[scale=0.3]{intercomm.png}
345	\end{center}
346
347	As shown in the figure, we have two intra-communicators A and B and they are totally disjoint both at the thread and process level. Each of
348	the communicators has a local leader. We also assume that both leaders belong to a peer communicator and have rank 4 and 9 respectively.
349
350	To create the inter-communicator, all threads from the left intra-comm call:
351	\begin{verbatim}
352	MPI_Intercomm_create(commA, local_leader = 2, peer_comm,
353	remote_leader = 9, tag, inter_comm)
354	\end{verbatim}
355	and for threads of the right intra-comm, they call:
356	\begin{verbatim}
357	MPI_Intercomm_create(commB, local_leader = 3, peer_comm,
358	remote_leader = 4, tag, inter_comm)
359	\end{verbatim}
360
361	To perform the inter-communicator creation, we follow the 3 steps:
362	\begin{itemize}
363	\item[1.] Determine the leaders and ranks at the process level;
364	\item[2.] Call classic \verb\|MPI_Intercomm_create\|;
365	\item[3.] Create endpoints from process and assigned to threads.
366	\end{itemize}
367
368	\begin{center}
369	\includegraphics[scale=0.25]{intercomm_step.png}
370	\end{center}
371
372	If we have overlapped process in the creation of inter-communicator, we should add an \textit{priority check} to assign the process to only
373	one intra-communicator.
374	Several possibilities:
375	\begin{itemize}
376	\item[1.] Process is shared and contains no local leader $\implies$ process belongs to group with higher rank in peer comm;
377	\item[2.] Process is shared and contains one local leader $\implies$ process belongs to group with the leader;
378	\item[3.] Process is shared and contains both local leaders : leader change is performed and the peer communicator is
379	\verb\|MPI_COMM_WORLD\| and we note ``group A'' the group with smaller peer rank and ``group B'' the group with higher peer rank.
380	\begin{itemize}
381	\item[3a.] If group A has at least two processes, the leader of group A is changed to the master thread of the process with smallest
382	rank except the overlapped process. The overlapped process belongs to group B.
383	\item[3b.] If group A has only one processes, and group B has at least two processes, then the leader of group B is changed to the
384	master thread of the process with smallest rank except the overlapped process. The overlapped process belongs to group A.
385	\item[3c.] If both group A and group B have only one process, then an one-process intra-communicator is created though it will be
386	considered (labeled) as an inter-communicator.
387	\end{itemize}
388	\end{itemize}
389
390	\begin{center}
391	\includegraphics[scale=0.25]{intercomm2.png}
392	\end{center}
393
394	\subsection{The merge of inter-communicators}
395	\verb\|MPI_Intercomm_Merge(MPI_Comm intercomm, int high, MPI_Comm *newintracomm)\| creates an intra-communicator by merging the
396	local and remote groups of an inter-communicator. All processes should provide the same \verb\|high\| value within each of the two groups. If
397	processes in one group provided the value \verb\|high=false\| and processes in the other group provided the value \verb\|high=true\| then the
398	union orders the âlowâ group before the âhighâ group. If all processes provided the same high argument then the order of the union
399	is arbitrary. This call is blocking and collective within the union of the two groups. \cite{MPI}
400
401	This routine can be considered as the inverse of \verb\|MPI_Intercomm_create\|. In the intercommunicator create function, all 5 cases are
402	eventually transformed into the case where no MPI process is shared by two groups. It is from this case that the merge funtion takes place.
403
404	\begin{itemize}
405	\item[1.] The classic \verb\|MPI_Intercomm_merge\| is called and an MPI intracommunicator is created from the two disjoint groups and MPI
406	processes are ordered by the high value of the local leader.
407	\item[2.] Endpoints are created based on the MPI intracommunicator and the new EP ranks are orderd firstly according to the high value of
408	each thread and then to the origianl EP ranks in the intercommunicators.
409	\end{itemize}
410
411	\begin{center}
412	\includegraphics[scale=0.25]{merge.png}
413	\end{center}
414
415	\section{P2P communication on inter-communicators}
416
417	In case of the intercommunicators, the \verb\|MPI_Comm\| class has 3 members to determine the topology along with the original
418	\verb\|rank_map\|:
419	\begin{itemize}
420	\item \verb\|RANK_MAP local_rank_map[size of commA]\|: composed of the EP rank in commA' or commB';
421	\item \verb\|RANK_MAP remote_rank_map[size of commB]\|: = \verb\|local_rank_map\| of remote group;
422	\item \verb\|RANK_MAP intercomm_rank_map[size of commB']\|: = \verb\|rank_map\| of remote group';
423	\item \verb\|RANK_MAP rank_map\|: rank map of commA' or commB'.
424	\end{itemize}
425
426	For example, in the following configuration:
427	\begin{center}
428	\includegraphics[scale = 0.3]{ranks.png}
429	\end{center}
430
431	For all endpoints in commA,
432	\begin{verbatim}
433	local_rank_map={(rank in commA' or commB',
434	rank of leader in MPI_Comm_world)}
435	={(1,0), (0,1), (2,1), (4,1)}
436
437	remote_rank_map={(remote endpoints' rank in commA' or commB',
438	rank of remote leader in MPI_Comm_world)}
439	={(0,0), (1,1), (3,1), (5,1)}
440	\end{verbatim}
441
442	For all endpoints in commA'
443	\begin{verbatim}
444	intercomm_rank_map={(remote endpoints local rank in commA' or commB',
445	remote endpoints MPI rank in commA' or commB')}
446	={(0,0), (1,0)}
447	rank_map={(local rank in commA', mpi rank in commA')}
448	={(0,0), (1,0), (0,1), (1,1), (0,2), (1,2)}
449	\end{verbatim}
450
451
452	For all endpoints in comm B,
453	\begin{verbatim}
454	local_rank_map={(rank in commA' or commB',
455	rank of leader in MPI_Comm_world)}
456	={(0,0), (1,1), (3,1), (5,1)}
457
458	remote_rank_map={(remote endpoints' rank in commA' or commB',
459	rank of remote leader in MPI_Comm_world)}
460	={(1,0), (0,1), (2,1), (4,1)}
461	\end{verbatim}
462
463	For all endpoints in commB'
464	\begin{verbatim}
465	intercomm_rank_map={(remote endpoints local rank in commA' or commB',
466	remote endpoints MPI rank in commA' or commB')}
467	={(0,0), (1,0), (0,1), (1,1), (0,2), (1,2)}
468	rank_map={(local rank in commB', mpi rank in commB')}
469	={(0,0), (1,0)}
470	\end{verbatim}
471
472	When calling a p2p communication on an inter-communicator, we should:
473	\begin{itemize}
474	\item[1.] Determine if the source and the destination endpoints are in a same group by checking the ``labels''.
475	\begin{itemize}
476	\item[$\bullet$] \verb\|src_label = local_rank_map->at(src).second\|
477	\item[$\bullet$] \verb\|dest_label = remote_rank_map->at(dest).second\|
478	\end{itemize}
479
480
481	\item[2.] If \verb\|src_label == dest_label\|, then the communication is in fact a intra-communication. The new source rank and
482	destination rank, as well as the local ranks, are deduced by:
483	\begin{verbatim}
484	src_rank = local_rank_map->at(src).first
485	dest_rank = remote_rank_map->at(dest).first
486	src_rank_local = rank_map->at(src_rank).first
487	dest_rank_local = rank_map->at(dest_rank).first
488	\end{verbatim}
489	\item[3.] If \verb\|src_label != dest_label\|, then the inter-communication is required. The new ranks are obtained by:
490	\begin{verbatim}
491	src_rank = local_rank_map->at(src).first
492	dest_rank = remote_rank_map->at(dest).first
493	src_rank_local = intercomm_rank_map->at(src_rank).first
494	dest_rank_local = rank_map->at(dest_rank).first
495	\end{verbatim}
496	\item[4.] Call MPI P2P function to start the communication.
497	\begin{itemize}
498	\item[$\bullet$] If intra-communication, \verb\|mpi_comm = commA'_mpi or commB'_mpi\|;
499	\item[$\bullet$] If inter-communication, \verb\|mpi_comm = inter_comm_mpi\|.
500	\end{itemize}
501	\end{itemize}
502
503	\begin{center}
504	\includegraphics[scale = 0.3]{sendrecv2.png}
505	\end{center}
506
507	\section{One-sided communications}
508
509	The one-sided communication is a type of communcation which involves only one process to specify all communication parameters, both for the
510	sending side and the receiving side \cite[Chapter~11]{MPI}. To extend this type of communication in the context of endpoints, we encounter
511	some limitations. In the current work, the one-sided communication can only be used in the client-server mode which means that
512	RMA(remote memory access) can occur only between a server and a client.
513
514	The construction of RMA windows is illustrated by the following figure:
515	\begin{center}
516	\includegraphics[scale=0.5]{RMA_schema.pdf}
517	\end{center}
518
519	\begin{itemize}
520	\item we determin the max number of threads N in the endpoint environment (N=3 in the example);
521	\item on the server side, N windows are declared and asociated with the same memory adress;
522	\item we start a loop : i = 0, ..., N-1
523	\begin{itemize}
524	\item each endpoint with thread number i declares an RMA window;
525	\item the link between windows on the client side and the i-th window on the server side are created via \verb\|MPI_Win_created\|;
526	\item if the number of threads on a certain process is less than N, then a \verb\|NULL\| pointer is used as memory adress.
527	\end{itemize}
528	\end{itemize}
529
530	With the RMA windows created, we can then perform some communications: \verb\|MPI_Put\|, \verb\|MPI_Get\|, \verb\|MPI_Accumulate\|,
531	\verb\|MPI_Get_accumulate\|, \verb\|MPI_Fetch_and_op\|, \verb\|MPI_Compare_and_swap\|, \textit{etc}.
532
533	The main idea of any of the mentioned communications is to identify the threads which are involved in the connection. For example, we want
534	to perform a put operation from EP 2 to the server. We know that EP 2 is the thread 0 of process 1. Thus the 0-th window (win A) of the
535	server side should be used. Once the sender and the receiver are identified, the \verb\|MPI_Put\| communication can be established.
536
537	Other RMA functions, such as \verb\|MPI_Win_allocate\|, \verb\|MPI_win_Fence\|, and \verb\|MPI_Win_free\|, remain nearly the same and we will
538	skip the detail in this document.
539
540
541
542	\bibliographystyle{plain}
543	\bibliography{reference}
544
545	\end{document}

Note: See TracBrowser for help on using the repository browser.

Download in other formats: