Blame - doc/tutorial_parallel.qbk - RealtimeRoboticsGroup/aos

blob: 92f401c5e30918111c8bdf2303ee861795b62440 [file] [log] [blame]

Brian Silverman	7c33ab2	2018-08-04 17:14:51 -0700	[diff] [blame^]	1	[/============================================================================
				2	Boost.odeint
				3
				4	Copyright 2013 Karsten Ahnert
				5	Copyright 2013 Pascal Germroth
				6	Copyright 2013 Mario Mulansky
				7
				8	Use, modification and distribution is subject to the Boost Software License,
				9	Version 1.0. (See accompanying file LICENSE_1_0.txt or copy at
				10	http://www.boost.org/LICENSE_1_0.txt)
				11	=============================================================================/]
				12
				13
				14	[section Parallel computation with OpenMP and MPI]
				15
				16	Parallelization is a key feature for modern numerical libraries due to the vast
				17	availability of many cores nowadays, even on Laptops.
				18	odeint currently supports parallelization with OpenMP and MPI, as described in
				19	the following sections.
				20	However, it should be made clear from the beginning that the difficulty of
				21	efficiently distributing ODE integration on many cores/machines lies in the
				22	parallelization of the system function, which is still the user's
				23	responsibility.
				24	Simply using a parallel odeint backend without parallelizing the system function
				25	will bring you almost no performance gains.
				26
				27	[section OpenMP]
				28
				29	[import ../examples/openmp/phase_chain.cpp]
				30
				31	odeint's OpenMP support is implemented as an external backend, which needs to be
				32	manually included. Depending on the compiler some additional flags may be
				33	needed, i.e. [^-fopenmp] for GCC.
				34	[phase_chain_openmp_header]
				35
				36	In the easiest parallelization approach with OpenMP we use a standard `vector`
				37	as the state type:
				38	[phase_chain_vector_state]
				39
				40	We initialize the state with some random data:
				41	[phase_chain_init]
				42
				43	Now we have to configure the stepper to use the OpenMP backend.
				44	This is done by explicitly providing the `openmp_range_algebra` as a template
				45	parameter to the stepper.
				46	This algebra requires the state type to be a model of Random Access Range and
				47	will be used from multiple threads by the algebra.
				48	[phase_chain_stepper]
				49
				50	Additional to providing the stepper with OpenMP parallelization we also need
				51	a parallelized system function to exploit the available cores.
				52	Here this is shown for a simple one-dimensional chain of phase oscillators with
				53	nearest neighbor coupling:
				54	[phase_chain_rhs]
				55
				56	[note In the OpenMP backends the system function will always be called
				57	sequentially from the thread used to start the integration.]
				58
				59	Finally, we perform the integration by using one of the integrate functions from
				60	odeint.
				61	As you can see, the parallelization is completely hidden in the stepper and the
				62	system function.
				63	OpenMP will take care of distributing the work among the threads and join them
				64	automatically.
				65	[phase_chain_integrate]
				66
				67	After integrating, the data can be accessed immediately and be processed
				68	further.
				69	Note, that you can specify the OpenMP scheduling by calling `omp_set_schedule`
				70	in the beginning of your program:
				71	[phase_chain_scheduling]
				72
				73	See [github_link examples/openmp/phase_chain.cpp
				74	openmp/phase_chain.cpp] for the complete example.
				75
				76	[heading Split state]
				77
				78	[import ../examples/openmp/phase_chain_omp_state.cpp]
				79
				80	For advanced cases odeint offers another approach to use OpenMP that allows for
				81	a more exact control of the parallelization.
				82	For example, for odd-sized data where OpenMP's thread boundaries don't match
				83	cache lines and hurt performance it might be advisable to copy the data from the
				84	continuous `vector<T>` into separate, individually aligned, vectors.
				85	For this, odeint provides the `openmp_state<T>` type, essentially an alias for
				86	`vector<vector<T>>`.
				87
				88	Here, the initialization is done with a `vector<double>`, but then we use
				89	odeint's `split` function to fill an `openmp_state`.
				90	The splitting is done such that the sizes of the individual regions differ at
				91	most by 1 to make the computation as uniform as possible.
				92	[phase_chain_state_init]
				93
				94	Of course, the system function has to be changed to deal with the
				95	`openmp_state`.
				96	Note that each sub-region of the state is computed in a single task, but at the
				97	borders read access to the neighbouring regions is required.
				98	[phase_chain_state_rhs]
				99
				100	Using the `openmp_state<T>` state type automatically selects `openmp_algebra`
				101	which executes odeint's internal computations on parallel regions.
				102	Hence, no manual configuration of the stepper is necessary.
				103	At the end of the integration, we use `unsplit` to concatenate the sub-regions
				104	back together into a single vector.
				105	[phase_chain_state_integrate]
				106
				107	[note You don't actually need to use `openmp_state<T>` for advanced use cases,
				108	`openmp_algebra` is simply an alias for `openmp_nested_algebra<range_algebra>`
				109	and supports any model of Random Access Range as the outer, parallel state type,
				110	and will use the given algebra on its elements.]
				111
				112	See [github_link examples/openmp/phase_chain_omp_state.cpp
				113	openmp/phase_chain_omp_state.cpp] for the complete example.
				114
				115	[endsect]
				116
				117	[section MPI]
				118
				119	[import ../examples/mpi/phase_chain.cpp]
				120
				121	To expand the parallel computation across multiple machines we can use MPI.
				122
				123	The system function implementation is similar to the OpenMP variant with split
				124	data, the main difference being that while OpenMP uses a spawn/join model where
				125	everything not explicitly paralleled is only executed in the main thread, in
				126	MPI's model each node enters the `main()` method independently, diverging based
				127	on its rank and synchronizing through message-passing and explicit barriers.
				128
				129	odeint's MPI support is implemented as an external backend, too.
				130	Depending on the MPI implementation the code might need to be compiled with i.e.
				131	[^mpic++].
				132	[phase_chain_mpi_header]
				133
				134	Instead of reading another thread's data, we asynchronously send and receive the
				135	relevant data from neighbouring nodes, performing some computation in the interim
				136	to hide the latency.
				137	[phase_chain_mpi_rhs]
				138
				139	Analogous to `openmp_state<T>` we use `mpi_state< InnerState<T> >`, which
				140	automatically selects `mpi_nested_algebra` and the appropriate MPI-oblivious
				141	inner algebra (since our inner state is a `vector`, the inner algebra will be
				142	`range_algebra` as in the OpenMP example).
				143	[phase_chain_state]
				144
				145	In the main program we construct a `communicator` which tells us the `size` of
				146	the cluster and the current node's `rank` within that.
				147	We generate the input data on the master node only, avoiding unnecessary work on
				148	the other nodes.
				149	Instead of simply copying chunks, `split` acts as a MPI collective function here
				150	and sends/receives regions from master to each slave.
				151	The input argument is ignored on the slaves, but the master node receives
				152	a region in its output and will participate in the computation.
				153	[phase_chain_mpi_init]
				154
				155	Now that `x_split` contains (only) the local chunk for each node, we start the
				156	integration.
				157
				158	To print the result on the master node, we send the processed data back using
				159	`unsplit`.
				160	[phase_chain_mpi_integrate]
				161
				162	[note `mpi_nested_algebra::for_each`[~N] doesn't use any MPI constructs, it
				163	simply calls the inner algebra on the local chunk and the system function is not
				164	guarded by any barriers either, so if you don't manually place any (for example
				165	in parameter studies cases where the elements are completely independent) you
				166	might see the nodes diverging, returning from this call at different times.]
				167
				168	See [github_link examples/mpi/phase_chain.cpp
				169	mpi/phase_chain.cpp] for the complete example.
				170
				171	[endsect]
				172
				173	[section Concepts]
				174
				175	[section MPI State]
				176	As used by `mpi_nested_algebra`.
				177	[heading Notation]
				178	[variablelist
				179	[[`InnerState`] [The inner state type]]
				180	[[`State`] [The MPI-state type]]
				181	[[`state`] [Object of type `State`]]
				182	[[`world`] [Object of type `boost::mpi::communicator`]]
				183	]
				184	[heading Valid Expressions]
				185	[table
				186	[[Name] [Expression] [Type] [Semantics]]
				187	[[Construct a state with a communicator]
				188	[`State(world)`] [`State`] [Constructs the State.]]
				189	[[Construct a state with the default communicator]
				190	[`State()`] [`State`] [Constructs the State.]]
				191	[[Get the current node's inner state]
				192	[`state()`] [`InnerState`] [Returns a (const) reference.]]
				193	[[Get the communicator]
				194	[`state.world`] [`boost::mpi::communicator`] [See __boost_mpi.]]
				195	]
				196	[heading Models]
				197	* `mpi_state<InnerState>`
				198
				199	[endsect]
				200
				201	[section OpenMP Split State]
				202	As used by `openmp_nested_algebra`, essentially a Random Access Container with
				203	`ValueType = InnerState`.
				204	[heading Notation]
				205	[variablelist
				206	[[`InnerState`] [The inner state type]]
				207	[[`State`] [The split state type]]
				208	[[`state`] [Object of type `State`]]
				209	]
				210	[heading Valid Expressions]
				211	[table
				212	[[Name] [Expression] [Type] [Semantics]]
				213	[[Construct a state for `n` chunks]
				214	[`State(n)`] [`State`] [Constructs underlying `vector`.]]
				215	[[Get a chunk]
				216	[`state[i]`] [`InnerState`] [Accesses underlying `vector`.]]
				217	[[Get the number of chunks]
				218	[`state.size()`] [`size_type`] [Returns size of underlying `vector`.]]
				219	]
				220	[heading Models]
				221	* `openmp_state<ValueType>` with `InnerState = vector<ValueType>`
				222
				223	[endsect]
				224
				225	[section Splitter]
				226	[heading Notation]
				227	[variablelist
				228	[[`Container1`] [The continuous-data container type]]
				229	[[`x`] [Object of type `Container1`]]
				230	[[`Container2`] [The chunked-data container type]]
				231	[[`y`] [Object of type `Container2`]]
				232	]
				233	[heading Valid Expressions]
				234	[table
				235	[[Name] [Expression] [Type] [Semantics]]
				236	[[Copy chunks of input to output elements]
				237	[`split(x, y)`] [`void`]
				238	[Calls `split_impl<Container1, Container2>::split(x, y)`, splits `x` into
				239	`y.size()` chunks.]]
				240	[[Join chunks of input elements to output]
				241	[`unsplit(y, x)`] [`void`]
				242	[Calls `unsplit_impl<Container2, Container1>::unsplit(y, x)`, assumes `x`
				243	is of the correct size ['__sigma `y[i].size()`], does not resize `x`.]]
				244	]
				245	[heading Models]
				246	* defined for `Container1` = __boost_range and `Container2 = openmp_state`
				247	* and `Container2 = mpi_state`.
				248
				249	To implement splitters for containers incompatible with __boost_range,
				250	specialize the `split_impl` and `unsplit_impl` types:
				251	```
				252	template< class Container1, class Container2 , class Enabler = void >
				253	struct split_impl {
				254	static void split( const Container1 &from , Container2 &to );
				255	};
				256
				257	template< class Container2, class Container1 , class Enabler = void >
				258	struct unsplit_impl {
				259	static void unsplit( const Container2 &from , Container1 &to );
				260	};
				261	```
				262	[endsect]
				263
				264	[endsect]
				265
				266	[endsect]