Namespaces
	distributed

	internal

Classes
struct	ParallelForInteger

class	Triangulation

Functions
template<typename InputIterator , typename OutputIterator , typename Predicate >
void	transform (const InputIterator &begin_in, const InputIterator &end_in, OutputIterator out, Predicate &predicate, const unsigned int grainsize)

template<typename InputIterator1 , typename InputIterator2 , typename OutputIterator , typename Predicate >
void	transform (const InputIterator1 &begin_in1, const InputIterator1 &end_in1, InputIterator2 in2, OutputIterator out, Predicate &predicate, const unsigned int grainsize)

template<typename InputIterator1 , typename InputIterator2 , typename InputIterator3 , typename OutputIterator , typename Predicate >
void	transform (const InputIterator1 &begin_in1, const InputIterator1 &end_in1, InputIterator2 in2, InputIterator3 in3, OutputIterator out, Predicate &predicate, const unsigned int grainsize)

template<typename RangeType , typename Function >
void	apply_to_subranges (const RangeType &begin, const typename identity< RangeType >::type &end, const Function &f, const unsigned int grainsize)

template<typename ResultType , typename RangeType , typename Function >
ResultType	accumulate_from_subranges (const Function &f, const RangeType &begin, const typename identity< RangeType >::type &end, const unsigned int grainsize)

Detailed Description

A namespace in which we define classes and algorithms that deal with running in parallel on shared memory machines when deal.II is configured to use multiple threads (see Parallel computing with multiple processors accessing shared memory), as well as running things in parallel on distributed memory machines (see Parallel computing with multiple processors using distributed memory).

Author: Wolfgang Bangerth, 2008, 2009

Function Documentation

template<typename InputIterator , typename OutputIterator , typename Predicate >

void parallel::transform	(	const InputIterator &	begin_in,
		const InputIterator &	end_in,
		OutputIterator	out,
		Predicate &	predicate,
		const unsigned int	grainsize
	)

An algorithm that performs the action *out++ = predicate(*in++) where the in iterator ranges over the given input range.

This algorithm does pretty much what std::transform does. The difference is that the function can run in parallel when deal.II is configured to use multiple threads.

If running in parallel, the iterator range is split into several chunks that are each packaged up as a task and given to the Threading Building Blocks scheduler to work on as compute resources are available. The function returns once all chunks have been worked on. The last argument denotes the minimum number of elements of the iterator range per task; the number must be large enough to amortize the startup cost of new tasks, and small enough to ensure that tasks can be reasonably load balanced.

For a discussion of the kind of problems to which this function is applicable, see the Parallel computing with multiple processors module.

Definition at line 173 of file parallel.h.

template<typename InputIterator1 , typename InputIterator2 , typename OutputIterator , typename Predicate >

void parallel::transform	(	const InputIterator1 &	begin_in1,
		const InputIterator1 &	end_in1,
		InputIterator2	in2,
		OutputIterator	out,
		Predicate &	predicate,
		const unsigned int	grainsize
	)

An algorithm that performs the action *out++ = predicate(*in1++, *in2++) where the in1 iterator ranges over the given input range, using the parallel for operator of tbb.

This algorithm does pretty much what std::transform does. The difference is that the function can run in parallel when deal.II is configured to use multiple threads.

If running in parallel, the iterator range is split into several chunks that are each packaged up as a task and given to the Threading Building Blocks scheduler to work on as compute resources are available. The function returns once all chunks have been worked on. The last argument denotes the minimum number of elements of the iterator range per task; the number must be large enough to amortize the startup cost of new tasks, and small enough to ensure that tasks can be reasonably load balanced.

For a discussion of the kind of problems to which this function is applicable, see the Parallel computing with multiple processors module.

Definition at line 229 of file parallel.h.

template<typename InputIterator1 , typename InputIterator2 , typename InputIterator3 , typename OutputIterator , typename Predicate >

void parallel::transform	(	const InputIterator1 &	begin_in1,
		const InputIterator1 &	end_in1,
		InputIterator2	in2,
		InputIterator3	in3,
		OutputIterator	out,
		Predicate &	predicate,
		const unsigned int	grainsize
	)

An algorithm that performs the action *out++ = predicate(*in1++, *in2++, *in3++) where the in1 iterator ranges over the given input range.

This algorithm does pretty much what std::transform does. The difference is that the function can run in parallel when deal.II is configured to use multiple threads.

If running in parallel, the iterator range is split into several chunks that are each packaged up as a task and given to the Threading Building Blocks scheduler to work on as compute resources are available. The function returns once all chunks have been worked on. The last argument denotes the minimum number of elements of the iterator range per task; the number must be large enough to amortize the startup cost of new tasks, and small enough to ensure that tasks can be reasonably load balanced.

For a discussion of the kind of problems to which this function is applicable, see the Parallel computing with multiple processors module.

Definition at line 288 of file parallel.h.

template<typename RangeType , typename Function >

void parallel::apply_to_subranges	(	const RangeType &	begin,
		const typename identity< RangeType >::type &	end,
		const Function &	f,
		const unsigned int	grainsize
	)

This function applies the given function argument f to all elements in the range [begin,end) and may do so in parallel.

However, in many cases it is not efficient to call a function on each element, so this function calls the given function object on sub-ranges. In other words: if the given range [begin,end) is smaller than grainsize or if multithreading is not enabled, then we call f(begin,end); otherwise, we may execute, possibly in parallel, a sequence of calls f(b,e) where [b,e) are subintervals of [begin,end) and the collection of calls we do to f(.,.) will happen on disjoint subintervals that collectively cover the original interval [begin,end).

Oftentimes, the called function will of course have to get additional information, such as the object to work on for a given value of the iterator argument. This can be achieved by binding certain arguments. For example, here is an implementation of a matrix-vector multiplication \(y=Ax\) for a full matrix \(A\) and vectors \(x,y\):

void matrix_vector_product (const FullMatrix &A,
                            const Vector     &x,
                            Vector           &y)
{
  parallel::apply_to_subranges
     (0, A.n_rows(),
      std::bind (&mat_vec_on_subranges,
                 std::placeholders::_1, std::placeholders::_2,
                 std::cref(A),
                 std::cref(x),
                 std::ref(y)),
      50);
}
void mat_vec_on_subranges (const unsigned int begin_row,
                           const unsigned int end_row,
                           const FullMatrix &A,
                           const Vector     &x,
                           Vector           &y)
{
  for (unsigned int row=begin_row; row!=end_row; ++row)
    for (unsigned int col=0; col<x.size(); ++col)
      y(row) += A(row,col) * x(col);
}

Note how we use the std::bind function to convert mat_vec_on_subranges from a function that takes 5 arguments to one taking 2 by binding the remaining arguments (the modifiers std::ref and std::cref make sure that the enclosed variables are actually passed by reference and constant reference, rather than by value). The resulting function object requires only two arguments, begin_row and end_row, with all other arguments fixed.

The code, if in single-thread mode, will call mat_vec_on_subranges on the entire range [0,n_rows) exactly once. In multi-threaded mode, however, it may be called multiple times on subranges of this interval, possibly allowing more than one CPU core to take care of part of the work.

The grainsize argument (50 in the example above) makes sure that subranges do not become too small, to avoid spending more time on scheduling subranges to CPU resources than on doing actual work.

For a discussion of the kind of problems to which this function is applicable, see also the Parallel computing with multiple processors module.

Definition at line 412 of file parallel.h.

template<typename ResultType , typename RangeType , typename Function >

ResultType parallel::accumulate_from_subranges	(	const Function &	f,
		const RangeType &	begin,
		const typename identity< RangeType >::type &	end,
		const unsigned int	grainsize
	)

This function works a lot like the apply_to_subranges(), but it allows to accumulate numerical results computed on each subrange into one number. The type of this number is given by the ResultType template argument that needs to be explicitly specified.

An example of use of this function is to compute the value of the expression \(x^T A x\) for a square matrix \(A\) and a vector \(x\). The sum over rows can be parallelized and the whole code might look like this:

void matrix_norm (const FullMatrix &A,
                  const Vector     &x)
{
  return
   std::sqrt
    (parallel::accumulate_from_subranges<double>
     (0, A.n_rows(),
      std::bind (&mat_norm_sqr_on_subranges,
                  std::placeholders::_1, std::placeholders::_2,
                  std::cref(A),
                  std::cref(x)),
      50);
}
double
mat_norm_sqr_on_subranges (const unsigned int begin_row,
                           const unsigned int end_row,
                           const FullMatrix &A,
                           const Vector     &x)
{
  double norm_sqr = 0;
  for (unsigned int row=begin_row; row!=end_row; ++row)
    for (unsigned int col=0; col<x.size(); ++col)
      norm_sqr += x(row) * A(row,col) * x(col);
  return norm_sqr;
}

Here, mat_norm_sqr_on_subranges is called on the entire range [0,A.n_rows()) if this range is less than the minimum grainsize (above chosen as 50) or if deal.II is configured to not use multithreading. Otherwise, it may be called on subsets of the given range, with results from the individual subranges accumulated internally.

Warning: If ResultType is a floating point type, then accumulation is not an associative operation. In other words, if the given function object is called three times on three subranges, returning values \(a,b,c\), then the returned result of this function is \((a+b)+c\). However, depending on how the three sub-tasks are distributed on available CPU resources, the result may also be \((a+c)+b\) or any other permutation; because floating point addition is not associative (as opposed, of course, to addition of real numbers), the result of invoking this function several times may differ on the order of round-off.

For a discussion of the kind of problems to which this function is applicable, see also the Parallel computing with multiple processors module.

Definition at line 652 of file parallel.h.

Namespaces

Classes

Functions

Detailed Description

Function Documentation