This tutorial depends on step-25, step-37, step-40.

Table of contents

Introduction
The commented program

Results
The plain program

This program was contributed by Katharina Kormann and Martin Kronbichler.

The algorithm for the matrix-vector product is based on the article A generic interface for parallel cell-based finite element operator application by Martin Kronbichler and Katharina Kormann, Computers and Fluids 63:135–147, 2012, and the paper "Parallel finite element operator application: Graph partitioning and coloring" by Katharina Kormann and Martin Kronbichler in: Proceedings of the 7th IEEE International Conference on e-Science, 2011.

Introduction

This program demonstrates how to use the cell-based implementation of finite element operators with the MatrixFree class, first introduced in step-37, to solve nonlinear partial differential equations. Moreover, we demonstrate how the MatrixFree class handles constraints, an issue shortly mentioned in the results section of step-37. Finally, we will use an explicit time-stepping method to solve the problem and introduce Gauss-Lobatto finite elements that are very convenient in this case since their mass matrix can be accurately approximated by a diagonal, and thus trivially invertible, matrix. The two ingredients to this property are firstly a distribution of the nodal points of Lagrange polynomials according to the point distribution of the Gauss-Lobatto quadrature rule. Secondly, the quadrature is done with the same Gauss-Lobatto quadrature rule. In this formula, the integrals \(\int_K \varphi_i \varphi_j dx\approx \sum_q \varphi_i \varphi_j \mathrm{det}(J) \big |_{x_q}\) are approximated to zero whenever \(i\neq j\), because on the points defining the Lagrange polynomials exactly one function \(\varphi_j\) is one and all others zero. Moreover, the Gauss-Lobatto distribution of nodes of Lagrange polynomials clusters the nodes towards the element boundaries. This results in a well-conditioned polynomial basis for high-order discretization methods. Indeed, the condition number of an FE_Q elements with equidistant nodes grows exponentially with the degree, which destroys any benefit for orders of about five and higher. For this reason, Gauss-Lobatto points are the default distribution for FE_Q (but at degrees one and two, those are equivalent to the equidistant points).

Problem statement and discretization

As an example, we choose to solve the sine-Gordon soliton equation

\begin{eqnarray*} u_{tt} &=& \Delta u -\sin(u) \quad\mbox{for}\quad (x,t) \in \Omega \times (t_0,t_f],\\ {\mathbf n} \cdot \nabla u &=& 0 \quad\mbox{for}\quad (x,t) \in \partial\Omega \times (t_0,t_f],\\ u(x,t_0) &=& u_0(x). \end{eqnarray*}

that was already introduced in step-25. As a simple explicit time integration method, we choose leap frog scheme using the second-order formulation of the equation. Then, the scheme reads in weak form

\begin{eqnarray*} (v,u^{n+1}) = (v,2 u^n-u^{n-1} - (\Delta t)^2 \sin(u^n)) - (\nabla v, (\Delta t)^2 \nabla u^n), \end{eqnarray*}

where v denotes a test function and the index n stands for the time step number.

For the spatial discretization, we choose FE_Q elements with basis functions defined to interpolate the support points of the Gauss-Lobatto quadrature rule. Moreover, when we compute the integrals over the basis functions to form the mass matrix and the operator on the right hand side of the equation above, we use the Gauss-Lobatto quadrature rule with the same support points as the node points of the finite element to evaluate the integrals. Since the finite element is Lagrangian, this will yield a diagonal mass matrix on the left hand side of the equation, making the solution of the linear system in each time step trivial.

Using this quadrature rule, for a pth order finite element, we use a (2p-1)th order accurate formula to evaluate the integrals. Since the product of two pth order basis functions when computing a mass matrix gives a function with polynomial degree 2p in each direction, the integrals are not computed exactly. However, considering the fact that the interpolation order of finite elements of degree p is p+1, the overall convergence properties are not disturbed by the quadrature error, in particular not when we use high orders.

Apart from the fact that we avoid solving linear systems with this type of elements when using explicit time-stepping, they come with two other advantages. When we are using the sum-factorization approach to evaluate the finite element operator (cf. step-37), we have to evaluate the function at the quadrature points. In the case of Gauss-Lobatto elements, where quadrature points and node points of the finite element coincide, this operation is trivial since the value of the function at the quadrature points is given by its one-dimensional coefficients. In this way, the complexity of a finite element operator evaluation is further reduced compared to equidistant elements.

The third advantage is the fact that these elements are better conditioned than equidistant Lagrange polynomials for increasing order so that we can use higher order elements for an accurate solution of the equation. Lagrange elements FE_Q with equidistant points should not be used for polynomial degrees four and higher.

To sum up the discussion, by using the right finite element and quadrature rule combination, we end up with a scheme where we in each time step need to compute the right hand side vector corresponding to the formulation above and then multiply it by the inverse of the diagonal mass matrix. In practice, of course, we extract the diagonal elements and invert them only once at the beginning of the program.

Implementation of constraints

The usual way to handle constraints in deal.II is to use the AffineConstraints class that builds a sparse matrix storing information about which degrees of freedom (DoF) are constrained and how they are constrained. This format uses an unnecessarily large amount of memory since there are not so many different types of constraints: for example, in the case of hanging nodes when using linear finite element on every cell, constraints most have the form \(x_k = \frac 12 x_i + \frac 12 x_j\) where the coefficients \(\frac 12\) are always the same and only \(i,j,k\) are different. While storing this redundant information is not a problem in general because it is only needed once during matrix and right hand side assembly, it becomes a problem when we want to use the matrix-free approach since there this information has to be accessed every time we apply the operator. Thus, instead of a AffineConstraints, we use a variable that we call constraint_pool that collects the weights of the different constraints. Then, we only have to store an identifier of each constraint in the mesh instead of all the weights. Moreover, we do not want to apply the constraints in a pre- and postprocessing step but want to take care of constraints as we evaluate the finite element operator. Therefore, we embed the constraint information into the variable indices_local_to_global that is used to extract the cell information from the global vector. If a DoF is constrained, the indices_local_to_global variable contains the global indices of the DoFs that it is constrained to. Then, we have another variable constraint_indicator at hand that holds, for each cell, the local indices of DoFs that are constrained as well as the identifier of the type of constraint. Actually, you will not see these data structures in the example program since the class FEEvaluation takes care of the constraints without user interaction.

Parallelization

The MatrixFree class comes with the option to be parallelized on three levels: MPI parallelization on clusters of distributed nodes, thread parallelization scheduled by the Threading Building Blocks library, and finally with a vectorization by clustering of two (or more) cells into a SIMD data type for the operator application. As we have already discussed in step-37, you will get best performance by using an instruction set specific to your system, e.g. with the cmake variable -DCMAKE_CXX_FLAGS="-march=native". The MPI parallelization was already exploited in step-37. Here, we additionally consider thread parallelization with TBB. This is fairly simple, as all we need to do is to tell the initialization of the MatrixFree object about the fact that we want to use a thread parallel scheme through the variable MatrixFree::AdditionalData::thread_parallel_scheme. During setup, a dependency graph similar to the one described in the workstream_paper , which allows the code to schedule the work of the local_apply function on chunks of cell without several threads accessing the same vector indices. As opposed to the WorkStream loops, some additional clever tricks to avoid global synchronizations as described in Kormann and Kronbichler (2011) are also applied.

Note that this program is designed to be run with a distributed triangulation (parallel::distributed::Triangulation), which requires deal.II to be configured with p4est as described in the deal.II ReadMe file. However, a non-distributed triangulation is also supported, in which case the computation will be run in serial.

The test case

In our example, we choose the initial value to be

\begin{eqnarray*} u(x,t) = \prod_{i=1}^{d} -4 \arctan \left( \frac{m}{\sqrt{1-m^2}}\frac{\sin\left(\sqrt{1-m^2} t +c_2\right)}{\cosh(mx_i+c_1)}\right) \end{eqnarray*}

and solve the equation over the time interval [-10,10]. The constants are chosen to be \(c_1=c_1=0\) and m=0.5. As mentioned in step-25, in one dimension u as a function of t is the exact solution of the sine-Gordon equation. For higher dimension, this is however not the case.

The commented program

The necessary files from the deal.II library.

#include <deal.II/base/logstream.h>
#include <deal.II/base/utilities.h>
#include <deal.II/base/function.h>
#include <deal.II/base/conditional_ostream.h>
#include <deal.II/base/timer.h>
#include <deal.II/lac/vector.h>
#include <deal.II/grid/tria.h>
#include <deal.II/grid/grid_generator.h>
#include <deal.II/dofs/dof_tools.h>
#include <deal.II/dofs/dof_handler.h>
#include <deal.II/lac/affine_constraints.h>
#include <deal.II/fe/fe_q.h>
#include <deal.II/fe/fe_values.h>
#include <deal.II/numerics/vector_tools.h>
#include <deal.II/numerics/data_out.h>
#include <deal.II/distributed/tria.h>

This includes the data structures for the efficient implementation of matrix-free methods.

#include <deal.II/lac/la_parallel_vector.h>
#include <deal.II/matrix_free/matrix_free.h>
#include <deal.II/matrix_free/fe_evaluation.h>
#include <fstream>
#include <iostream>
#include <iomanip>
namespace Step48
{
  using namespace dealii;

We start by defining two global variables to collect all parameters subject to changes at one place: One for the dimension and one for the finite element degree. The dimension is used in the main function as a template argument for the actual classes (like in all other deal.II programs), whereas the degree of the finite element is more crucial, as it is passed as a template argument to the implementation of the Sine-Gordon operator. Therefore, it needs to be a compile-time constant.

const unsigned int dimension = 2;

const unsigned int fe_degree = 4;

SineGordonOperation

The SineGordonOperation class implements the cell-based operation that is needed in each time step. This nonlinear operation can be implemented straight-forwardly based on the MatrixFree class, in the same way as a linear operation would be treated by this implementation of the finite element operator application. We apply two template arguments to the class, one for the dimension and one for the degree of the finite element. This is a difference to other functions in deal.II where only the dimension is a template argument. This is necessary to provide the inner loops in FEEvaluation with information about loop lengths etc., which is essential for efficiency. On the other hand, it makes it more challenging to implement the degree as a run-time parameter.

template <int dim, int fe_degree>
class SineGordonOperation
{
public:
  SineGordonOperation(const MatrixFree<dim, double> &data_in,
                      const double                   time_step);
  void apply(LinearAlgebra::distributed::Vector<double> &dst,
             const std::vector<LinearAlgebra::distributed::Vector<double> *>
               &src) const;
private:
  const MatrixFree<dim, double> &            data;
  const VectorizedArray<double>              delta_t_sqr;
  LinearAlgebra::distributed::Vector<double> inv_mass_matrix;
  void local_apply(
    const MatrixFree<dim, double> &                                  data,
    LinearAlgebra::distributed::Vector<double> &                     dst,
    const std::vector<LinearAlgebra::distributed::Vector<double> *> &src,
    const std::pair<unsigned int, unsigned int> &cell_range) const;
};

SineGordonOperation::SineGordonOperation

This is the constructor of the SineGordonOperation class. It receives a reference to the MatrixFree holding the problem information and the time step size as input parameters. The initialization routine sets up the mass matrix. Since we use Gauss-Lobatto elements, the mass matrix is a diagonal matrix and can be stored as a vector. The computation of the mass matrix diagonal is simple to achieve with the data structures provided by FEEvaluation: Just loop over all (macro-) cells and integrate over the function that is constant one on all quadrature points by using the integrate function with true argument at the slot for values. Finally, we invert the diagonal entries since we have to multiply by the inverse mass matrix in each time step.

template <int dim, int fe_degree>
SineGordonOperation<dim, fe_degree>::SineGordonOperation(
  const MatrixFree<dim, double> &data_in,
  const double                   time_step)
  : data(data_in)
  , delta_t_sqr(make_vectorized_array(time_step * time_step))
{
  VectorizedArray<double> one = make_vectorized_array(1.);
  data.initialize_dof_vector(inv_mass_matrix);
  FEEvaluation<dim, fe_degree> fe_eval(data);
  const unsigned int           n_q_points = fe_eval.n_q_points;
  for (unsigned int cell = 0; cell < data.n_macro_cells(); ++cell)
    {
      fe_eval.reinit(cell);
      for (unsigned int q = 0; q < n_q_points; ++q)
        fe_eval.submit_value(one, q);
      fe_eval.integrate(true, false);
      fe_eval.distribute_local_to_global(inv_mass_matrix);
    }
  inv_mass_matrix.compress(VectorOperation::add);
  for (unsigned int k = 0; k < inv_mass_matrix.local_size(); ++k)
    if (inv_mass_matrix.local_element(k) > 1e-15)
      inv_mass_matrix.local_element(k) =
        1. / inv_mass_matrix.local_element(k);
    else
      inv_mass_matrix.local_element(k) = 0;
}

SineGordonOperation::local_apply

This operator implements the core operation of the program, the integration over a range of cells for the nonlinear operator of the Sine-Gordon problem. The implementation is based on the FEEvaluation class as in step-37. Due to the special structure in Gauss-Lobatto elements, certain operations become simpler, in particular the evaluation of shape function values on quadrature points which is simply the injection of the values of cell degrees of freedom. The MatrixFree class detects possible structure of the finite element at quadrature points when initializing, which is then used by FEEvaluation for selecting the most appropriate numerical kernel.

The nonlinear function that we have to evaluate for the time stepping routine includes the value of the function at the present time current as well as the value at the previous time step old. Both values are passed to the operator in the collection of source vectors src, which is simply a std::vector of pointers to the actual solution vectors. This construct of collecting several source vectors into one is necessary as the cell loop in MatrixFree takes exactly one source and one destination vector, even if we happen to use many vectors like the two in this case. Note that the cell loop accepts any valid class for input and output, which does not only include vectors but general data types. However, only in case it encounters a LinearAlgebra::distributed::Vector<Number> or a std::vector collecting these vectors, it calls functions that exchange data at the beginning and the end of the loop. In the loop over the cells, we first have to read in the values in the vectors related to the local values. Then, we evaluate the value and the gradient of the current solution vector and the values of the old vector at the quadrature points. Then, we combine the terms in the scheme in the loop over the quadrature points. Finally, we integrate the result against the test function and accumulate the result to the global solution vector dst.

template <int dim, int fe_degree>
void SineGordonOperation<dim, fe_degree>::local_apply(
  const MatrixFree<dim> &                                          data,
  LinearAlgebra::distributed::Vector<double> &                     dst,
  const std::vector<LinearAlgebra::distributed::Vector<double> *> &src,
  const std::pair<unsigned int, unsigned int> &cell_range) const
{
  AssertDimension(src.size(), 2);
  FEEvaluation<dim, fe_degree> current(data), old(data);
  for (unsigned int cell = cell_range.first; cell < cell_range.second; ++cell)
    {
      current.reinit(cell);
      old.reinit(cell);
      current.read_dof_values(*src[0]);
      old.read_dof_values(*src[1]);
      current.evaluate(true, true, false);
      old.evaluate(true, false, false);
      for (unsigned int q = 0; q < current.n_q_points; ++q)
        {
          const VectorizedArray<double> current_value = current.get_value(q);
          const VectorizedArray<double> old_value     = old.get_value(q);
          current.submit_value(2. * current_value - old_value -
                                 delta_t_sqr * std::sin(current_value),
                               q);
          current.submit_gradient(-delta_t_sqr * current.get_gradient(q), q);
        }
      current.integrate(true, true);
      current.distribute_local_to_global(dst);
    }
}

SineGordonOperation::apply

This function performs the time stepping routine based on the cell-local strategy. First the destination vector is set to zero, then the cell-loop is called, and finally the solution is multiplied by the inverse mass matrix. The structure of the cell loop is implemented in the cell finite element operator class. On each cell it applies the routine defined as the local_apply() method of the class SineGordonOperation, i.e., this. One could also provide a function with the same signature that is not part of a class.

template <int dim, int fe_degree>
void SineGordonOperation<dim, fe_degree>::apply(
  LinearAlgebra::distributed::Vector<double> &                     dst,
  const std::vector<LinearAlgebra::distributed::Vector<double> *> &src) const
{
  dst = 0;
  data.cell_loop(&SineGordonOperation<dim, fe_degree>::local_apply,
                 this,
                 dst,
                 src);
  dst.scale(inv_mass_matrix);
}

Equation data

We define a time-dependent function that is used as initial value. Different solutions can be obtained by varying the starting time. This function has already been explained in step-25.

template <int dim>
class ExactSolution : public Function<dim>
{
public:
  ExactSolution(const unsigned int n_components = 1, const double time = 0.)
    : Function<dim>(n_components, time)
  {}
  virtual double value(const Point<dim> & p,
                       const unsigned int component = 0) const override;
};
template <int dim>
double ExactSolution<dim>::value(const Point<dim> &p,
                                 const unsigned int / * component * /) const
{
  double t = this->get_time();
  const double m  = 0.5;
  const double c1 = 0.;
  const double c2 = 0.;
  const double factor =
    (m / std::sqrt(1. - m * m) * std::sin(std::sqrt(1. - m * m) * t + c2));
  double result = 1.;
  for (unsigned int d = 0; d < dim; ++d)
    result *= -4. * std::atan(factor / std::cosh(m * p[d] + c1));
  return result;
}

SineGordonProblem class

This is the main class that builds on the class in step-25. However, we replaced the SparseMatrix<double> class by the MatrixFree class to store the geometry data. Also, we use a distributed triangulation in this example.

  template <int dim>
  class SineGordonProblem
  {
  public:
    SineGordonProblem();
    void run();
  private:
    ConditionalOStream pcout;
    void make_grid_and_dofs();
    void output_results(const unsigned int timestep_number);
#ifdef DEAL_II_WITH_P4EST
    parallel::distributed::Triangulation<dim> triangulation;
#else
    Triangulation<dim> triangulation;
#endif
    FE_Q<dim>                 fe;
    DoFHandler<dim>           dof_handler;
    AffineConstraints<double> constraints;
    IndexSet                  locally_relevant_dofs;
    MatrixFree<dim, double> matrix_free_data;
    LinearAlgebra::distributed::Vector<double> solution, old_solution,
      old_old_solution;
    const unsigned int n_global_refinements;
    double             time, time_step;
    const double       final_time;
    const double       cfl_number;
    const unsigned int output_timestep_skip;
  };

SineGordonProblem::SineGordonProblem

This is the constructor of the SineGordonProblem class. The time interval and time step size are defined here. Moreover, we use the degree of the finite element that we defined at the top of the program to initialize a FE_Q finite element based on Gauss-Lobatto support points. These points are convenient because in conjunction with a QGaussLobatto quadrature rule of the same order they give a diagonal mass matrix without compromising accuracy too much (note that the integration is inexact, though), see also the discussion in the introduction. Note that FE_Q selects the Gauss-Lobatto nodal points by default due to their improved conditioning versus equidistant points. To make things more explicit, we choose to state the selection of the nodal points nonetheless.

  template <int dim>
  SineGordonProblem<dim>::SineGordonProblem()
    : pcout(std::cout, Utilities::MPI::this_mpi_process(MPI_COMM_WORLD) == 0)
    ,
#ifdef DEAL_II_WITH_P4EST
    triangulation(MPI_COMM_WORLD)
    ,
#endif
    fe(QGaussLobatto<1>(fe_degree + 1))
    , dof_handler(triangulation)
    , n_global_refinements(10 - 2 * dim)
    , time(-10)
    , time_step(10.)
    , final_time(10)
    , cfl_number(.1 / fe_degree)
    , output_timestep_skip(200)
  {}

SineGordonProblem::make_grid_and_dofs

As in step-25 this functions sets up a cube grid in dim dimensions of extent \([-15,15]\). We refine the mesh more in the center of the domain since the solution is concentrated there. We first refine all cells whose center is within a radius of 11, and then refine once more for a radius 6. This simple ad hoc refinement could be done better by adapting the mesh to the solution using error estimators during the time stepping as done in other example programs, and using parallel::distributed::SolutionTransfer to transfer the solution to the new mesh.

  template <int dim>
  void SineGordonProblem<dim>::make_grid_and_dofs()
  {
    GridGenerator::hyper_cube(triangulation, -15, 15);
    triangulation.refine_global(n_global_refinements);
    {
      for (const auto &cell : triangulation.active_cell_iterators())
        if (cell->is_locally_owned())
          if (cell->center().norm() < 11)
            cell->set_refine_flag();
      triangulation.execute_coarsening_and_refinement();
      for (const auto &cell : triangulation.active_cell_iterators())
        if (cell->is_locally_owned())
          if (cell->center().norm() < 6)
            cell->set_refine_flag();
      triangulation.execute_coarsening_and_refinement();
    }
    pcout << "   Number of global active cells: "
#ifdef DEAL_II_WITH_P4EST
          << triangulation.n_global_active_cells()
#else
          << triangulation.n_active_cells()
#endif
          << std::endl;
    dof_handler.distribute_dofs(fe);
    pcout << "   Number of degrees of freedom: " << dof_handler.n_dofs()
          << std::endl;

We generate hanging node constraints for ensuring continuity of the solution. As in step-40, we need to equip the constraint matrix with the IndexSet of locally relevant degrees of freedom to avoid it to consume too much memory for big problems. Next, the MatrixFree for the problem is set up. Note that we specify the MPI communicator which we are going to use, and that we also want to use shared-memory parallelization (hence one would use multithreading for intra-node parallelism and not MPI; note that we here choose the standard option — if we wanted to disable shared memory parallelization, we would choose none). Finally, three solution vectors are initialized. MatrixFree stores the layout that is to be used by distributed vectors, so we just ask it to initialize the vectors.

  DoFTools::extract_locally_relevant_dofs(dof_handler, locally_relevant_dofs);
  constraints.clear();
  constraints.reinit(locally_relevant_dofs);
  DoFTools::make_hanging_node_constraints(dof_handler, constraints);
  constraints.close();
  QGaussLobatto<1>                         quadrature(fe_degree + 1);
  typename MatrixFree<dim>::AdditionalData additional_data;
  additional_data.tasks_parallel_scheme =
    MatrixFree<dim>::AdditionalData::partition_partition;
  matrix_free_data.reinit(dof_handler,
                          constraints,
                          quadrature,
                          additional_data);
  matrix_free_data.initialize_dof_vector(solution);
  old_solution.reinit(solution);
  old_old_solution.reinit(solution);
}

SineGordonProblem::output_results

This function prints the norm of the solution and writes the solution vector to a file. The norm is standard (except for the fact that we need to accumulate the norms over all processors for the parallel grid), and the second is similar to what we did in step-40 or step-37. Note that we can use the same vector for output as the one used during computations: The vectors in the matrix-free framework always provide full information on all locally owned cells (this is what is needed in the local evaluations, too), including ghost vector entries on these cells. This is the only data that is needed in the integrate_difference function as well as in DataOut. The only action to take at this point is then to make sure that the vector updates its ghost values before we read from them. This is a feature present only in the LinearAlgebra::distributed::Vector class. Distributed vectors with PETSc and Trilinos, on the other hand, need to be copied to special vectors including ghost values (see the relevant section in step-40). If we also wanted to access all degrees of freedom on ghost cells (e.g. when computing error estimators that use the jump of solution over cell boundaries), we would need more information and create a vector initialized with locally relevant dofs just as in step-40. Observe also that we need to distribute constraints for output - they are not filled during computations (rather, they are interpolated on the fly in the matrix-free method read_dof_values).

template <int dim>
void
SineGordonProblem<dim>::output_results(const unsigned int timestep_number)
{
  constraints.distribute(solution);
  Vector<float> norm_per_cell(triangulation.n_active_cells());
  solution.update_ghost_values();
  VectorTools::integrate_difference(dof_handler,
                                    solution,
                                    Functions::ZeroFunction<dim>(),
                                    norm_per_cell,
                                    QGauss<dim>(fe_degree + 1),
                                    VectorTools::L2_norm);
  const double solution_norm =
    VectorTools::compute_global_error(triangulation,
                                      norm_per_cell,
                                      VectorTools::L2_norm);
  pcout << "   Time:" << std::setw(8) << std::setprecision(3) << time
        << ", solution norm: " << std::setprecision(5) << std::setw(7)
        << solution_norm << std::endl;
  DataOut<dim> data_out;
  data_out.attach_dof_handler(dof_handler);
  data_out.add_data_vector(solution, "solution");
  data_out.build_patches();
  const std::string filename =
    "solution-" + Utilities::int_to_string(timestep_number, 3);
  std::ofstream output(
    filename + "." +
    Utilities::int_to_string(Utilities::MPI::this_mpi_process(MPI_COMM_WORLD),
                             4) +
    ".vtu");
  data_out.write_vtu(output);
  if (Utilities::MPI::this_mpi_process(MPI_COMM_WORLD) == 0)
    {
      std::vector<std::string> filenames;
      for (unsigned int i = 0;
           i < Utilities::MPI::n_mpi_processes(MPI_COMM_WORLD);
           ++i)
        filenames.push_back("solution-" +
                            Utilities::int_to_string(timestep_number, 3) +
                            "." + Utilities::int_to_string(i, 4) + ".vtu");
      std::ofstream master_output((filename + ".pvtu"));
      data_out.write_pvtu_record(master_output, filenames);
    }
}

SineGordonProblem::run

This function is called by the main function and calls the subroutines of the class.

The first step is to set up the grid and the cell operator. Then, the time step is computed from the CFL number given in the constructor and the finest mesh size. The finest mesh size is computed as the diameter of the last cell in the triangulation, which is the last cell on the finest level of the mesh. This is only possible for Cartesian meshes, otherwise, one needs to loop over all cells. Note that we need to query all the processors for their finest cell since the not all processors might hold a region where the mesh is at the finest level. Then, we readjust the time step a little to hit the final time exactly.

template <int dim>
void SineGordonProblem<dim>::run()
{
  make_grid_and_dofs();
  const double local_min_cell_diameter =
    triangulation.last()->diameter() / std::sqrt(dim);
  const double global_min_cell_diameter =
    -Utilities::MPI::max(-local_min_cell_diameter, MPI_COMM_WORLD);
  time_step = cfl_number * global_min_cell_diameter;
  time_step = (final_time - time) / (int((final_time - time) / time_step));
  pcout << "   Time step size: " << time_step
        << ", finest cell: " << global_min_cell_diameter << std::endl
        << std::endl;

Next the initial value is set. Since we have a two-step time stepping method, we also need a value of the solution at time-time_step. For accurate results, one would need to compute this from the time derivative of the solution at initial time, but here we ignore this difficulty and just set it to the initial value function at that artificial time.

We create an output of the initial value. Then we also need to collect the two starting solutions in a std::vector of pointers field and to set up an instance of the SineGordonOperation class based on the finite element degree specified at the top of this file.

VectorTools::interpolate(dof_handler,
                         ExactSolution<dim>(1, time),
                         solution);
VectorTools::interpolate(dof_handler,
                         ExactSolution<dim>(1, time - time_step),
                         old_solution);
output_results(0);
std::vector<LinearAlgebra::distributed::Vector<double> *>
  previous_solutions;
previous_solutions.push_back(&old_solution);
previous_solutions.push_back(&old_old_solution);
SineGordonOperation<dim, fe_degree> sine_gordon_op(matrix_free_data,
                                                   time_step);

Now loop over the time steps. In each iteration, we shift the solution vectors by one and call the apply function of the SineGordonOperator . Then, we write the solution to a file. We clock the wall times for the computational time needed as wall as the time needed to create the output and report the numbers when the time stepping is finished.

Note how this shift is implemented: We simply call the swap method on the two vectors which swaps only some pointers without the need to copy data around. Obviously, this is a more efficient way to update the vectors during time stepping. Let us see what happens in more detail: First, we exchange old_solution with old_old_solution, which means that old_old_solution gets old_solution, which is what we expect. Similarly, old_solution gets the content from solution in the next step. Afterward, solution holds old_old_solution, but that will be overwritten during this step.

    unsigned int timestep_number = 1;
    Timer  timer;
    double wtime       = 0;
    double output_time = 0;
    for (time += time_step; time <= final_time;
         time += time_step, ++timestep_number)
      {
        timer.restart();
        old_old_solution.swap(old_solution);
        old_solution.swap(solution);
        sine_gordon_op.apply(solution, previous_solutions);
        wtime += timer.wall_time();
        timer.restart();
        if (timestep_number % output_timestep_skip == 0)
          output_results(timestep_number / output_timestep_skip);
        output_time += timer.wall_time();
      }
    timer.restart();
    output_results(timestep_number / output_timestep_skip + 1);
    output_time += timer.wall_time();
    pcout << std::endl
          << "   Performed " << timestep_number << " time steps." << std::endl;
    pcout << "   Average wallclock time per time step: "
          << wtime / timestep_number << "s" << std::endl;
    pcout << "   Spent " << output_time << "s on output and " << wtime
          << "s on computations." << std::endl;
  }
} // namespace Step48

The `main` function

As in step-40, we initialize MPI at the start of the program. Since we will in general mix MPI parallelization with threads, we also set the third argument in MPI_InitFinalize that controls the number of threads to an invalid number, which means that the TBB library chooses the number of threads automatically, typically to the number of available cores in the system. As an alternative, you can also set this number manually if you want to set a specific number of threads (e.g. when MPI-only is required).

int main(int argc, char **argv)
{
  using namespace Step48;
  using namespace dealii;
  Utilities::MPI::MPI_InitFinalize mpi_initialization(
    argc, argv, numbers::invalid_unsigned_int);
  try
    {
      SineGordonProblem<dimension> sg_problem;
      sg_problem.run();
    }
  catch (std::exception &exc)
    {
      std::cerr << std::endl
                << std::endl
                << "----------------------------------------------------"
                << std::endl;
      std::cerr << "Exception on processing: " << std::endl
                << exc.what() << std::endl
                << "Aborting!" << std::endl
                << "----------------------------------------------------"
                << std::endl;
      return 1;
    }
  catch (...)
    {
      std::cerr << std::endl
                << std::endl
                << "----------------------------------------------------"
                << std::endl;
      std::cerr << "Unknown exception!" << std::endl
                << "Aborting!" << std::endl
                << "----------------------------------------------------"
                << std::endl;
      return 1;
    }
  return 0;
}

Results

Comparison with a sparse matrix

In order to demonstrate the gain in using the MatrixFree class instead of the standard deal.II assembly routines for evaluating the information from old time steps, we study a simple serial run of the code on a nonadaptive mesh. Since much time is spent on evaluating the sine function, we do not only show the numbers of the full sine-Gordon equation but also for the wave equation (the sine-term skipped from the sine-Gordon equation). We use both second and fourth order elements. The results are summarized in the following table.

	wave equation			sine-Gordon
	MF	SpMV	dealii	MF	dealii
2D, \(\mathcal{Q}_2\)	0.0106	0.00971	0.109	0.0243	0.124
2D, \(\mathcal{Q}_4\)	0.0328	0.0706	0.528	0.0714	0.502
3D, \(\mathcal{Q}_2\)	0.0151	0.0320	0.331	0.0376	0.364
3D, \(\mathcal{Q}_4\)	0.0918	0.844	6.83	0.194	6.95

It is apparent that the matrix-free code outperforms the standard assembly routines in deal.II by far. In 3D and for fourth order elements, one operator application is also almost ten times as fast as a sparse matrix-vector product.

Parallel run in 3D

To demonstrate how the example scales for a parallel run and to demonstrate that hanging node constraints can be handled in an efficient way, we run the example in 3D with \(\mathcal{Q}_4\) elements. First, we run it on a notebook with 2 cores (Sandy Bridge CPU) at 2.7 GHz.

$ make debug-mode=off run
   Number of global active cells: 17592
   Number of degrees of freedom: 1193881
   Time step size: 0.0117233, finest cell: 0.46875
   Time:     -10, solution norm:  29.558
   Time:   -7.66, solution norm:  129.13
   Time:   -5.31, solution norm:  67.753
   Time:   -2.97, solution norm:  79.245
   Time:  -0.621, solution norm:  123.52
   Time:    1.72, solution norm:  43.525
   Time:    4.07, solution norm:  93.285
   Time:    6.41, solution norm:  97.722
   Time:    8.76, solution norm:  36.734
   Time:      10, solution norm:  94.115
   Performed 1706 time steps.
   Average wallclock time per time step: 0.038261s
   Spent 11.977s on output and 65.273s on computations.

It takes 0.04 seconds for one time step on a notebook with more than a million degrees of freedom (note that we would need many processors to reach such numbers when solving linear systems). If we run the same 3D code on a cluster with 2 nodes and each node runs 8 threads, we get the following times:

$ mpirun --bynode -n 2 ./step-48
...
   Performed 1706 time steps.
   Average wallclock time per time step: 0.0123188s
   Spent 6.74378s on output and 21.0158s on computations.

We observe a considerable speedup over the notebook (16 cores versus 2 cores; nonetheless, one notebook core is considerably faster than one core of the cluster because of a newer processor architecture). If we run the same program on 4 nodes with 8 threads on each node, we get:

$ mpirun --bynode -n 4 ./step-48
...
   Performed 1706 time steps.
   Average wallclock time per time step: 0.00689865s
   Spent 3.54145s on output and 11.7691s on computations.

By comparing the times for two nodes and four nodes, we observe the nice scaling behavior of the implementation. Of course, the code can also be run in MPI-mode only by disabling the multithreading flag in the code. If we use the same 32 cores as for the hybrid parallelization above, we observe the following run-time:

$ mpirun -n 32 ./step-48
...
   Performed 1706 time steps.
   Average wallclock time per time step: 0.0189041s
   Spent 0.968967s on output and 32.2504s on computations.

We observe slower speed for computations, but faster output (which makes sense, as output is only parallelized by MPI and not threads), whereas the computations are faster if we use hybrid parallelism in the given case.

Possibilities for extensions

There are several things in this program that could be improved to make it even more efficient (besides improved boundary conditions and physical stuff as discussed in step-25):

Faster evaluation of sine terms: As becomes obvious from the comparison of the plain wave equation and the sine-Gordon equation above, the evaluation of the sine terms dominates the total time for the finite element operator application. There are a few reasons for this: Firstly, the deal.II sine computation of a VectorizedArray field is not vectorized (as opposed to the rest of the operator application). This could be cured by handing the sine computation to a library with vectorized sine computations like Intel's math kernel library (MKL). By using the function vdSin in MKL, the program uses half the computing time in 2D and 40 percent less time in 3D. On the other hand, the sine computation is structurally much more complicated than the simple arithmetic operations like additions and multiplications in the rest of the local operation.
Higher order time stepping: While the implementation allows for arbitrary order in the spatial part (by adjusting the degree of the finite element), the time stepping scheme is a standard second-order leap-frog scheme. Since solutions in wave propagation problems are usually very smooth, the error is likely dominated by the time stepping part. Of course, this could be cured by using smaller time steps (at a fixed spatial resolution), but it would be more efficient to use higher order time stepping as well. While it would be straight-forward to do so for a first-order system (use some Runge–Kutta scheme of higher order, probably combined with adaptive time step selection like the Dormand–Prince method), it is more challenging for the second-order formulation. At least in the finite difference community, people usually use the PDE to find spatial correction terms that improve the temporal error.

The plain program

/* ---------------------------------------------------------------------
 *
 * Copyright (C) 2011 - 2017 by the deal.II authors
 *
 * This file is part of the deal.II library.
 *
 * The deal.II library is free software; you can use it, redistribute
 * it, and/or modify it under the terms of the GNU Lesser General
 * Public License as published by the Free Software Foundation; either
 * version 2.1 of the License, or (at your option) any later version.
 * The full text of the license can be found in the file LICENSE.md at
 * the top level directory of deal.II.
 *
 * ---------------------------------------------------------------------
 *
 * Author: Katharina Kormann, Martin Kronbichler, Uppsala University, 2011-2012
 */
#include <deal.II/base/logstream.h>
#include <deal.II/base/utilities.h>
#include <deal.II/base/function.h>
#include <deal.II/base/conditional_ostream.h>
#include <deal.II/base/timer.h>
#include <deal.II/lac/vector.h>
#include <deal.II/grid/tria.h>
#include <deal.II/grid/grid_generator.h>
#include <deal.II/dofs/dof_tools.h>
#include <deal.II/dofs/dof_handler.h>
#include <deal.II/lac/affine_constraints.h>
#include <deal.II/fe/fe_q.h>
#include <deal.II/fe/fe_values.h>
#include <deal.II/numerics/vector_tools.h>
#include <deal.II/numerics/data_out.h>
#include <deal.II/distributed/tria.h>
#include <deal.II/lac/la_parallel_vector.h>
#include <deal.II/matrix_free/matrix_free.h>
#include <deal.II/matrix_free/fe_evaluation.h>
#include <fstream>
#include <iostream>
#include <iomanip>
namespace Step48
{
  using namespace dealii;
  const unsigned int dimension = 2;
  const unsigned int fe_degree = 4;
  template <int dim, int fe_degree>
  class SineGordonOperation
  {
  public:
    SineGordonOperation(const MatrixFree<dim, double> &data_in,
                        const double                   time_step);
    void apply(LinearAlgebra::distributed::Vector<double> &dst,
               const std::vector<LinearAlgebra::distributed::Vector<double> *>
                 &src) const;
  private:
    const MatrixFree<dim, double> &            data;
    const VectorizedArray<double>              delta_t_sqr;
    LinearAlgebra::distributed::Vector<double> inv_mass_matrix;
    void local_apply(
      const MatrixFree<dim, double> &                                  data,
      LinearAlgebra::distributed::Vector<double> &                     dst,
      const std::vector<LinearAlgebra::distributed::Vector<double> *> &src,
      const std::pair<unsigned int, unsigned int> &cell_range) const;
  };
  template <int dim, int fe_degree>
  SineGordonOperation<dim, fe_degree>::SineGordonOperation(
    const MatrixFree<dim, double> &data_in,
    const double                   time_step)
    : data(data_in)
    , delta_t_sqr(make_vectorized_array(time_step * time_step))
  {
    VectorizedArray<double> one = make_vectorized_array(1.);
    data.initialize_dof_vector(inv_mass_matrix);
    FEEvaluation<dim, fe_degree> fe_eval(data);
    const unsigned int           n_q_points = fe_eval.n_q_points;
    for (unsigned int cell = 0; cell < data.n_macro_cells(); ++cell)
      {
        fe_eval.reinit(cell);
        for (unsigned int q = 0; q < n_q_points; ++q)
          fe_eval.submit_value(one, q);
        fe_eval.integrate(true, false);
        fe_eval.distribute_local_to_global(inv_mass_matrix);
      }
    inv_mass_matrix.compress(VectorOperation::add);
    for (unsigned int k = 0; k < inv_mass_matrix.local_size(); ++k)
      if (inv_mass_matrix.local_element(k) > 1e-15)
        inv_mass_matrix.local_element(k) =
          1. / inv_mass_matrix.local_element(k);
      else
        inv_mass_matrix.local_element(k) = 0;
  }
  template <int dim, int fe_degree>
  void SineGordonOperation<dim, fe_degree>::local_apply(
    const MatrixFree<dim> &                                          data,
    LinearAlgebra::distributed::Vector<double> &                     dst,
    const std::vector<LinearAlgebra::distributed::Vector<double> *> &src,
    const std::pair<unsigned int, unsigned int> &cell_range) const
  {
    AssertDimension(src.size(), 2);
    FEEvaluation<dim, fe_degree> current(data), old(data);
    for (unsigned int cell = cell_range.first; cell < cell_range.second; ++cell)
      {
        current.reinit(cell);
        old.reinit(cell);
        current.read_dof_values(*src[0]);
        old.read_dof_values(*src[1]);
        current.evaluate(true, true, false);
        old.evaluate(true, false, false);
        for (unsigned int q = 0; q < current.n_q_points; ++q)
          {
            const VectorizedArray<double> current_value = current.get_value(q);
            const VectorizedArray<double> old_value     = old.get_value(q);
            current.submit_value(2. * current_value - old_value -
                                   delta_t_sqr * std::sin(current_value),
                                 q);
            current.submit_gradient(-delta_t_sqr * current.get_gradient(q), q);
          }
        current.integrate(true, true);
        current.distribute_local_to_global(dst);
      }
  }
  template <int dim, int fe_degree>
  void SineGordonOperation<dim, fe_degree>::apply(
    LinearAlgebra::distributed::Vector<double> &                     dst,
    const std::vector<LinearAlgebra::distributed::Vector<double> *> &src) const
  {
    dst = 0;
    data.cell_loop(&SineGordonOperation<dim, fe_degree>::local_apply,
                   this,
                   dst,
                   src);
    dst.scale(inv_mass_matrix);
  }
  template <int dim>
  class ExactSolution : public Function<dim>
  {
  public:
    ExactSolution(const unsigned int n_components = 1, const double time = 0.)
      : Function<dim>(n_components, time)
    {}
    virtual double value(const Point<dim> & p,
                         const unsigned int component = 0) const override;
  };
  template <int dim>
  double ExactSolution<dim>::value(const Point<dim> &p,
                                   const unsigned int /* component */) const
  {
    double t = this->get_time();
    const double m  = 0.5;
    const double c1 = 0.;
    const double c2 = 0.;
    const double factor =
      (m / std::sqrt(1. - m * m) * std::sin(std::sqrt(1. - m * m) * t + c2));
    double result = 1.;
    for (unsigned int d = 0; d < dim; ++d)
      result *= -4. * std::atan(factor / std::cosh(m * p[d] + c1));
    return result;
  }
  template <int dim>
  class SineGordonProblem
  {
  public:
    SineGordonProblem();
    void run();
  private:
    ConditionalOStream pcout;
    void make_grid_and_dofs();
    void output_results(const unsigned int timestep_number);
#ifdef DEAL_II_WITH_P4EST
    parallel::distributed::Triangulation<dim> triangulation;
#else
    Triangulation<dim> triangulation;
#endif
    FE_Q<dim>                 fe;
    DoFHandler<dim>           dof_handler;
    AffineConstraints<double> constraints;
    IndexSet                  locally_relevant_dofs;
    MatrixFree<dim, double> matrix_free_data;
    LinearAlgebra::distributed::Vector<double> solution, old_solution,
      old_old_solution;
    const unsigned int n_global_refinements;
    double             time, time_step;
    const double       final_time;
    const double       cfl_number;
    const unsigned int output_timestep_skip;
  };
  template <int dim>
  SineGordonProblem<dim>::SineGordonProblem()
    : pcout(std::cout, Utilities::MPI::this_mpi_process(MPI_COMM_WORLD) == 0)
    ,
#ifdef DEAL_II_WITH_P4EST
    triangulation(MPI_COMM_WORLD)
    ,
#endif
    fe(QGaussLobatto<1>(fe_degree + 1))
    , dof_handler(triangulation)
    , n_global_refinements(10 - 2 * dim)
    , time(-10)
    , time_step(10.)
    , final_time(10)
    , cfl_number(.1 / fe_degree)
    , output_timestep_skip(200)
  {}
  template <int dim>
  void SineGordonProblem<dim>::make_grid_and_dofs()
  {
    GridGenerator::hyper_cube(triangulation, -15, 15);
    triangulation.refine_global(n_global_refinements);
    {
      for (const auto &cell : triangulation.active_cell_iterators())
        if (cell->is_locally_owned())
          if (cell->center().norm() < 11)
            cell->set_refine_flag();
      triangulation.execute_coarsening_and_refinement();
      for (const auto &cell : triangulation.active_cell_iterators())
        if (cell->is_locally_owned())
          if (cell->center().norm() < 6)
            cell->set_refine_flag();
      triangulation.execute_coarsening_and_refinement();
    }
    pcout << "   Number of global active cells: "
#ifdef DEAL_II_WITH_P4EST
          << triangulation.n_global_active_cells()
#else
          << triangulation.n_active_cells()
#endif
          << std::endl;
    dof_handler.distribute_dofs(fe);
    pcout << "   Number of degrees of freedom: " << dof_handler.n_dofs()
          << std::endl;
    DoFTools::extract_locally_relevant_dofs(dof_handler, locally_relevant_dofs);
    constraints.clear();
    constraints.reinit(locally_relevant_dofs);
    DoFTools::make_hanging_node_constraints(dof_handler, constraints);
    constraints.close();
    QGaussLobatto<1>                         quadrature(fe_degree + 1);
    typename MatrixFree<dim>::AdditionalData additional_data;
    additional_data.tasks_parallel_scheme =
      MatrixFree<dim>::AdditionalData::partition_partition;
    matrix_free_data.reinit(dof_handler,
                            constraints,
                            quadrature,
                            additional_data);
    matrix_free_data.initialize_dof_vector(solution);
    old_solution.reinit(solution);
    old_old_solution.reinit(solution);
  }
  template <int dim>
  void
  SineGordonProblem<dim>::output_results(const unsigned int timestep_number)
  {
    constraints.distribute(solution);
    Vector<float> norm_per_cell(triangulation.n_active_cells());
    solution.update_ghost_values();
    VectorTools::integrate_difference(dof_handler,
                                      solution,
                                      Functions::ZeroFunction<dim>(),
                                      norm_per_cell,
                                      QGauss<dim>(fe_degree + 1),
                                      VectorTools::L2_norm);
    const double solution_norm =
      VectorTools::compute_global_error(triangulation,
                                        norm_per_cell,
                                        VectorTools::L2_norm);
    pcout << "   Time:" << std::setw(8) << std::setprecision(3) << time
          << ", solution norm: " << std::setprecision(5) << std::setw(7)
          << solution_norm << std::endl;
    DataOut<dim> data_out;
    data_out.attach_dof_handler(dof_handler);
    data_out.add_data_vector(solution, "solution");
    data_out.build_patches();
    const std::string filename =
      "solution-" + Utilities::int_to_string(timestep_number, 3);
    std::ofstream output(
      filename + "." +
      Utilities::int_to_string(Utilities::MPI::this_mpi_process(MPI_COMM_WORLD),
                               4) +
      ".vtu");
    data_out.write_vtu(output);
    if (Utilities::MPI::this_mpi_process(MPI_COMM_WORLD) == 0)
      {
        std::vector<std::string> filenames;
        for (unsigned int i = 0;
             i < Utilities::MPI::n_mpi_processes(MPI_COMM_WORLD);
             ++i)
          filenames.push_back("solution-" +
                              Utilities::int_to_string(timestep_number, 3) +
                              "." + Utilities::int_to_string(i, 4) + ".vtu");
        std::ofstream master_output((filename + ".pvtu"));
        data_out.write_pvtu_record(master_output, filenames);
      }
  }
  template <int dim>
  void SineGordonProblem<dim>::run()
  {
    make_grid_and_dofs();
    const double local_min_cell_diameter =
      triangulation.last()->diameter() / std::sqrt(dim);
    const double global_min_cell_diameter =
      -Utilities::MPI::max(-local_min_cell_diameter, MPI_COMM_WORLD);
    time_step = cfl_number * global_min_cell_diameter;
    time_step = (final_time - time) / (int((final_time - time) / time_step));
    pcout << "   Time step size: " << time_step
          << ", finest cell: " << global_min_cell_diameter << std::endl
          << std::endl;
    VectorTools::interpolate(dof_handler,
                             ExactSolution<dim>(1, time),
                             solution);
    VectorTools::interpolate(dof_handler,
                             ExactSolution<dim>(1, time - time_step),
                             old_solution);
    output_results(0);
    std::vector<LinearAlgebra::distributed::Vector<double> *>
      previous_solutions;
    previous_solutions.push_back(&old_solution);
    previous_solutions.push_back(&old_old_solution);
    SineGordonOperation<dim, fe_degree> sine_gordon_op(matrix_free_data,
                                                       time_step);
    unsigned int timestep_number = 1;
    Timer  timer;
    double wtime       = 0;
    double output_time = 0;
    for (time += time_step; time <= final_time;
         time += time_step, ++timestep_number)
      {
        timer.restart();
        old_old_solution.swap(old_solution);
        old_solution.swap(solution);
        sine_gordon_op.apply(solution, previous_solutions);
        wtime += timer.wall_time();
        timer.restart();
        if (timestep_number % output_timestep_skip == 0)
          output_results(timestep_number / output_timestep_skip);
        output_time += timer.wall_time();
      }
    timer.restart();
    output_results(timestep_number / output_timestep_skip + 1);
    output_time += timer.wall_time();
    pcout << std::endl
          << "   Performed " << timestep_number << " time steps." << std::endl;
    pcout << "   Average wallclock time per time step: "
          << wtime / timestep_number << "s" << std::endl;
    pcout << "   Spent " << output_time << "s on output and " << wtime
          << "s on computations." << std::endl;
  }
} // namespace Step48
int main(int argc, char **argv)
{
  using namespace Step48;
  using namespace dealii;
  Utilities::MPI::MPI_InitFinalize mpi_initialization(
    argc, argv, numbers::invalid_unsigned_int);
  try
    {
      SineGordonProblem<dimension> sg_problem;
      sg_problem.run();
    }
  catch (std::exception &exc)
    {
      std::cerr << std::endl
                << std::endl
                << "----------------------------------------------------"
                << std::endl;
      std::cerr << "Exception on processing: " << std::endl
                << exc.what() << std::endl
                << "Aborting!" << std::endl
                << "----------------------------------------------------"
                << std::endl;
      return 1;
    }
  catch (...)
    {
      std::cerr << std::endl
                << std::endl
                << "----------------------------------------------------"
                << std::endl;
      std::cerr << "Unknown exception!" << std::endl
                << "Aborting!" << std::endl
                << "----------------------------------------------------"
                << std::endl;
      return 1;
    }
  return 0;
}