A little about speeding up the program: parallelization (manual or automatic) based on super-optimistic calculations

Hello dear readers. In this publication, we will talk about such (already become familiar) thing as speeding up the program by using parallel computations. The technologies for organizing such computations are known - this is both ordinary multithreaded programming and the use of special interfaces: OpenMP, OpenAcc, MPI, DVM and many others (in this case, loops are parallelized, vectorization or pipelining is used, lazy computations are organized, independent program blocks are allocated that can be run in parallel, etc.).



In this case, they usually proceed from the idea that parallelization should not somehow affect the results of program execution. This is a tough requirement, but fair in many cases. However, if we try to parallelize a program that performs some calculations by numerical methods (we train a neural network, simulate the dynamics of a fluid or a molecular system, solve ordinary differential equations or optimization problems), then the result will (in any case) have some error. Therefore, why not apply “risky” parallelization technologies, which can introduce a small additional error in the mathematical solution,but allow you to get some more additional acceleration? One of such technologies - the splitting of loop bodies with predicting intermediate results and rolling back in case of unsuccessful prediction (in fact, this is "over-optimistic" calculations in partially transactional memory) will be discussed.



Parallelization idea



Suppose we have a cycle, the body of which consists of two consecutive parts, and the second part depends on the first. Let the individual loops of the cycle depend on each other. For instance:



for (int i = 0; i < N; i++) {
	x = f(y);
	y = g(x);
}


At first glance, such a loop cannot be parallelized. However, we will try. Let's try to execute the first and second operators of the loop body in parallel. The problem is that at the time of calculating g (x) x must be known, but it will only be calculated at the end of the first part. Well, let's introduce some scheme, which at the beginning of the second part will try to predict the new value of x. This can be done, for example, with the help of linear predictive, which "learns" to predict a new value of x, based on the "history" of its change. Then the second part can be read in parallel with the first (this is "over-optimism"), and when both are calculated, compare the predicted value of x with the real one obtained at the end of the first part. If they are approximately equal, then the result of calculations of both parts can be accepted (and go to the next iteration of the loop).And if they are very different, then only the second part will need to be recounted. With this scheme, in some part of the cases, we will get pure parallelization, in the rest - the actual sequential count. The loop execution algorithm is as follows:



for (int i = 0; i < N; i++) {
	    {
		  1 –  x = f(y).        x;
		  2 –   x*   y* = g(x*).   x        x*.   ,  y = y*    .   ,     : y = g(x). 
	}
}


The basic algorithm is clear. The theoretical acceleration is two times, but in practice it will, of course, be less, because: a) part of the time is spent on predictions and coordination; b) not all iterations will be executed in parallel; c) the first and second parts of the cycle body can have different labor intensity (ideally, equal is required). Let's move on to the implementation.



Parallelization Implementation - Over-Optimistic Computing



Since the parallelization algorithm deals with canceling some of the calculations (in case of failure) and re-executing them, there is clearly something from the idea of ​​working in transactional memory. Better - in a partially transactional one, where certain variables work according to the transactional memory scheme, and the rest of the variables work as usual. The transfer of data from the first part to the second can be organized using some special channel. Let this channel be predictive: a) if at the time of reception the data has already been transmitted to the channel, then they are read from it, b) if at the time of reception the data has not yet arrived in the channel, then it tries to predict this data and returns the prediction result. This channel will work according to a scheme that is somewhat unusual for conventional transactional memory:If at the end of the transaction of the second part of the cycle there is a discrepancy between the data received into the channel and the data predicted by it, then the transaction of the second part of the cycle is canceled and re-executed, and not “predictions” will be read from the channel, but the data actually received. The cycle will look like:



for (int i = 0; i < N; i++) {
	   ,     {
		 1 ( 1):
			x = f(y);
			_.put(x);
		 2 ( 2):
			_.get(x);
			y = g(x);
	}
}


Done. The channel took care of the data prediction, while the transaction memory took care of canceling calculations in case of an overly optimistic prediction.



Some useful applications: neural networks, particle-in-cell method



Such a scheme for parallelizing the loop body can be used in a number of mathematical algorithms, for example, when modeling an electrostatic lens using the particle-in-cells method, as well as when training a feedforward neural network using the backpropagation method. The first task is very special, so I will not discuss it here, I will only say that the described approach to parallelization gave an acceleration of 10-15%. But the second task is already more popular, so it is simply necessary to say a few words about it.



The training cycle of the neural network includes a sequential pass through the training pairs, and for each pair, a forward move (calculation of the network output) and a reverse move (correction of weights and biases) are performed. These are the two parts of the body of the loop for training pairs, and to parallelize them, you can apply the above approach (by the way, it can also be applied with a parallel pass through training pairs, with minor changes). As a result, on a typical neural network training task, I got a 50% gain in performance.



Automation of parallelization of C programs



The idea of ​​super-optimistic computations is not very difficult, so a special translator program was written that deals with automatic parallelization - it finds loops in the original C-program for which such parallelization can give a positive result and splits their bodies into two parts, inserting the necessary OpenMP directives, finding potential variables for channels, connecting a library for working with partially transactional memory and predictive channels and, ultimately, generating an output parallelized program.



In particular, such a translator was applied to an electrostatic lens simulation program. I will give both programs - the original one (which includes a directive indicating the parallelization of loops) and the one obtained after translation.



Original program:



#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#pragma auto parallelize
#pragma auto pure(malloc,fabs,free,sizeof,omp_get_wtime)
#define theta 1.83
#define NX 40
#define NY 40
#define h 0.1
#define NP 15000
//   
#define U1 200
#define U2 5000
#define e -1.5E-13
#define m 1E-11
#define e0 8.85E-12
#define V (h*h)
#define tau 0.000015
#define T 0.09
#define POISSON_EPS 0.01
#define TOL_EPS 0.25

int main() {
        double * U  = (double *)malloc(NY*NX*sizeof(double));
        double * UU = (double *)malloc(NY*NX*sizeof(double));
        double * EX = (double *)malloc(NY*NX*sizeof(double));
        double * EY = (double *)malloc(NY*NX*sizeof(double));
	double * PX = (double *)malloc(NP*sizeof(double));
	double * PY = (double *)malloc(NP*sizeof(double));
	int * X = (int *)malloc(NP*sizeof(int));
	int * Y = (int *)malloc(NP*sizeof(int));

	double ro[NY][NX];

	split_private double t;
	split_private double tm;
	split_private int i, j;

	for (i = 0; i < NY; i++)
		for (j = 0; j < NX; j++) {
			UU[i*NX+j] = j == NX-1 ? U2 : j == NX/2 && (i < NY/4 || i > 3*NY/4) ? U1 : 0.0;
			EX[i*NX+j] = 0.0;
			EY[i*NX+j] = 0.0;
		}
	for (i = 0; i < NP; i++) {
		int x, y;

		PX[i] = 0.5*NX*h*rand()/RAND_MAX;
		PY[i] = NY*h*rand()/RAND_MAX;

		x = PX[i]/h;
		y = PY[i]/h;
		if (x < 0) x = 0;
		else if (x > NX-1) x = NX-1;
		if (y < 0) y = 0;
		else if (y > NY-1) y = NY-1;

		X[i] = x;
		Y[i] = y;
	}

	tm = omp_get_wtime();

	for (t = 0.0; t < T; t += tau) {
		unsigned int n[NY][NX] = { 0 };
		double err;
		int ptr = 0;
		for (i = 0; i < NY; i++)
    			for (j = 0; j < NX; j++, ptr++)
				U[ptr] = UU[ptr];
		for (i = 1; i < NY - 1; i++)
			for (j = 1; j < NX - 1; j++) {
				EX[i*NX+j] = -(U[i*NX+j+1]-U[i*NX+j-1])/2.0/h;
				EY[i*NX+j] = -(U[(i+1)*NX+j]-U[(i-1)*NX+j])/2.0/h;
			}
						
		for (i = 0; i < NP; i++) {
			PX[i] += tau*e*EX[Y[i]*NX+X[i]]/m;
			PY[i] += tau*e*EY[Y[i]*NX+X[i]]/m;
		}

		for (i = 0; i < NP; i++) {
			int x = PX[i]/h;
			int y = PY[i]/h;
			if (x < 0) x = 0;
			else if (x > NX-1) x = NX-1;
			if (y < 0) y = 0;
			else if (y > NY-1) y = NY-1;

			Y[i] = y;
			X[i] = x;
			n[y][x]++;
		}

		for (i = 0; i < NY; i++)
			for (j = 0; j < NX; j++)
				ro[i][j] = n[i][j]*e/V;

		do {
			err = 0.0;
	
			for (i = 1; i < NY - 1; i++)
				for (j = 1+(i-1)%2; j < NX - 1; j+=2) {
				  int ptr = i*NX + j;
				  if (!(j == NX/2 && (i < NY/4 || i > 3*NY/4))) {
					double _new = (1-theta)*UU[ptr] + theta/4.0*(UU[ptr-1]+UU[ptr+1]+UU[ptr+NX]+UU[ptr-NX]-h*h*ro[i][j]/e0);
					double loc_err = fabs(UU[ptr] - _new);
					if (loc_err > err) err = loc_err;
					UU[ptr] = _new;
				  }
				}
			for (i = 1; i < NY - 1; i++)
				for (j = 1+i%2; j < NX - 1; j+=2) {
				  int ptr = i*NX + j;
				  if (!(j == NX/2 && (i < NY/4 || i > 3*NY/4))) {
					double _new = (1-theta)*UU[ptr] + theta/4.0*(UU[ptr-1]+UU[ptr+1]+UU[ptr+NX]+UU[ptr-NX]-h*h*ro[i][j]/e0);
					double loc_err = fabs(UU[ptr] - _new);
					if (loc_err > err) err = loc_err;
					UU[ptr] = _new;
				  }
				}
			for (j = 0; j < NX; j++) {
				UU[j] = UU[NX + j];
				UU[(NY-1)*NX + j] = UU[(NY-2)*NX + j];
			}
		} while (err > POISSON_EPS);
	}

	for (i = 0; i < NY; i++) {
		for (j = 0; j < NX; j++)
			printf("%lf\t", UU[i*NX+j]);
		printf("\n");
	}

	return 0;
}


Automatically parallel program



#include "transact.h"
#define split_private /* split-private */
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#define theta 1.83
#define NX 40
#define NY 40
#define h 0.1
#define NP 15000
#define U1 200
#define U2 5000
#define e -1.5E-13
#define m 1E-11
#define e0 8.85E-12
#define V (h*h)
#define tau 0.000015
#define T 0.09
#define POISSON_EPS 0.01
#define TOL_EPS 0.25

int  main(  ){
  double * U  = (double *)malloc(NY*NX*sizeof(double));
  double * UU = (double *)malloc(NY*NX*sizeof(double));
  double * EX = (double *)malloc(NY*NX*sizeof(double));
  double * EY = (double *)malloc(NY*NX*sizeof(double));
  double * PX = (double *)malloc(NP*sizeof(double));
  double * PY = (double *)malloc(NP*sizeof(double));
  int * X = (int *)malloc(NP*sizeof(int));
  int * Y = (int *)malloc(NP*sizeof(int));
  double ro[NY][NX];
  split_private double t;
  split_private double tm;
  split_private int i, j;
  for ( i = 0; i < NY; i++ )
    for ( j = 0; j < NX; j++ )
      {
        UU[i*NX+j] = j == NX-1 ? U2 : j == NX/2 && (i < NY/4 || i > 3*NY/4) ? U1 : 0.0;
        EX[i*NX+j] = 0.0;
        EY[i*NX+j] = 0.0;
      }
  for ( i = 0; i < NP; i++ )
    {
      int x, y;
      PX[i] = 0.5*NX*h*rand()/RAND_MAX;
      PY[i] = NY*h*rand()/RAND_MAX;
      x = PX[i]/h;
      y = PY[i]/h;
      if ( x < 0 )
        x = 0;
      else
        if ( x > NX-1 )
          x = NX-1;
      if ( y < 0 )
        y = 0;
      else
        if ( y > NY-1 )
          y = NY-1;
      X[i] = x;
      Y[i] = y;
    }
  tm = omp_get_wtime();
#pragma omp parallel num_threads(2) private(t,tm,i,j) 
  {
    int __id__ = omp_get_thread_num();
    TOut<double > * out_ro = __id__ == 0 ? new TOut<double >("ro63", (NY)*(NX), 2, 0.01, -1, "63") : NULL;
    TIn<double > * in_ro = __id__ == 1 ? new TIn<double >("ro63", (NY)*(NX), 2, 0.01, -1, "63") : NULL;
    for ( t = 0.0; t < T; t += tau )
      {
        unsigned int n[NY][NX] = { 0 };
        double err;
        int ptr = 0;
        if ( __id__ == 0 )
          {
            for ( i = 0; i < NY; i++ )
              for ( j = 0; j < NX; j++, ptr++ )
                U[ptr] = UU[ptr];
          }
transaction_atomic("63")
        {
          if ( __id__ == 0 )
            {
              for ( i = 1; i < NY - 1; i++ )
                for ( j = 1; j < NX - 1; j++ )
                  {
                    EX[i*NX+j] = -(U[i*NX+j+1]-U[i*NX+j-1])/2.0/h;
                    EY[i*NX+j] = -(U[(i+1)*NX+j]-U[(i-1)*NX+j])/2.0/h;
                  }

              for ( i = 0; i < NP; i++ )
                {
                  PX[i] += tau*e*EX[Y[i]*NX+X[i]]/m;
                  PY[i] += tau*e*EY[Y[i]*NX+X[i]]/m;
                }

              for ( i = 0; i < NP; i++ )
                {
                  int x = PX[i]/h;
                  int y = PY[i]/h;
                  if ( x < 0 )
                    x = 0;
                  else
                    if ( x > NX-1 )
                      x = NX-1;
                  if ( y < 0 )
                    y = 0;
                  else
                    if ( y > NY-1 )
                      y = NY-1;
                  Y[i] = y;
                  X[i] = x;
                  n[y][x]++;
                }
              for ( i = 0; i < NY; i++ )
                for ( j = 0; j < NX; j++ )
                  ro[i][j] = n[i][j]*e/V;
              out_ro->put((double  *)ro);
            }
          else
            {
              double  ro[NY][NX];
              in_ro->get((double  *)ro, 0);
              do
                {
                  err = 0.0;

                  for ( i = 1; i < NY - 1; i++ )
                    for ( j = 1+(i-1)%2; j < NX - 1; j+=2 )
                      {
                        int ptr = i*NX + j;
                        if ( !(j == NX/2 && (i < NY/4 || i > 3*NY/4)) )
                          {
                            double _new = (1-theta)*UU[ptr] + theta/4.0*(UU[ptr-1]+UU[ptr+1]+UU[ptr+NX]+UU[ptr-NX]-h*h*ro[i][j]/e0);
                            double loc_err = fabs(UU[ptr] - _new);
                            if ( loc_err > err )
                              err = loc_err;
                            UU[ptr] = _new;
                          }
                      }
                  for ( i = 1; i < NY - 1; i++ )
                    for ( j = 1+i%2; j < NX - 1; j+=2 )
                      {
                        int ptr = i*NX + j;
                        if ( !(j == NX/2 && (i < NY/4 || i > 3*NY/4)) )
                          {
                            double _new = (1-theta)*UU[ptr] + theta/4.0*(UU[ptr-1]+UU[ptr+1]+UU[ptr+NX]+UU[ptr-NX]-h*h*ro[i][j]/e0);
                            double loc_err = fabs(UU[ptr] - _new);
                            if ( loc_err > err )
                              err = loc_err;
                            UU[ptr] = _new;
                          }
                      }
                  for ( j = 0; j < NX; j++ )
                    {
                      UU[j] = UU[NX + j];
                      UU[(NY-1)*NX + j] = UU[(NY-2)*NX + j];
                    }
                }
              while ( err > POISSON_EPS )
                ;
            }
        }
      }
    delete in_ro;
    delete out_ro;
  }
  for ( i = 0; i < NY; i++ )
    {
      for ( j = 0; j < NX; j++ )
        printf("%lf\t", UU[i*NX+j]);
      printf("\n");
    }
  return 0;
}


Outcome



So, sometimes you can try to parallelize the program even in cases when it consists of strictly sequential fragments, and even get positive results in speeding up (in my experiments, the speedup is increased from 15 to 50%). I hope that this short article will be useful to someone.



All Articles