assignment_3

Ciaran Cox (1115773)MA5605: Financial Computing 3 Assignment

0.1 Explicit Time Stepping (C File Appendix .1)

Revisiting the explicit time stepping problem from task 2; assignment 2, this code can be paral-

lelised in multiple places. The computation of the updated vector along x uses 3 values from the

previous vector back in time. Therefore, this for loop can be computed in parallel and does not

matter what order the elements of this new time vector are placed.

#pragma omp parallel for num_threads(NUM_THREADS) private(i)

for(i=1;i<N;i++)

Unew[i]=alpha*Uold[i-1]+(1-2*alpha)*Uold[i]

+alpha*Uold[i+1]+k*Function(a+i*h,t-k);

The same parallel technique can be used with the initial condition and on the updating of the x

vectors iterating up through time. A sum reduction can be done on the error summation.

#pragma omp parallel for num_threads(NUM_THREADS) private(i) reduction(+:sum)

for(i=0;i<=N;i++)

sum+=pow(check(a+i*h,T)-Uold[i],2);

Each thread receiving their own part of the summation and then the reduction statement brings

them back together at the end of the for loop. The computation times were taken for 1 thread and 2

threads. The times tabled are an average of 3 runs, along with the convergence of error.

As the iteration increases the error converges along with an increasing computation time. 2 threads

Table 1: Times and errors for parallelised algorithm

M N error 1 thread 2 threads131072 128 8.41137e-05 1.90976 1.36499

524288 256 2.97356e-05 15.158 9.24518

2097152 512 1.05128e-05 119.0797 70.27753

8388608 1024 3.71683e-06 956.022 684.3453

is quicker than 1 thread on all iterations in a similar ratio, due to the portion of serial code being

parallelised. Taking the algorithm and running it on a cluster and then taking average readings for

3 runs along with the speed ups shown below.

1

Table 2: Times and speed ups on the cluster

M N 1 thread 2 threads 4 threads 8 threads131072 128 2.2872 1.82623 1.4034 1.49945

Speed up 1.2524 1.6298 1.5254

524288 256 17.3948 10.9538 6.8831 7.0257

Speed up 1.588 2.5272 2.4759

2097152 512 134.877 77.9357 46.2854 36.0823

Speed up 1.7306 2.914 3.738

8388608 1024 1074.44 581.364 330.6903 215.02

Speed up 1.8481 3.2491 4.9969

From 4 threads to 8 threads on the first 2 rows the speed up is less. However, when going to larger

M values the speed up is noticed because each thread has more to do. The maximum speed up

achieved for the first two rows is just on 4 threads. Increasing the threads will decrease the speed

up because not enough work is distributed to each individual thread, and each thread is waiting on

one another to process.

0.2 Implicit Time Stepping - Serial Case (C File Appendix .2)

The initial conditions gives the first time vector for 0 < n < N. Moving up one more time vec-

tor implicitly gives an N −1 tri-diagonal matrix, with the forcing term being added to each of the

previous values in the previous time vector. The left boundary condition and right boundary con-

dition also needs to be added to the first and last (N − 1) entries respectively. Shown below is the

corresponding matrix system.

1+2α −α 0 . . . 0

−α 1+2α −α 0...

0 . . . . . . . . . 0... 0 −α 1+2α −α

0 . . . 0 −α 1+2α

Um1

Um2...

UmN−2

UmN−1

=

Um−11 + k f (x1, tm)+αUL(tm)

Um−12 + k f (x2, tm)

...

Um−1N−2 + k f (xN−2, tm)

Um−2N−1 + k f (xN−1, tm)+αUR(tm)

Solving the linear system is done using the CG-algorithm with ε = 1e− 10. After each iteration

forward through time the next vector is updated with the previous solution, along with the forcing

term and boundary conditions being updated. The initial solution of the system is set to 0, passed

into the CG-algorithm. The average number of CG iterations used with the error’s are shown below.

2

Table 3: Implicit Time stepping errors and average CG iterations

M N Error Time Avg. CG iterations131072 256 3.11153e-05 21.654 23.9947

524288 512 1.09378e-05 156.757 22.9984

2097152 1024 3.7316e-06 1267.71 22.99

8388608 2048 9.6833e-07 10413.3 22.9993

Showing strong convergence to the solution as N and M are increased, however CG-iterations

remaining on average the same. Increase in computation time is increased more than the previous

explicit method, to just under three hours for the final computation.

0.3 Implicit Time Stepping - Open MP (C File Appendix .3)

Parallelising the previous implicit algorithm was done in multiple places, including the CG-algorithm.

The initial conditions and the addition of the forcing term for each vector component, of the r.h.s

of the system at each time is done with a simple for pragma. Inside the CG-algorithm; the initial

computation of the residual, two norms (r.h.s squared and residual squared) done in parallel. When

in the while loop, the pq and rr computation uses the reduction pragma. The updating of p, x and r

are also done in parallel along with the calculation of q. Once out of the function, replacement of

old vector with the new solution vector is done in parallel. Concluding with a reduction pragma for

the average of the iterations and error summation.

Table 4: Times and errors for parallelised algorithm

M N error 1 thread 2 threads131072 256 3.11153e-05 25.2604 24.9901

524288 512 1.09378e-05 189.347 142.821

2097152 1024 3.7316e-06 1456.87 990.64

8388608 2048 9.6833e-07 11413.5 7180.39

With these changes the error’s remain the same, but the computation time drastically increases for

smaller iterations than the bigger iterations. With the extra line of code for parallelising, the com-

putation time for 1 thread is longer than running the program in serial. Taking the algorithm on to

3

the cluster, bigger problems were run quicker as more threads were used because more processing

was done than communication between threads. Tabulated below are the speed-ups for a different

number of threads run on the cluster, with each reading is an average of three runs.

Table 5: Times and speed ups on the cluster

M N 1 thread 2 threads 4 threads 8 threads131072 256 27.4566 50.8358 61.1607 89.3802

Speed up 0.5401 0.4489 0.3072

524288 512 199.298 249.5213 258.126 348.0613

Speed up 0.7987 0.7721 0.5726

2097152 1024 1535.14 1357.613 1251.647 1511.127

Speed up 1.1308 1.2265 1.0159

8388608 2048 11542.6 8407.076 6232.403 6782.21

Speed up 1.373 1.852 1.702

Showing speed-up clearly decreasing for the smaller problem, however speed-up increasing for

the bigger problems. For M = 2097152,N = 1024 speed-up is achieved up to 4 threads, however

back to the original 1 thread speed for 8 threads. On the biggest problem speed-up is successfully

achieved up to 4 threads, but a decrease moving to 8 threads. Concluding 8 threads is in efficient

for this problem.

Going beyond open-MPIn the CG-iterations algorithm the two initial norms, residual squared and r.h.s squared are inde-

pendent from each other and could be computed on separate machines. Also, in the while loop the

updating of x and r can also be done on separate machines. The average number of the iterations

and the error summation do not rely on each other, so can be done on separate machines. These few

changes may speed up the algorithm, but communication time between machines is now another

factor. Perhaps partitioning the initial condition and re-formulating intermediate boundary condi-

tions between partitions, each partition could then be a separate boundary value problem. Each

partition can be computed on separate machines and bought back together again at maturity. Ex-

plicit method would need to be used to avoid solving a tri-diagonal linear system of equations at

each point in time. The partitioning boundary conditions would need to be computed first and then

each partition to be solved explicitly in a for loop at each point in time.

4

.1 Task 1

/*Ciaran Cox (1115773) [email protected]*/

/*relevant libruarys*/

#include<stdio.h>

#include<stdlib.h>

#include<math.h>

#include<omp.h>

#define PI 3.14159265358979323846264338327950288

#define NUM_THREADS 1

/*Function prototypes*/

double Function(double, double);

double leftbound(double);

double rightbound(double);

double initial(double);

double check(double, double);

/*main function*/

int main()

{

/*Defining variables and parameters*/

long int M,N,i,j;

double a,b,h,T,k,alpha,*Unew,*Uold,error,sum=0,t1,t2;

int status1, status2;

/*Input from user for N and M*/

printf("Enter N:\n");

status1=scanf("%ld",&N);

printf("Enter M:\n");

status2=scanf("%ld",&M);

/*checking input is valid*/

if(status1!=1 || status2!=1)

{

printf("incorrect input...exiting\n");

exit(1);

}

printf("M\tN\terror\t\ttime\n");

5

t1=omp_get_wtime();

/*Dynamically allocating memory*/

if((Unew=(double*)malloc((N+1)*sizeof(double)))==NULL) exit(1);

if((Uold=(double*)malloc((N+1)*sizeof(double)))==NULL) exit(1);

a=0; b=1; T=2; h=(b-a)/N; k=T/M; alpha=k*(1/pow(h,2));

/*Initial conditions*/


for(i=0;i<=N;i++)

{

Uold[i]=initial(a+i*h);

}

/*Iterating up through time*/

for(j=1;j<=M;j++)

{

double t = j*k;

Unew[0]=leftbound(t);

Unew[N]=rightbound(t);

/*solving explicitly*/


for(i=1;i<N;i++)

{

Unew[i]=alpha*Uold[i-1]+(1-2*alpha)*Uold[i]

+alpha*Uold[i+1]+k*Function(a+i*h,t-k);

}

/*replacing old vector with new vector*/


for(i=0;i<=N;i++)

{

Uold[i]=Unew[i];

}

}

/*Computing error*/

#pragma omp parallel for num_threads(NUM_THREADS) private(i) reduction(+:sum)

for(i=0;i<=N;i++)

6

{

sum+=pow(check(a+i*h,T)-Uold[i],2);

}

error=sqrt(sum);

/*freeing memory*/

free(Unew); free(Uold);

t2=omp_get_wtime();

/*prints results for required input from user*/

printf("%d\t%d\t%1g\t%lg\n\n\n",M,N,error,t2-t1);

}

/*Function for the forcing term*/

double Function(double x, double t)

{

double out;

out=exp(-2*t)*cos(2*PI*x)*(-PI*cos(PI*t)

+(4*pow(PI,2)-2)*(1-sin(PI*t)));

return out;

}

/*Function for the left boundary condition*/

double leftbound(double t)

{

double out;

out=check(0.0,t);

return out;

}

/*Function for the right boundary condition*/

double rightbound(double t)

{

double out;

out=check(1.0,t);

return out;

}

/*Function for the initial condition*/

double initial(double x)

7

{

double out;

out=check(x,0.0);

return out;

}

/*Function for the exact solution*/

double check(double x, double t)

{

double out;

out=(1-sin(PI*t))*exp(-2*t)*cos(2*PI*x);

return out;

}

.2 task 2



#include<stdio.h>

#include<stdlib.h>

#include<math.h>

#include<omp.h>

#define PI (4.0*atan(1.0))







int cg(double, double, int, double*, double*, double);

/*main function*/

int main()

{


long int M,N,i,j;

8

double a,b,h,T,k,alpha,error,sum=0.0,t1,t2,*xx,

*it=NULL,*bb0=NULL,*sol=NULL,itav=0.0,eps;

int status1, status2;


printf("Enter N:\n");

status1=scanf("%ld",&N);

printf("Enter M:\n");

status2=scanf("%ld",&M);

/*checking input is valid*/

if(status1!=1 || status2!=1)

{

printf("incorrect input...exiting\n");

exit(1);

}

printf("M\tN\terror\t\ttime\tCG-iterations\n");

t1=omp_get_wtime();


if((xx=(double*)malloc((N-1)*sizeof(double)))==NULL) exit(1);

if((bb0=(double*)malloc((N-1)*sizeof(double)))==NULL) exit(1);

if((it=(double*)malloc(M*sizeof(double)))==NULL) exit(1);

if((sol=(double*)malloc((N+1)*sizeof(double)))==NULL) exit(1);

a=0.0; b=1.0; T=2.0; h=(b-a)/N; k=T/M; alpha=k/h/h; eps=1e-10;

/*initial conditions*/

for(i=1;i<=N-1;i++)

{

bb0[i-1]=initial(a+i*h);

}


for(j=1;j<=M;j++)

{

/*setting guess to zero*/

for(i=0;i<N-1;i++)

{

xx[i]=0;

9

}

/*adjusting for forcing term*/

for(i=1;i<=N-1;i++)

{

bb0[i-1]+=k*Function(a+i*h,j*k);

}

/*incorporating boundary conditions*/

bb0[0]+=leftbound(j*k)*alpha;

bb0[N-2]+=rightbound(j*k)*alpha;

/*CG-Algorithm, solving linear system*/

it[j-1]=cg(1+2*alpha,-alpha,N-1,&*xx,bb0,eps);

/*replacing old vector with new solution vector*/

for(i=0;i<N-1;i++)

{

bb0[i]=xx[i];

}

}

/*creating solution vector for error computation*/

for(i=1;i<=N-1;i++)

{

sol[i]=bb0[i-1];

}

sol[0]=leftbound(T); sol[N]=rightbound(T);

/*average of CG iterations*/

for(j=0;j<M;j++)

{

itav+=it[j];

}

itav=itav/M;

/*Computing error*/

for(i=0;i<=N;i++)

{

sum+=pow(check(a+i*h,T)-sol[i],2);

}

10

error=sqrt(sum);

/*freeing memory*/

free(xx); free(bb0); free(it); free(sol);

t2=omp_get_wtime();

/*prints results for required input from user*/

printf("%d\t%d\t%1g\t%lg\t%lg\n\n\n",M,N,error,t2-t1,itav);

}



{

double out;


+(4*pow(PI,2)-2)*(1-sin(PI*t)));

return out;

}



{

double out;

out=check(0.0,t);

return out;

}



{

double out;

out=check(1.0,t);

return out;

}



{

double out;

out=check(x,0.0);

11

return out;

}



{

double out;


return out;

}

/*Function for CG-Algorithm*/

int cg(double A,double B,int n,double *x,double *b,double eps)

{

int i,j,k;

double rr,pq,bb,alpha,beta,rrold;

double *r=NULL,*p=NULL,*q=NULL;

k=0;

if (n<=2) { return 0; }

if( (r=(double*)malloc(n*sizeof(double)))==NULL) exit(1);

if( (p=(double*)malloc(n*sizeof(double)))==NULL) exit(1);

if( (q=(double*)malloc(n*sizeof(double)))==NULL) exit(1);

r[0]=b[0]-(A*x[0]+B*x[0+1]);

for (i=1;i<n-1;i++)

{

r[i]=b[i]-(B*x[i-1]+A*x[i]+B*x[i+1]);

}

r[n-1]=b[n-1]-(B*x[n-2]+A*x[n-1]);

bb=0;

for (i=0;i<n;i++)

{

bb+=b[i]*b[i];

}

rr=0;

for (i=0;i<n;i++)

{

12

rr+=r[i]*r[i];

}

k=0;

while (sqrt(rr)>eps*sqrt(bb))

{

k=k+1;

if (k==1)

{

for (i=0;i<n;i++)

{

p[i]=r[i];

}

beta=0;

}

else

{

beta=rr/rrold;

for (i=0;i<n;i++)

{

p[i]=r[i]+beta*p[i];

}

}

q[0]=A*p[0]+B*p[0+1];

for (i=1;i<n-1;i++)

{

q[i]=B*p[i-1]+A*p[i]+B*p[i+1];

}

q[n-1]=B*p[n-2]+A*p[n-1];

pq=0;

for (i=0;i<n;i++)

{

pq+=p[i]*q[i];

}

alpha=rr/pq;

13

for (i=0;i<n;i++)

{

x[i]=x[i]+alpha*p[i];

r[i]=r[i]-alpha*q[i];

}

rrold=rr;

rr=0;

for (i=0;i<n;i++)

{

rr+=r[i]*r[i];

}

}

free(p); free(q); free(r);

return k;

}

.3 task 3



#include<stdio.h>

#include<stdlib.h>

#include<math.h>

#include<omp.h>

#define PI (4.0*atan(1.0))

#define NUM_THREADS 2







int cg(double, double, int, double*, double*, double);

/*main function*/

14

int main()

{


long int M,N,i,j;

double a,b,h,T,k,alpha,error,sum=0.0,t1,t2,*xx,

*it=NULL,*bb0=NULL,*sol=NULL,itav=0.0,eps;


printf("Enter N:\n"); scanf("%ld",&N);

if(N%1!=0)

{

print("Input not integer\n");

exit(1);

}

printf("Enter M:\n"); scanf("%ld",&M);

printf("M\tN\terror\t\ttime\tcg-iterations\n");

t1=omp_get_wtime();


if((xx=(double*)malloc((N-1)*sizeof(double)))==NULL) exit(1);

if((bb0=(double*)malloc((N-1)*sizeof(double)))==NULL) exit(1);

if((it=(double*)malloc(M*sizeof(double)))==NULL) exit(1);

if((sol=(double*)malloc((N+1)*sizeof(double)))==NULL) exit(1);

a=0.0; b=1.0; T=2.0; h=(b-a)/N; k=T/M; alpha=k/h/h; eps=1e-10;

/*Initial conditions*/


for(i=1;i<=N-1;i++)

{

bb0[i-1]=initial(a+i*h);

}


for(j=1;j<=M;j++)

{

/*setting guess to zero*/

for(i=0;i<N-1;i++)

{

15

xx[i]=0;

}

/*Adjusting for forcing term*/


for(i=1;i<=N-1;i++)

{

bb0[i-1]+=k*Function(a+i*h,j*k);

}

/*boundary conditions*/

bb0[0]+=leftbound(j*k)*alpha;

bb0[N-2]+=rightbound(j*k)*alpha;

/*solving system of equations by CG-algorithm*/

it[j-1]=cg(1+2*alpha,-alpha,N-1,&*xx,bb0,eps);

/*replacing old vector with new solution vector*/


for(i=0;i<N-1;i++)

{

bb0[i]=xx[i];

}

}

/*setting up solution vector for error computation*/


for(i=1;i<=N-1;i++)

{

sol[i]=bb0[i-1];

}

sol[0]=leftbound(T); sol[N]=rightbound(T);

/*average of CG-algorithm iterations*/

#pragma omp parallel for num_threads(NUM_THREADS) private(j) reduction(+:itav)

for(j=0;j<M;j++)

{

itav+=it[j];

}

itav=itav/M;

16

/*Computing error*/

#pragma omp parallel for num_threads(NUM_THREADS) private(j) reduction(+:sum)

for(i=0;i<=N;i++)

{

sum+=pow(check(a+i*h,T)-sol[i],2);

}

error=sqrt(sum);

/*freeing memory*/

free(xx); free(bb0); free(it); free(sol);

t2=omp_get_wtime();

/*printing results for required input from user*/

printf("%d\t%d\t%1g\t%lg\t%lg\n\n\n",M,N,error,t2-t1,itav);

}



{

double out;


+(4*pow(PI,2)-2)*(1-sin(PI*t)));

return out;

}



{

double out;

out=check(0.0,t);

return out;

}



{

double out;

out=check(1.0,t);

return out;

17

}



{

double out;

out=check(x,0.0);

return out;

}



{

double out;


return out;

}

/*Function for CG-Algorithm*/

int cg(double A,double B,int n,double *x,double *b,double eps)

{

int i,j,k;

double rr,pq,bb,alpha,beta,rrold;

double *r=NULL,*p=NULL,*q=NULL;

k=0;

if (n<=2) { return 0; }

if( (r=(double*)malloc(n*sizeof(double)))==NULL) exit(1);

if( (p=(double*)malloc(n*sizeof(double)))==NULL) exit(1);

if( (q=(double*)malloc(n*sizeof(double)))==NULL) exit(1);

r[0]=b[0]-(A*x[0]+B*x[0+1]);


for (i=1;i<n-1;i++)

{

r[i]=b[i]-(B*x[i-1]+A*x[i]+B*x[i+1]);

}

r[n-1]=b[n-1]-(B*x[n-2]+A*x[n-1]);

bb=0;

18

#pragma omp parallel for num_threads(NUM_THREADS) private(i) reduction(+:bb)

for (i=0;i<n;i++)

{

bb+=b[i]*b[i];

}

rr=0;

#pragma omp parallel for num_threads(NUM_THREADS) private(i) reduction(+:rr)

for (i=0;i<n;i++)

{

rr+=r[i]*r[i];

}

k=0;

while (sqrt(rr)>eps*sqrt(bb))

{

k=k+1;

if (k==1)

{


for (i=0;i<n;i++)

{

p[i]=r[i];

}

beta=0;

}

else

{

beta=rr/rrold;


for (i=0;i<n;i++)

{

p[i]=r[i]+beta*p[i];

}

}

q[0]=A*p[0]+B*p[0+1];

19


for (i=1;i<n-1;i++)

{

q[i]=B*p[i-1]+A*p[i]+B*p[i+1];

}

q[n-1]=B*p[n-2]+A*p[n-1];

pq=0;

#pragma omp parallel for num_threads(NUM_THREADS) private(i) reduction(+:pq)

for (i=0;i<n;i++)

{

pq+=p[i]*q[i];

}

alpha=rr/pq;


for (i=0;i<n;i++)

{

x[i]=x[i]+alpha*p[i];

r[i]=r[i]-alpha*q[i];

}

rrold=rr;

rr=0;

#pragma omp parallel for num_threads(NUM_THREADS) private(i) reduction(+:rr)

for (i=0;i<n;i++)

{

rr+=r[i]*r[i];

}

}

free(p); free(q); free(r);

return k;

}

20

assignment_3

Documents

Transcript of assignment_3