Last Updated: 2016-02-23 Tue 18:41

CS 499 Homework 2: MPI Programming

CHANGELOG:

Tue Feb 23 18:26:47 EST 2016
Major updates to include details of how to report timings, what to put in the HW report, and some guidelines on what to expect for the grader interview. Ensure that your mpi_heat program can disable output to improve your timing reporting. Review the following updated/new sections.

You will also want to get the following new/updated files which are also in the update code distribution.

Tue Feb 16 18:43:30 EST 2016
Minor update to show how to run batch jobs with a number of processors not divisible by 4 on medusa.

Table of Contents

1 Overview

The assignment involves programming in MPI and answering some questions (TBD) about your programs. It is a programming assignment so dust off your C skills. We will attempt to spend some class discussing issues related to the assignment as the due date approaches but it will pay to start early to get oriented. Debugging parallel programs can be difficult and requires time.

There are two main problems to solve, parallelizing the heat program from HW1 and parallelizing a provided pagerank algorithm. Both problems involve coding and analyzing the programs you eventually construct.

In addition to providing code and a short report, you will be required to meet with the GTA for the course to demonstrate your program running. These meetings will occur after the assignment is due during the week of 2/29/2016. Details of what will go on in this meeting and what questions the GTA might ask will be made more clear at a later time. For the moment, get cracking on the code.

2 A Sample MPI Program with Some Utilities

A simple sample program called mpi_hello.c is provided as part of the code distribution. This program includes two useful utilities

  • pprintf(fmt,...) will have any processor running it print a message like printf does but the message will be appended with the processor ID. It will be useful for debugging to track which proc is doing what.
  • rprintf(fmt,...) also works like printf but has only the root processor print messages. Usually duplicated output is not desired so the root processor should do most output for human consumption.

Compiling and running the program can be done locally on any machine with MPI installed as follows.

lila [ckauffm2-hw2]% mpicc mpi_hello.c
lila [ckauffm2-hw2]% mpirun -np 4 mpi_hello
P 0: Hello world from process 0 of 4 (host: lila)
P 1: Hello world from process 1 of 4 (host: lila)
P 2: Hello world from process 2 of 4 (host: lila)
P 3: Hello world from process 3 of 4 (host: lila)
Hello from the root processor 0 of 4 (host: lila)

3 The Medusa Cluster

3.1 Summary

  • medusa.vsnet.gmu.edu is the MPI cluster available for use in the class. Use ssh to log in. You must have a VPN connection set up. Some instructions are here for setting up VPN.
  • Use mpicc to compile on medusa.
  • Use mpirun to test interactively. The login node will be used for this only so programs will not typically get much speedup.
  • When your code is in good shape, create a small batch script and submit it to the batch queue using sbatch. Check the status of the job queue with squeue.

3.2 Details

The main platform we will utilize is the medusa.vsnet.gmu.edu cluster. All students in CS 499 have been authorized to use the cluster. Log into it using your favorite ssh tool using your standard mason NetID and password.

lila [~]% ssh ckauffm2@medusa.vsnet.gmu.edu
Password: 
Last login: Tue Feb 16 09:33:37 2016 from lila.vsnet.gmu.edu
Use the command 'module avail' to list available modules.
Use the command 'module add <module_name>' to use module <module_name>.
Default Modules: 
  1) dot              4) boost/1.60.0     7) munge/0.5.11
  2) openmpi/1.10.1   5) SimGrid/3.12     8) dmtcp/2.4.4
  3) java/1.8.0_66    6) slurm/15.08.7    9) medusa-default

medusa [~]% which mpicc
/usr/local/openmpi/1.10.1/bin/mpicc

medusa [~]% which mpirun
/usr/local/openmpi/1.10.1/bin/mpirun

medusa [~]% cd cs499/ckauffm2-hw2

medusa [ckauffm2-hw2]% mpicc -o mpi_hello mpi_hello.c

medusa [ckauffm2-hw2]% mpirun -np 2 mpi_hello 
P 0: Hello world from process 0 of 2 (host: medusa)
Hello from the root processor 0 of 2 (host: medusa)
P 1: Hello world from process 1 of 2 (host: medusa)

medusa [ckauffm2-hw2]% mpirun -np 8 mpi_hello 
P 0: Hello world from process 0 of 8 (host: medusa)
Hello from the root processor 0 of 8 (host: medusa)
P 3: Hello world from process 3 of 8 (host: medusa)
P 5: Hello world from process 5 of 8 (host: medusa)
P 6: Hello world from process 6 of 8 (host: medusa)
P 2: Hello world from process 2 of 8 (host: medusa)
P 1: Hello world from process 1 of 8 (host: medusa)
P 4: Hello world from process 4 of 8 (host: medusa)
P 7: Hello world from process 7 of 8 (host: medusa)

medusa is a cluster of 14 or so computers and a careful observer will note that the the mpi_run seems to be running everything on the a single node, host medusa which is the login node. To gain access to other compute nodes, you will need to request that a computation be run via the job queue.

The typical approach to this is to create a little shell script which will run the job instead and set up the parameters for the job. Any text editor can be used to create such a script. It is then submitted to the job queue using the sbatch command. One can check on queued jobs using squeue which displays running and waiting jobs.

# show the job being submitted
medusa [ckauffm2-hw2]% cat mpi_hello.sh
#!/bin/bash
# 
#SBATCH --output hello.out  # output file where printed text will be saved
#SBATCH --ntasks 8          # how many processors to request

mpirun mpi_hello


# submit the job to the queue
medusa [ckauffm2-hw2]% sbatch mpi_hello.sh
Submitted batch job 261

# check the status of the queue
medusa [ckauffm2-hw2]% squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               253    Medusa pagerank ckauffm2  R   10:17:45      4 medusa-node[08-11]
               261    Medusa mpi_hell ckauffm2  R       0:02      2 medusa-node[12-13]
# job mpi_hello is running as there is an R for ^ running associated with it

# check the queue again
medusa [ckauffm2-hw2]% squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               253    Medusa pagerank ckauffm2  R   10:17:49      4 medusa-node[08-11]
# job is gone so must be finished

# check that the output file exists
medusa [ckauffm2-hw2]% ls hello.out
hello.out

# show contents of the output file
medusa [ckauffm2-hw2]% cat hello.out
P 1: Hello world from process 1 of 8 (host: medusa-node12.Medusa.vsnet.gmu.edu)
P 6: Hello world from process 6 of 8 (host: medusa-node13.Medusa.vsnet.gmu.edu)
P 7: Hello world from process 7 of 8 (host: medusa-node13.Medusa.vsnet.gmu.edu)
P 0: Hello world from process 0 of 8 (host: medusa-node12.Medusa.vsnet.gmu.edu)
Hello from the root processor 0 of 8 (host: medusa-node12.Medusa.vsnet.gmu.edu)
P 3: Hello world from process 3 of 8 (host: medusa-node12.Medusa.vsnet.gmu.edu)
P 4: Hello world from process 4 of 8 (host: medusa-node13.Medusa.vsnet.gmu.edu)
P 5: Hello world from process 5 of 8 (host: medusa-node13.Medusa.vsnet.gmu.edu)
P 2: Hello world from process 2 of 8 (host: medusa-node12.Medusa.vsnet.gmu.edu)

Options can also be specified from the command line and these are identical to the options above in the job file itself. This is good for testing whether

# show a job script which has no options
medusa [ckauffm2-hw2]% cat no_options.sh
#!/bin/bash
mpirun mpi_hello

# submit running on 8 processors
medusa [ckauffm2-hw2]% sbatch --output output.8.out --ntasks 8 no_options.sh
Submitted batch job 267

# submit running on 12 processors
medusa [ckauffm2-hw2]% sbatch --output output.12.out --ntasks 12 no_options.sh
Submitted batch job 268

# Check: both jobs done
medusa [ckauffm2-hw2]% squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               253    Medusa pagerank ckauffm2  R   10:31:46      4 medusa-node[08-11]

# show output of both job files
medusa [ckauffm2-hw2]% cat output.8.out
P 4: Hello world from process 4 of 8 (host: medusa-node13.Medusa.vsnet.gmu.edu)
P 6: Hello world from process 6 of 8 (host: medusa-node13.Medusa.vsnet.gmu.edu)
P 7: Hello world from process 7 of 8 (host: medusa-node13.Medusa.vsnet.gmu.edu)
P 5: Hello world from process 5 of 8 (host: medusa-node13.Medusa.vsnet.gmu.edu)
P 1: Hello world from process 1 of 8 (host: medusa-node12.Medusa.vsnet.gmu.edu)
P 0: Hello world from process 0 of 8 (host: medusa-node12.Medusa.vsnet.gmu.edu)
Hello from the root processor 0 of 8 (host: medusa-node12.Medusa.vsnet.gmu.edu)
P 2: Hello world from process 2 of 8 (host: medusa-node12.Medusa.vsnet.gmu.edu)
P 3: Hello world from process 3 of 8 (host: medusa-node12.Medusa.vsnet.gmu.edu)

medusa [ckauffm2-hw2]% cat output.12.out
P 9: Hello world from process 9 of 12 (host: medusa-node02.Medusa.vsnet.gmu.edu)
P 4: Hello world from process 4 of 12 (host: medusa-node01.Medusa.vsnet.gmu.edu)
P 2: Hello world from process 2 of 12 (host: medusa-node00.Medusa.vsnet.gmu.edu)
P 8: Hello world from process 8 of 12 (host: medusa-node02.Medusa.vsnet.gmu.edu)
P 6: Hello world from process 6 of 12 (host: medusa-node01.Medusa.vsnet.gmu.edu)
P 7: Hello world from process 7 of 12 (host: medusa-node01.Medusa.vsnet.gmu.edu)
P11: Hello world from process 11 of 12 (host: medusa-node02.Medusa.vsnet.gmu.edu)
P 0: Hello world from process 0 of 12 (host: medusa-node00.Medusa.vsnet.gmu.edu)
Hello from the root processor 0 of 12 (host: medusa-node00.Medusa.vsnet.gmu.edu)
P 5: Hello world from process 5 of 12 (host: medusa-node01.Medusa.vsnet.gmu.edu)
P10: Hello world from process 10 of 12 (host: medusa-node02.Medusa.vsnet.gmu.edu)
P 3: Hello world from process 3 of 12 (host: medusa-node00.Medusa.vsnet.gmu.edu)
P 1: Hello world from process 1 of 12 (host: medusa-node00.Medusa.vsnet.gmu.edu)

It is more complex to run jobs on a number of processors that is not evenly divisible by 4, so initially restrict your job runs to 4, 8, 12, and 16 processors.

When you want to run on an arbitrary number of processors, use the following job script as a template. Adjust the --ntasks option to the desired number of processors and the --output option as needed but leave the other two options unchanged.

#!/bin/bash
# 
#SBATCH --output hello.out  # output file where printed text will be saved
#SBATCH --ntasks 9          # how many processors to request
#SBATCH --cpus-per-task 1   # use as is
#SBATCH --ntasks-per-node 4 # use as is
#
# Demonstrate running on 9 processors

mpirun mpi_hello

4 Problem 1: Parallel Heat (50%)

A slightly modified version of the heat propagation simulation from HW1 and in-class discussion is in the code pack and called heat.c. In this problem, create an MPI version of this program which will divide calculation of the heat of each simulation cell over time up among many processors. Key features of this parallelization are as follows.

  • Name your source file
    mpi_heat.c 
    

    It will need to be a C program and run with the MPI library provided on the medusa cluster.

  • The serial version of the program provided accepts the number of time steps and the width of the rod as command line arguments. Make sure to preserve it so that a command like
    mpicc -o mpi_heat mpi_heat.c
    mpirun -np 4 mpi_heat 10 40
    
  • Divide the problem data so that each processor owns only a portion of the columns of the heat matrix as discussed in class.
  • Utilize send and receives or the combined MPI_Sendrecv to allow processors to communicate with neighbors.
  • Utilize a collective communication operation at the end of the computation to gather all results on the root processor 0 and have it print out the entire results matrix.
  • Verify that the output of your MPI version is identical to the output of the serial version which is provided.
  • Your MPI version is only required to work correctly in the following situations:
    • The width of the rod in cells is evenly divisible by the number of processors being run
    • The width of the rod is at least three times the number of processors so that each processor would have at least 3 columns associated with it.

    That means the following configurations should work or not work as indicated.

    # procs width works? Notes
    1 1 no not enough cols
    1 2 no not enough cols
    1 3 yes take special care for 1 proc
    4 4 no only 1 column per proc
    4 8 no only 2 columns per proc
    4 12 yes at least 3 cols per proc
    4 16 yes at least 3 cols per proc
    4 15 no uneven cols
    3 9 yes 3 cols per proc, evenly divisible
    4 40 yes evenly divisible, >= 3 cols per proc

4.1 Written Summary of the Parallel Heat Results

The following script can be used to submit jobs to run mpi_heat on various numbers of processors with different widths. It is required that the program take a 3rd commandline argument which will disable printing. Ensure that

> mpirun -np 2 mpi_heat 100 32 0

runs and produces no output. See the updated section on disabling output in heat for additional details.

The script is here: submit-heat-jobs.sh

After running the script, jobs will be submitted to the batch queue to run your mpi_heat on 1,2,4,8,10, and 16 processors and using different widths (6400, 25600, and 102400 columns). Execution time is saved in files named after the width and number of processors. When all jobs are complete, timing info will be in files that start with pr.*. The time for the jobs is accessible easily using the grep command as demonstrated below.

medusa [ckauffm2-hw2]% ./submit-heat-jobs.sh
Submitted batch job 1063
Submitted batch job 1064
Submitted batch job 1065
Submitted batch job 1066
...
medusa [ckauffm2-hw2]% squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              1081    Medusa submit.s ckauffm2 PD       0:00      1 (Resources)
              1082    Medusa submit.s ckauffm2 PD       0:00      1 (Priority)
              1092    Medusa submit.s ckauffm2 PD       0:00      4 (Priority)
              1074    Medusa submit.s ckauffm2  R       0:05      4 medusa-node[02-05]
              1077    Medusa submit.s ckauffm2  R       0:04      1 medusa-node08
              1078    Medusa submit.s ckauffm2  R       0:04      2 medusa-node[09-10]
              1079    Medusa submit.s ckauffm2  R       0:02      3 medusa-node[11-13]
              1080    Medusa submit.s ckauffm2  R       0:01      4 medusa-node[00-01,06-07]
...
medusa [ckauffm2-hw2]% squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
medusa [ckauffm2-hw2]% grep 'real' ht.*.out
ht.006400.01.out:real 1.13
ht.006400.02.out:real 1.25
ht.006400.04.out:real 1.58
ht.006400.08.out:real 3.16
ht.006400.10.out:real 3.16
ht.006400.16.out:real 3.25
ht.025600.01.out:real 1.76
ht.025600.02.out:real 1.79
...

In your written report, include the following table with execution times filled based on the processors and widths of your runs using the script above.

Procs\Width 6400 25600 102400
1 ? ? ?
2 ? ? ?
4 ? ? ?
8 ? ? ?
10 ? ? ?
16 ? ? ?

Comment on how whether you achieve any speedup using more processors. Describe any trends or anomalies you see in the timings and speculate on their causes.

4.2 Grader Interview for Problem 1

You will need to demonstrate your code for this problem to the GTA for the course. Expectations of what will go on in this meeting are in the Grader Interview Specifics Section.

4.3 (50 / 50) Grading Criteria for Problem 1   grading

  • (20 / 20) Demonstration of your code to the GTA and ability to explain how it works during an in-person meeting.
  • (10 / 10) Cleanly written code with good documentation
  • (10 / 10) Correct execution for the given configurations of processors vs columns.
  • (10 / 10) Written report which includes timings described above and discussion of them.

4.4 Adjustments to heat.c to omit output

Input and Output can occupy a tremendous amount of execution time and often mask the real performance of a program. To that end, make some adjustments to you mpi_heat.c program to disable printing of the final output matrix. An updated version of heat.c is provided which shows how to do this in the serial context and can largely be copied over to the parallel context. This involves accepting and parsing an additional command-line argument that turns printing on/off. This is handled at the beginning of the program.

int main(int argc, char **argv){
  if(argc < 4){
    printf("usage: %s max_time width print\n",argv[0]);
    printf("  max_time: int\n");
    printf("  width: int\n");
    printf("  print: 1 print output, 0 no printing\n");
    return 0;
  }

  int max_time = atoi(argv[1]); // Number of time steps to simulate
  int width = atoi(argv[2]);    // Number of cells in the rod
  int print = atoi(argv[3]);    // CONTROLS PRINTING

Later, output of the final heat matrix is conditioned on the variable print:

  if(print==1){
    // Print results
    printf("Temperature results for 1D rod\n");
    printf("Time step increases going down rows\n");
    printf("Position on rod changes going accross columns\n");
    ...
    // Row headers and data
    for(t=0; t<max_time; t++){
      printf("%3d| ",t);
      for(p=0; p<width; p++){
        printf("%5.1f ",H[t][p]);
      }
      printf("\n");
    }
  }

This allows one to run the program without output.

medusa [ckauffm2-hw2]% gcc -o heat heat.c
medusa [ckauffm2-hw2]% ./heat
usage: ./heat max_time width print
  max_time: int
  width: int
  print: 1 print output, 0 no printing
medusa [ckauffm2-hw2]% ./heat 5 10 1
Temperature results for 1D rod
Time step increases going down rows
Position on rod changes going accross columns
   |     0     1     2     3     4     5     6     7     8     9 
---+-------------------------------------------------------------
  0|  20.0  50.0  50.0  50.0  50.0  50.0  50.0  50.0  50.0  10.0 
  1|  20.0  35.0  50.0  50.0  50.0  50.0  50.0  50.0  30.0  10.0 
  2|  20.0  35.0  42.5  50.0  50.0  50.0  50.0  40.0  30.0  10.0 
  3|  20.0  31.2  42.5  46.2  50.0  50.0  45.0  40.0  25.0  10.0 
  4|  20.0  31.2  38.8  46.2  48.1  47.5  45.0  35.0  25.0  10.0 
medusa [ckauffm2-hw2]% ./heat 5 10 0
medusa [ckauffm2-hw2]%

While running the program without output seems useless, we are primarily interested in this mode of operation to time program execution without the interference of output time and this is the easiest way get at those timings.

5 Page Rank

5.1 Overview of Computing Pageranks

A key to Google's early success was its ability of its search engine to identify web pages which seemed important to user search queries. A key component of their engine was and remains an importance metric metric called Pagerank, so named due to its ranking a web page and the author of the algorithm is Larry Page (history is has a splendid sense of irony). The pagerank has a beautiful theory behind it which involves modeling web users as random walkers through hyperlinked pages. On arriving at a page, a user randomly selects a link and visits it. This process is repeated on the next page, and the next page, and so forth. With a small probability, a user may randomly jump to some arbitrary other page which is not linked to the present one. According to this formalism, pagerank represents the probability of finding a user on a given page at a particular moment in time. A page with many incoming links to it has a higher probability of being visited as many other pages have "voted" for its importance. A page with many outgoing links contributes little to importance of any linked pages: its votes are spread very thin. This rough sort of voting turned out to be a good measure of the importance of page, at least in the early 2000s before web denizens learned to manipulate the algorithm.

It turns out that if the network of links web pages is represented as a certain matrix, the page ranks are identical to a particular eigenvector of that matrix. There are several interesting facets to this relationship for the mathematically inclined and good reading on the subject comes from a survey by Bherkin. The bottom line is that any algorithms for computing an eigenvector of a matrix can be used to compute page ranks. A classical iterative technique to compute eigenvectors is the Power Method which involves repeatedly multiplying a vector by a matrix. Matrix-vector multiplication is a ripe operation for parallelization and your primary task will be to parallelize this process for the pagerank computation.

A code is provided called dense_pagerank.c which performs pagerank serial computations. In high-level terms the computation breaks down as follows.

  1. Load data for a matrix of web page links (link matrix). Each page is numbered 0 to N-1 where N is the total number of pages. The file format is simply pairs of numbers of one page pointing to another one. Loading the file involves allocating memory for the entire matrix, zeroing each entry, then filling a 1 into each row/col entry indicated by the file.
  2. Normalize columns by summing each column in the matrix, then dividing each entry in a column by the sum of the column.
  3. Apply a damping factor which allows random warping from one page to another. The math on this is a little funky, but the intent is to make each nonzero entry in the matrix a little smaller and each zero entry nonzero so there is a chance of jumping to an arbitrary page. See the code for the specific math involved with the update. A typical damping factor is 0.85: 85% chance of visiting a link on the page and 15% change of jumping an arbitrary unlinked page.
  4. Initialize pageranks to be equal for each page and so that the pageranks sum to 1. If there are 10 pages, each page initially has a pagerank of 0.1; with 100 pages each has 0.01. Only the relative size between ranks is important.
  5. Multiply the link matrix by the pageranks according to the standard matrix-vector multiplication algorithm. Store the results in a second array of numbers. This second array of numbers is now the new pageranks. Assign this back to the array of old pageranks after checking for convergence.
  6. Repeat step 5 of creating new pageranks by multiplying the link matrix by the old pageranks. Continue repeating this until there is very little change between new and old pageranks. At this point, the solution has converged.

This algorithm is a good example of iterative algorithms: it is not known ahead of time how many steps will be required to converge but steady progress should be made as indicated by the old and new pagerank vectors being closer and closer together.

Note that due to the columns of the link matrix and the vector of pageranks being positive and summing to 1, the results of their multiplication should also sum to 1 (e.g. the new pageranks also sum to 1). The code presently reports the norm of the vector as this sum and it should remain 1 throughout the computation.

It should be mentioned that the provided code is a dense version of the pagerank: every element of the link matrix has memory allocated to it. Unsurprisingly, a production version of the code would use sparse matrices instead where the many zero entries of the matrix are represented implicitly to save a tremendous amount of memory. While the dense algorithm is easier to parallelize than the sparse, the dense version is woefully inappropriate for the enormous size of Google-scale pagerank computations involving 30,000,000,000,000+ web pages. It is a computation that necessitates parallelism at a sickening scale but is reasonably approximated by the present code.

Take some time to examine the code provided carefully.

5.2 Sample Runs of dense_pagerank.c

Part of the code distribution includes some graph files which you can use for experimentation and timing analysis of your code. Each graph is named after its size and content. The notredame graphs are derived from a real dataset of web sites in the Notre Dame domain. The full set is available here though will require a bit of processing to be used with this code and is extremely large for a dense pagerank calculation.

Start by experimenting with the small graphs like tiny-20.txt which has only 20 nodes in it and 200 links between pages.

lila [ckauffm2-hw2]% ls graphs/*.txt
graphs/notredame-100.txt    graphs/notredame-2000.txt  graphs/notredame-8000.txt
graphs/notredame-16000.txt  graphs/notredame-501.txt   graphs/tiny-20.txt

lila [ckauffm2-hw2]% dense_pagerank graphs/tiny-20.txt 0.85 
Loaded graphs/tiny-20.txt: 20 rows, 200 nonzeros
Beginning Computation

ITER     DIFF     NORM
  1: 1.78e-01 1.00e+00
  2: 3.85e-02 1.00e+00
  3: 7.27e-03 1.00e+00
  4: 1.32e-03 1.00e+00
  5: 2.12e-04 1.00e+00
CONVERGED

PAGE RANKS
0.04779640
0.04147775
0.04912589
0.03965692
0.05845908
0.04394957
0.02513647
0.04369224
0.05522195
0.07147504
0.05889092
0.06569723
0.05264261
0.03913282
0.05423814
0.05833793
0.04308603
0.06827848
0.03697897
0.04672553

The progress at each iteration is reported: the DIFF column should get progressively smaller while the NORM column should remain 1 throughout. After convergence, the pageranks of the 20 pages are printed.

The largest graph you should work with is notredame-8000.txt which has 8000 web sites involved in it leading to an 8000 by 8000 link matrix. Running this through the serial code looks like the following. Note that the output will be long (8000+ lines) so it is put into the file output.txt and examined using the head command to display the first few lines.

lila [ckauffm2-hw2]% ls graphs/*.txt
graphs/notredame-100.txt    graphs/notredame-2000.txt  graphs/notredame-8000.txt
graphs/notredame-16000.txt  graphs/notredame-501.txt   graphs/tiny-20.txt

lila [ckauffm2-hw2]% dense_pagerank graphs/notredame-8000.txt 0.85 > output.txt
lila [ckauffm2-hw2]% head -50 output.txt
Loaded graphs/notredame-8000.txt: 8000 rows, 27147 nonzeros
Beginning Computation

ITER     DIFF     NORM
  1: 1.26e+00 1.00e+00
  2: 7.92e-01 1.00e+00
  3: 4.24e-01 1.00e+00
  4: 2.48e-01 1.00e+00
  5: 1.50e-01 1.00e+00
  6: 9.45e-02 1.00e+00
  7: 6.23e-02 1.00e+00
  8: 4.11e-02 1.00e+00
  9: 2.73e-02 1.00e+00
 10: 1.91e-02 1.00e+00
 11: 1.31e-02 1.00e+00
 12: 9.24e-03 1.00e+00
 13: 6.74e-03 1.00e+00
 14: 4.91e-03 1.00e+00
 15: 3.75e-03 1.00e+00
 16: 2.81e-03 1.00e+00
 17: 2.16e-03 1.00e+00
 18: 1.64e-03 1.00e+00
 19: 1.27e-03 1.00e+00
 20: 9.79e-04 1.00e+00
CONVERGED

PAGE RANKS
0.00227804
0.00044506
0.00001875
0.00051994
0.00156742
0.00015092
0.00087703
0.00111392
0.00123884
0.00081005
0.00252026
0.00359624
0.00007052
0.00005559
0.00001959
0.00107474
0.00075570
0.00015412
0.00011205
0.00395254
0.02658639
0.00023358
0.00009175

6 Problem 2: Parallel PageRank (50%)

Parallelize the provided pagerank code. It is suggested that you start with the code provided serial code and take small steps towards parallelizing.

6.1 Reading Data Files

The program starts with reading input from a file which should be done only on the root processor. After reading the whole matrix into the root processor, send chunks of the matrix to each processor for the main part of the algorithm.

The serial code uses a densemat structure to store the matrix. This structure uses a trick.

  • All elements are stored in a linear array called all. This allows linear index access via mat->all[i]
  • An array of pointers called data points to the beginning of each row in the matrix. This allows row/col access via mat->data[r][c].
  • As a consequence of the linear array, sequential rows are stored in adjacent memory. In a 10 by 10 matrix, rows 0, 1, and 2 are stored in elements 0-29 of mat->all. This makes it possible to send multiple adjacent rows with single communications.

6.2 Row Partitioning Woes

The main source of parallelism is obtained by dividing up the link matrix so that each processor owns a collection of whole rows. This is effective as matrix vector multiplication relies on multiplying a whole row by a column vector (the pageranks in this case).

Do not assume that the number of rows in the link matrix is evenly divisible by the number of processors. Make your code more flexible than that. This, unfortunately, means dealing with some minutia as not every processor will send or receive the same number of elements. As a suggested approach, do the following

  • First, assume the number of rows is evenly divisible by the number of processors and use simple MPI calls like MPI_Scatter and MPI_Allgather which assume every processor will receive the same number of elements. Make sure that this version works on some of the input graphs for numbers of processors that evenly divide the size.
  • When you are confident in your code above, make a backup copy of it for safekeeping.
  • Now take the plunge and switch to the MPI vector calls which allow one to specify then number of elements each processor will receive: functions like MPI_Scatterv and MPI_Allgatherv (notice the v at the end) take additional parameter arrays of the counts of elements for each processor and the offsets into storage arrays where those elements reside. These more complex invocation may seem tedious, but all that is really required is to set up arrays indicating the counts elements on each processor and pass those in. Establish these arrays near the beginning of the program and use them throughout.

6.3 Parallelizing Column Normalization and Damping

It is suggested that you initially let the root processor read the whole matrix, normalize the rows, apply the damping factor, then scatter the matrix rows to each processor. That way the serial code can be used to ensure normalization and damping is correct.

Later, revisit the column normalization and damping to parallelize it.

  • Scatter the unnormalized link matrix rows to each process
  • Have each process compute an array of its own column sums
  • Use a all-to-all reduction so that every processor has the sums of all columns. Investigate a good MPI function for this all-to-all reduction and potentially use the MPI_IN_PLACE constant to save yourself some allocations of buffers (the manual pages for relevant MPI functions describe this option).
  • Have each processor divide each of its elements by the appropriate column sum.
  • Have each processor apply the damping factor adjustment to each of its elements.

6.4 Parallelizing the Repeated Matrix-vector Multiplication

The main computation loop involves repeatedly multiplying the link matrix by the vector or pageranks. In the parallel version, each processor has some whole rows of the link matrix. Note the consequences of this decomposition.

  • Each processor has some link matrix rows but must have the whole vector of old pageranks to do the multiplication
  • After completing the multiplication, each processor will contain only part of the new pagerank vector and must communicate its portion of to all other processors for the next multiplication to occur.
  • After each multiplication, each processor must also share how much its new pageranks differ from the equivalent portion of the old pagerank vector so that all processors can determine if the algorithm has converged.

This will involve several collective communication operations at each iteration to share.

6.5 Written Summary of Parallel Pagerank Results

The following script can be used to submit jobs to run mpi_dense_pagerank on the provided graphs.

Timing jobs script: submit-pagerank-jobs.sh

After running the script, jobs will be submitted to the batch queue to run your mpi_dense_pagerank on 1-16 processors saving output in files along with the execution time. When all jobs are complete, timing info will be in files that start with pr.*. The time for the jobs is accessible easily using the grep command as demonstrated below.

medusa [ckauffm2-hw2]% submit-pagerank-jobs.sh
Submitted batch job 541
Submitted batch job 542
Submitted batch job 543
Submitted batch job 544
...
medusa [ckauffm2-hw2]% squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               549    Medusa time-pag ckauffm2 PD       0:00      3 (Resources)
               550    Medusa time-pag ckauffm2 PD       0:00      3 (Priority)
               551    Medusa time-pag ckauffm2 PD       0:00      3 (Priority)
               556    Medusa time-pag ckauffm2 PD       0:00      4 (Priority)
               541    Medusa time-pag ckauffm2  R       0:14      1 medusa-node00
               542    Medusa time-pag ckauffm2  R       0:14      1 medusa-node01
...
medusa [ckauffm2-hw2]% squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
medusa [ckauffm2-hw2]% grep 'real' pr.notredame-8000.*
pr.notredame-8000.txt.01.out:real 16.92
pr.notredame-8000.txt.02.out:real 10.81
pr.notredame-8000.txt.03.out:real 11.97
pr.notredame-8000.txt.04.out:real 10.93
pr.notredame-8000.txt.05.out:real 9.92
...

The number in the last column is the execution time of the run while the file name indicates the number of processors used (1 to 5 are shown). Note that you can adjust parameters in the script to change the input graph which is required.

In your HW report, produce a report which shows.

  • Timings of mpi_dnese_pagerank for 1-16 processors on the notredame-8000.txt graph (default)
  • Timings of mpi_dnese_pagerank for 1-16 processors on the notredame-16000.txt graph (change the input file in the script)
  • A brief discussion of how scalable your implementation appears to be based on those timings and any irregularities you see.

Present your results in a table with the following format.

Procs\Graph 8000 16000
1 ? ?
2 ? ?
3 ? ?
4 ? ?
5 ? ?
6 ? ?
7 ? ?
8 ? ?
9 ? ?
10 ? ?
11 ? ?
12 ? ?
13 ? ?
14 ? ?
15 ? ?
16 ? ?

6.6 Grader Interview for Problem 2

You will need to demonstrate your code for this problem to the GTA for the course. Expectations of what will go on in this meeting are in the Grader Interview Specifics Section.

6.7 (50 / 50) Grading Criteria for Problem 2   grading

  • (20 / 20) Demonstration of your code to the GTA and ability to explain how it works during an in-person meeting.
  • (10 / 10) Clean parallelization of the main matrix-vector multiplcation loop along with checks for convergence.
  • (5 / 5) Effective use of MPI's collective communication operations to spread and gather data between processors
  • (5 / 5) Clean parallelization of the computation of column sums and application of the damping factor in the link matrix.
  • (5 / 5) Correct execution for a matrix with any size on any number of processors.
  • (5 / 5) Report on timings of parallel pagerankfor generated using the script provided along with discussion of those results.

7 Submission and Evaluation

7.1 Submitting your work

Submit the following to Blackboard for this HW.

Zip file of program code
Should contain at a minimum
  • mpi_heat.c for problem 1
  • mpi_dense_pagerank.c for problem 2
  • graphs/ directory with graph data files in it

May contain various other scripts that were provided or you developed to aid you.

DOCX or PDF Report
  • Names of all group members: Include full names, NetIDs and G#s such as the below for a group of 2.
    CS 499 HW 2
    Group of 2
    Turanga Leela tleela4 G07019321
    Philip J Fry pfry99 G00000001
    
  • Problem 1: Timings table generated using the script provided and discussion of the timings.
  • Problem 2: Timings table generated using the script provided and discussion of those timings.

Grading criteria for submitted work is described in previous sections

7.2 Grader Interview Specifics

40% of your grade is based on an interview with the course grader in which you will be asked to demonstrate execution of your code and describe its approach. Briefly, this interview may include the following elements.

  • 20-minute interview
  • Demonstrate compiling and running a parallel program interactively on medusa
  • Demonstrate submitting a parallel job on the batch queue with a certain number of processors
  • Outline how the Problem 1: Heat program was parallelized
  • Give a brief walk-through of code for Problem 1: Heat
  • Explain some MPI calls as they appear in the Heat program
  • Outline how the Problem 2: Pagerank program was parallelized
  • Explain some MPI calls as they appear in the Pagerank program
  • Describe timing results associated with parallel Pagerank runs with different numbers of processors and input sizes
  • For groups of 2, interviewer may direct a question at individual group members to assess that both members understand the content

Some sample questions are provided below. These are meant to guide and the actual questions may be variants of them.

  • "Here you called MPI_XXX(...) in your Pagerank code. What is being accomplished there and why is it necessary?"
  • "What kind of decomposition did you use for your parallel Heat code? What kind of communication did it require?"
  • "Show me the timing results for running your Pagerank code on 4, 8, and 16 processors for the notredame-8000.txt graph."
  • "Show me how you would run your parallel Heat program with 8 processors and width 64 interactively."
  • "Submit a job to the batch queue which runs your parallel Heat program with 8 processors and width 64 and puts the output in testout.txt."
  • "At the end of your Pagerank program, where are is the entire array of Pageranks stored? Show me where this happens in your code."

Author: Chris Kauffman (kauffman@cs.gmu.edu)
Date: 2016-02-23 Tue 18:41