CS 499 Homework 2: MPI Programming
- Due: Monday 2/29/2016 by 11:59 pm
- Approximately 10% of total grade
- Submit to Blackboard zip of code with PDF report
- You may work in groups of 2 and submit one assignment per group.
- Code Distribution: distrib-hw2.zip (Updated Tue Feb 23 18:26:36 EST 2016 )
CHANGELOG:
- Tue Feb 23 18:26:47 EST 2016
- Major updates to include details of how to report timings, what
to put in the HW report, and some guidelines on what to expect
for the grader interview. Ensure that your
mpi_heat
program can disable output to improve your timing reporting. Review the following updated/new sections.- Disabling output in
heat
andmpi_heat
- Problem 1 Scripts for Timing Results
- Problem 2 Scripts for Timing Results
- Submission and Evaluation
You will also want to get the following new/updated files which are also in the update code distribution.
- Disabling output in
- Tue Feb 16 18:43:30 EST 2016
- Minor update to show how to run batch jobs with a number of processors not divisible by 4 on medusa.
Table of Contents
1 Overview
The assignment involves programming in MPI and answering some questions (TBD) about your programs. It is a programming assignment so dust off your C skills. We will attempt to spend some class discussing issues related to the assignment as the due date approaches but it will pay to start early to get oriented. Debugging parallel programs can be difficult and requires time.
There are two main problems to solve, parallelizing the heat program from HW1 and parallelizing a provided pagerank algorithm. Both problems involve coding and analyzing the programs you eventually construct.
In addition to providing code and a short report, you will be required to meet with the GTA for the course to demonstrate your program running. These meetings will occur after the assignment is due during the week of 2/29/2016. Details of what will go on in this meeting and what questions the GTA might ask will be made more clear at a later time. For the moment, get cracking on the code.
2 A Sample MPI Program with Some Utilities
A simple sample program called mpi_hello.c
is provided as part of
the code distribution. This program includes two useful utilities
pprintf(fmt,...)
will have any processor running it print a message likeprintf
does but the message will be appended with the processor ID. It will be useful for debugging to track which proc is doing what.rprintf(fmt,...)
also works likeprintf
but has only the root processor print messages. Usually duplicated output is not desired so the root processor should do most output for human consumption.
Compiling and running the program can be done locally on any machine with MPI installed as follows.
lila [ckauffm2-hw2]% mpicc mpi_hello.c lila [ckauffm2-hw2]% mpirun -np 4 mpi_hello P 0: Hello world from process 0 of 4 (host: lila) P 1: Hello world from process 1 of 4 (host: lila) P 2: Hello world from process 2 of 4 (host: lila) P 3: Hello world from process 3 of 4 (host: lila) Hello from the root processor 0 of 4 (host: lila)
3 The Medusa Cluster
3.1 Summary
medusa.vsnet.gmu.edu
is the MPI cluster available for use in the class. Usessh
to log in. You must have a VPN connection set up. Some instructions are here for setting up VPN.- Use
mpicc
to compile onmedusa
. - Use
mpirun
to test interactively. The login node will be used for this only so programs will not typically get much speedup. - When your code is in good shape, create a small batch script and
submit it to the batch queue using
sbatch
. Check the status of the job queue withsqueue
.
3.2 Details
The main platform we will utilize is the medusa.vsnet.gmu.edu
cluster. All students in CS 499 have been authorized to use the
cluster. Log into it using your favorite ssh
tool using your
standard mason NetID and password.
lila [~]% ssh ckauffm2@medusa.vsnet.gmu.edu Password: Last login: Tue Feb 16 09:33:37 2016 from lila.vsnet.gmu.edu Use the command 'module avail' to list available modules. Use the command 'module add <module_name>' to use module <module_name>. Default Modules: 1) dot 4) boost/1.60.0 7) munge/0.5.11 2) openmpi/1.10.1 5) SimGrid/3.12 8) dmtcp/2.4.4 3) java/1.8.0_66 6) slurm/15.08.7 9) medusa-default medusa [~]% which mpicc /usr/local/openmpi/1.10.1/bin/mpicc medusa [~]% which mpirun /usr/local/openmpi/1.10.1/bin/mpirun medusa [~]% cd cs499/ckauffm2-hw2 medusa [ckauffm2-hw2]% mpicc -o mpi_hello mpi_hello.c medusa [ckauffm2-hw2]% mpirun -np 2 mpi_hello P 0: Hello world from process 0 of 2 (host: medusa) Hello from the root processor 0 of 2 (host: medusa) P 1: Hello world from process 1 of 2 (host: medusa) medusa [ckauffm2-hw2]% mpirun -np 8 mpi_hello P 0: Hello world from process 0 of 8 (host: medusa) Hello from the root processor 0 of 8 (host: medusa) P 3: Hello world from process 3 of 8 (host: medusa) P 5: Hello world from process 5 of 8 (host: medusa) P 6: Hello world from process 6 of 8 (host: medusa) P 2: Hello world from process 2 of 8 (host: medusa) P 1: Hello world from process 1 of 8 (host: medusa) P 4: Hello world from process 4 of 8 (host: medusa) P 7: Hello world from process 7 of 8 (host: medusa)
medusa
is a cluster of 14 or so computers and a careful observer
will note that the the mpi_run
seems to be running everything on the
a single node, host medusa
which is the login node. To gain access
to other compute nodes, you will need to request that a computation be
run via the job queue.
The typical approach to this is to create a little shell script which
will run the job instead and set up the parameters for the job. Any
text editor can be used to create such a script. It is then submitted
to the job queue using the sbatch
command. One can check on queued
jobs using squeue
which displays running and waiting jobs.
# show the job being submitted medusa [ckauffm2-hw2]% cat mpi_hello.sh #!/bin/bash # #SBATCH --output hello.out # output file where printed text will be saved #SBATCH --ntasks 8 # how many processors to request mpirun mpi_hello # submit the job to the queue medusa [ckauffm2-hw2]% sbatch mpi_hello.sh Submitted batch job 261 # check the status of the queue medusa [ckauffm2-hw2]% squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 253 Medusa pagerank ckauffm2 R 10:17:45 4 medusa-node[08-11] 261 Medusa mpi_hell ckauffm2 R 0:02 2 medusa-node[12-13] # job mpi_hello is running as there is an R for ^ running associated with it # check the queue again medusa [ckauffm2-hw2]% squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 253 Medusa pagerank ckauffm2 R 10:17:49 4 medusa-node[08-11] # job is gone so must be finished # check that the output file exists medusa [ckauffm2-hw2]% ls hello.out hello.out # show contents of the output file medusa [ckauffm2-hw2]% cat hello.out P 1: Hello world from process 1 of 8 (host: medusa-node12.Medusa.vsnet.gmu.edu) P 6: Hello world from process 6 of 8 (host: medusa-node13.Medusa.vsnet.gmu.edu) P 7: Hello world from process 7 of 8 (host: medusa-node13.Medusa.vsnet.gmu.edu) P 0: Hello world from process 0 of 8 (host: medusa-node12.Medusa.vsnet.gmu.edu) Hello from the root processor 0 of 8 (host: medusa-node12.Medusa.vsnet.gmu.edu) P 3: Hello world from process 3 of 8 (host: medusa-node12.Medusa.vsnet.gmu.edu) P 4: Hello world from process 4 of 8 (host: medusa-node13.Medusa.vsnet.gmu.edu) P 5: Hello world from process 5 of 8 (host: medusa-node13.Medusa.vsnet.gmu.edu) P 2: Hello world from process 2 of 8 (host: medusa-node12.Medusa.vsnet.gmu.edu)
Options can also be specified from the command line and these are identical to the options above in the job file itself. This is good for testing whether
# show a job script which has no options medusa [ckauffm2-hw2]% cat no_options.sh #!/bin/bash mpirun mpi_hello # submit running on 8 processors medusa [ckauffm2-hw2]% sbatch --output output.8.out --ntasks 8 no_options.sh Submitted batch job 267 # submit running on 12 processors medusa [ckauffm2-hw2]% sbatch --output output.12.out --ntasks 12 no_options.sh Submitted batch job 268 # Check: both jobs done medusa [ckauffm2-hw2]% squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 253 Medusa pagerank ckauffm2 R 10:31:46 4 medusa-node[08-11] # show output of both job files medusa [ckauffm2-hw2]% cat output.8.out P 4: Hello world from process 4 of 8 (host: medusa-node13.Medusa.vsnet.gmu.edu) P 6: Hello world from process 6 of 8 (host: medusa-node13.Medusa.vsnet.gmu.edu) P 7: Hello world from process 7 of 8 (host: medusa-node13.Medusa.vsnet.gmu.edu) P 5: Hello world from process 5 of 8 (host: medusa-node13.Medusa.vsnet.gmu.edu) P 1: Hello world from process 1 of 8 (host: medusa-node12.Medusa.vsnet.gmu.edu) P 0: Hello world from process 0 of 8 (host: medusa-node12.Medusa.vsnet.gmu.edu) Hello from the root processor 0 of 8 (host: medusa-node12.Medusa.vsnet.gmu.edu) P 2: Hello world from process 2 of 8 (host: medusa-node12.Medusa.vsnet.gmu.edu) P 3: Hello world from process 3 of 8 (host: medusa-node12.Medusa.vsnet.gmu.edu) medusa [ckauffm2-hw2]% cat output.12.out P 9: Hello world from process 9 of 12 (host: medusa-node02.Medusa.vsnet.gmu.edu) P 4: Hello world from process 4 of 12 (host: medusa-node01.Medusa.vsnet.gmu.edu) P 2: Hello world from process 2 of 12 (host: medusa-node00.Medusa.vsnet.gmu.edu) P 8: Hello world from process 8 of 12 (host: medusa-node02.Medusa.vsnet.gmu.edu) P 6: Hello world from process 6 of 12 (host: medusa-node01.Medusa.vsnet.gmu.edu) P 7: Hello world from process 7 of 12 (host: medusa-node01.Medusa.vsnet.gmu.edu) P11: Hello world from process 11 of 12 (host: medusa-node02.Medusa.vsnet.gmu.edu) P 0: Hello world from process 0 of 12 (host: medusa-node00.Medusa.vsnet.gmu.edu) Hello from the root processor 0 of 12 (host: medusa-node00.Medusa.vsnet.gmu.edu) P 5: Hello world from process 5 of 12 (host: medusa-node01.Medusa.vsnet.gmu.edu) P10: Hello world from process 10 of 12 (host: medusa-node02.Medusa.vsnet.gmu.edu) P 3: Hello world from process 3 of 12 (host: medusa-node00.Medusa.vsnet.gmu.edu) P 1: Hello world from process 1 of 12 (host: medusa-node00.Medusa.vsnet.gmu.edu)
It is more complex to run jobs on a number of processors that is not evenly divisible by 4, so initially restrict your job runs to 4, 8, 12, and 16 processors.
When you want to run on an arbitrary number of processors, use the
following job script as a template. Adjust the --ntasks
option to
the desired number of processors and the --output
option as needed
but leave the other two options unchanged.
#!/bin/bash # #SBATCH --output hello.out # output file where printed text will be saved #SBATCH --ntasks 9 # how many processors to request #SBATCH --cpus-per-task 1 # use as is #SBATCH --ntasks-per-node 4 # use as is # # Demonstrate running on 9 processors mpirun mpi_hello
4 Problem 1: Parallel Heat (50%)
A slightly modified version of the heat propagation simulation from
HW1 and in-class discussion is in the code pack and called heat.c
.
In this problem, create an MPI version of this program which will
divide calculation of the heat of each simulation cell over time up
among many processors. Key features of this parallelization are as
follows.
- Name your source file
mpi_heat.c
It will need to be a C program and run with the MPI library provided on the medusa cluster.
- The serial version of the program provided accepts the number of
time steps and the width of the rod as command line arguments. Make
sure to preserve it so that a command like
mpicc -o mpi_heat mpi_heat.c mpirun -np 4 mpi_heat 10 40
- Divide the problem data so that each processor owns only a portion of the columns of the heat matrix as discussed in class.
- Utilize send and receives or the combined
MPI_Sendrecv
to allow processors to communicate with neighbors. - Utilize a collective communication operation at the end of the computation to gather all results on the root processor 0 and have it print out the entire results matrix.
- Verify that the output of your MPI version is identical to the output of the serial version which is provided.
- Your MPI version is only required to work correctly in the following
situations:
- The width of the rod in cells is evenly divisible by the number of processors being run
- The width of the rod is at least three times the number of processors so that each processor would have at least 3 columns associated with it.
That means the following configurations should work or not work as indicated.
# procs width works? Notes 1 1 no not enough cols 1 2 no not enough cols 1 3 yes take special care for 1 proc 4 4 no only 1 column per proc 4 8 no only 2 columns per proc 4 12 yes at least 3 cols per proc 4 16 yes at least 3 cols per proc 4 15 no uneven cols 3 9 yes 3 cols per proc, evenly divisible 4 40 yes evenly divisible, >= 3 cols per proc
4.1 Written Summary of the Parallel Heat Results
The following script can be used to submit jobs to run mpi_heat
on
various numbers of processors with different widths. It is required
that the program take a 3rd commandline argument which will disable
printing. Ensure that
> mpirun -np 2 mpi_heat 100 32 0
runs and produces no output. See the updated section on disabling
output in heat
for additional details.
The script is here: submit-heat-jobs.sh
After running the script, jobs will be submitted to the batch queue to
run your mpi_heat
on 1,2,4,8,10, and 16 processors and using
different widths (6400, 25600, and 102400 columns). Execution time is
saved in files named after the width and number of processors. When
all jobs are complete, timing info will be in files that start with
pr.*
. The time for the jobs is accessible easily using the grep
command as demonstrated below.
medusa [ckauffm2-hw2]% ./submit-heat-jobs.sh
Submitted batch job 1063
Submitted batch job 1064
Submitted batch job 1065
Submitted batch job 1066
...
medusa [ckauffm2-hw2]% squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1081 Medusa submit.s ckauffm2 PD 0:00 1 (Resources)
1082 Medusa submit.s ckauffm2 PD 0:00 1 (Priority)
1092 Medusa submit.s ckauffm2 PD 0:00 4 (Priority)
1074 Medusa submit.s ckauffm2 R 0:05 4 medusa-node[02-05]
1077 Medusa submit.s ckauffm2 R 0:04 1 medusa-node08
1078 Medusa submit.s ckauffm2 R 0:04 2 medusa-node[09-10]
1079 Medusa submit.s ckauffm2 R 0:02 3 medusa-node[11-13]
1080 Medusa submit.s ckauffm2 R 0:01 4 medusa-node[00-01,06-07]
...
medusa [ckauffm2-hw2]% squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
medusa [ckauffm2-hw2]% grep 'real' ht.*.out
ht.006400.01.out:real 1.13
ht.006400.02.out:real 1.25
ht.006400.04.out:real 1.58
ht.006400.08.out:real 3.16
ht.006400.10.out:real 3.16
ht.006400.16.out:real 3.25
ht.025600.01.out:real 1.76
ht.025600.02.out:real 1.79
...
In your written report, include the following table with execution times filled based on the processors and widths of your runs using the script above.
Procs\Width | 6400 | 25600 | 102400 |
---|---|---|---|
1 | ? | ? | ? |
2 | ? | ? | ? |
4 | ? | ? | ? |
8 | ? | ? | ? |
10 | ? | ? | ? |
16 | ? | ? | ? |
Comment on how whether you achieve any speedup using more processors. Describe any trends or anomalies you see in the timings and speculate on their causes.
4.2 Grader Interview for Problem 1
You will need to demonstrate your code for this problem to the GTA for the course. Expectations of what will go on in this meeting are in the Grader Interview Specifics Section.
4.3 (50 / 50) Grading Criteria for Problem 1 grading
- (20 / 20) Demonstration of your code to the GTA and ability to explain how it works during an in-person meeting.
- (10 / 10) Cleanly written code with good documentation
- (10 / 10) Correct execution for the given configurations of processors vs columns.
- (10 / 10) Written report which includes timings described above and discussion of them.
4.4 Adjustments to heat.c
to omit output
Input and Output can occupy a tremendous amount of execution time and
often mask the real performance of a program. To that end, make some
adjustments to you mpi_heat.c
program to disable printing of the
final output matrix. An updated version of heat.c
is provided
which shows how to do this in the serial context and can largely be
copied over to the parallel context. This involves accepting and
parsing an additional command-line argument that turns printing
on/off. This is handled at the beginning of the program.
int main(int argc, char **argv){ if(argc < 4){ printf("usage: %s max_time width print\n",argv[0]); printf(" max_time: int\n"); printf(" width: int\n"); printf(" print: 1 print output, 0 no printing\n"); return 0; } int max_time = atoi(argv[1]); // Number of time steps to simulate int width = atoi(argv[2]); // Number of cells in the rod int print = atoi(argv[3]); // CONTROLS PRINTING
Later, output of the final heat matrix is conditioned on the variable
print
:
if(print==1){ // Print results printf("Temperature results for 1D rod\n"); printf("Time step increases going down rows\n"); printf("Position on rod changes going accross columns\n"); ... // Row headers and data for(t=0; t<max_time; t++){ printf("%3d| ",t); for(p=0; p<width; p++){ printf("%5.1f ",H[t][p]); } printf("\n"); } }
This allows one to run the program without output.
medusa [ckauffm2-hw2]% gcc -o heat heat.c medusa [ckauffm2-hw2]% ./heat usage: ./heat max_time width print max_time: int width: int print: 1 print output, 0 no printing medusa [ckauffm2-hw2]% ./heat 5 10 1 Temperature results for 1D rod Time step increases going down rows Position on rod changes going accross columns | 0 1 2 3 4 5 6 7 8 9 ---+------------------------------------------------------------- 0| 20.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 10.0 1| 20.0 35.0 50.0 50.0 50.0 50.0 50.0 50.0 30.0 10.0 2| 20.0 35.0 42.5 50.0 50.0 50.0 50.0 40.0 30.0 10.0 3| 20.0 31.2 42.5 46.2 50.0 50.0 45.0 40.0 25.0 10.0 4| 20.0 31.2 38.8 46.2 48.1 47.5 45.0 35.0 25.0 10.0 medusa [ckauffm2-hw2]% ./heat 5 10 0 medusa [ckauffm2-hw2]%
While running the program without output seems useless, we are primarily interested in this mode of operation to time program execution without the interference of output time and this is the easiest way get at those timings.
5 Page Rank
5.1 Overview of Computing Pageranks
A key to Google's early success was its ability of its search engine to identify web pages which seemed important to user search queries. A key component of their engine was and remains an importance metric metric called Pagerank, so named due to its ranking a web page and the author of the algorithm is Larry Page (history is has a splendid sense of irony). The pagerank has a beautiful theory behind it which involves modeling web users as random walkers through hyperlinked pages. On arriving at a page, a user randomly selects a link and visits it. This process is repeated on the next page, and the next page, and so forth. With a small probability, a user may randomly jump to some arbitrary other page which is not linked to the present one. According to this formalism, pagerank represents the probability of finding a user on a given page at a particular moment in time. A page with many incoming links to it has a higher probability of being visited as many other pages have "voted" for its importance. A page with many outgoing links contributes little to importance of any linked pages: its votes are spread very thin. This rough sort of voting turned out to be a good measure of the importance of page, at least in the early 2000s before web denizens learned to manipulate the algorithm.
It turns out that if the network of links web pages is represented as a certain matrix, the page ranks are identical to a particular eigenvector of that matrix. There are several interesting facets to this relationship for the mathematically inclined and good reading on the subject comes from a survey by Bherkin. The bottom line is that any algorithms for computing an eigenvector of a matrix can be used to compute page ranks. A classical iterative technique to compute eigenvectors is the Power Method which involves repeatedly multiplying a vector by a matrix. Matrix-vector multiplication is a ripe operation for parallelization and your primary task will be to parallelize this process for the pagerank computation.
A code is provided called dense_pagerank.c
which performs pagerank
serial computations. In high-level terms the computation breaks down
as follows.
- Load data for a matrix of web page links (link matrix). Each
page is numbered 0 to
N-1
whereN
is the total number of pages. The file format is simply pairs of numbers of one page pointing to another one. Loading the file involves allocating memory for the entire matrix, zeroing each entry, then filling a 1 into each row/col entry indicated by the file. - Normalize columns by summing each column in the matrix, then dividing each entry in a column by the sum of the column.
- Apply a damping factor which allows random warping from one page to another. The math on this is a little funky, but the intent is to make each nonzero entry in the matrix a little smaller and each zero entry nonzero so there is a chance of jumping to an arbitrary page. See the code for the specific math involved with the update. A typical damping factor is 0.85: 85% chance of visiting a link on the page and 15% change of jumping an arbitrary unlinked page.
- Initialize pageranks to be equal for each page and so that the pageranks sum to 1. If there are 10 pages, each page initially has a pagerank of 0.1; with 100 pages each has 0.01. Only the relative size between ranks is important.
- Multiply the link matrix by the pageranks according to the standard matrix-vector multiplication algorithm. Store the results in a second array of numbers. This second array of numbers is now the new pageranks. Assign this back to the array of old pageranks after checking for convergence.
- Repeat step 5 of creating new pageranks by multiplying the link matrix by the old pageranks. Continue repeating this until there is very little change between new and old pageranks. At this point, the solution has converged.
This algorithm is a good example of iterative algorithms: it is not known ahead of time how many steps will be required to converge but steady progress should be made as indicated by the old and new pagerank vectors being closer and closer together.
Note that due to the columns of the link matrix and the vector of pageranks being positive and summing to 1, the results of their multiplication should also sum to 1 (e.g. the new pageranks also sum to 1). The code presently reports the norm of the vector as this sum and it should remain 1 throughout the computation.
It should be mentioned that the provided code is a dense version of the pagerank: every element of the link matrix has memory allocated to it. Unsurprisingly, a production version of the code would use sparse matrices instead where the many zero entries of the matrix are represented implicitly to save a tremendous amount of memory. While the dense algorithm is easier to parallelize than the sparse, the dense version is woefully inappropriate for the enormous size of Google-scale pagerank computations involving 30,000,000,000,000+ web pages. It is a computation that necessitates parallelism at a sickening scale but is reasonably approximated by the present code.
Take some time to examine the code provided carefully.
5.2 Sample Runs of dense_pagerank.c
Part of the code distribution includes some graph files which you can
use for experimentation and timing analysis of your code. Each graph
is named after its size and content. The notredame
graphs are
derived from a real dataset of web sites in the Notre Dame domain. The
full set is available here though will require a bit of processing to
be used with this code and is extremely large for a dense pagerank
calculation.
Start by experimenting with the small graphs like tiny-20.txt
which
has only 20 nodes in it and 200 links between pages.
lila [ckauffm2-hw2]% ls graphs/*.txt graphs/notredame-100.txt graphs/notredame-2000.txt graphs/notredame-8000.txt graphs/notredame-16000.txt graphs/notredame-501.txt graphs/tiny-20.txt lila [ckauffm2-hw2]% dense_pagerank graphs/tiny-20.txt 0.85 Loaded graphs/tiny-20.txt: 20 rows, 200 nonzeros Beginning Computation ITER DIFF NORM 1: 1.78e-01 1.00e+00 2: 3.85e-02 1.00e+00 3: 7.27e-03 1.00e+00 4: 1.32e-03 1.00e+00 5: 2.12e-04 1.00e+00 CONVERGED PAGE RANKS 0.04779640 0.04147775 0.04912589 0.03965692 0.05845908 0.04394957 0.02513647 0.04369224 0.05522195 0.07147504 0.05889092 0.06569723 0.05264261 0.03913282 0.05423814 0.05833793 0.04308603 0.06827848 0.03697897 0.04672553
The progress at each iteration is reported: the DIFF
column should
get progressively smaller while the NORM
column should remain 1
throughout. After convergence, the pageranks of the 20 pages are
printed.
The largest graph you should work with is notredame-8000.txt
which has 8000 web sites involved in it leading to an 8000 by 8000
link matrix. Running this through the serial code looks like the
following. Note that the output will be long (8000+ lines) so it is
put into the file output.txt
and examined using the head
command
to display the first few lines.
lila [ckauffm2-hw2]% ls graphs/*.txt graphs/notredame-100.txt graphs/notredame-2000.txt graphs/notredame-8000.txt graphs/notredame-16000.txt graphs/notredame-501.txt graphs/tiny-20.txt lila [ckauffm2-hw2]% dense_pagerank graphs/notredame-8000.txt 0.85 > output.txt lila [ckauffm2-hw2]% head -50 output.txt Loaded graphs/notredame-8000.txt: 8000 rows, 27147 nonzeros Beginning Computation ITER DIFF NORM 1: 1.26e+00 1.00e+00 2: 7.92e-01 1.00e+00 3: 4.24e-01 1.00e+00 4: 2.48e-01 1.00e+00 5: 1.50e-01 1.00e+00 6: 9.45e-02 1.00e+00 7: 6.23e-02 1.00e+00 8: 4.11e-02 1.00e+00 9: 2.73e-02 1.00e+00 10: 1.91e-02 1.00e+00 11: 1.31e-02 1.00e+00 12: 9.24e-03 1.00e+00 13: 6.74e-03 1.00e+00 14: 4.91e-03 1.00e+00 15: 3.75e-03 1.00e+00 16: 2.81e-03 1.00e+00 17: 2.16e-03 1.00e+00 18: 1.64e-03 1.00e+00 19: 1.27e-03 1.00e+00 20: 9.79e-04 1.00e+00 CONVERGED PAGE RANKS 0.00227804 0.00044506 0.00001875 0.00051994 0.00156742 0.00015092 0.00087703 0.00111392 0.00123884 0.00081005 0.00252026 0.00359624 0.00007052 0.00005559 0.00001959 0.00107474 0.00075570 0.00015412 0.00011205 0.00395254 0.02658639 0.00023358 0.00009175
6 Problem 2: Parallel PageRank (50%)
Parallelize the provided pagerank code. It is suggested that you start with the code provided serial code and take small steps towards parallelizing.
6.1 Reading Data Files
The program starts with reading input from a file which should be done only on the root processor. After reading the whole matrix into the root processor, send chunks of the matrix to each processor for the main part of the algorithm.
The serial code uses a densemat
structure to store the matrix. This
structure uses a trick.
- All elements are stored in a linear array called
all
. This allows linear index access viamat->all[i]
- An array of pointers called
data
points to the beginning of each row in the matrix. This allows row/col access viamat->data[r][c]
. - As a consequence of the linear array, sequential rows are stored in
adjacent memory. In a 10 by 10 matrix, rows 0, 1, and 2 are stored
in elements 0-29 of
mat->all
. This makes it possible to send multiple adjacent rows with single communications.
6.2 Row Partitioning Woes
The main source of parallelism is obtained by dividing up the link matrix so that each processor owns a collection of whole rows. This is effective as matrix vector multiplication relies on multiplying a whole row by a column vector (the pageranks in this case).
Do not assume that the number of rows in the link matrix is evenly divisible by the number of processors. Make your code more flexible than that. This, unfortunately, means dealing with some minutia as not every processor will send or receive the same number of elements. As a suggested approach, do the following
- First, assume the number of rows is evenly divisible by the number
of processors and use simple MPI calls like
MPI_Scatter
andMPI_Allgather
which assume every processor will receive the same number of elements. Make sure that this version works on some of the input graphs for numbers of processors that evenly divide the size. - When you are confident in your code above, make a backup copy of it for safekeeping.
- Now take the plunge and switch to the MPI vector calls which allow
one to specify then number of elements each processor will receive:
functions like
MPI_Scatterv
andMPI_Allgatherv
(notice thev
at the end) take additional parameter arrays of the counts of elements for each processor and the offsets into storage arrays where those elements reside. These more complex invocation may seem tedious, but all that is really required is to set up arrays indicating the counts elements on each processor and pass those in. Establish these arrays near the beginning of the program and use them throughout.
6.3 Parallelizing Column Normalization and Damping
It is suggested that you initially let the root processor read the whole matrix, normalize the rows, apply the damping factor, then scatter the matrix rows to each processor. That way the serial code can be used to ensure normalization and damping is correct.
Later, revisit the column normalization and damping to parallelize it.
- Scatter the unnormalized link matrix rows to each process
- Have each process compute an array of its own column sums
- Use a all-to-all reduction so that every processor has the sums of
all columns. Investigate a good MPI function for this all-to-all
reduction and potentially use the
MPI_IN_PLACE
constant to save yourself some allocations of buffers (the manual pages for relevant MPI functions describe this option). - Have each processor divide each of its elements by the appropriate column sum.
- Have each processor apply the damping factor adjustment to each of its elements.
6.4 Parallelizing the Repeated Matrix-vector Multiplication
The main computation loop involves repeatedly multiplying the link matrix by the vector or pageranks. In the parallel version, each processor has some whole rows of the link matrix. Note the consequences of this decomposition.
- Each processor has some link matrix rows but must have the whole vector of old pageranks to do the multiplication
- After completing the multiplication, each processor will contain only part of the new pagerank vector and must communicate its portion of to all other processors for the next multiplication to occur.
- After each multiplication, each processor must also share how much its new pageranks differ from the equivalent portion of the old pagerank vector so that all processors can determine if the algorithm has converged.
This will involve several collective communication operations at each iteration to share.
6.5 Written Summary of Parallel Pagerank Results
The following script can be used to submit jobs to run
mpi_dense_pagerank
on the provided graphs.
Timing jobs script: submit-pagerank-jobs.sh
After running the script, jobs will be submitted to the batch queue to
run your mpi_dense_pagerank
on 1-16 processors saving output in
files along with the execution time. When all jobs are complete,
timing info will be in files that start with pr.*
. The time for the
jobs is accessible easily using the grep
command as demonstrated
below.
medusa [ckauffm2-hw2]% submit-pagerank-jobs.sh
Submitted batch job 541
Submitted batch job 542
Submitted batch job 543
Submitted batch job 544
...
medusa [ckauffm2-hw2]% squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
549 Medusa time-pag ckauffm2 PD 0:00 3 (Resources)
550 Medusa time-pag ckauffm2 PD 0:00 3 (Priority)
551 Medusa time-pag ckauffm2 PD 0:00 3 (Priority)
556 Medusa time-pag ckauffm2 PD 0:00 4 (Priority)
541 Medusa time-pag ckauffm2 R 0:14 1 medusa-node00
542 Medusa time-pag ckauffm2 R 0:14 1 medusa-node01
...
medusa [ckauffm2-hw2]% squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
medusa [ckauffm2-hw2]% grep 'real' pr.notredame-8000.*
pr.notredame-8000.txt.01.out:real 16.92
pr.notredame-8000.txt.02.out:real 10.81
pr.notredame-8000.txt.03.out:real 11.97
pr.notredame-8000.txt.04.out:real 10.93
pr.notredame-8000.txt.05.out:real 9.92
...
The number in the last column is the execution time of the run while the file name indicates the number of processors used (1 to 5 are shown). Note that you can adjust parameters in the script to change the input graph which is required.
In your HW report, produce a report which shows.
- Timings of
mpi_dnese_pagerank
for 1-16 processors on thenotredame-8000.txt
graph (default) - Timings of
mpi_dnese_pagerank
for 1-16 processors on thenotredame-16000.txt
graph (change the input file in the script) - A brief discussion of how scalable your implementation appears to be based on those timings and any irregularities you see.
Present your results in a table with the following format.
Procs\Graph | 8000 | 16000 |
---|---|---|
1 | ? | ? |
2 | ? | ? |
3 | ? | ? |
4 | ? | ? |
5 | ? | ? |
6 | ? | ? |
7 | ? | ? |
8 | ? | ? |
9 | ? | ? |
10 | ? | ? |
11 | ? | ? |
12 | ? | ? |
13 | ? | ? |
14 | ? | ? |
15 | ? | ? |
16 | ? | ? |
6.6 Grader Interview for Problem 2
You will need to demonstrate your code for this problem to the GTA for the course. Expectations of what will go on in this meeting are in the Grader Interview Specifics Section.
6.7 (50 / 50) Grading Criteria for Problem 2 grading
- (20 / 20) Demonstration of your code to the GTA and ability to explain how it works during an in-person meeting.
- (10 / 10) Clean parallelization of the main matrix-vector multiplcation loop along with checks for convergence.
- (5 / 5) Effective use of MPI's collective communication operations to spread and gather data between processors
- (5 / 5) Clean parallelization of the computation of column sums and application of the damping factor in the link matrix.
- (5 / 5) Correct execution for a matrix with any size on any number of processors.
- (5 / 5) Report on timings of parallel pagerankfor generated using the script provided along with discussion of those results.
7 Submission and Evaluation
7.1 Submitting your work
Submit the following to Blackboard for this HW.
- Zip file of program code
- Should contain at a minimum
mpi_heat.c
for problem 1mpi_dense_pagerank.c
for problem 2graphs/
directory with graph data files in it
May contain various other scripts that were provided or you developed to aid you.
- DOCX or PDF Report
- Names of all group members: Include full names, NetIDs and
G#s such as the below for a group of 2.
CS 499 HW 2 Group of 2 Turanga Leela tleela4 G07019321 Philip J Fry pfry99 G00000001
- Problem 1: Timings table generated using the script provided and discussion of the timings.
- Problem 2: Timings table generated using the script provided and discussion of those timings.
- Names of all group members: Include full names, NetIDs and
G#s such as the below for a group of 2.
Grading criteria for submitted work is described in previous sections
7.2 Grader Interview Specifics
40% of your grade is based on an interview with the course grader in which you will be asked to demonstrate execution of your code and describe its approach. Briefly, this interview may include the following elements.
- 20-minute interview
- Demonstrate compiling and running a parallel program interactively on medusa
- Demonstrate submitting a parallel job on the batch queue with a certain number of processors
- Outline how the Problem 1: Heat program was parallelized
- Give a brief walk-through of code for Problem 1: Heat
- Explain some MPI calls as they appear in the Heat program
- Outline how the Problem 2: Pagerank program was parallelized
- Explain some MPI calls as they appear in the Pagerank program
- Describe timing results associated with parallel Pagerank runs with different numbers of processors and input sizes
- For groups of 2, interviewer may direct a question at individual group members to assess that both members understand the content
Some sample questions are provided below. These are meant to guide and the actual questions may be variants of them.
- "Here you called
MPI_XXX(...)
in your Pagerank code. What is being accomplished there and why is it necessary?" - "What kind of decomposition did you use for your parallel Heat code? What kind of communication did it require?"
- "Show me the timing results for running your Pagerank code on 4, 8,
and 16 processors for the
notredame-8000.txt
graph." - "Show me how you would run your parallel Heat program with 8 processors and width 64 interactively."
- "Submit a job to the batch queue which runs your parallel Heat program
with 8 processors and width 64 and puts the output in
testout.txt
." - "At the end of your Pagerank program, where are is the entire array of Pageranks stored? Show me where this happens in your code."