Results: 25x Gain on 32 core, 4 Node Cluster
Singular value decomposition (SVD) based algorithms are used by scientists and engineers to recognize data in noisy environments, reduce data storage requirements, and for regularization of pseudo-inverse calculations. SVD is a linear algebra technique that decomposes matrices into constituent parts, a left-hand and right-hand matrix separated by a descriptive diagonal matrix, the singular values, as their weighting. In many applications, the singular values of a matrix decrease quickly with increasing rank. This property allows the user to reduce noise or compress the matrix data by eliminating the small singular values or the higher ranks.
Application areas for the SVD include digital signal processing (DSP), image processing, bioinformatics and physical simulation. With technology advances, these applications increasingly work with larger datasets—and many more of them. The computational intensity of the SVD algorithm limits throughput when processing many matrices, and limits usefulness when processing large matrices on a serial computational platform. When handling many small matrices, parallel SVD computation on a cluster of processors would be ideal—but how does one easily distribute the work? Similarly challenging is the implementation of the SVD algorithm in MPI or coding the application in C/C++ or Fortran with an MPI SVD library to process large data sets in parallel.
With Star-P, scientists and engineers can quickly take advantage of both data and task-parallel SVD computation. Star-P lets users develop and optimize their applications in the familiar MATLAB® environment, while seamlessly computing interactively on a multiprocessor server. With the native MATLAB® SVD compatibility, Star-P readily enables simple conversion of existing serial MATLAB® code for parallel computation shaving precious execution and development time.
In a recent application of Star-P, the SVD algorithm was used for image compression. The algorithm consisted of the following three steps: 1) load the data, 2) compress the data, and 3) decompress the data to view the results. The compression algorithm simply retains the SVD data up to the rank specified - in other words, keeping only matrix u and v columns one through the compression rank, and singular values one through compression rank.
The original image shows snow in red and bare ground in green. This image is processed into an image showing the relative amount of snow and then that resulting image is compressed with SVD. The four images show the original image, the resulting relative amount of snow image and then two compression experiments. The first, performed with rank 10 is highly compressed, but with extensive loss of fidelity. A second image is compressed with rank of 256, has a lower compression ratio but is close to indistinguishable from the original. An advantage of Star-P is that the user can easily try multiple configurations to see which settings are best for the data in use.
Only one change was made to the original MATLAB® code to run in data parallel: data is loaded onto the server with the command ppload. The original algorithm is shown in Figures 1 and 2. The difference between the serial and parallel algorithms is shown in Figure 3. With a single image of 1600 by 2400 pixels, running Star-P on a four node cluster with 32 processor cores, the execution time was accelerated 25 times. In addition to running the SVD compression/decompression algorithm in data parallel, one could very easily run many smaller images in task parallel.
Figure 1:
SVD Image Compression Algorithm
function [uout sout vout] =
SVD_compress(a,rank)
rankvec = 1:rank;
%Compress
image by SVD algorithm
[u,s,v] = svd(a,0);
%
Collect output
vout = v(:,rankvec);
uout = u(:,rankvec);
fulls = diag(s);
sout = fulls(rankvec);
Figure 2:
SVD Image Decompression Algorithm
function o = SVD_decompress(uout,sout,vout)
%Reconstruct
the image from uout, sout, vout
o =
uout*diag(sout)*vout';
Figure 3:
Difference Between
Data-Parallel and Serial
%%
Read image from disk (Serial)
image_file = 'EastUSA.mat';
%%
Read image from disk (Parallel)
image_file = 'EastUSA.mat';
ppload(image_file,'a',2);
As can be seen from the graph of execution times, considerable speed up is obtained for very little effort. The graph showing the efficiency of execution for the different numbers of cores shows that efficiency is great through 16 cores, but declines at 32 cores. This is a common effect which occurs when too many processors are being used for relatively little data. An analogy is painting a room: It is quite easy to imagine four painters working together, one per wall, and if the room is reasonably large, two painters per wall can be used efficiently. But it would be very hard to efficiently paint a room with 32 painters.