Interactive SuperComputing


 

Success Stories

Genomic Correlation at National Cancer Institute


PDF of Case Study
 
Case Study
from Itanium Solutions

image of a dna strand

The Challenge

Researchers at the National Cancer Institute (NCI) explore a vast public database of genomic information for potential discoveries. This genomic profiling effort may help researchers better understand genetic risk factors for cancer, and help them develop new procedures for testing the genomic profiles of tumors—procedures that might advance the cause of personalized medicine, in which patient’s genetic information may be used to customize the detection, treatment, or prevention of disease.

NCI researchers calculate the correlation between genes in microarray gene expression studies, using a data base with more than 40,000 probe entries. The correlations help researchers better understand the relationship of genes but, with the explosion in the amount of available genomic data, came the increase of the computation time on the desktop. And in the race to understand how genetics and cancer are linked, time is precious.

The MATLAB® based, specialized, interactive application used to compute correlations, was performance-constrained by the amount of memory supported in the desktop. As a result, extensive correlations proved problematic for some projects. Turn around times of up to a week had proven to be the practical limit for researchers. And with bioinformatics on a steep growth curve, the problem was only going to grow worse.

By using Star-P software running on a multi-processor SGI Altix® server with large memory, researchers at NCI were able to overcome the limitations of serial desktop computing, while continuing to work with the familiar, interactive, MATLAB® based environment. And, the answers to some researchers’ questions are arriving faster up to 200 times faster—than ever before.

Star-P Solution

The problem and its Star-P solution at NCI are representative of a wide range of statistical analysis projects. As the problem sizes increase, the researchers need tools enabling rapid computation of cross-correlation between experimentally obtained samples. Typically, the algorithm consists of three main steps: 1) Read the data from a file. 2) Compute cross-correlation results. 3) Select results of interest and write them out to a file.

The main problem is the fact that the size of data and intermediate results may vastly exceeded the desktop memory. In the NCI case, as it often happens, computations of correlations had to be done in parts and the use of full-scale vectorization of the computation was not possible. Although vectorized computations operate much faster, in both serial and parallel modes, the speedup comes at a cost of increased demand for memory.

The use of Star-P on a server obviates the desktop memory size limitations. With Star-P’s parallel I/O, data can be loaded fast, distributed and computed on the server, and written out in parallel. It’s now possible to do the entire computation in one step, using full-scale vectorization. Star-P ability to perform the required parallel and vectorized computations on the distributed matrix accounts for the bulk of the speedup achieved at NCI.

Although the I/O operations are typically problem-specific, the computation of cross- correlations is a core operation in a wide variety of statistical analysis projects. An example generic vectorized code for computation of cross-correlation is shown below (where "data" is organized as an N x M matrix, where N represents the number of samples and M the count of numbers in each sample):

example of generic vectorized code

Using Star-P, it is possible to overcome the scaling limits of the desktop. Without Star-P, for larger data sizes, the correlation would have to be computed using FOR loops, without vectorization, taking dramatically longer time to complete.

This code segment can run serially in MATLAB®, on the desktop, or (without any changes) in parallel, in the Star-P with MATLAB® environment.

The only step needed to run this computation in parallel in Star-P is to distribute the "data" matrix on the server, using one of the simple methods provided by Star-P, such as parallel read from file. Then all operations are automatically parallel.

Summary & Metrics

National Cancer Institute (NCI), succeeded in migrating from previously used— but no longer sufficient—serial computation in MATLAB® to parallel computation in the combined Star-P + MATLAB® environment. As a result, the NCI researches obtain their result up to 200 times faster than previously possible.

With a more powerful parallel system at their disposal, researchers may also try even more complex searches that previously weren't an option. For example, the group has estimated that, using today's database, the largest potential correlation— with a data matrix of 100,000 by 100,000—would require more than 256GB of memory to solve. Star-P fundamentally transforms the workflow, giving researchers the ability to run more samples, and approach problems differently than they would have before.

The generic example code segment shown above was run in both serial and parallel, to illustrate the impact of using Star-P. Serially, it ran on a single-processor Pentium-D desktop, with 2 GB of memory. In parallel, it ran on Star-P client-server with the client running on a single-processor Pentium-D desktop and the Star-P server running on an SMP server with four dual-core Opteron processors (8 cores total), with 32 GB memory.

cross-correlation computation graph
  • 200X Speedup on 8-processor server
  • Ability to process larger data sets (>256GB) from desktop MATLAB®
  • Transformation of research workflow
Back to Success Stories

200x speed-up on 8-processor server