Genomic Correlation at National Cancer Institute |
|
The Challenge
Researchers at the National Cancer Institute (NCI) explore a vast public database of
genomic information for potential discoveries. This genomic profiling effort may help
researchers better understand genetic risk factors for cancer, and help them develop
new procedures for testing the genomic profiles of tumors—procedures that might advance
the cause of personalized medicine, in which patient’s genetic information may be used
to customize the detection, treatment, or prevention of disease.
NCI researchers calculate the correlation between genes in microarray gene expression
studies, using a data base with more than 40,000 probe entries. The correlations help
researchers better understand the relationship of genes but, with the explosion in the
amount of available genomic data, came the increase of the computation time on the desktop.
And in the race to understand how genetics and cancer are linked, time is precious.
The MATLAB® based, specialized, interactive application used to compute correlations,
was performance-constrained by the amount of memory supported in the desktop. As a result,
extensive correlations proved problematic for some projects. Turn around times of up to a
week had proven to be the practical limit for researchers. And with bioinformatics on a
steep growth curve, the problem was only going to grow worse.
By using Star-P software running on a multi-processor SGI Altix® server with large
memory, researchers at NCI were able to overcome the limitations of serial desktop
computing, while continuing to work with the familiar, interactive, MATLAB® based
environment. And, the answers to some researchers’ questions are arriving faster up to
200 times faster—than ever before.
Star-P Solution
The problem and its Star-P solution at NCI are representative of a wide range of
statistical analysis projects. As the problem sizes increase, the researchers need tools
enabling rapid computation of cross-correlation between experimentally obtained samples.
Typically, the algorithm consists of three main steps: 1) Read the data from a file.
2) Compute cross-correlation results. 3) Select results of interest and write them out
to a file.
The main problem is the fact that the size of data and intermediate results may vastly
exceeded the desktop memory. In the NCI case, as it often happens, computations of
correlations had to be done in parts and the use of full-scale vectorization of the
computation was not possible. Although vectorized computations operate much faster, in both
serial and parallel modes, the speedup comes at a cost of increased demand for memory.
The use of Star-P on a server obviates the desktop memory size limitations. With Star-P’s
parallel I/O, data can be loaded fast, distributed and computed on the server, and written
out in parallel. It’s now possible to do the entire computation in one step, using full-scale
vectorization. Star-P ability to perform the required parallel and vectorized computations on
the distributed matrix accounts for the bulk of the speedup achieved at NCI.
Although the I/O operations are typically problem-specific, the computation of cross-
correlations is a core operation in a wide variety of statistical analysis projects. An
example generic vectorized code for computation of cross-correlation is shown below (where
"data" is organized as an N x M matrix, where N represents the number of samples and M the
count of numbers in each sample):
Using Star-P, it is possible to overcome the scaling limits of the desktop. Without
Star-P, for larger data sizes, the correlation would have to be computed using FOR loops,
without vectorization, taking dramatically longer time to complete.
This code segment can run serially in MATLAB®, on the desktop, or (without any
changes) in parallel, in the Star-P with MATLAB® environment.
The only step needed to run this computation in parallel in Star-P is to distribute
the "data" matrix on the server, using one of the simple methods provided by Star-P,
such as parallel read from file. Then all operations are automatically parallel.
|
Summary & Metrics
National Cancer Institute (NCI), succeeded in migrating from previously used— but no
longer sufficient—serial computation in MATLAB® to parallel computation in the combined
Star-P + MATLAB® environment. As a result, the NCI researches obtain their result up to
200 times faster than previously possible.
With a more powerful parallel system at their disposal, researchers may also try even
more complex searches that previously weren't an option. For example, the group has
estimated that, using today's database, the largest potential correlation— with a data
matrix of 100,000 by 100,000—would require more than 256GB of memory to solve. Star-P
fundamentally transforms the workflow, giving researchers the ability to run more samples,
and approach problems differently than they would have before.
The generic example code segment shown above was run in both serial and parallel, to
illustrate the impact of using Star-P. Serially, it ran on a single-processor Pentium-D
desktop, with 2 GB of memory. In parallel, it ran on Star-P client-server with the client
running on a single-processor Pentium-D desktop and the Star-P server running on an SMP
server with four dual-core Opteron processors (8 cores total), with 32 GB memory.
- 200X Speedup on 8-processor server
- Ability to process larger data sets (>256GB) from desktop MATLAB®
- Transformation of research workflow
Back to Success Stories

|