Science-Driven Systems
NERSC deploys five major types of hardware systems—computational systems, filesystems, storage systems, network, and analytics and visualization systems—all of which must be carefully designed and integrated to maximize productivity for the entire computational community supported by the DOE Office of Science. In 2005 NERSC implemented major upgrades or improvements in all five categories, as described below.
System specifications must meet requirements that are constantly evolving because of technological progress and the changing needs of scientists. Therefore, NERSC is constantly planning for the future, not just by tracking trends in science and technology and planning new system procurements, but also by actively influencing the direction of technological development through efforts such as the Science-Driven System Architecture collaborations.
Two New Clusters: Jacquard and Bassi
In August 2005 NERSC accepted a 722-processor Linux Networx Evolocity cluster system named “Jacquard” for full production use (Figure 7). The acceptance test included a 14-day availability test, during which a select group of NERSC users were given full access to the Jacquard cluster to thoroughly test the entire system in production operation. Jacquard had a 99 percent availability uptime during the testing while users and scientists ran a variety of codes and jobs on the system.
Figure 7. Jacquard is a 722-processor Linux Networx Evolocity cluster system with a theoretical peak performance of 2.8 teraflop/s.
The Jacquard system is one of the largest production InfiniBand-based Linux cluster systems and met rigorous acceptance criteria for performance, reliability, and functionality that are unprecedented for an InfiniBand cluster. Jacquard is the first system to deploy Mellanox 12x InfiniBand uplinks in its fat-tree interconnect, reducing network hot spots and improving reliability by dramatically reducing the number of cables required.
The system has 640 AMD 2.2 GHz Opteron processors devoted to computation, with the rest used for I/O, interactive work, testing, and interconnect management. Jacquard has a peak performance of 2.8 teraflop/s. Storage from DataDirect Networks provides 30 TB of globally available formatted storage.
Following the tradition at NERSC, the system was named for someone who has had an impact on science or computing. In 1801, Joseph-Marie Jacquard invented the Jacquard loom, which was the first programmable machine. The Jacquard loom used punched cards and a control unit that allowed a skilled user to program detailed patterns on the loom.
In January 2006, NERSC launched an 888-processor IBM cluster named “Bassi” into production use (Figure 8). Earlier, during the acceptance testing, users reported that codes ran from 3 to 10 times faster on Bassi than on NERSC’s other IBM supercomputer, Seaborg, leading one tester to call the system the “best machine I have seen.”
Figure 8. Bassi is an 888-processor IBM p575 POWER5 system with a theoretical peak performance of 6.7 teraflop/s.
Bassi is an IBM p575 POWER5 system, and each processor has a theoretical peak performance of 7.6 gigaflop/s. The processors are distributed among 111 compute nodes with eight processors per node. Processors on each node have a shared memory pool of 32 GB. A Bassi node is an example of a shared memory processor, or SMP.
The compute nodes are connected to each other with a high-bandwidth, low-latency switching network. Each node runs its own full instance of the standard AIX operating system. The disk storage system is a distributed, parallel I/O system called GPFS (IBM’s General Parallel File System). Additional nodes serve exclusively as GPFS servers. Bassi’s network switch is the IBM “Federation” HPS switch, which is connected to a two-link network adapter on each node.
One of the test users for NERSC’s two new clusters was Robert Duke of the University of North Carolina, Chapel Hill, the author of the PMEMD code, which is the parallel workhorse in modern versions of the popular chemistry code AMBER. PMEMD is widely used for molecular dynamics simulations and is also part of NERSC’s benchmark applications suite. Duke has worked with NERSC’s David Skinner to port and improve the performance of PMEMD on NERSC systems.
“I have to say that both of these machines are really nothing short of fabulous,” Duke wrote to Skinner. “While Jacquard is perhaps the best-performing commodity cluster I have seen, Bassi is the best machine I have seen, period.”
Other early users during the acceptance testing included the INCITE project team “Direct Numerical Simulation of Turbulent Nonpremixed Combustion.” “Our project required a very long stretch of using a large fraction of Bassi processors—512 processors for essentially an entire month,” recounted Evatt Hawkes. “During this period we experienced only a few minor problems, which is exceptional for a pre-production machine, and enabled us to complete our project against a tight deadline. We were very impressed with the reliability of the machine.”
Hawkes noted that their code also ported quickly to Bassi, starting with a code already ported to Seaborg’s architecture. “Bassi performs very well for our code. With Bassi’s faster processors we were able to run on far fewer processors (512 on Bassi as opposed to 4,096 on Seaborg) and still complete the simulations more rapidly,” Hawkes added. “Based on scalar tests, it is approximately 7 times faster than Seaborg and 1½ times faster than a 2.0 GHz Opteron processor. Also, the parallel efficiency is very good. In a weak scaling test, we obtained approximately 78 percent parallel efficiency using 768 processors, compared with about 70 percent on Seaborg.”
The machine is named in honor of Laura Bassi, a noted Newtonian physicist of the eighteenth century. Appointed a professor at the University of Bologna in 1731, Bassi was the first woman to officially teach at a European university.
New Visual Analytics Server: DaVinci
In mid-August, NERSC put into production a new server specifically tailored to data-intensive visualization and analysis. The 32-processor SGI Altix, called DaVinci (Figure 9), offers interactive access to large amounts of large memory and high performance I/O capabilities well suited for analyzing large-scale data produced by the NERSC high performance computing systems (Bassi, Jacquard, and Seaborg).
With its 192 gigabytes (GB) of RAM and 25 terabytes (TB) of disk, DaVinci’s system balance is biased toward memory and I/O, which is different from the other systems at NERSC. This balance favors data-intensive analysis and interactive visualization. DaVinci has 6 GB of memory per processor, compared to 4 GB per processor on Jacquard and Bassi and 1 GB on Seaborg.
Users can obtain interactive access to 80 GB of memory from a single application (or all 192 GB of memory by prior arrangement), whereas the interactive limits on production NERSC supercomputing systems restrict interactive tasks to a smaller amount of memory (256 MB on login nodes). While DaVinci is available primarily for interactive use, the system is also configured to run batch jobs, especially those jobs that are data intensive.
Figure 9. DaVinci is a 32-processor SGI Altix with 6 GB of memory per processor and 25 TB of disk memory, a configuration designed for data-intensive analysis and interactive visualization.
The new server runs a number of visualization, statistics, and mathematics applications, including IDL, AVS/Express, CEI Ensight, VisIT (a parallel visualization application from Lawrence Livermore National Laboratory), Maple, Mathematica, and MatLab. Many users depend on IDL and MatLab to process or reorganize data in preparation for visualization. The large memory is particularly beneficial for these types of jobs.
DaVinci is connected to the NERSC Global Filesystem (see below), High Performance Storage System (HPSS), and ESnet networks by two independent 10 gigabit Ethernet connections.
With DaVinci now in production, NERSC has retired the previous visualization server, Escher, and the math server, Newton.
NERSC Global Filesystem
In early 2006, NERSC deployed the NERSC Global Filesystem (NGF) into production, providing seamless data access from all of the Center’s computational and analysis resources. NGF is intended to facilitate sharing of data between users and/or machines. For example, if a project has multiple users who must all access a common set of data files, NGF provides a common area for those files. Alternatively, when sharing data between machines, NGF eliminates the need to copy large datasets from one machine to another. For example, because NGF has a single unified namespace, a user can run a highly parallel simulation on Seaborg, followed by a serial or modestly parallel post-processing step on Jacquard, and then perform a data analysis or visualization step on DaVinci—all without having to explicitly move a single data file.
NGF’s single unified namespace makes it easier for users to manage their data across multiple systems (Figure 10). Users no longer need to keep track of multiple copies of programs and data, and they no longer need to copy data between NERSC systems for pre- and post-processing. NGF provides several other benefits as well: storage utilization is more efficient because of decreased fragmentation; computational resource utilization is more efficient because users can more easily run jobs on an appropriate resource; NGF provides improved methods of backing up user data; and NGF improves system security by eliminating the need for collaborators to use “group” or “world” permissions.
“NGF stitches all of our systems together,” said Greg Butler, leader of the NGF project. “When you go from system to system, your data is just there. Users don’t have to manually move their data or keep track of it. They can now see their data simultaneously and access the data simultaneously.”
Figure 10. NGF is the first production global filesystem spanning five platforms (Seaborg, Jacquard, PDSF, DaVinci and Bassi), three architectures, and four different vendors.
NERSC staff began adding NGF to computing systems in October 2005, starting with the DaVinci visualization cluster and finishing with the Seaborg system in December. To help test the system before it entered production, a number of NERSC users were given preproduction access to NGF. Early users helped identify problems with NGF so they could be addressed before the filesystem was made available to the general user community.
“I have been using the NGF for some time now, and it’s made my work a lot easier on the NERSC systems,” said Martin White, a physicist at Berkeley Lab. “I have at times accessed files on NGF from all three compute platforms (Seaborg, Jacquard, and Bassi) semi-simultaneously.”
NGF also makes it easier for members of collaborative groups to access data, as well as ensure data consistency by eliminating multiple copies of critical data. Christian Ott, a Ph.D. student and member of a team studying core-collapse supernovae, wrote that “the project directories make our collaboration much more efficient. We can now easily look at the output of the runs managed by other team members and monitor their progress. We are also sharing standard input data for our simulations.”
NERSC General Manager Bill Kramer said that as far as he knows, NGF is the first production global filesystem spanning five platforms (Seaborg, Bassi, Jacquard, DaVinci, and PDSF), three architectures, and four different vendors. While other centers and distributed computing projects such as the National Science Foundation’s TeraGrid may also have shared filesystems, Butler said he thinks NGF is unique in its heterogeneity.
The heterogeneous approach of NGF is a key component of NERSC’s five-year plan. This approach is important because NERSC typically procures a major new computational system every three years, then operates it for five years to support DOE research. Consequently, NERSC operates in a heterogeneous environment with systems from multiple vendors, multiple platforms, different system architectures, and multiple operating systems. The deployed filesystem must operate in the same heterogeneous client environment throughout its lifetime.
Butler noted that the project, which is based on IBM’s proven GPFS technology (in which NERSC was a research partner), started about five years ago. While the computing systems, storage, and interconnects were mostly in place, deploying a shared filesystem among all the resources was a major step beyond a parallel filesystem. In addition to the different system architectures, there were also different operating systems to contend with. However, the last servers and storage have now been deployed. To keep everything running and ensure a graceful shutdown in the event of a power outage, a large uninterruptible power supply has been installed in the basement of the Oakland Scientific Facility.
While NGF is a significant change for NERSC users, it also “fundamentally changes the Center in terms of our perspective,” Butler said. For example, when the staff needs to do maintenance on the filesystem, the various groups need to coordinate their efforts and take all the systems down at once.
Storage servers, accessing the consolidated storage using the shared-disk filesystems, provide hierarchical storage management, backup, and archival services. The first phase of NGF is focused on function and not raw performance, but in order to be effective, NGF has to have performance comparable to native cluster filesystems. The current capacity of NGF is approximately 70 TB of user-accessible storage and 50 million inodes (the data structures for individual files). Default project quotas are 1 TB and 250,000 inodes. The system has a sustainable bandwidth of 3 GB/sec bandwidth for streaming I/O, although actual performance for user applications will depend on a variety of factors. Because NGF is a distributed network filesystem, performance will be only slightly less than that of filesystems that are local to NERSC compute platforms. This should only be an issue for applications whose performance is I/O bound.
NGF will grow in both capacity and bandwidth over the next several years, eventually replacing or dwarfing the amount of local storage on systems. NERSC is also working to seamlessly integrate NGF with the HPSS data archive to create much larger “virtual” data storage for projects. Once NGF is completely operational within the NERSC facility, Butler said, users at other centers, such as the National Center for Atmospheric Research and NASA Ames Research Center, could be allowed to remotely access the NERSC filesystem, allowing users to read and visualize data without having to execute file transfers. Eventually, the same capability could be extended to experimental research sites, such as accelerator labs.
NGF was made possible by IBM’s decision to make its GPFS software available across mixed-vendor supercomputing systems. This strategy was a direct result of IBM’s collaboration with NERSC. “Thank you for driving us in this direction,” wrote IBM Federal Client Executive Mike Henesy to NERSC General Manager Bill Kramer when IBM announced the project in December 2005. “It’s quite clear we would never have reached this point without your leadership!”
NERSC’s Mass Storage Group collaborated with IBM and the San Diego Supercomputer Center to develop a Hierarchical Storage Manager (HSM) that can be used with IBM’s GPFS. The HSM capability with GPFS provides a recoverable GPFS filesystem that is transparent to users and fully backed up and recoverable from NERSC’s multi-petabyte archive on HPSS. GPFS and HPSS are both cluster storage software: GPFS is a shared disk filesystem, while HPSS supports both disk and tape, moving less-used data to tape while keeping current data on disk.
One of the key capabilities of the GPFS/HPSS HSM is that users’ files are automatically backed up on HPSS as they are created. Additionally, files on the GPFS which have not been accessed for a specified period of time are automatically migrated from online resources as space is needed by users for files currently in use. Since the purged files are already backed up on HPSS, they can easily be automatically retrieved by users when needed, and the users do not need to know where the files are stored to access them. “This gives the user the appearance of almost unlimited disk storage space without the cost,” said NERSC’s Mass Storage Group Leader Nancy Meyer.
This capability was demonstrated in the Berkeley Lab and IBM booths at the SC05 conference. Bob Coyne of IBM, the industry co-chair of the HPSS executive committee, said, “There are at least ten institutions at SC05 who are both HPSS and GPFS users, many with over a petabyte of data, who have expressed interest in this capability. HPSS/GPFS will not only serve these existing users but will be an important step in simplifying the storage tools of the largest supercomputer centers and making them available to research institutions, universities, and commercial users.”
“Globally accessible data is becoming the most important part of Grid computing,” said Phil Andrews of the San Diego Supercomputer Center. “The immense quantity of information demands full vertical integration from a transparent user interface via a high performance filesystem to an enormously capable archival manager. The integration of HPSS and GPFS closes the gap between the long-term archival storage and the ultra high performance user access mechanisms.”
The GPFS/HPSS HSM will be included in the release of HPSS 6.2 in spring 2006.
Integrating HPSS into Grids
NERSC’s Mass Storage Group is currently involved in another development collaboration, this one with Argonne National Laboratory and IBM, to integrate HPSS accessibility into the Globus Toolkit for Grid applications.
At Argonne, researchers are adding functionality to the Grid file transfer daemon2 so that the appropriate class of service can be requested from HPSS. IBM is contributing the development of an easy-to-call library of parallel I/O routines that work with HPSS structures and are also easy to integrate into the file transfer deamon. This library will ensure that Grid file transfer requests to HPSS movers are handled correctly.
NERSC is providing the HPSS platform and testbed system for IBM and Argonne to do their respective development projects. As pieces are completed, NERSC tests the components and works with the developers to help identify and resolve problems.
The public release of this capability is scheduled with HPSS 6.2, as well as future releases of the Globus Toolkit.
Bay Area MAN Inaugurated
On August 23, 2005, the NERSC Center became the first of six DOE research sites to go into full production on the Energy Science Network’s (ESnet’s) new San Francisco Bay Area Metropolitan Area Network (MAN). The new MAN provides dual connectivity at 20 to 30 gigabits per second (10 to 50 times the previous site bandwidths, depending on the site using the ring) while significantly reducing the overall cost.
The connection to NERSC consists of two 10-gigabit Ethernet links. One link is used for production scientific computing traffic, while the second is dedicated to special networking needs, such as moving terabyte-scale datasets between research sites or transferring large datasets which are not TCP-friendly.
“What this means is that NERSC is now connected to ESnet at the same speed as ESnet’s backbone network,” said ESnet engineer Eli Dart.
The new architecture is designed to meet the increasing demand for network bandwidth and advanced network services as next-generation scientific instruments and supercomputers come on line. Through a contract with Qwest Communications, the San Francisco Bay Area MAN provides dual connectivity to six DOE sites—the Stanford Linear Accelerator Center, Lawrence Berkeley National Laboratory, the Joint Genome Institute, NERSC, Lawrence Livermore National Laboratory, and Sandia National Laboratories/California (Figure 11). The MAN also provides high-speed access to California’s higher education network (CENIC), NASA’s Ames Research Center, and DOE’s R&D network, Ultra Science Net. The Bay Area MAN connects to both the existing ESnet production backbone and the first segments of the new Science Data Network backbone.
The connection between the MAN and NERSC was formally inaugurated on June 24 by DOE Office of Science Director Raymond Orbach and Berkeley Lab Director Steven Chu (Figure 12).
Figure 11. ESnet’s new San Francisco Bay Area Metropolitan Area Network provides dual connectivity at 20 to 30 gigabits per second to six DOE sites and NASA Ames Research Center.
| Figure 12. DOE Office of Science Director Raymond Orbach (left) and Berkeley Lab Director Steven Chu made the ceremonial connection between NERSC and ESnet in June. After testing, the full production connection was launched in August. |
Another Checkpoint/ Restart Milestone
On the weekend of June 11 and 12, 2005, IBM personnel used NERSC’s Seaborg supercomputer for dedicated testing of IBM’s latest HPC Software Stack, a set of tools for high performance computing. To maximize system utilization for NERSC users, instead of “draining” the system (letting running jobs continue to completion) before starting this dedicated testing, NERSC staff checkpointed all running jobs at the start of the testing period. “Checkpointing” means stopping a program in progress and saving the current state of the program and its data—in effect, “bookmarking” where the program left off so it can start up later in exactly the same place.
This is believed to be the first full-scale use of the checkpoint/restart software with an actual production workload on an IBM SP, as well as the first checkpoint/restart on a system with more than 2,000 processors. It is the culmination of a collaborative effort between NERSC and IBM that began in 1999. Of the 44 jobs that were checkpointed, approximately 65% checkpointed successfully. Of the 15 jobs that did not checkpoint successfully, only 7 jobs were deleted from the queuing system, while the rest were requeued to run again at a later time. This test enabled NERSC and IBM staff to identify some previously undetected problems with the checkpoint/restart software, and they are now working to fix those problems.
In 1997 NERSC made history by being the first computing center to achieve successful checkpoint/restart on a massively parallel system, the Cray T3E.
Science-Driven System Architecture
The creation of NERSC’s Science-Driven System Architecture (SDSA) Team formalizes an ongoing effort to monitor and influence the direction of technology development for the benefit of computational science. NERSC staff are collaborating with scientists and computer vendors to refine computer systems under current or future development so that they will provide excellent sustained performance per dollar for the broadest possible range of large-scale scientific applications.
While the goal of SDSA may seem ambitious, the actual work that promotes that goal deals with the nitty-gritty of scientific computing—for example, why does a particular algorithm perform well on one system but poorly on another—at a level of detail that some people might find tedious or overwhelming, but which the SDSA team finds fascinating and challenging.
“All of our architectural problems would be solvable if money were no object,” said SDSA Team Leader John Shalf, “but that’s never the case, so we have to collaborate with the vendors in a continuous, iterative fashion to work towards more efficient and cost-effective solutions. We’re not improving performance for its own sake, but we are improving system effectiveness.”
Much of the SDSA work involves performance analysis: how fast do various scientific codes run on different systems, how well do they scale to hundreds or thousands of processors, what kinds of bottlenecks can slow them down, and how can performance be improved through hardware development. A solid base of performance data is particularly useful when combined with workload analysis, which considers what codes and algorithms are common to NERSC’s diverse scientific workload. These two sets of data lay a foundation for assessing how that workload would perform on alternative system architectures. Current architectures may be directly analyzed, while future architectures may be tested through simulations or predictive models.
The SDSA Team is investigating a number of different performance modeling frameworks, such as the San Diego Supercomputer Center’s Memory Access Pattern Signature (MAPS), in order to assess their accuracy in predicting performance for the NERSC workload. SDSA team members are working closely with San Diego’s Performance Modeling and Characterization Laboratory to model the performance of the NERSC-5 SSP benchmarks and compare the performance predictions to the benchmark results collected on existing and proposed HPC systems.
Another important part of the SDSA team’s work is sharing performance and workload data, along with benchmarking and performance monitoring codes, with others in the HPC community. Benchmarking suites, containing application codes or their algorithmic kernels, are widely used for system assessment and procurement. NERSC has recently shared its SSP benchmarking suite with National Science Foundation (NSF) computer centers. With the Defense Department’s HPC Modernization Program, NERSC has shared benchmarks and jointly developed a new one.
Seemingly mundane activities like these can have an important cumulative impact: as more research institutions set specific goals for application performance in their system procurement specifications, HPC vendors have to respond by offering systems that are specifically designed and tuned to meet the needs of scientists and engineers, rather than proposing strictly off-the-shelf systems. By working together and sharing performance data with NERSC and other computer centers, the vendors can improve their competitive position in future HPC procurements, refining their system designs to redress any architectural bottlenecks discovered through the iterative process of benchmarking and performance modeling. The end result is systems better suited for scientific applications and a better-defined niche market for scientific computing that is distinct from the business and commercial HPC market.
The SDSA Team also collaborates on research projects in HPC architecture. One key project, in which NERSC is collaborating with Berkeley Lab’s Computational Research Division and computer vendors, is ViVA, or Virtual Vector Architecture. The ViVA concept involves hardware and software enhancements that would coordinate a set of commodity scalar processors to function like a single, more powerful vector processor. ViVA would enable much faster performance for certain types of widely used scientific algorithms, but without the high cost of specialized processors. The research is proceeding in phases. ViVA-1 is focused on a fast synchronization register to coordinate processors on a node or multicore chip. ViVA-2 is investigating a vector register set that hides latency to memory using vector-like semantics. Benchmark scientific kernels are being run on an architectural simulator with ViVA enhancements to assess the effectiveness of those enhancements.
Perhaps the most ambitious HPC research project currently under way is the Defense Advanced Research Projects Agency’s (DARPA’s) High Productivity Computer Systems (HPCS) program. HPCS aims to develop a new generation of hardware and software technologies that will take supercomputing to the petascale level and increase overall system productivity ten-fold by the end of this decade. NERSC is one of several “mission partners” participating in the review of proposals and milestones for this project.
Proposals for New System Evaluated
As part of NERSC’s regular computational system acquisition cycle, the NERSC-5 procurement team was formed in October 2004 to develop an acquisition plan, select and test benchmarks, and prepare a request for proposals (RFP). The RFP was released in September 2005; proposals were submitted in November and are currently being evaluated. The RFP set the following general goals for the NERSC-5 system:
- Support the entire NERSC workload, specifically addressing the DOE Greenbook recommendations.
- Integrate with the NERSC environment, including the NERSC Global Filesystem, HPSS, Grid software, security and networking systems, and the user environment (software tools).
- Provide the optimal balance of the following system components:
- computational: CPU speed, memory bandwidth, and latency
- memory: aggregate and per parallel task
- global disk storage: capacity and bandw