The persistent digitization of business and social life is generating unprecedented demand for compute resources. Expressed in terms of data growth, we have entered the “zettabyte” era, with only one prefix (the yotta) left available to measure the exponential growth of data that is now underway. To respond to new demand for data services, many organizations have chosen to access additional compute capacity through the development of cloud efficiencies, and via the use of public cloud services offered by third-party service providers. But this transition is not yet complete: for the foreseeable future, analyst firms including InsightaaS, envision a hybrid deployment model combining on-premise and cloud-based service delivery, which in turn requires that businesses work to optimize on-premise data centre service delivery. As a source of guidance in this enterprise, many organizations look to data centre innovators – to the hyperscale Internet companies and to supercomputer clusters for inspiration on design, platform and operational techniques. But can the wisdom of the hyperscaler and the super computer translate to the enterprise data centre, and in what ways are their computing models different? The recent update of SciNet’s supercomputer offers a fascinating window into the business and technical decisions that are driving the deployment of the infrastructure that can support large, compute-intensive workloads.
Dubbed “Canada’s most powerful research supercomputer” and the “fastest supercomputer in the country,” Niagara was jointly funded by the Canada Foundation for Innovation, the Government of Ontario, and the University of Toronto, and is operated by the university’s high-performance computing division SciNet. Together with Cedar at Simon Fraser University, Arbutus at the University of Victoria, and Graham at University of Waterloo, Niagara forms part of a national advanced research computing infrastructure system that Canadian researchers who rely on high-performance computing can access to manage investigations involving Big Data.
Niagara represents a $17 million new systems upgrade in a facility that was built back in 2008 by the IBM Global Services team. Located in Vaughan, Ontario, where real estate costs are more competitive than in Toronto centre, the data centre is housed in a repurposed industrial building that is serviced by its own transformer (for constant power), and 2x 10 gigabit per second Ethernet fibre that connects to research users at U of T and York University, which will soon bump to 100 gigs per second. The facility is water cooled through an open loop system that relies on evaporative techniques and brings cooling to rack doors – even air handlers run off water, hence the facility has no need for large, energy-intensive CRAC units. While initial deployment of the water cooling system was not inexpensive, according to SciNet hardware operations and applications analyst Scott Northrup, use of cold winter temperatures to cool the water means reduced operating costs, resulting in a lower TCO that would be the case with other cooling technologies. In 2008, facility design and infrastructure produced a PUE of 1.18 for the data centre, when running a set point temperature of 16 degrees C on the data centre’s primary cooling loop; this temperature may be raised, once operators have had an opportunity to evaluate the performance of new equipment.
To prepare for the Niagara equipment upgrade, SciNet decommissioned its original supercomputer TCS, a 104 node, 3328 core system built on IBM’s Power 6 server technology and half of its GPC, a 30,000 core IBM Idataplex system. The Niagara cluster is an end-to-end Lenovo solution featuring 1,500 ultra-dense ThinkSystem SD530 compute nodes (servers) that provide more than three petaflops of processing power. Each node contains 40 cores for a total of 60,000, which each share the same memory and are configured in a way that allows simultaneous computation on all 60,000 cores. Each node has 192 gigabytes of RAM (288 TB available for use during calculations) and 12 PB of disk space for storing data from calculations – for a total 12,000,000 GB of storage. The new system is also very power efficient: while TCS consumed 400kW of power to deliver a speed of 70TF and GPC consumed approximately 1000kW for 360TF, Niagara will use 650kW to achieve ten times the performance (4.3PF) and a fifteen-fold increase in energy efficiency.
SciNet’s technology decision was based on an open bid RFP designed to maximize investment: “it’s all about maximizing how much infrastructure I can afford to run,” Northrup explained. “So within that context of afford to run, how do I get the biggest bang per dollar per unit?” The RFP was “classic HPC,” he added; 11 responders to the bid, including Cray, HPE SGI, Huawei, Super Micro, IBM and Lenovo, were given a set of scientific codes, and asked to demonstrate how much work could be done with 200 nodes on their systems. A performance metric that measures time or throughput on the specified workload, Northup described the evaluation as “a normalized way [based on a five factor speed up] of presenting how much work could be done… we said, ‘we want you to maximize how much throughput and performance you can give us per dollar on the equipment using these codes’.” Approximately half of the RFP was based on these codes, and part of this evaluation was based on a power metric; in other words, how much power is used per unit under normalized performance – a watts per flops calculation.
The ultimate decision for Lenovo was also based on a number of factors, including the company’s reversion to a smaller, standard form factor – a shorter, skinnier rack with doors that were compatible with existing cooling infrastructure. But in addition to physical attributes or quantifiable metrics were qualitative criteria, such as SciNet’s confidence in the company’s ability to execute, experience with integrated systems – as opposed to component – implementations, timelines, and deployment plans. According to Northrup, the Lenovo win was based on performance as opposed to compatibility with existing infrastructure; however, the company’s ability to integrate with the brown space facility’s electrical system delivered a cost advantage over vendor bids that would have to include new electrical infrastructure. Since Lenovo essentially acquired IBM technology, their equipment could plug into circuits that were already there, enabling the reuse of power infrastructure. Companies like Cray, on the other hand, have their own dedicated power infrastructure and would have to pay for new transformer and electrical work as part of the bid, leaving less resource for IT equipment.
Hyperscale vs supercomputer
The use of commodity server equipment, open source software and cutting edge operational techniques in the hyperscale data centre are recognized tactics designed to keep capital and operating costs low, and efficiency high. Describing the early days of supercomputing at U of T, Northrup explained: “In 1994 when they first started building Beowulf clusters, you took the cheapest hardware you could find and lashed this together because you didn’t want to pay the big mainframe costs. The whole industry is shifting to that approach now that even the foundation of hyperscalers, of cloud computing, is open source infrastructure. But the hyperscalers have taken this to the nth degree – if I don’t need that screw, I’ll build a motherboard that doesn’t have it. And I’ll run my own version of Linux because I don’t want that overhead. Google and Facebook have gone to that level in building open platforms – and this now even extends into hardware.”
However, this approach was not feasible for Niagara, which Northrup described as a “mid range” data centre that sits between the hyperscaler and the enterprise models. “When you are at the scale of Google or Facebook, and you start doing the TCO, it makes sense to create your own servers,” he explained. “When you go to Google, it’s a scale up; if you look at ODM [original design manufacture], Google is in the top ten, and buys almost as many chips from Intel as Dell does. So why should they pay Dell? Even the DoE sites that pay IBM $290 million dollars to build custom – if they save five cents on the cost of a server, it’s worth it for them.” But with 14 employees managing a one megawatt, 3,000 square feet facility, SciNet does not have the size or the staff to push to that level: “we do the middle ground, and still want the vendor to come in and provide a warranty,” Northrup explained. “We cross the line, making a usable system but from an operating perspective, and one that is relatively straightforward. In HPC, we’re pretty cheap about software, but we’ll pay for the hardware.”
For the smaller clusters, vendor support may be even more critical in Northrup’s view: “if you’re only going to run one rack of equipment, you could buy it off the shelf from a white box vendor. But then you would have to do a lot more leg work, and you wouldn’t get an integrated solution.” Cray is a totally integrated platform, he added, but a Dell, Lenovo or HPE (Intel box) solution that has a few more integrated features, a few more nodes, reliability and warranties, offers a middle ground alternative.
In its custom work, SciNet does not focus on hardware, but does leverage academic research to manage parallel processing. The Niagara nodes run Linux OS, and the MPI Library – a standard open source library for parallel scientific performance – as well as the open source Slurm workload management scheduler developed at the Lawrence Livermore National Lab. According to Northrup, a lot of the community code is open source, as is the cluster management code in HPC as there are simply too many cores to pay for software by the seat. The SciNet team is well equipped to provide any support that is needed.
SciNet also uses proprietary GPFS (General Parallel File System), IBM’s Spectrum Scale, a cluster file system that provides concurrent access to a single storage file system from multiple nodes. This enables administrators to load the file system to all of the nodes at once, allowing access to all data in the same file system on all nodes on log in. Delivered through license from IBM by Lenovo, the GPFS software runs on a server that resides in a remote, centralized storage appliance and sits on the InfiniBand network, supporting the high-performance scale that would not be available through the more traditional NFS distributed file system protocol.
The secret sauce: DragonFly + InfiniBand
Niagara’s efficiency improvements over SciNet’s TCS and GPC may be attributed in large part to generational advances in server technologies that are available to adopters across industries. But what sets Niagara apart is the ability to process workloads across infrastructure simultaneously, parallel processing that is managed by the management software noted above, but more importantly, through high speed networking that delivers the speed and throughput needed to support data-intensive scientific research. Niagara’s network is based on a Dragonfly+ topology and uses fibre optic cable with high speed Infiniband from Mellinox to connect nodes and the storage system. Built on adaptive routing and grouping that reduces the number of global (intercabinet) channels that a packet must traverse, Dragonfly topologies can produce higher speed and lower latency for less cost. In general, packing infrastructure close together delivers cost savings as less infrastructure – racks, power supplies, cooling loops – is needed to achieve the same performance. This packing principle is especially apparent in networking: “high speed cables are expensive and the longer they get, the more expensive they get,” Northrup observed. As example, he noted that copper for shorter links is $100, but $800 when they are longer, so as the system scales up, the goal is to have links be “dense and short.” For short links inside the rack, Niagara uses copper (electrically, copper cannot be used for distances over 3 metres), and for longer links relies on fast fibre optic InfiniBand. An open communications protocol, InfiniBand differs from Ethernet TC/IP, which carries a lot of routing information vital to web services that must be interpreted by large switches at each stop along a packet’s journey. In contrast, InfiniBand is a networking communications standard used for data interconnects among and within servers within the data centre that relies on direct links to deliver high bandwidth and very low latency – Northrup estimates he will see a 10x improvement in latency and 5x improvement in bandwidth by moving from DDR to EDR with new generation InfiniBand which has bumped up the frequency with which messages are sent to improve switching rates.
“Infiniband originally developed at a time when the PCI bus was becoming an interconnect bottleneck,” he explained. "It uses a switched fabric network topology that allows computers to communicate directly memory to memory using remote direct memory access (RDMA). The network may not be as flexible as standard Ethernet as one cannot simply route it, however with less protocol overhead, such as with TCP/IP, the reduced latency is significant, allowing programs that need tight integration to scale to large numbers and run as fast as possible, making it the go to in HPC environments."
The Niagara upgrade means that SciNet has additional compute capacity and more space as the new equipment has a smaller footprint. While the team is considering offering excess capacity to dealers for bursting by organizations outside the academic community, there are some issues would have to be resolved. For example, the facility has many privacy controls in place, but is not really set up for compliance or industrial requirements. According to Northrup, the facility is secure at an academic research level, but not at a commercial level. So while SciNet does collaborate with other research organizations – for example, it has industrial research partnerships in the aerospace industry – it is not equipped to do clinical hospital research because that data would need to be anonymized.
SciNet’s real mandate and the source of its funding is support for academic researchers. Clients do not actually pay for capacity, rather they apply to Compute Canada for access and once a year are granted some –– in cases where the researcher wants more than the default access, the request for additional hours is evaluated by a science panel. The SciNet team works to provide uptime; however, there is no contractual penalty for downtime as there typically is in commercial relationships, and the facility has single points of failure. As Northrup explained, maintenance is coordinated to minimize disruption, and the team devotes most of its resources to making high performance computing available for the whole year, without guaranteeing uptime.
Service delivery for this community has been consistent over several years: researchers log in to a development node, compile and build their codes, submit a batch script, specifying nodes and time, the script is run, and results returned. The traditional codes are old high-performance languages – C++ and Fortran – but SciNet also supports Python, Ruby, as well as science codes that have developed in specific communities like electro dynamics, helping researchers to optimize these for scale out. According to Northrup, scientists put a lot of work into these codes and some of them have legacies that are 20 years old. Users of Niagara are graduate student level or up, and typically have more knowledge of computer systems than most users: “we expect them to know what they are doing,” Northrup added.
What this means is that Niagara is designed for large jobs processed at high speed – when you request a node you get all 40 cores – but operates without a lot of extraneous features. Northrup explained: “This system is bare metal, designed to build a code, run a thousand or 10 thousand processors with a fast, parallel file system – it’s brute force, just raw, raw processing. It is a low over head system that is not especially user friendly for the general-purpose user – there’s no GUI s on the system – you log in and build your code, put it on line, and it will be back as fast as possible with no frills. It’s a stripped-down race car – there are no windows on it.”