Intel® Parallel Computing Center for Lustre | Efficient Computing and Storage

Intel^®, as the leading processor manufacturer, recognizes that future performance gains will come through parallelism and the modernization of key applications, helping to make the next leap in discovery. Intel^® therefore supports universities, institutions, and labs that are leaders in their field to focus on modernizing applications by building Intel^® Parallel Computing Centers (Intel^® PCCs).

Intel^® and the Zentrum für Datenverarbeitung of the Johannes Gutenberg University opened the first Intel^® Parallel Computing Centers for Lustre (IPCC-L), which is deadicated to the modernization of the Lustre file system:

Lustre QoS: Network Request Scheduler and Monitoring Revisited

Lustre is a parallel file system, which is used in many of the Top 500 HPC clusters. Lustre has been designed to support a huge number of parallel applications running concurrently on these clusters. The load of each application is spread over many Object Storage Targets (OSTs), which serve as backend storage devices. Nevertheless, overlapping between different stripes can significantly reduce the available bandwidth of a Lustre environment.

The Network Request Scheduler (NRS) has been introduced in the Lustre mainline kernel in version 2.4.0 to enable quality of service (QoS) options, which previously have mostly been considered in networking. Standard QoS strategies include the token bucket strategy, where an average bandwidth is assigned to each client. Nevertheless, even if the NRS is able to enforce priorities between different applications, it is currently unable to optimize the overall bandwidth delivered by the file system.

The “Lustre QoS”-project will include additional information to improve the quality of the NRS and to optimize overall bandwidth delivery. The main idea is to include information about the striping targets of each client into the token bucket strategy to ensure that no individual OST will be overloaded. Additional approaches include request reordering to fit the strategies of the object storage targets and data layouts minimizing stripe overheads.

The project will start based on realistic simulations, which already include standard data layouts and access patterns of leadership class HPC environments. These new architectural approaches will be transferred into the Lustre NRS and monitoring source code, enabling higher overall storage bandwidth and fine-grained QoS. The (intermediate) results will be presented to the Lustre and HPC community to collect constant feedback within the design and implementation process.

Project Partners

Intel^® Inc.

Funding Period

01/2016 -- 12/2017

Contact

Prof. Dr. André Brinkmann

2019

Yingjin Qian, Xi Li, Shuichi Ihara, Andreas Dilger, Carlos Thomaz, Shilong Wang, Wen Cheng, Chunyan Li, Lingfang Zeng, Fang Wang, Dan Feng, Tim Süß, and André Brinkmann. 2019. LPCC: hierarchical persistent client caching for lustre. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 88:1–88:14. DOI

2017

Yingjin Qian, Xi Li, Shuichi Ihara, Lingfang Zeng, Jürgen Kaiser, Tim Süß, and André Brinkmann. 2017. A configurable rule based classful token bucket filter network request scheduler for the lustre file system. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 6:1–6:12. DOI