21. Software Development Team 21.1. Team members Kazuo MINAMI (Team Head) Masaaki TERAI (Research & Development Scientist) Atsuya UNO (Research & Development Scientist) Akiyoshi KURODA (Research & Development Scientist) Hitoshi MURAI (Research & Development Scientist) Kiyoshi KUMAHATA (Research & Development Scientist) Shunsuke INOUE (Research & Development Scientist) Yukihiro HASEGAWA(Research & Development Scientist) Shinji HAMADA (Postdoctoral Researcher) 21.2.Research Activities Applications that use calculative resources to the maximum and high system performance are essential elements of the K computer that can run applications for parallelization on a scale of several tens of thousands. Our team selected, from the development stages of the K computer, six target applications that are expected to contribute to the advancement in the numerical simulation and computer engineering fields, and we promoted research and development for increasing their performance. Further, in order to make applications perform at a high level, it is also necessary to increase the functionality and performance of the system as well as middleware. By evaluating and improving these areas, we are further promoting enhancements in the operation of the K computer. During 2012, the following actions took place in line with the above principles. (1) Performance testing of the K computer systems as a whole using real applications Towards completion of K computer in June 2012, we performed 10PFLOPS system performance testing using the aforementioned six target applications. (2) Systematization of methods for increasing application performance in the K computer Methods of increasing performance differ according to the features of the application, such as memory access patterns, required BF values, and parallel characteristics. We organized these in terms of each feature and systematized methods for improving performance. (3) Support for consultations on usage by the registered institution We fielded questions regarding application performance from the registered institution/AICS research departments. Further, we provided support for the necessary methods to increase performance, in order to draw out the CPU performance and parallel performance of the K
computer. (4) Increased sophistication of compiler tools of the K computer and procedures of methods to increase performance We increased the efficiency of compiler tools in order to make (1)，(2), and (3) more efficient. In addition, we established methods such as kernel segmentation, performance prediction methods during times of high levels of parallelization and debug methods, and created procedures and tools for these. 21.3.Research Results and Achievements 21.3.1 Performance testing of the K computer system as a whole using real applications In order to carry out performance testing on the K computer system as a whole, we selected six applications from each application field, with different computational and scientific characteristics. (Fig 1)
Figure 1 Six applications By focusing on the unified use of the 10PLOPS scale, we used these applications to evaluate the 5 items of calculative performance, inter-node communication performance, I/O performance, staging performance, and performance hindering factors.
A) Calculative performance We evaluated node unit performance, which determines K computer computational performance. From the weak scale measurement results, which are equivalent to the computational volume per node, we confirmed that there was no performance degradation when executing 10PFLOPS. The application uses NICAM, QCD. With NICAM, an increase in execution time and performance degradation was observed when processing 81,920 parallel processes. A survey revealed that the read processing of data within the evaluation area was the cause, and we confirmed that performance degradation could be avoided by excluding this. B)
Inter-node communication performance
We evaluated communication performance between nodes, which strongly affects the parallel performance of processes. We compared the fundamental performance (ideal measurement value obtained through benchmarks) of the MPI communication measurement value obtained through running the applications, and confirmed that the communication performance between nodes can be achieved, as designed, on a 10PLOPS scale. The application uses RSDFT, Seism3D and FFB. With RSDFT, the Tofu-dedicated algorithm Trinaryx3 is operated for collective communication Allreduce， Bcast and Reduce, and we confirmed that the bandwidth specified for fundamental performance was achieved. With Seism3D and FFB, we confirmed that the adjacent communication isend/irecv measurement values specified for fundamental performance were achieved. C)
We evaluated the I/O performance with the local file system during execution of the application. We confirmed the actual values measured when executing the application with the fundamental performance, and confirmed that I/O performance can be obtained. The application uses NICAM. In terms of file input, the actually measured throughput value was high, but it was estimated that this was actually carried out on only a section of the data and judged to be adequate. In terms of file output, as the throughput value is the same as fundamental performance, performance is considered to be adequate. D) Staging performance After execution of the application, we evaluated file transfer between the global file system and the local file system. When executing the application, we compared the actual measured value from staging with fundamental performance, and confirmed that it achieved the same level of performance. The application used RSDFT, PHASE and
NICAM. It was inefficient in the case of RSDFT,
PHASE, where transfer size is small, and performance was low. However, with NICAM, where the transfer size was larger at 32 GB, adequate performance was obtained, being at the same level as fundamental performance. E)
Performance hindering factors
We confirmed whether there was any performance degradation caused by OS jitter or any external
factors. Using FFB, Seism3D, QCD and RSDF, we confirmed load balancing between the processes and also confirmed that there were no issues. 21.3.2.Systematization of methods of increasing application performance with the K computer The six applications used in 3.1 are defined on the basis of the following computational science characteristics.
In regard to high parallelization, can strong parallel performance be achieved with a comparatively simple parallelization method? Alternatively, is it impossible to achieve high parallelization without using a complex parallelization method?
In regard to improving the performance levels of unit performance, does achieving high unit performance with current computers tend to be difficult because of the high ratio of memory band width and floating-point arithmetical operations required by the application? Alternatively, does achieving comparatively high levels of unit performance tend to be easy because of low B/F values being required by the application?
Figure 2 Behavior in Computational Science In Fig 2, each application is plotted on the two evaluation axis. The methods for improving the performance of the applications were systematized as described below based on these computational science features.
A) NICAM, Seism3D group In regard to high parallelization, a simple parallelization method based on adjacent communication is used, thereby making it easy to achieve strong levels of parallel performance. The K computer network topology is like a structured grid with 1 hop communication guaranteed for adjacent communications in 6 directions, and the design is such that the cost associated will not increase with high parallelization. No increase in communication cost was seen when validating the K computer’s full nodes using Seism3D where adjacent communications occur in 4 horizontal directions. Further, in regard to increasing the performance levels of unit performance, as the required B/F is high despite continuous access, it is difficult to achieve high levels of unit performance. With this type, it is first important to confirm that the practical memory bandwidth of the hardware in the hotspot loop is achieved. Causes of this not being achieved include L1 cache slashing or register spills due to stream variables and scalar variables being described in high volumes in the innermost loop. These can be effectively avoided using loop division and array merging. B)
PHASE, RSDFT group
With the original code, restrictions to the application mean that parallelization on the scale of several tens of thousands were not impossible for this group. RSDFT, in addition to traditional space division, has implemented division in the direction of the energy band. PHASE, in addition to traditional energy band division, has rewritten the application as one that can handle high parallelization execution to the tune of several tens of thousands through the addition of wave number direction division. Further, as these applications make heavy use of group communications and there is a high volume of traffic, the focus at times of high parallelization is how the K computer network topology or high speed library are used. With RSDFT, as there is a high volume of traffic each time, the group communication algorithm Trinaryx3 for Tofu is used. With PHASE, high speeds are achieved by packing all the traffic within a small scale of partial space through bi-axial parallelization, thus controlling the share of large-scale total traffic. Further, from the viewpoint of increasing performance levels of unit performance, as, in principle, it is possible to rewrite the matrix–matrix product, the use of rewrite tuning in order to use matrix– matrix product libraries (DGEMM) is effective. C)
As this type is based on list access and has a high request B/F, the important thing in terms of improving unit performance is to what extent the list data can be stored in the cache. With FFB, the analysis space is divided into small spaces, tuning was performed in which vector data is stored in the cache by renumbering within the small space. In regard to high parallelization, we need to focus on how Allreduce transitions as a result of an increase in the parallelization on the unstructured grid. However, when using the hardware barrier function of Tofu interconnect, a value close to fundamental performance was confirmed even on a full node scale.
D) LatticeQCD group Despite traditionally having high B/F requirements, as it can operate on-cache, it is easier to achieve unit performance than with the FFB or Seism3D, NICAM group. However, with QCD, we observed an increase in L1 cache misses and integer load cache access waits, and sluggishness in the rise of the SIMD command rate. In regard to these issues, the expected performance could be achieved by improving the compiler or using the SIMD embedded function. Through the use of these methods, the performance described in Fig3 has been achieved.
Figure 3 Feature of six application and achieved performance 21.3.3. Support for consultations on usage by registration institute A) Hold an exchange of opinions in regard to increasing the sophistication of applications with registration agencies Meetings for the exchange of opinions were held with the aim of sharing skills in relation to increased sophistication of applications with the registered institution. (10/2, 10/31) The content mainly involved the introduction of examples, such as methods of analyzing cost on the K computer, methods of judging tuning spots and tuning methods, and those in charge of increasing speed on both sides found the discussion beneficial. B)
Support for the use of K computer
development team. In the operational technology department, a user support service, named K support desk, is provided from the test usage period, and this will continue to be used after joint use begins. The K support desk provides support for the technically specialized questions among the general usage issues transferred from the registration period, and framework issues related to the adjustment and increased sophistication of the K computer, mainly for the research departments of AICS. Further, as part of the operation, work is carried out periodically to check the status of responses to inquiries, in order to promote the smooth handling of questions and issues, and investigate necessary issues related to increased sophistication in the future. In the 6 month operating period from the start of shared use until the end of March 2013, it dealt with approximately 1400 inquiries. 21.3.4. Sophistication of compiler tools and procedures for methods to improve performance using the K computer A) Advancements in the compiler runtime system We discovered performance issues in the compiler and runtime system from the perspective of the aforementioned six applications and increasing speed, using the supported user applications. We further carried out improvement testing. These items will be reflected in the actual operating environment within this year. 1.
As following condition, automatic parallelization is not performed when optimization control specifier "noreccurrence" is specified.
There is operation between for block and for block with nested loop.
Loop index variable is depends on the variable which is used in the out of loop body.
Removal of restriction that software pipelining is not work when XFILL optimization is applied.
Amendment that simd optimization is performed using "-Ksimd=2" option even when there is reduction operation in if clause.
Simd optimization is applied to ibits function.
Deterrence of relocate of dynamic linker for a purpose of control dispersion of execution time.
Increased sophistication of the profiler
We are providing a profiler (PA function) that uses the CPU event counter function to our K computer users and we have worked to increase the sophistication of this PA function. With the PA function up until 2011, the measurement location resource usage information (memory bandwidth, peak performance ratio, cache miss rate, SIMD rate etc.) was displayed as numerical data or ratios on a graph, but it was difficult to judge the area in which tuning should be applied for each user. In the advancements on this occasion, we have realized a level of visualization with greater PA
output legibility than now, by coloring the number cells on which the user should concentrate, adding indicators and clarifying bottleneck points etc., thus guiding you to the next tuning step. C)
Procedures for methods of improving performance
In regard the work to generally improve the speed of applications, it took a lot of effort to grasp the logical structure. As a plan to resolve this, we have developed a static analysis tool (K-scope) with the aim of providing code reading support for the purposes of tuning. The K-scope is constructed in Java and will operate in an ordinary PC environment. It has the adopted the interface shown in Fig4. This provides the analysis results with the program logical structure is shown in a tree structure on the left-hand side, the source code for the selected area shown on the top right and the list of variable features and floating point counts etc. shown on the bottom right. The plan is to publish this software as open source in April 2013.
Figure 4 Snapshot of K-scope 21.4. Schedule and Future Plan We will continue to further analyze and evaluate the K computer hardware, compiler features and computational science features of the applications, and systematize methods of increasing their
sophistication. Further, aiming to provide an environment in which it is easy for users to use the K computer, we will continue to promote advancements in the middleware used by the application. 21.5. Publication, Presentation and Deliverables (1) Journal Papers - None (2) Conference Papers 1.
M. Terai, E. Tomiyama, H. Murai, K. Minami and M. Yokokawa, “K-scope: a Java-based Fortran Source Code Analyzer with Graphical User Interface for Performance Improvement”, 41st International Conference on Parallel Processing Workshops, September, 2012.
M. Terai, K. Ishikawa, Y. Sugisaki, K. Minami, F. Shoji, Y. Nakamura, Y. Kuramashi and M. Yokokawa, “Performance Tuning of a Lattice QCD code on a node of the K computer”, High Performance Computing Symposium 2013,January (2013). (in Japanese)
S. Inoue, S. Tsutsumi, T. Maeda and K. Minami, “Performance optimization of seismic wave simulation code on the K computer”, High Performance Computing Symposium 2013, January, 2013. (in Japanese)
K. Kumahata, S.Inoue and K. Minami, “Performance Tuning for Gradient Kernel of the FrontFlow/blue on the K computer”, High Performance Computing Symposium 2013, January, 2013. (in Japanese)
A. Kuroda, Y. Sugisaki, S.Chiba, K.Kumahata, M. Terai, S. Inoue and K. Minami, “Performance Impact of TLB on the K computer Applications”, High Performance Computing Symposium 2013, January , 2013. (in Japanese)
(3) Invited Talks 1.
K. Minami，“Massively Parallelization and Performance Improvement of Applications on the K computer”, Symposium on Advanced Computing Systems and Infrastructures 2012，May, 2012. (in Japanese)
K. Minami, “Overview of Development of the K computer and Prospects for the Future”, Kure Medical Association lecture, November , 2012.(in Japanese)
A. Kuroda, K. Minami, T. Yamazaki, J. Nara, J. Kouga, T, Uda and T. Ono “Can we speed up of FFT on Massively-parallel architecture? ”, The 3rd society for Computational Materials Science Initiative , December, 2012．(in Japanese)
Y. Hasegawa, “Toward Petaflops Applications - First-principles electronic structure calculation program RSDFT -”， HPC Strategic Program Field 5 Symposium, March, 2013.(in Japanese)
"The origin of matter and the universe"
A. Kuroda, “Example of Utilization of the K computer - With Optimization of PHASE”， The Society of Polymer Science, Research Group on Computational Polymer Science, March,2013.(in Japanese)
(4) Posters and presentations 1. K. Minami, “Parallelization and Performance Improvement of Applications on the K computer”, RIKEN Seminar, July, 2012. (in Japanese) 2. A. Kuroda, “Can we speed up of FFT on the K computer? -- With Performance Optimization of PHASE--”, RIKEN Seminar, July, 2012.(in Japanese) 3. Y. Hasegawa, “Development of Petaflops Application RSDFT”, RIKEN Seminar, July, 2012.(in Japanese) 4. K. Minami，“Optimization Ⅱ”, AICS Summer School, August,2012. (in Japanese) 5. K. Minami，“Performance of Real Applications on the K computer”, Computational Science Seminar, August, 2012. (in Japanese) 6. A. Kuroda, K. Minami, T. Yamasaki, J. Nara, J. Koga, T. Uda, and T. Ohno, "Planewave-based first-principles MD calculation on 80,000-node K-computer", SC’12, November, 2012. 7. K. Minami, “Performance Improvement of Applications on the K computer”, SC'12, RIKEN AICS Booth Short Lectures, November, 2012. 8. Y. Hasegawa, “Effective use of collective communications tuned for the K computer in the real-space density functional theory code”, SC'12, RIKEN AICS Booth Short Lectures, November, 2012. 9. M. Terai, E. Tomiyama, H. Murai, K. Kumahata, S. Hamada, S. Inoue, A. Kuroda, Y. Hasegawa, K. Minami and M. Yokokawa, “Development of supporting tool “K-scope” to tune of Fortran code”, High Performance Computing Symposium 2013, January, 2013. (in Japanese) (5) Patents and Deliverables -