EBookClubs

Read Books & Download eBooks Full Online

EBookClubs

Read Books & Download eBooks Full Online

Book Performance Analysis and Evaluation of Dynamic Loop Scheduling Techniques in a Competitive Runtime Environment for Distributed Memory Architectures

Download or read book Performance Analysis and Evaluation of Dynamic Loop Scheduling Techniques in a Competitive Runtime Environment for Distributed Memory Architectures written by Mahadevan Balasubramaniam and published by . This book was released on 2003 with total page pages. Available in PDF, EPUB and Kindle. Book excerpt: Parallel computing offers immense potential to solve large, complex scientific problems. Load imbalance is a major impediment in obtaining high performance by a parallel system. One principal form of parallelism found in scientific applications is data parallelism. Loops without dependencies are data parallel. During the execution of large parallel loops, computational requirements vary due to problem, algorithmic and systemic characteristics. These factors lead to load imbalance which in turn degrades the performance of an application. Over the years, a number of dynamic loop scheduling techniques have been proposed to address one or more of these factors. However, there is no single strategy that works well for different problem domains and system characteristics. Moreover, load balancing during runtime is complicated because of its need for dynamic data redistribution. Therefore, there is a distinct need to integrate the dynamic loop scheduling techniques into a single package and provide them as an application programming interface (API) to the application developer. In recent years, along this direction, a number of dynamic loop scheduling techniques have been integrated into the compiler technologies for shared memory environments. On the other hand, there is no such integrated approach for distributed memory applications. The purpose of this thesis is to present the design, implementation and effectiveness of an integrated approach:the dynamic loop scheduling techniques are integrated into a runtime system for distributed memory architectures. For this purpose, we choose the newly developed parallel runtime environment for multicomputer architecture (PREMA) with its main components: the data movement and control substrate (DMCS) and mobile object layer (MOL). This runtime system has recently been developed and has demonstrated to be one of the most competitive runtime systems for distributed memory architectures. The significance of this work is that the proposed API will enhance the performance of parallel applications by reducing the load imbalance among processors caused by a wide range of factors and will reduce the software developmental cost required for load balancing. With the integration of the scheduling capabilities into the runtime system, its applicability has been expanded. The performance of the API has been evaluated qualitatively and quantitatively. The overhead of the API has been studied analytically and measured experimentally. Three parallel benchmarks including scientific applications of general interest (N-body simulations, automatic quadrature routine and unstructured grid heat solver) were considered for experimentation purpose. Based on the experiments conducted, a cost improvement of up to 76% over the straight forward parallel benchmark has been obtained. For certain application characteristics, the overhead of the runtime system was found to be within 10% of the underlying messaging layer. These results demonstrate that, in large scientific applications it is possible and desirable to combine the rich functionality of a runtime system with the advantages of scheduling techniques to achieve high performance.

Book Performance Analysis and Evaluation of Dynamic Loop Scheduling Techniques in a Competitive Runtime Environment for Distributed Memory Architectures

Download or read book Performance Analysis and Evaluation of Dynamic Loop Scheduling Techniques in a Competitive Runtime Environment for Distributed Memory Architectures written by and published by . This book was released on 2003 with total page pages. Available in PDF, EPUB and Kindle. Book excerpt: Parallel computing offers immense potential to solve large, complex scientific problems. Load imbalance is a major impediment in obtaining high performance by a parallel system. One principal form of parallelism found in scientific applications is data parallelism. Loops without dependencies are data parallel. During the execution of large parallel loops, computational requirements vary due to problem, algorithmic and systemic characteristics. These factors lead to load imbalance which in turn degrades the performance of an application. Over the years, a number of dynamic loop scheduling techniques have been proposed to address one or more of these factors. However, there is no single strategy that works well for different problem domains and system characteristics. Moreover, load balancing during runtime is complicated because of its need for dynamic data redistribution. Therefore, there is a distinct need to integrate the dynamic loop scheduling techniques into a single package and provide them as an application programming interface (API) to the application developer. In recent years, along this direction, a number of dynamic loop scheduling techniques have been integrated into the compiler technologies for shared memory environments. On the other hand, there is no such integrated approach for distributed memory applications. The purpose of this thesis is to present the design, implementation and effectiveness of an integrated approach:the dynamic loop scheduling techniques are integrated into a runtime system for distributed memory architectures. For this purpose, we choose the newly developed parallel runtime environment for multicomputer architecture (PREMA) with its main components: the data movement and control substrate (DMCS) and mobile object layer (MOL). This runtime system has recently been developed and has demonstrated to be one of the most competitive runtime systems for distributed memory architectures. The significance of this work is that the prop.

Book Performance Analysis and Evaluation of Divisible Load Theory and Dynamic Loop Scheduling Algorithms in Parallel and Distributed Environments

Download or read book Performance Analysis and Evaluation of Divisible Load Theory and Dynamic Loop Scheduling Algorithms in Parallel and Distributed Environments written by Mahadevan Balasubramaniam and published by . This book was released on 2015 with total page 185 pages. Available in PDF, EPUB and Kindle. Book excerpt: High performance parallel and distributed computing systems are used to solve large, complex, and data parallel scientific applications that require enormous computational power. Data parallel workloads which require performing similar operations on different data objects, are present in a large number of scientific applications, such as N-body simulations and Monte Carlo simulations, and are expressed in the form of loops. Data parallel workloads that lack precedence constraints are called arbitrarily divisible workloads, and are amenable to easy parallelization. Load imbalance that arise from various sources such as application, algorithmic, and systemic characteristics during the execution of scientific applications degrades performance. Scheduling of arbitrarily divisible workloads to address load imbalance in order to obtain better utilization of computing resources is a major area of research. Divisible load theory (DLT) and dynamic loop scheduling (DLS) algorithms are two algorithmic approaches employed in the scheduling of arbitrarily divisible workloads. Despite sharing the same goal of achieving load balancing, the two approaches are fundamentally different. Divisible load theory algorithms are linear, deterministic and platform dependent, whereas dynamic loop scheduling algorithms are probabilistic and platform agnostic. Divisible load theory algorithms have been traditionally used for performance prediction in environments characterized by known or expected variation in the system characteristics at runtime. Dynamic loop scheduling algorithms are designed to simultaneously address all the sources of load imbalance that stochastically arise at runtime from application, algorithmic, and systemic characteristics. In In this dissertation, an analysis and performance evaluation of DLT and DLS algorithms are presented in the form of a scalability study and a robustness investigation. The effect of network topology on their performance is studied. A hybrid scheduling approach is also proposed that integrates DLT and DLS algorithms. The hybrid approach combines the strength of DLT and DLS algorithms and improves the performance of the scientific running in large scale parallel and distributed computing environments, and delivers performance superior to that which can be obtained by applying DLT algorithms in isolation. The range of conditions for which the hybrid approach is useful is also identified and discussed.

Book Dynamic Task Execution on Shared and Distributed Memory Architectures

Download or read book Dynamic Task Execution on Shared and Distributed Memory Architectures written by Asim Yarkhan and published by . This book was released on 2012 with total page 122 pages. Available in PDF, EPUB and Kindle. Book excerpt: Multicore architectures with high core counts have come to dominate the world of high performance computing, from shared memory machines to the largest distributed memory clusters. The multicore route to increased performance has a simpler design and better power efficiency than the traditional approach of increasing processor frequencies. But, standard programming techniques are not well adapted to this change in computer architecture design. In this work, we study the use of dynamic runtime environments executing data driven applications as a solution to programming multicore architectures. The goals of our runtime environments are productivity, scalability and performance. We demonstrate productivity by defining a simple programming interface to express code. Our runtime environments are experimentally shown to be scalable and give competitive performance on large multicore and distributed memory machines. This work is driven by linear algebra algorithms, where state-of-the-art libraries (e.g., LAPACK and ScaLAPACK) using a fork-join or block-synchronous execution style do not use the available resources in the most efficient manner. Research work in linear algebra has reformulated these algorithms as tasks acting on tiles of data, with data dependency relationships between the tasks. This results in a task-based DAG for the reformulated algorithms, which can be executed via asynchronous data-driven execution paths analogous to dataflow execution. We study an API and runtime environment for shared memory architectures that efficiently executes serially presented tile based algorithms. This runtime is used to enable linear algebra applications and is shown to deliver performance competitive with state-of-the-art commercial and research libraries. We develop a runtime environment for distributed memory multicore architectures extended from our shared memory implementation. The runtime takes serially presented algorithms designed for the shared memory environment, and schedules and executes them on distributed memory architectures in a scalable and high performance manner. We design a distributed data coherency protocol and a distributed task scheduling mechanism which avoid global coordination. Experimental results with linear algebra applications show the scalability and performance of our runtime environment.

Book Evaluation of Loop Scheduling Algorithms on Distributed Memory Systems

Download or read book Evaluation of Loop Scheduling Algorithms on Distributed Memory Systems written by Teebu Philip and published by . This book was released on 1996 with total page 18 pages. Available in PDF, EPUB and Kindle. Book excerpt: Abstract: "Loops are the largest source of parallelism in many applications. All prior DOALL loop scheduling algorithms such as Self- Scheduling, Guided Self-Scheduling, Trapezoid Self-Scheduling, and Factoring try to achieve workload balance through decreasing chunk sizes. Moreover, they have been analyzed only for shared memory platforms. In this work, the prior loop scheduling methods will be evlauated on two distributed memory machines using realistic workloads from the NAS Parallel benchmark suite and Livermore Loop Series. The distributed memory platforms are: a 16 node IBM SP2 and a 16-node nCUBE 2. The experimental results show that these decreasing chunk size methods tend to increase the communication time in distributed memory models by assigning more chunks. In view of these results, two new schemes, called Fixed Increase and Variable Increase, are introduced. Contrary to the earlier techniques, these schemes increase the chunk sizes in order to minimize the scheduling overhead by reducing interprocessor communication. The new algorithms can be implemented by parallel compilers and are scalable over large numbers of processors and iterations. Extensive measurements on both the machines indicate that the increasing chunk size methods can provide better performance than the existing algorithms for almost all workload patterns."

Book Handbook of Scheduling

Download or read book Handbook of Scheduling written by Joseph Y-T. Leung and published by CRC Press. This book was released on 2004-04-27 with total page 1215 pages. Available in PDF, EPUB and Kindle. Book excerpt: This handbook provides full coverage of the most recent and advanced topics in scheduling, assembling researchers from all relevant disciplines to facilitate new insights. Presented in six parts, these experts provides introductory material, complete with tutorials and algorithms, then examine classical scheduling problems. Part 3 explores scheduling models that originate in areas such as computer science, operations research. The following section examines scheduling problems that arise in real-time systems. Part 5 discusses stochastic scheduling and queueing networks, and the final section discusses a range of applications in a variety of areas, from airlines to hospitals.

Book Automatic Selection of Dynamic Loop Scheduling Algorithms for Load Balancing Using Reinforcement Learning

Download or read book Automatic Selection of Dynamic Loop Scheduling Algorithms for Load Balancing Using Reinforcement Learning written by and published by . This book was released on 2004 with total page pages. Available in PDF, EPUB and Kindle. Book excerpt: Scientific applications are large, complex, irregular, and computationally intensive and are characterized by data parallel loops. The prevalence of independent iterations in these loops, makes parallel computing as the natural choice for solving these applications. The computational requirements of these problems vary due to variations in problem, algorithmic and systemic characteristics during parallelization, leading to performance degradation. Considerable amount of research has been dedicated to the development of dynamic scheduling techniques based on probabilistic analysis to address these predictable and unpredictable factors that lead to severe load imbalance. The mathematical foundations of these scheduling algorithms have been previously developed and published in the literature. These techniques have been successfully integrated into scientific applications as well as into runtime systems. Recently, efforts have also been directed to integrate these techniques into dynamic load balancing libraries for scientific applications. The optimal scheduling algorithm to load balance a specific scientific application in a dynamic parallel computing environment is very difficult without the exhaustive testing of all the scheduling techniques. This is a time consuming process, and therefore, there is a need for developing an automatic mechanism for the selection of dynamic scheduling algorithms. In recent years, extensive work has been dedicated to the development of reinforcement learning and some of its techniques have addressed load-balancing problems. However, they do not cover a number of aspects regarding the performance of scientific applications. First, these previously developed techniques address the load balancing problem only at a coarse granularity level (for example, job scheduling), and the reinforcement learning techniques used for load balancing are based on learning from trained datasets which are obtained prior to the execution of the application.

Book Automatic Selection of Dynamic Loop Scheduling Algorithms for Load Balancing Using Reinforcement Learning

Download or read book Automatic Selection of Dynamic Loop Scheduling Algorithms for Load Balancing Using Reinforcement Learning written by Sumithra Dhandayuthapani and published by . This book was released on 2004 with total page pages. Available in PDF, EPUB and Kindle. Book excerpt: Scientific applications are large, complex, irregular, and computationally intensive and are characterized by data parallel loops. The prevalence of independent iterations in these loops, makes parallel computing as the natural choice for solving these applications. The computational requirements of these problems vary due to variations in problem, algorithmic and systemic characteristics during parallelization, leading to performance degradation. Considerable amount of research has been dedicated to the development of dynamic scheduling techniques based on probabilistic analysis to address these predictable and unpredictable factors that lead to severe load imbalance. The mathematical foundations of these scheduling algorithms have been previously developed and published in the literature. These techniques have been successfully integrated into scientific applications as well as into runtime systems. Recently, efforts have also been directed to integrate these techniques into dynamic load balancing libraries for scientific applications. The optimal scheduling algorithm to load balance a specific scientific application in a dynamic parallel computing environment is very difficult without the exhaustive testing of all the scheduling techniques. This is a time consuming process, and therefore, there is a need for developing an automatic mechanism for the selection of dynamic scheduling algorithms. In recent years, extensive work has been dedicated to the development of reinforcement learning and some of its techniques have addressed load-balancing problems. However, they do not cover a number of aspects regarding the performance of scientific applications. First, these previously developed techniques address the load balancing problem only at a coarse granularity level (for example, job scheduling), and the reinforcement learning techniques used for load balancing are based on learning from trained datasets which are obtained prior to the execution of the application. Moreover, scientific applications contain parameters whose variations are so irregular that the use of training sets would not be able to accurately capture the entire spectrum of possible characteristics. Finally, algorithm selection using reinforcement learning has only been used for simple sequential problems. This thesis addresses these limitations and provides a novel integrated approach for automating the selection of dynamic scheduling algorithms at a finer granularity level to improve the performance of scientific applications using reinforcement learning. This integrated approach will experimentally be tested on a scientific application that involves a large number of time steps: The Quantum Trajectory Method (QTM). A qualitative and quantitative analysis of the effectiveness of this novel approach will be presented to underscore the significance of its use in improving the performance of large-scale scientific applications.

Book Static and Dynamic Scheduling for Effective Use of Multicore Systems

Download or read book Static and Dynamic Scheduling for Effective Use of Multicore Systems written by Fengguang Song and published by . This book was released on 2009 with total page 158 pages. Available in PDF, EPUB and Kindle. Book excerpt: Multicore systems have increasingly gained importance in high performance computers. Compared to the traditional microarchitectures, multicore architectures have a simpler design, higher performance-to-area ratio, and improved power efficiency. Although the multicore architecture has various advantages, traditional parallel programming techniques do not apply to the new architecture efficiently. This dissertation addresses how to determine optimized thread schedules to improve data reuse on shared-memory multicore systems and how to seek a scalable solution to designing parallel software on both shared-memory and distributed-memory multicore systems. We propose an analytical cache model to predict the number of cache misses on the time-sharing L2 cache on a multicore processor. The model provides an insight into the impact of cache sharing and cache contention between threads. Inspired by the model, we build the framework of affinity based thread scheduling to determine optimized thread schedules to improve data reuse on all the levels in a complex memory hierarchy. The affinity based thread scheduling framework includes a model to estimate the cost of a thread schedule, which consists of three submodels: an affinity graph submodel, a memory hierarchy submodel, and a cost submodel. Based on the model, we design a hierarchical graph partitioning algorithm to determine near-optimal solutions. We have also extended the algorithm to support threads with data dependences. The algorithms are implemented and incorporated into a feedback directed optimization prototype system. The prototype system builds upon a binary instrumentation tool and can improve program performance greatly on shared-memory multicore architectures. We also study the dynamic data-availability driven scheduling approach to designing new parallel software on distributed-memory multicore architectures. We have implemented a decentralized dynamic runtime system. The design of the runtime system is focused on the scalability metric. At any time only a small portion of a task graph exists in memory. We propose an algorithm to solve data dependences without process cooperation in a distributed manner. Our experimental results demonstrate the scalability and practicality of the approach for both shared-memory and distributed-memory multicore systems. Finally, we present a scalable nonblocking topology-aware multicast scheme for distributed DAG scheduling applications.

Book An API for Adaptive Loop Scheduling in Shared Address Space Architectures

Download or read book An API for Adaptive Loop Scheduling in Shared Address Space Architectures written by Kirthilakshmi Govindaswamy and published by . This book was released on 2003 with total page pages. Available in PDF, EPUB and Kindle. Book excerpt: The parallelization of complex, irregular scientific applications with various computational requirements often results in severe load imbalance. Load balancing increases the efficient utilization of available resources in parallel and distributed applications, thereby reducing the overall processor completion times. Loops are a rich source of parallelism in data parallel applications. In recent years, several loop scheduling schemes that balance processor workloads have been proposed and have been successfully implemented in data parallel applications. If the workload on processors is balanced, then the overall efficiency of a computation increases, and that, in turn reduces the computation run-time. Therefore, loop scheduling routines are incorporated into applications to insure that the workload is balanced for all the available processors. Significant research effort has been made towards embedding the most competitive loop scheduling algorithms into specific scientific applications. The application developer has to rewrite the algorithm to be incorporated into a different application, each time a new one is developed. Certain compilers take advantage of loops present in the application and perform automatic parallelization on them. However, the automatic parallelization doesn't address all sources of algorithmic and systemic variances in heterogeneous environments. These limitations raise a compelling need for building an application programmable interface (API) for adaptive loop scheduling algorithms that can be incorporated into any scientific application. This thesis presents an API for various adaptive loop scheduling strategies for data parallel applications in a shared address space architecture, which allows for parallelization as well as adaptive load balancing of a scientific application. This API has been incorporated into a few scientific applications in order to evaluate the performance of each application using the adaptive loop scheduling routines on shared address space parallel machines against the automatic loop scheduling offered by present parallelizing compiler technology.

Book Performance Issues in Parallel Loop Scheduling for Multiprogrammed Multiprocesors

Download or read book Performance Issues in Parallel Loop Scheduling for Multiprogrammed Multiprocesors written by Kelvin Kam-Suen Yue and published by . This book was released on 1996 with total page 306 pages. Available in PDF, EPUB and Kindle. Book excerpt:

Book Documentation Abstracts

Download or read book Documentation Abstracts written by and published by . This book was released on 1995 with total page 628 pages. Available in PDF, EPUB and Kindle. Book excerpt:

Book The Engineering Index Annual

Download or read book The Engineering Index Annual written by and published by . This book was released on 1994 with total page 2398 pages. Available in PDF, EPUB and Kindle. Book excerpt: Since its creation in 1884, Engineering Index has covered virtually every major engineering innovation from around the world. It serves as the historical record of virtually every major engineering innovation of the 20th century. Recent content is a vital resource for current awareness, new production information, technological forecasting and competitive intelligence. The world?s most comprehensive interdisciplinary engineering database, Engineering Index contains over 10.7 million records. Each year, over 500,000 new abstracts are added from over 5,000 scholarly journals, trade magazines, and conference proceedings. Coverage spans over 175 engineering disciplines from over 80 countries. Updated weekly.

Book Performance Analysis and Tuning on Modern CPUs

Download or read book Performance Analysis and Tuning on Modern CPUs written by and published by Independently Published. This book was released on 2020-11-16 with total page 238 pages. Available in PDF, EPUB and Kindle. Book excerpt: Performance tuning is becoming more important than it has been for the last 40 years. Read this book to understand your application's performance that runs on a modern CPU and learn how you can improve it. The 170+ page guide combines the knowledge of many optimization experts from different industries.

Book Dynamic Scheduling

Download or read book Dynamic Scheduling written by Siu Kee Ng and published by . This book was released on 1996 with total page 74 pages. Available in PDF, EPUB and Kindle. Book excerpt:

Book Dynamic Task Discovery in a Data flow  Task based Runtime System

Download or read book Dynamic Task Discovery in a Data flow Task based Runtime System written by Reazul Hoque and published by . This book was released on 2019 with total page 0 pages. Available in PDF, EPUB and Kindle. Book excerpt: The successful utilization of the modern configuration of the heterogeneous many-core architectures with complex memory hierarchies is a challenge for many application developers. Portability and performance of existing and new applications are the key challenges scientific application developers are continuously facing. Many evolutionary solutions have been proposed, including ones that seek to extend the capabilities of the current message passing paradigm with intra-node features (MPI+X). A different, more revolutionary, solution explores data-flow task-based Runtime systems as a substitute to both local and distributed data dependencies management. The method of programming such a Runtime is important, as that directly affects the productivity of the developers and the performance of the applications. This work extends the capability of one of such runtime, the Parallel Runtime Scheduling and Execution Controller (PaRSEC), to the novel programming approach of allowing users to insert task in the Runtime by writing sequential code. This programming model is called Dynamic Task Discovery (DTD), which discovers tasks dynamically at runtime and uses optimized graph unrolling techniques to accommodate applications with large task graphs. In this work, PaRSEC's capability is extended by providing a new programming model, DTD. Bottlenecks of this programming model are identified and solutions to overcome its limitations are proposed. The performance of the implementation of DTD on top of dense linear algebra workload is analyzed at scale, where DTD has shown excellent results in distributed memory: 2.3x--1.3x better performance at 128 nodes for QR factorization compared to ScaLAPACK and in shared memory, 4x—5x better performance for Cholesky factorization compared to other runtimes, StarPU and QUARK. DTD was also evaluated via the coupled-cluster method of state of the art quantum chemistry application NWCHEM, where it performed remarkably well among all considered Runtimes at scale of 128 nodes. The hope is that the concept and the development of DTD, the detailed evaluation of its practical performance at scale, the analysis of the theoretical limitations of it, the thorough study and classification of various task-based Runtime systems, and the design, implementation and evaluations of the chosen Runtimes on micro-benchmarks will help the broad scientific application developer community.