EBookClubs

Read Books & Download eBooks Full Online

EBookClubs

Read Books & Download eBooks Full Online

Book Checkpointing in a Virtual Shared Memory System

Download or read book Checkpointing in a Virtual Shared Memory System written by Tony P. Ng and published by . This book was released on 1991 with total page 36 pages. Available in PDF, EPUB and Kindle. Book excerpt: Abstract: "In this paper we describe several checkpointing algorithms for backward error recovery in virtual shared memory systems. Counterparts to some of these algorithms can be found in message-passing systems, but a shared memory system allows a significant optimization. The read-write semantics of the shared memory can be used to distinguish network messages that do not create dependencies from those that do, whereas all messages passed in a message-passing system are assumed to create dependencies. We measure the performance of the checkpointing algorithms using a trace-driven simulation of several shared memory parallel applications.

Book Checkpointing in Distributed Virtual Memory by Utilizing Local Virtual Memory

Download or read book Checkpointing in Distributed Virtual Memory by Utilizing Local Virtual Memory written by F. X. Nursalim Hadi and published by . This book was released on 1995 with total page pages. Available in PDF, EPUB and Kindle. Book excerpt: This study explores a recovery strategy using checkpointing in a distributed shared virtual memory (DVM) system. DVM shares virtual memory in a loosely-coupled multi-computer system and is implemented at the software-level. The goal of this recovery strategy is to obtain a consistent recovery line that is close to the time of failure. Therefore the system could be rolled back from the time of failure to the closest possible state of normal execution. In order to achieve the objective, this thesis proposes a checkpointing strategy that utilizes virtual memory (VM) as transient checkpoint storage in addition to commonly-used stable storage. In controllable checkpoint intervals, these additional checkpoints make checkpoint intervals shorter; in turn making the recovery line closer to the time of failure. Compared to the cost of taking checkpoints to stable storage, taking these additional checkpoints does not cost much since they are saved to virtual memory. This thesis will show that the additional cost of these transient storage checkpoints is very low, while the benefit of reducing the rollback cost is very high. The utilization of VM will be applied to commonly-used independent checkpointing and coordinated checkpointing strategies. The checkpointing protocols of both strategies are changed to accommodate additional checkpointing to VM. This thesis will show that the modified protocols still guarantee state consistency after recovery. Simulations on trace data and experiments on the Choices operating system are conducted to measure the performance of the proposed checkpointing strategies. We compare independent checkpointing strategies with and without VM utilization; we also compare coordinated checkpointing strategies with and without VM utilization. The simulations and experiments demonstrate that in the independent checkpointing strategy, utilizing of VM reduces rollback costs with only a small fraction of additional checkpoint costs. The same result also applies to the coordinated checkpointing strategy utilizing VM.

Book An Evolutionary Approach to Concurrent Checkpointing

Download or read book An Evolutionary Approach to Concurrent Checkpointing written by and published by . This book was released on 1992 with total page 33 pages. Available in PDF, EPUB and Kindle. Book excerpt: This paper describes an evolutionary approach to establishing a consistent global recovery line for concurrent processes. Unlike globally synchronized schemes, our approach uses no agreement protocols and thus no rounds of messages to decide upon a recovery line. Unlike logging-based schemes, our approach neither stores the messages exchanged between concurrent processes, nor constructs message dependence graphs to determine a recovery line. In contrast to communication synchronized schemes, our technique reduces overhead by not always synchronizing computation with checkpointing and by allowing a potentially inconsistent recovery line temporarily. Evolutionary concurrent checkpointing periodically starts a checkpointing session by checkpointing each process locally. As the checkpointing session progresses, the initial checkpoints are updated according to the communication between the concurrent processes. This local checkpoint updating guarantees that the recovery line evolves into a consistent line. Evolutionary concurrent checkpointing can be applied to message-based multicomputer systems, shared virtual memory systems, and shared memory multiprocessors. We evaluate the performance of our approach using execution traces from a hypercube multicomputer and a shared-memory multiprocessor. fault tolerant computing, checkpointing, and rollback error recovery.

Book Checkpointing a Multithreaded Distributed Shared Memory Computer System

Download or read book Checkpointing a Multithreaded Distributed Shared Memory Computer System written by William R. Dieter and published by . This book was released on 2001 with total page 218 pages. Available in PDF, EPUB and Kindle. Book excerpt:

Book Checkpointing Shared Memory Programs at the Application level

Download or read book Checkpointing Shared Memory Programs at the Application level written by M. Schulz and published by . This book was released on 2004 with total page 8 pages. Available in PDF, EPUB and Kindle. Book excerpt: Trends in high-performance computing are making it necessary for long-running applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart(CPR)-the state of the computation is saved periodically on disk, and when a failure occurs, the computation is restarted from the last saved state. At present, it is the responsibility of the programmer to instrument applications for CPR. Our group is investigating the use of compiler technology to instrument codes to make them self-checkpointing and self-restarting, thereby providing an automatic solution to the problem of making long-running scientific applications resilient to hardware faults. Our previous work focused on message-passing programs. In this paper, we describe such a system for shared-memory programs running on symmetric multiprocessors. The system has two components: (i)a pre-compiler for source-to-source modification of applications, and (ii) a runtime system that implements a protocol for coordinating CPR among the threads of the parallel application. For the sake of concreteness, we focus on a non-trivial subset of OpenMP that includes barriers and locks. One of the advantages of this approach is that the ability to tolerate faults becomes embedded within the application itself, so applications become self-checkpointing and self-restarting on any platform. We demonstrate this by showing that our transformed benchmarks can checkpoint and restart on three different platforms (Windows/x86, Linux/x86, and Tru64/Alpha). Our experiments show that the overhead introduced by this approach is usually quite small; they also suggest ways in which the current implementation can be tuned to reduced overheads further.

Book Using Lightweight Checkpoint recovery to Improve the Availability and Designability of Shared Memory Multiprocessors

Download or read book Using Lightweight Checkpoint recovery to Improve the Availability and Designability of Shared Memory Multiprocessors written by Daniel J. Sorin and published by . This book was released on 2002 with total page 194 pages. Available in PDF, EPUB and Kindle. Book excerpt:

Book OpenSHMEM and Related Technologies  Enhancing OpenSHMEM for Hybrid Environments

Download or read book OpenSHMEM and Related Technologies Enhancing OpenSHMEM for Hybrid Environments written by Manjunath Gorentla Venkata and published by Springer. This book was released on 2016-12-14 with total page 244 pages. Available in PDF, EPUB and Kindle. Book excerpt: This book constitutes the proceedings of the Third OpenSHMEM Workshop, held in Baltimore, MD, USA, in August 2016. The 14 full papers and 3 short papers presented were carefully reviewed and selected from 25 submissions. The papers discuss a variety of ideas of extending the OpenSHMEM specification and making it efficient for current and next generation systems. This included active messages, non-blocking APIs, fault tolerance capabitlities, exploring implementation of OpenSHMEM using communication layers such as OFI and UCX and implementing OpenSHMEM for heterogeneous architectures.

Book The Performance of Consistent Checkpointing in Distributed Shared Memory Systems

Download or read book The Performance of Consistent Checkpointing in Distributed Shared Memory Systems written by Gilbert Cabillic and published by . This book was released on 1995 with total page 26 pages. Available in PDF, EPUB and Kindle. Book excerpt:

Book Checkpoint in Distributed Virtual Memory by Using Local Virtual Memory

Download or read book Checkpoint in Distributed Virtual Memory by Using Local Virtual Memory written by F. X. Nursalim Hadi and published by . This book was released on 1995 with total page 134 pages. Available in PDF, EPUB and Kindle. Book excerpt:

Book Checkpointing and Rollback Recovery in Distributed Shared Memory Systems

Download or read book Checkpointing and Rollback Recovery in Distributed Shared Memory Systems written by and published by . This book was released on 1994 with total page 24 pages. Available in PDF, EPUB and Kindle. Book excerpt: Checkpointing techniques in parallel systems use dependency tracking and/or message logging to ensure that a system rolls back to a consistent state. Traditional dependency tracking in distributed shared memory systems (DSM) is expensive because of high frequency of communication. In this paper we show that, because of information redundancy, not all message-passing dependences need to be considered to roll back to a consistent state in DSM systems, resulting in reduced dependency tracking overhead and reduced potential for rollback propagation. We develop a model of execution where client processes running an application interact atomically with a set of shared-memory server processes on every access to shared data. We show that under this model, dependences are significantly reduced over the message-passing model. We use results from simulation with multiprocessor address traces to demonstrate the reduction in dependences.

Book Virtual Shared Memory for Distributed Architectures

Download or read book Virtual Shared Memory for Distributed Architectures written by Eva Kühn and published by Nova Publishers. This book was released on 2001 with total page 138 pages. Available in PDF, EPUB and Kindle. Book excerpt: Virtual Shared Memory for Distributed Architecture

Book Fault Tolerant Parallel and Distributed Systems

Download or read book Fault Tolerant Parallel and Distributed Systems written by Dimiter R. Avresky and published by Springer Science & Business Media. This book was released on 2012-12-06 with total page 396 pages. Available in PDF, EPUB and Kindle. Book excerpt: The most important use of computing in the future will be in the context of the global "digital convergence" where everything becomes digital and every thing is inter-networked. The application will be dominated by storage, search, retrieval, analysis, exchange and updating of information in a wide variety of forms. Heavy demands will be placed on systems by many simultaneous re quests. And, fundamentally, all this shall be delivered at much higher levels of dependability, integrity and security. Increasingly, large parallel computing systems and networks are providing unique challenges to industry and academia in dependable computing, espe cially because of the higher failure rates intrinsic to these systems. The chal lenge in the last part of this decade is to build a systems that is both inexpensive and highly available. A machine cluster built of commodity hardware parts, with each node run ning an OS instance and a set of applications extended to be fault resilient can satisfy the new stringent high-availability requirements. The focus of this book is to present recent techniques and methods for im plementing fault-tolerant parallel and distributed computing systems. Section I, Fault-Tolerant Protocols, considers basic techniques for achieving fault-tolerance in communication protocols for distributed systems, including synchronous and asynchronous group communication, static total causal order ing protocols, and fail-aware datagram service that supports communications by time.

Book Parallel I O for High Performance Computing

Download or read book Parallel I O for High Performance Computing written by John M. May and published by Morgan Kaufmann. This book was released on 2001 with total page 392 pages. Available in PDF, EPUB and Kindle. Book excerpt: "I enjoyed reading this book immensely. The author was uncommonly careful in his explanations. I'd recommend this book to anyone writing scientific application codes." -Peter S. Pacheco, University of San Francisco "This text provides a useful overview of an area that is currently not addressed in any book. The presentation of parallel I/O issues across all levels of abstraction is this book's greatest strength." -Alan Sussman, University of Maryland Scientific and technical programmers can no longer afford to treat I/O as an afterthought. The speed, memory size, and disk capacity of parallel computers continue to grow rapidly, but the rate at which disk drives can read and write data is improving far less quickly. As a result, the performance of carefully tuned parallel programs can slow dramatically when they read or write files-and the problem is likely to get far worse. Parallel input and output techniques can help solve this problem by creating multiple data paths between memory and disks. However, simply adding disk drives to an I/O system without considering the overall software design will not significantly improve performance. To reap the full benefits of a parallel I/O system, application programmers must understand how parallel I/O systems work and where the performance pitfalls lie. Parallel I/O for High Performance Computing directly addresses this critical need by examining parallel I/O from the bottom up. This important new book is recommended to anyone writing scientific application codes as the best single source on I/O techniques and to computer scientists as a solid up-to-date introduction to parallel I/O research. Features: An overview of key I/O issues at all levels of abstraction-including hardware, through the OS and file systems, up to very high-level scientific libraries. Describes the important features of MPI-IO, netCDF, and HDF-5 and presents numerous examples illustrating how to use each of these I/O interfaces. Addresses the basic question of how to read and write data efficiently in HPC applications. An explanation of various layers of storage - and techniques for using disks (and sometimes tapes) effectively in HPC applications.

Book Using Logging and Asynchronous Checkpointing to Implement Recoverable Distributed Shared Memory

Download or read book Using Logging and Asynchronous Checkpointing to Implement Recoverable Distributed Shared Memory written by Golden G. Richard and published by . This book was released on 1993 with total page 20 pages. Available in PDF, EPUB and Kindle. Book excerpt: The proposed scheme supports local process recovery without forcing rollback of operational processes during recovery. Our method is particularly useful in environments where taking process checkpoints is expensive (e.g., in some UNIX [trademark] environments)."

Book Dependable Computing

    Book Details:
  • Author : Ravishankar K. Iyer
  • Publisher : John Wiley & Sons
  • Release : 2024-05-29
  • ISBN : 1118709446
  • Pages : 852 pages

Download or read book Dependable Computing written by Ravishankar K. Iyer and published by John Wiley & Sons. This book was released on 2024-05-29 with total page 852 pages. Available in PDF, EPUB and Kindle. Book excerpt: Dependable Computing Covering dependability from software and hardware perspectives Dependable Computing: Design and Assessment looks at both the software and hardware aspects of dependability. This book: Provides an in-depth examination of dependability/fault tolerance topics Describes dependability taxonomy, and briefly contrasts classical techniques with their modern counterparts or extensions Walks up the system stack from the hardware logic via operating systems up to software applications with respect to how they are hardened for dependability Describes the use of measurement-based analysis of computing systems Illustrates technology through real-life applications Discusses security attacks and unique dependability requirements for emerging applications, e.g., smart electric power grids and cloud computing Finally, using critical societal applications such as autonomous vehicles, large-scale clouds, and engineering solutions for healthcare, the book illustrates the emerging challenges faced in making artificial intelligence (AI) and its applications dependable and trustworthy. This book is suitable for those studying in the fields of computer engineering and computer science. Professionals who are working within the new reality to ensure dependable computing will find helpful information to support their efforts. With the support of practical case studies and use cases from both academia and real-world deployments, the book provides a journey of developments that include the impact of artificial intelligence and machine learning on this ever-growing field. This book offers a single compendium that spans the myriad areas in which dependability has been applied, providing theoretical concepts and applied knowledge with content that will excite a beginner, and rigor that will satisfy an expert. Accompanying the book is an online repository of problem sets and solutions, as well as slides for instructors, that span the chapters of the book.