Domino Effect Protection on Dataflow Error Detection and Recovery

Domino Effect Protection on Dataflow Error Detection and Recovery Tiago A. O. Alves∗ , Leandro A. J. Marzulo† , Sandip Kundu‡ and Felipe M. G. Franc¸a∗...
Author: Guest
2 downloads 0 Views 450KB Size
Domino Effect Protection on Dataflow Error Detection and Recovery Tiago A. O. Alves∗ , Leandro A. J. Marzulo† , Sandip Kundu‡ and Felipe M. G. Franc¸a∗ ∗ Programa

de Engenharia de Sistemas e Computac¸a˜ o - COPPE Universidade Federal do Rio de Janeiro (UFRJ), Rio de Janeiro, Brazil Email: {tiagoaoa, felipe}@cos.ufrj.br † Instituto de Matem´atica e Estat´ıstica Universidade do Estado do Rio de Janeiro (UERJ), Rio de Janeiro, Brazil Email: [email protected] ‡ Department of Electrical and Computer Engineering University of Massachusetts Amherst Email: [email protected]

Abstract—Dataflow Error Detection and Recovery (DFER) was shown to be a good approach to address errors in the scope of parallel programming. Previous work showed that this technique presents good performance by imposing reduced overhead in error-free executions. However, in the presence of errors excessive rollbacks may occur, characterizing the Domino Effect. In this paper we propose a scheme that addresses this issue by protecting execution from the Domino Effect. Our experimental results show that without adding any significant overheads to the original DFER version we are able to reduce in up to 40% the total execution time in situations where errors are detected. Furthermore, since there are no significant overheads, the execution time in error-free situations remains the same as in the baseline.

I. I NTRODUCTION As processor manufacturing companies shifted to chips with an ever-increasing number of cores, creating a tangible way for average programmers to exploit parallelism became imperative. The scientific community is in a quest to create programming models that would make it easier to describe tasks and interaction between them. In the dataflow model [1], programs are described as a graph where nodes are instructions (or tasks) and edges represent dependencies between instructions. Program execution will follow data dependencies, instead of using a program counter. Dataflow has been shown to be a good abstraction for achieving high performance in these parallel architectures [2], [3], [4], [5], [6], [7], [8], [9], [10]. On the other hand, as the number of cores increases, so does the chance of having a fault in a core. These faults can be caused by variability in the components, soft or transient errors and permanent errors caused by device degradation [11]. Since current multicore processors are manufactured in unreliable technologies [12], [13], dependability for these processors is an important issue. Transient errors occur due to external events, such as capacitive cross-talk, power supply noise, cosmic particles

c 978-1-4799-6155-9/14/$31.00 2014 IEEE

or radiation of α-particles. Unlike permanent faults, transient faults may remain silent throughout program execution, as long as no OS trap is triggered by the fault. In these cases, program execution will not be aborted, but the output produced will potentially be wrong. This is specially hazardous in the context of High Performance Computing (HPC), since HPC programs usually run for long periods of time, increasing both the probability of an error during the execution and the overall cost of having to re-execute the entire program due to a faulty output caused by the error. Dataflow Error Recovery (DFER) [14] is an online error detection and recovery mechanism based on dataflow execution. In DFER tasks where error detection is desirable are replicated and a Commit instruction that receives copies of all input and output operands from both the original tasks and its replica is created. The Commit instruction will compare the outputs of both tasks and trigger the re-execution of the original one, if the outputs are not the same. Different from traditional error detection and recovery mechanisms (as in [12], [15], [16], [17], [18], [19], [20] and [21]), DFER is solely based on data dependencies between tasks, does not require any kind of global synchronization between processors, reduces amount of data that needs to be buffered in order to perform error recovery, only re-executes the specific tasks that used faulty data and allows distribution of Commit and redundant tasks through static and dynamic scheduling mechanisms. In DFER, an error can trigger an entire chain of reexecutions (domino effect) of tasks that depend on the one where the error was detected. Although it is possible to reduce the impact of domino effect by explicitly inserting new dependencies in the dataflow graph, this approach is suboptimal in error-free scenarios. Typically, this explicit technique consists in adding edges going from Commit instructions to speculative tasks in order to guarantee that the data consumed by the latter is error-free, hence the longer critical path, which imposes overhead even in error-free executions.

147

In this paper we introduce Domino Effect Protection (DEP) for DFER, an implicit technique implemented in the runtime that aims at reducing the impact of Domino Effect while keeping error-free execution unaffected. Results show that for scenarios where a single error happens during the execution DEP was 100% effective regardless of the point where the error is injected and performed up to 44% better than the baseline in a scenario with just one error. In scenarios where errors are injected errors at different rates, results show that for small error rates (below 2.5%) DEP outperforms greatly the baseline version, while reducing the number of wasteful executions. The rest of this work is organized as follows: (i) Section II explains how dataflow guided execution works and why it is a good model for exposing parallelism; (ii) Section III discusses our Dataflow Error Detection and Recovery (DFER) mechanism; (iii) Section IV presents our Domino Effect Protection solution; (iv) Section V presents and discusses experimental results; (v) Section VI concludes and discusses future work. II. DATAFLOW GUIDED EXECUTION In traditional (Von Neumann) machines programs are described as sequences of instructions and execution is guided by control flow, i.e., a program counter (P C) points to the next instruction to be executed and branch/jump instructions can change the PC to point to other portions of code (as in a if-then-else statement, loop or function call). Those machines also rely on global state as register files. This means that, in those machines, execution is sequential by definition, although techniques like pipelining and dynamic out-of-order execution are used to extract instruction-level parallelism. The dataflow model [1] uses a completely different approach. In dataflow machines, programs are described as graphs, where nodes are instructions (or tasks) and edges represent data dependencies between tasks. Execution in this model is guided by the flow of data, i.e., when all input operands destined to a instruction are available it will cause an operand match and the instruction will be dispatched for execution. This is a natural way of exploiting parallelism, since independent instructions (instructions that have no path between them in the graph) will be able to execute in parallel if there are enough computational resources in the system. In this model, there is no need of a Program Counter and there is no need for global state, since operands will be sent directly from the producer task to its consumers. Since there is no P C in Data-flow, control branches in a program must change the flow of data in the dataflow graph during execution. For example, in a if-then-else statement, instructions in the if and the else will be in different subgraphs. In runtime, according to the evaluation of a logic expression, operands should be sent to the correct subgraph. A classic solution to that is to use steer nodes that receive the data operand and a control operand that will direct the data operand to the desired subgraph. Although, steer instructions are also used to implement loops in dataflow, they are not enough. In dynamic data-flow, a instruction may have multiple instances,

148

one per iteration. During the execution of a loop, independent portions of the iteration may run faster than others, reaching the next iteration before the current one completed. Therefore, operands are tagged with their associated instance number and incremented by a special node (that we call inctag) when an operand reaches the next iteration. Hence, the definition of the basic dataflow firing rule has to be extended: when all input operands with the same tag destined to an instruction are available it will cause an operand match and the instruction will be dispatched for execution. A third control problem in dataflow is function calling, which can be solved in a similar fashion than loops, i.e., with the use of tags [5]. Dataflow runtime environments are becoming more popular as a solution to ease parallelism exploitation in the multi and many-core era (as in [2], [3], [4], [5], [6], [7], [8]). They can be implemented on top of traditional multicore machines, where each core runs a virtual processing element (PE) that implements the dataflow firing rule. Dataflow PEs will trigger the execution of blocks of code (or even functions) on the host machine when all their inputs become present. If two blocks have no dependencies between them and are mapped to distinct PEs, they may potentially run in parallel. III. DATAFLOW E RROR D ETECTION AND R ECOVERY Dataflow Error Recovery (DFER) [14], is a mechanism for online error detection and recovery based on dataflow execution. The basic idea of DFER consists in the addition of a redundant task, where error detection is desired and a Commit instruction that receives, from both the original (primary) and the redundant (secondary) tasks, their input and output operands, the PE ID where the each task is mapped and their static ID (task number) and dynamic ID (each task may have multiple instances since we can have multiple executions in loop or multiple re-executions due to error detection). Upon receiving those messages, the Commit instruction compares the data produced by the primary and the secondary executions and, in case a discrepancy is found, a re-execution of the primary instance is fired. Re-executions are triggered by the Commit instructions that simply sends a message to the PE of the primary task containing the unique id of the task and the input data. Upon re-execution, new version of the task output operands will be produced (including the message for the Commit). No architectural state needs to be saved in order to recover from an error, just the data consumed by an execution of the task. However, when memory containing the input data is overwritten by the task’s execution, memory rollback needs to be applied, which is achieved by logging the stores made during task execution. In traditional error recovery the detection of an error in a task will cause the re-execution of all tasks between checkpoints. In the worst-case scenario where there are recovery points only in the beginning and the end of the program, it will be entirely re-executed. In DFER, re-execution caused by an error can trigger an entire chain of re-executions of tasks that depend on the task where the error was detected

2014 International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)

(which, in the worst case, could also cause the re-execution of the entire program). We call that the Domino Effect. In certain programs it may be desirable to avoid the possibility of Domino Effect for the sake of performance. This situation can be addressed by inserting new edges in the graph going from Commit instructions to any other tasks where we want errorfree guarantee – creating a dependency between the Commit instruction and the target one. This way, a task that has an input edge coming from a Commit instruction is assured to only start executing after the Commit instruction has checked the data and found it to be error-free. Consider the example of Figure 1, where the program is divided in two stages, comprised by tasks A and B and tasks C and D respectively. In order to assure that the tasks in the second stage will only start executing after the data produced in the first stage has been checked to be error-free, edges were added from the Commit tasks of the first stage to the actual tasks of the second stage. These edges also forward the checked data produced by A and B to C and D. In panes (a) and (b) we present the original dataflow graph and the one with error detection/checkpointing, respectively.

   







   







 

 









         



not memory bound. For applications that were memory intensive this overhead went up to 70%. Moreover, this approach may complicate the process of developing or compiling the code to dataflow. Therefore, a transparent and more efficient mechanism is required. In some programs it may be important to allow the execution of tasks to begin even if the data consumed has not been committed yet. Usually this is the case for programs where performance is important, since postponing error checking by allowing execution to flow is likely to yield better performance (in an error-free scenario) than checkpointing at each individual step. Consider the graph of Figure 2. Since there are no edges coming from commit tasks into regular ones, the execution of the tasks does not get blocked to check for errors. Instead, tasks B and C will start executing even if CommitA has not finished checking the data produced by A. The obvious side-effect caused by this approach is that B and C might execute with input data that contains errors. Re-executions in the model can be caused by the re-execution of a preceding task (where errors were detected) or by an error detected in the execution of the task itself (triggered by the corresponding Commit instruction). In a scenario where an error occurs in the execution of A, both kinds of re-executions could be triggered for B and C, since A itself would be re-executed and CommitB and CommitC could detect errors on the data produced by B and C respectively. To address this issue, wait edges are added between Commit instructions. The wait edge from the CommitA to CommitB sends a tag used to identify the last execution of A, i.e. the execution that was checked and is error-free. Every time a task produces data (faulty or not), it propagates this data with a unique tag that identifies that execution. This tag is then used by the corresponding commit task to inform the other ones of which execution is error-free (though the wait edges). In the case of the graph of Figure 2, CommitA informs to CommitB the tag of the last execution of A, so it knows that an execution of B using data sent by A with errors should not even be checked, preventing the second case of re-execution described. The analog happens for CommitB and CommitC.

  

Fig. 1. Example of checkpointing in DFER. In (a) we have the original dataflow graph, without error detection/checkpointing and in (b) this mechanism is added. The edges from CommitA and CommitB to C and D guarantee that the tasks in the second stage of the program will only start executing after the data from the first stage is checked, thus reducing the possible number of re-executions in the program.

It is important to observe that even though Domino Effect Protection may reduce the re-execution penalty in case of an error, the error-free execution will likely be worse in terms of performance because the error detection will be placed in the critical path. In [14], the addition of these edges from Commit tasks for explicit protection against Domino Effect added up to 6% overhead, in error-free executions, to applications that are



 





 



 

Fig. 2. Example where wait edges (labeled with w) are necessary to prevent unnecessary re-executions. The redundant tasks are omitted for readability, but there are implicit replications of A, B and C.

The strongest features of DFER are: •

It is based solely on the data dependencies between tasks. The state of the processing elements where each task

2014 International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)

149











executes plays no role in it (here, the data in the shared memory is not considered part of the architectural state of a specific processor). It does not require any kind of global synchronization between processors, only between processors whose tasks are independent of the faulty data. The reduced amount of data that needs to be buffered in order to perform error recovery (i.e. just the input data of the task executions that have yet to be checked). Every time an error is detected, the only processing elements involved in the recovery are the ones to which tasks dependent on the faulty data were mapped. In prior approaches for Backwards Error Recovery (BER) it was either necessary to establish global synchronous checkpoints or asynchronously store logical time vectors in every processor (which are all compared during error recovery) [19], [20]. The recovery mechanism only re-executes the specific tasks that used faulty data (prior approaches re-execute all the work in between checkpoints). Since Commit and redundant instructions can be independently mapped (as any other instruction), static and dynamic scheduling mechanisms could be used to balance computation of those instructions. IV. D OMINO E FFECT P ROTECTION

In order to protect from Domino Effect we must first analyze the intrinsic details of why it happens in our execution model. By definition, the Domino Effect occurs when a single error detected triggers a chain of re-executions on all instructions that depend on the one where the error was encountered. Although we refer to Domino Effect as the problem, the executions caused by the rollback are not the real problem. The real issue we want to address are the executions that are wasted since they used faulty data. Typically, wasteful executions occur when error detection latency is big enough to allow the consumption of faulty data to happen and instructions to be scheduled with that data. As described in Section III, DFER has a mechanism to detect when faulty data was consumed by an execution (and thus guarantee correctness), but it does not have a mechanism to avoid that such executions will take place to begin with. In TALM [5], our dataflow execution model, the instructions that are ready to execute are pushed into a FIFO queue and the re-executions are inserted in that same queue. Therefore, what naturally happens is that if executions with faulty data are scheduled they will happen before the rollbacks. We can thus conclude that the scheduling policies have to be modified in order to address this issue. The first measured adopted was to change the behaviour of the processing element when it receives a message triggering a re-execution due to an error encountered by the Commit instruction. The original treatment for a re-execution message was to push the re-execution to the end of the FIFO queue, as with any execution triggered by an operand match, i.e. all operands of an instruction becoming ready. In our new

150

version, this first re-execution of a rollback is dispatched once the message is received without being appended to the queue. This measure alone is not sufficient, since the re-executions of instructions dependent on the one where the error was detected will still be placed in the end of the queue. Therefore we also had to change the scheduling policy for those. Before dispatching the first re-execution of the rollback operation, the PE will raise a flag indicating that it is in reexecution mode and will also piggyback a message in every operand sent to other PEs (while in this mode) indicating that they also must enter re-execution mode. In re-execution mode all new executions caused by operand match are inserted in a separate FIFO queue that has priority over the default one and the PE will only leave re-execution mode once this new queue becomes empty. The PE keeps track of instructions executed during re-execution mode and thus is able to just ignore the executions with faulty data that were pushed in the default queue. This way, the number of wasteful executions gets drastically reduced and its major causing factor becomes the latency it takes for a PE to detect an error and enter reexecution mode. Figure 3 shows an example of how re-execution and default FIFOs are used to implement DEP. In pane (a) we have tasks A, B and C, which contain wrong inputs. In pane (b) an error is detected. New operands for tasks B and C were received and the PE went to re-execution mode, forcing the reexecution FIFO to be used. In pane (c) task C was dispatched for execution and in pane (d) the same happened with task B. In pane (e), since the re-execution FIFO became empty, the old instances of tasks B and C can be removed form the default FIFO and the PE goes back to normal mode. 



(%$  

"" " $$     $  %$  "(%$   

  



(%$  

(%$  

   %$ 

#!$ (%$  " & 

   %$ 

 #!$(%$ " &



(%$      %$ 



 ##'$ "" "   

(%$  

$  "  

 %$ 

"" ""$# #'$"" " (%$ 

Fig. 3. Example of how re-execution and default FIFOs are used to implement DEP.

Although in a scenario with a single-error the Domino Effect Protection described can be very efficient, it may not be the case when there are multiple errors. If a second (or third, fourth..) error occurs when the PEs are already in reexecution mode, the protection becomes useless because the wasteful executions are going to be inserted in the prioritized queue just like they would be inserted to the default queue. Consequently, if errors are detected while in re-execution

2014 International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)

mode, the protection is just ineffective and the system resumes the same inefficient behaviour of the original version of DFER. One possible way of dealing with this issue would be to instantiate a new FIFO queue every time an error is detected and have multiple recursive re-execution procedures. Another option would be to substitute the FIFO queue for a data structure that allows access with priority levels, like a binary heap. This, of course, would increase the complexity of the scheduling process, as the worst-case complexity for operations in a binary heap is greater than in a FIFO queue, and therefore would just be justified for scenarios where the error-rate is high enough. If errors are rare events in a system there is no point in including these additional measures since it is unlikely that a second error is going to occur while the PEs are in re-execution mode.

wasteful executions with DEP, meaning the technique perfectly eliminated Domino Effect for single error executions. Overall, the DEP version performed up to 44% better than the baseline.

V. E XPERIMENTAL R ESULTS In order to evaluate our mechanism, we implemented DFER and DEP in Trebuchet [5], a software runtime that provides dataflow execution on multicore machines. Experiments were executed on a host machine with an AMD Opteron Six-Core processor using our version of Trebuchet with DFER and DEP. To validate our claims we developed a benchmark that applies a sequence of linear transformations to a set of vectors and prints to a file the result for all vectors after each linear transformation is applied. The reason we chose this benchmark is because the linear transformations have to be applied sequentially (i.e. one after the other) to the vectors, which makes each transformation dependent on the result of the previous one, therefore the chain of re-executions can be critically long. The benchmark has a main loop where at each iteration one transformation is applied to the vectors. In our experiments we were primarily interested in three aspects: the effect of a single error inserted at different times during execution, since an error in the beginning of the execution might potentially cause much more re-executions than one in the ending, the behaviour of DFER when errors are inserted at different rates and the overheads imposed to errorfree execution. Preliminary experiments, where we executed the benchmark multiple times without injecting any errors, showed that DEP does not add any extra overhead to errorfree executions, so we are left with studying just the other two aspects. For the first aspect of our investigation we executed the benchmark multiple times, inserting a single error at different times during the execution, for the previous version of DFER and then for the new version with Domino Effect Protection (DEP). The results of Figure 4 show that the earlier in the execution the error happens, the more important is to have DEP. As the position of the error moves toward the end of execution, the number of cascading re-executions the error causes diminishes and thus the improvement obtained by DEP also gets reduced. We can also see that the version with DEP maintains good performance no matter at which iteration of the loop the error is inserted, which shows that for executions with a single error DEP works perfectly. Moreover, there were no

Fig. 4. Results for a single error injected in different iterations of the main loop. The x-axis is the iteration in which the error is injected, while the y-axis is the number of wasteful executions (on the left) and the total execution time (on the right). The bar represent the number of wasteful executions and the lines are the execution times. Notice that there are no wasteful executions for the version with DEP, meaning the technique perfectly eliminated Domino Effect for single error executions.

To evaluate the impact of multiple errors, we executed both versions (baseline and DEP) injecting errors at different rates. The error injection in this experiment is a Poisson process, so the interval between errors is an exponential random variable and the unit of time is an iteration of the loop. Therefore λ of this Poisson process represents the average number of errors inserted per iteration. In Figure 5 we show the experiments for the second scenario, where the x-axis is λ, the average number of errors inserted per iteration of the main loop. These results show that for small error rates, the version with DEP outperforms greatly the baseline version, presenting a much smaller number of wasteful executions. However, as λ grows, DEP loses its functionality because, as described in Section IV, errors start to occur when the PEs are in re-execution mode, which basically nullifies DEP. In those situations, where λ is big, the solutions discussed in Section IV should be implemented. VI. C ONCLUSION AND F UTURE W ORK In this paper we introduce Domino Effect Protection (DEP) for DFER, an implicit technique implemented in the runtime that aims at reducing the impact of Domino Effect while keeping error-free execution unaffected. We implemented DFER with DEP in Trebuchet [5], a dataflow runtime environment for multicore machines. Experiments show that for scenarios where a single error happens during the execution Domino Effect Protection (DEP) 100% effective regardless of the point where the error is injected. Moreover, the earlier in the execution the error happens,

2014 International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)

151

Fig. 5. Results for multiple errors injected with Poisson distribution. The x-axis is the lambda, while the y-axis is the number of wasteful executions (on the left) and the total execution time (on the right). The bars represent the number of wasteful executions and the lines are the execution times.

the more important it is to have DEP, since the number of cascading re-executions will likely be greater. Overall, the DEP version performed up to 44% better than the baseline in a scenario with just one error. In scenarios where errors are injected errors at different rates (with Poisson distribution) results show that for small error rates, the version with DEP outperforms greatly the baseline version, presenting a much smaller number of wasteful executions. However, as the error rate grows, DEP loses its functionality because errors start to occur when the PEs are in re-execution mode, which basically nullifies DEP. DFER is the first Error Recovery mechanism for Dataflow execution models, opening several research opportunities. Domino Effect Protection is an important mechanism to make error detection more efficient and transparent to programmers. It is necessary to evaluate DEP with more benchmarks, varying task granularity and error detection/correction latency, since both factors can cause impact on the number of wasteful executions. R EFERENCES [1] J. B. Dennis and D. P. Misunas, “A preliminary architecture for a basic data-flow processor,” SIGARCH Comput. Archit. News, vol. 3, no. 4, pp. 126–132, 1974. [2] S. Balakrishnan and G. Sohi, “Program Demultiplexing: Data-flow based Speculative Parallelization of Methods in Sequential Programs,” in 33rd International Symposium on Computer Architecture (ISCA’06). Washington, DC, USA: IEEE, 2006, pp. 302–313. [3] G. Bosilca, A. Bouteiller, A. Danalis, T. Hrault, P. Lemarinier, and J. Dongarra, “Dague: A generic distributed dag engine for high performance computing.” Parallel Computing, vol. 38, no. 1-2, pp. 37–51, 2012. [Online]. Available: http://dblp.unitrier.de/db/journals/pc/pc38.html#BosilcaBDHLD12 [4] K. Stavrou, D. Pavlou, M. Nikolaides, P. Petrides, P. Evripidou, P. Trancoso, Z. Popovic, and R. Giorgi, “Programming Abstractions and Toolchain for Dataflow Multithreading Architectures,” 2009 Eighth International Symposium on Parallel and Distributed Computing, pp. 107–114, Jun. 2009.

152

[5] T. A. Alves, L. A. Marzulo, F. M. Franca, and V. S. Costa, “Trebuchet: exploring TLP with dataflow virtualisation,” International Journal of High Performance Systems Architecture, vol. 3, no. 2/3, p. 137, 2011. [6] L. A. J. Marzulo, T. A. Alves, F. M. G. Franca, and V. S. Costa, “TALM: A Hybrid Execution Model with Distributed Speculation Support,” Computer Architecture and High Performance Computing Workshops, International Symposium on, vol. 0, pp. 31–36, 2010. [7] A. Duran, E. Ayguad, R. M. Badia, J. Labarta, L. Martinell, X. Martorell, and J. Planas, “Ompss: a proposal for programming heterogeneous multi-core architectures.” Parallel Processing Letters, vol. 21, no. 2, pp. 173–193, 2011. [8] M. Solinas, R. M. Badia, F. Bodin, A. Cohen, P. Evripidou, P. Faraboschi, B. Fechner, G. R. Gao, A. Garbade, S. Girbal, D. Goodman, B. Khan, S. Koliai, F. Li, M. Lujn, L. Morin, A. Mendelson, N. Navarro, A. Pop, P. Trancoso, T. Ungerer, M. Valero, S. Weis, I. Watson, S. Zuckerman, and R. Giorgi, “The teraflux project: Exploiting the dataflow paradigm in next generation teradevices.” in DSD. IEEE, 2013, pp. 272–279. [9] K. M. Kavi, R. Giorgi, and J. Arul, “Scheduled Dataflow: Execution Paradigm, Architecture, and Performance Evaluation,” IEEE Transactions on Computers, vol. 50, no. 8, pp. 834–846, 2001. [10] J. C. Meyer, T. B. Martinsen, and L. Natvig, “Implementation of an Energy-Aware OmpSs Task Scheduling Policy.” [11] S. Borkar, “D ESIGNING R ELIABLE S YSTEMS FROM U NRELIABLE C OMPONENTS : T HE C HALLENGES OF T RANSISTOR V ARIABILITY AND D EGRADATION,” pp. 10–16, 2005. [12] D. Gizopoulos, M. Psarakis, S. V. Adve, P. Ramachandran, S. K. S. Hari, D. Sorin, a. Meixner, a. Biswas, and X. Vera, “Architectures for online error detection and recovery in multicore processors,” 2011 Design, Automation & Test in Europe, no. c, pp. 1–6, Mar. 2011. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5763096 [13] P. Shivakumar, M. Kistler, S. Keckler, D. Burger, and L. Alvisi, “Modeling the effect of technology trends on the soft error rate of combinational logic,” Proceedings International Conference on Dependable Systems and Networks, pp. 389–398, 2002. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1028924 [14] T. A. O. Alves, S. Kundu, M. L. A. J., and F. M. G. Franc¸a, “Online error detection and recovery in dataflow execution,” in IOLTS toappear. IEEE, 2014. [15] G. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. August, “SWIFT: Software Implemented Fault Tolerance,” International Symposium on Code Generation and Optimization, pp. 243–254. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1402092 [16] N. Oh, P. P. Shirvani, and E. J. Mccluskey, “Reliable Computing TECHNICAL Error Detection by Duplicated Instructions,” no. 2, 2000. [17] N. Aggarwal, N. P. Jouppi, and J. E. Smith, “Configurable Isolation : Building High Availability Systems with Commodity Multi-Core Processors,” 2007. [18] D. J. Sorin, Fault Tolerant Computer Architecture, Jan. 2009, vol. 4, no. 1. [19] D. Sorin, M. Martin, M. Hill, and D. Wood, “SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery,” Proceedings 29th Annual International Symposium on Computer Architecture, pp. 123–134, 2002. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1003568 [20] M. Prvulovic, Z. Zhang, and J. Torrellas, “Revive: Cost-effective architectural support for rollback recovery in shared-memory multiprocessors,” in Proceedings of the 29th Annual International Symposium on Computer Architecture, ser. ISCA ’02. Washington, DC, USA: IEEE Computer Society, 2002, pp. 111–122. [Online]. Available: http://dl.acm.org/citation.cfm?id=545215.545228 [21] E. Rotenberg, “AR-SMT: a microarchitectural approach to fault tolerance in microprocessors,” Digest of Papers. TwentyNinth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352), pp. 84–91. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=781037

2014 International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)

Suggest Documents