Platforms, Applications and Demonstrators (Theme 4)

Initial Objectives

Understand the impact and interplay of fault-tolerance and energy-management in many-core systems through experimental studies with existing platforms; both in the context of traditional computing platforms and SpiNNaker, allowing insight to be applied to research of Themes 1-3.
Research new heterogeneous FPGA-based many-core embedded systems. Such systems will be based upon augmenting currently adopted computing paradigms, representing systems for next generation embedded devices envisaged within the next decade.
Investigate run-time optimisations for energy management and fault-tolerant in state-of-the-art homogeneous architectures such as SpiNNaker to provide run-time optimisations for energy-management and fault-tolerance, focussing research on constrained systems with lightweight OS.
Understand the impact and influence of different application requirements and computation patterns on reliability and energy management, selecting appropriate benchmarks, computational patterns/motifs, and applications to enable appropriate/relevant validation of research outputs.
Develop technology platforms and demonstrators which validate the research from Themes 1-3. Demonstrators will be developed for both the architectures of objectives 4b and 4c, thus incorporating many of PRiME’s outputs, and using appropriate applications and patterns highlighted in objective 4d as shown on the figure below. The demonstrator(s) will implement one or more real and industrially relevant application case-studies made through consultation with the Advisory Board, showcasing the improvements made possible through the research in real-world applications.

Introduction

This research theme is investigating energy, reliability trade-offs in existing highly-parallel systems, and developing future platforms for many-core embedded systems. These platforms will be used to integrate and validate research outputs of Themes 1-3, including run-time optimisation for energy and resilience in the presence of hardware failures. This theme will also deliver programme demonstrators to showcase next-generation applications and benchmarks. A key feature is to benchmark our novel technologies against systems that do not actively manage the energy-reliability trade-off in ‘traditional’ multi-core systems (approach 1), and ‘non-traditional’ bio-inspired systems (approach 2) being developed by members of the consortium.

The first approach will use Field Programmable Gate Array (FPGAs) to implement Reduced Instruction Set Computer (RISC) soft-cores which will then be instrumented for energy/reliability management. Existing examples without the active management include the BERI FPGA soft-core and the RAMP Blue platform: an FPGA system containing 1008 Xilinx MicroBlaze cores (32-bit, 90MHz) running off-the-shelf applications and scientific benchmarks. Existing platforms, soft-cores and many-core OSes such as these will be used to evaluate energy efficiency, fault tolerance parameters and their interplay. The pioneering work of Clearspeed, and systems such as the XMOS Xcore are beginning to emerge which consider these parameters; these will inform PRiME’s research. Using this understanding, and as outputs from PRiME’s other research themes become available, a novel FPGA-based platform will be developed to drive progress and support benchmarking. This platform will incorporate architectures and cross-layer collaborative mechanisms which are modelled, verified and optimised using the methods and tools developed in Themes 2 and 3. It is envisaged that the platform will contain heterogeneous and scalable (up to 1024 cores) many-core processors, representing the requirements of high-performance embedded systems and future applications over the next 5-10 years. The platform will take advantage of new technologies that become available during the project, utilising IP from our industrial collaborators (for example ARM’s big.LITTLE, Altera’s NIOS II) where available. Where suitable, the platforms will run appropriately identified off-the-shelf benchmarks and applications, expressed using off-the-shelf parallel programming models, compiled and scheduled using existing compliers and embedded OSes. Observations from the platform and its validation of research outputs will iteratively feedback to Themes 1-3 and inform further development.

The second approach, considering ‘non-traditional’ computing platforms, will investigate novel computer architectures, for example those inspired by biology and the working of the human brain. While other research has built systems simulating neural networks using interconnected FPGAs, SpiNNaker (Uni. Manchester) is based on a custom Multi-Processor System-on-Chip (MPSoC) that incorporates 18 ARM968 processor cores. Its novelty lies in the inter-processor communication mechanism that enables very high numbers of very small packets, each representing a neural ‘spike’, to propagate across the machine in much less than a millisecond (the requirement for biological real time). The machine will ultimately scale up to a system with over a million ARM processors, and at this scale both fault-tolerance and energy-efficiency are significant engineering concerns. As an existing state-of-the-art platform, PRiME will use SpiNNaker to identify and develop the new run-time approaches to reliability and energy-efficiency that are needed by the platform. One potential option for reliability is that, with cores considered cheap, some can be left uncommitted to provide fault tolerant redundancy. Further, unused cores could be clock gated to minimise power dissipation and protect against the more ‘permanent’ faults. While inter-chip links are fault-tolerant, more pertinent to PRiME is the ability of the system recovery from software crashes. In order to permit run-time recovery, we will consider forms of checkpointing, for example storing redundant data in non-local memory to reduce the chance of corruption. State restoration will also be investigated, migrating tasks to another core or, in the worst case, a different chip. By introducing elasticity into timing constraints, simpler mapping of a problem onto a message-passing machine such as SpiNNaker is enabled. We anticipate that this will give freedom for fault-tolerance through load-shedding in software. However, the frequency and depth of saving checkpoint data are parameters which can be tuned, and run-time trade-offs between energy and reliability such as these require investigation.

This theme will be led by Steve Furber (Manchester), with individual workpackages led by investigators across institutions. The theme’s research will be tackled by Post-Doctoral Research Associates at Manchester, Imperial and Southampton.

Many-Core Applications and Benchmarks

One of the challenges of this research theme is in identifying suitable and representative applications for the technology platforms and demonstrators. Recent research has highlighted that, to benchmark many-core systems, representative workloads are required utilising a mix of concurrent and heterogeneous applications. Furthermore, high-performance and embedded computing communities have been exploring the types of application that will gain major performance benefit through many-core parallelisation. Many-core application domains are rapidly changing and developing, and there is increasing recognition that it is better to capture patterns of computation and communication common across an array of application domains. To this end, Asanovic et al., proposed the capture of such patterns in the form of “motifs”, ranging from sparse linear algebra and to dynamic programming. The 13 motifs proposed represent common computational patterns that are essential to next-generation many-core applications with varying levels of parallelism, without needing to consider specific applications. Furthermore, as the motifs are not closely coupled to specific code or language implementations, encourage innovation.

The technology platform will be validated using appropriately selected computation patterns (for example those represented by Asanovic’s ‘motifs’) and established benchmarks for embedded computing platforms. This will allow the profiling and better understanding of the effectiveness and scope of PRiME’s research outputs. In addition, a number of applications will be incorporated into the research demonstrators, to maximise impact and facilitate dissemination of the project’s outputs. As applications for many-core are continuingly evolving, and can be expected to be vastly different in 5 years, they will be selected and refined throughout the PRiME’s duration. This selection will be made in consultation with the PRiME’s industrial experts, external stakeholders, and priority areas identified by RCUK (e.g. digital economy, lifelong health and wellbeing). Novel platforms, such as SpiNNaker, have differing target applications to be considered. Simple, non-neural, massively parallel applications have already been demonstrated (such as the modelling of heat dissipation), and new scenarios such as other finite-element problems or neural simulations will be considered in this research.

Interrelation of Theme 4’s workpackages (WPs) and output objectives

Methodology and Workplan

Homogeneous Many-Core Platform. The SpiNNaker platform will be used to evaluate, compare and demonstrate run-time optimisations for energy-efficiency and reliability on a highly novel constrained platform. Run-time fault monitoring and recovery mechanisms for a many-core kernel will be developed and implemented, supported by the outputs of Theme 2. This will facilitate software recovery from crashes across both individual and multiple cores. In addition it would exchange information with network neighbours to allow routing around temporary and permanent network faults, including congestion hot-spots. The workpackage will deliver a generalised fault-tolerant software layer for existing and future SpiNNaker applications in a transparent manner. While SpiNNaker will be used as a platform that is already available, new homogeneous many-core platforms will also be researched.

Heterogeneous Many-Core Platform. Novel FPGA-based many-core platforms will be researched which use heterogeneous IP cores and synthesis/compilation methods that are developed in Theme 3. Existing architectures will be evaluated to inform the design of the new platform, the major novelty being the incorporation of the necessary system-level support to enable the research outputs from Themes 2-3. Cores developed in Theme 3 will be integrated into a board-level test-bed, allowing execution of benchmarks and evaluation of architectures using run-time instrumentation.

Technology Platforms and Applications. Outputs from the above workpackages will be consolidated into many-core platforms that can be used by Themes 2-3. These platforms will enable investigation of energy-reliability trade-offs. This is anticipated to vary widely depending on the application, fault model and architecture, especially if faults require the migration of processes around the system. These platforms will also be used to develop technology demonstrators to showcase PRiME’s achievements to a wide audience. Benchmarks, computation patterns and applications will be evaluated and carefully considered for use with the platforms and demonstrators. Demonstrators will be delivered which exhibit fault tolerance and recovery, operating across a variety of applications and architectures, together with recommendations on how they could be used in other circumstances.

Key Outputs and Connections to Other Themes

The outputs from the Platforms, Applications and Demonstrators theme include: demonstrate the impact of energy-efficiency and reliability in SpiNNaker, supporting Themes 1-3; implement run-time fault monitoring and recovery mechanisms for SpiNNaker, supported by outputs of Theme 2, and embed these into a generalised fault-tolerant software layer for existing and future SpiNNaker applications; Publication presenting the trade-offs between energy, reliability and performance in existing heterogeneous many-core platforms, supporting Themes 1-3; a heterogeneous FPGA-based many-core platform capable of integrating research outputs from Themes 2-3; identification of suitable benchmarks, applications and ‘motifs’ for evaluating/demonstrating PRiME’s research; initial demonstrator(s) of technology developed over PRiME’s first 30 months, and final demonstrator(s) of PRiME’s research outputs.

This research theme is inherently linked to PRiME’s other themes. Its objectives and workpackages serve to bring the research themes together, validating outputs and translating them to other themes. The technology platforms and demonstrators will be developed such that they are easily and economically reproduced as required in PRiME’s different institutions, themes and workpackages. Specifically, the research of this theme will be delivering a greater understanding of energy efficiency and fault tolerance in many-core systems; this information will be used by Theme 1 (Cross-Layer Theory and Models). It will implement and evaluate the new architectures and run-time approaches to energy efficiency and fault tolerance developed in Themes 2 (High Integrity Run-time Management and Optimisation) and 3 (Heterogeneous Many-Core Architectures and Hardware Reconfiguration). It will also deliver technology platforms, benchmarks, computation patterns and motifs for Themes 1-3, and incorporate research outputs of Themes 2-3 into PRiME demonstrators.