Cellular Multiprocessors

Supported by:


(#GR/S79572/01)

(#IP 27648 (FP6))
(#STREP 249059 (FP7))

(#443570)

Principal Investigator

Marcelo Cintra - School of Informatics - University of Edinburgh

Research Assistants

Christian Fensch (Ph.D. 2008) - School of Informatics - University of Edinburgh
Pedro Díaz (Ph.D. 2011) - School of Informatics - University of Edinburgh
Vasileios Porpodas (Ph.D. 2013) - School of Informatics - University of Edinburgh
Konstantina Mitropoulou (Ph.D. 2014) - School of Informatics - University of Edinburgh
Andrew McPherson (Ph.D. 2015) School of Informatics - University of Edinburgh

Tom Ashby (Post-doc - 2005-2007) - School of Informatics - University of Edinburgh

Research Collaborators

Michael O'Boyle - School of Informatics - University of Edinburgh
Vijay Nagarajan - School of Informatics - University of Edinburgh
Wolfgang Karl - Faculty of Informatics - University of Karlsruhe
Manuel E. Acacio - Department of Computer Engineering and Technology - University of Murcia

Project Objectives

To investigate the use of single chip multiprocessor architectures based on the replication of very simple processor-memory "cells". This investigation follows two main lines of research. The first is the development of novel compilation techniques to extract larger degrees of TLP from traditionally hard to parallelise applications. The second is the development of improved architectures through the systematic study of architectural design alternatives under various performance, complexity, and power dissipation constraints.

Project Contributions

In [16] we proposed a hybrid software-hardware mechanism for cache coherence in tiled CMPs with scalable on-chip interconnects. The mechanism is relies on hardware to perform remote cache accesses and moves the responsibility for data mapping and coherence to the OS.
In [15] we presented a tool that can be used to identify the breakdown of cache miss types and, thus, assist programmers in improving the data locality behavior of their programs.
In [14] we proposed a novel class of hardware prefetchers that allow the global miss address stream to be first localized according to different correlation criteria and later chained following their original temporal behavior. This mechanism allows for the simultaneous exploitation of different types of correlation while maintaining timeliness.
In [13] we proposed a new OS-managed policy for mapping memory blocks to caches in a tiled NUCA. This policy addresses the trade-off between cache access latency and number of off-chip accesses and also introduces an upper bound on the deviation of the distribution of memory pages among cache tiles.
In [12] we presented a compiler scheme to automatically insert calls to run-time system functions to instrument code in order to generate performance models of parallel applications. The approach closes the gap between compiler-supported automatic model construction and the manual analytical modeling of workloads.
In [11] we proposed a novel hybrid software-hardware coherence mechanism, where software is responsible for triggering the coherence actions - self-invalidations and writebacks - at appropriate times while hardware uses Bloom filters to perform more selective self-invalidations.
In [10] we extended the work in [10] by investigating in detail the impact of system design parameters and extending the system to support multi-level cache hierarchies.
In [9] we proposed a simple technique able to dynamically adjust the bits used for cache indexing so as to disperse the working set more evenly across the available sets. This technique minimizes conflict misses, leading to improvement in energy efficiency.
In [8] we proposed a new scheduling algorithm for heterogeneous clustered VLIW processors with software DVFS control, that performs cluster assignment, instruction scheduling and fast frequency selection simultaneously. The proposed algorithm solves the phase ordering problem present in existing algorithms.
In [7] we proposed a new error code generator that adaptively distributes the error detection overhead to the available resources across multiple cores, fully exploiting the abundant ILP of these architectures and adapting to a wide range of architecture configurations (issue-width, inter-core delay).
In [6] we proposed a new unified cluster assignment and instruction-scheduling algorithm that adapts to the inter-cluster latency by performing fine-grain switching between two clustering heuristics. The approach generates better performing code for a wide range of inter-cluster latency values.
In [5] we proposed a novel instruction scheduling algorithm for clustered VLIW processors that combines cluster assignment, instruction scheduling and inter-cluster communication reuse. The proposed algorithm improves performance by any phase-ordering issues among these three code generation and optimization steps.
In [4] we proposed a novel instruction scheduling algorithm for VLIW processors with non-blocking caches that copes better with unpredictable cache-memory latencies. Aligned Scheduling exploits the VLIW-specific cache-miss semantics to efficiently align cache misses on the same scheduling cycle.
In [3] we proposed a new error code generator that reduces error detection overhead by reducing the impact of basic-block fragmentation. The approach breaks the synchronized execute-check-confirm-execute cycle, thus generating a more scheduler-friendly code with more ILP.

Publications (sorted by date)


Thesis

  • Ensuring Performance and Correctness for Legacy Parallel Programs.
    Andrew McPherson
    Ph.D., School of Informatics, University of Edinburgh, 2015.
  • Performance Optimizations for Compiler-based Error Detection.
    Konstantina Mitropoulou
    Ph.D., School of Informatics, University of Edinburgh, 2014.
  • Instruction Scheduling Optimizations for Energy Efficient VLIW Processors.
    Vasileios Porpodas
    Ph.D., School of Informatics, University of Edinburgh, 2013.
  • Mechanisms to Improve the Efficiency of Hardware Data Prefetchers.
    Pedro Díaz
    Ph.D., School of Informatics, University of Edinburgh, 2011.
  • An OS-Based Alternative to Full Hardware Coherence on Tiled Chip-Multiprocessors.
    Christian Fensch
    Ph.D., School of Informatics, University of Edinburgh, 2008.

    Talks

  • DRIFT: Decoupled compileR-based Instruction-level Fault-Tolerance (presentation)
    Intl. Wksp. on Languages and Compilers for Parallel Computing, San Jose, USA, September 2013
  • Aligned Scheduling: Cache-efficient Instruction Scheduling for VLIW Processors (presentation)
    Intl. Wksp. on Languages and Compilers for Parallel Computing, San Jose, USA, September 2013
  • CAeSaR: Unified Cluster-Assignment Scheduling and communication Reuse for clustered VLIW processors (presentation)
    Intl. Conf. on Compilers, Architecture and Synthesis for Embedded Systems, Montreal, Canada, September 2013
  • LUCAS: Latency-adaptive Unified Cluster Assignment and Instruction Scheduling (presentation)
    Conf. on Languages, Compilers, and Tools for Embedded Systems, Seattle, USA, June 2013
  • CASTED: Core-Adaptive Software Transient Error Detection for Tightly Coupled Cores (presentation)
    Intl. Parallel and Distributed Processing Symp., Boston, USA, May 2013
  • UCIFF: Unified Cluster Assignment, Instruction Scheduling, and Fast Frequency Selection for Heterogeneous Clustered VLIW Cores (presentation)
    Intl. Wksp. on Languages and Compilers for Parallel Computing, Tokyo, Japan, September 2012
  • ASCIB: Adaptive Selection of Cache Indexing Bits for Removing Conflict Misses (presentation)
    Intl. Symp. on Low Power Electronics and Design, Redondo Beach, USA, August 2012
  • Stream Chaining: Exploiting Multiple Levels of Correlation in Data Prefetching
    Department of Computer Science, University of Cambridge, Cambridge, UK, May 2011
  • Stream Chaining: Exploiting Multiple Levels of Correlation in Data Prefetching
    Department of Informatics Engineering, University of Porto, Porto, Portugal, November 2010
  • Fair Prefetching in Multi-Cores with Resizable Prefetch Heaps
    Intel European Research and Innovation Conference
    Intel Germany Research Laboratory, Braunschweig, Germany, September 2010
  • Compiler-Directed Performance Model Construction for Parallel Programs (presentation)
    Intl. Conf. on Architecture of Computing Systems, Hanover, Germany, February 2010
  • Distance-Aware Round-Robin Mapping for Large NUCA Caches (presentation)
    Intl. Conf. on High Performance Computing, Kochi, India, December 2009
  • Stream Chaining: Exploiting Multiple Levels of Correlation in Data Prefetching
    Department of Informatics, University of Munich, Munich, Germany, November 2009
  • Stream Chaining: Exploiting Multiple Levels of Correlation in Data Prefetching (presentation)
    Intl. Symp. on Computer Architecture, Austin, USA, June 2009
  • Stream Chaining: Exploiting Multiple Levels of Correlation in Data Prefetching
    Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, USA, May 2009
  • An OS-Based Alternative to Full Hardware Coherence on Tiled CMPs
    Department of Electrical and Computer Engineering, University of Rochester, Rochester, USA, March 2009
  • An OS-Based Alternative to Full Hardware Coherence on Tiled CMPs
    Intel Germany Research Laboratory, Braunschweig, Germany, July 2008
  • An OS-Based Alternative to Full Hardware Coherence on Tiled CMPs
    Department of Computer Engineering, Universidad de Murcia, Murcia, Spain, May 2008
  • An OS-Based Alternative to Full Hardware Coherence on Tiled CMPs (presentation)
    Intl. Symp. on High-Performance Computer Architecture, Salt Lake City, USA, February 2008
  • Alternatives to Eager Hardware Cache Coherence on Large-Scale CMPs (presentation)
    IBM T. J. Watson Research Center, Yorktown Heights, USA, July 2007
  • Alternatives to Eager Hardware Cache Coherence on Large-Scale CMPs
    Department of Informatics, University of Erlangen-Nuremberg, Erlangen, Germany, June 2007
  • Alternatives to Eager Hardware Cache Coherence on Large-Scale CMPs
    Oak Ridge National Laboratory, Oak Ridge, USA, March 2007
  • Alternatives to Eager Hardware Cache Coherence on Large-Scale CMPs
    ARM Ltd., Cambridge, UK, February 2007

    Related Projects (in alphabetical order)

  • Blue Gene at IBM Research (USA)
  • IACOMA at University of Illinois (USA)
  • RATS at University of Southern California (USA)
  • RAW at MIT (USA)
  • SCALE at MIT (USA)
  • Smart Memories at Stanford University (USA)
  • TRIPS at University of Texas (USA)
  • Wavescalar at University of Washington (USA)
    Last modified: September 1 2015