To investigate the use of single chip multiprocessor architectures based on the replication of very simple processor-memory "cells". This investigation follows two main lines of research. The first is the development of novel compilation techniques to extract larger degrees of TLP from traditionally hard to parallelise applications. The second is the development of improved architectures through the systematic study of architectural design alternatives under various performance, complexity, and power dissipation constraints.
Project Contributions
In [7] we proposed a hybrid software-hardware mechanism for cache coherence in tiled CMPs with scalable on-chip interconnects. The mechanism is relies on hardware to perform remote cache accesses and moves the responsibility for data mapping and coherence to the OS.
In [6] we presented a tool that can be used to identify the breakdown of cache miss types and, thus, assist programmers in improving the data locality behavior of their programs.
In [5] we proposed a novel class of hardware prefetchers that allow the global miss address stream to be first localized according to different correlation criteria and later chained following their original temporal behavior. This mechanism allows for the simultaneous exploitation of different types of correlation while maintaining timeliness.
In [4] we proposed a new OS-managed policy for mapping memory blocks to caches in a tiled NUCA. This policy addresses the trade-off between cache access latency and number of off-chip accesses and also introduces an upper bound on the deviation of the distribution of memory pages among cache tiles.
In [3] we presented a compiler scheme to automatically insert calls to run-time system functions to instrument code in order to generate performance models of parallel applications. The approach closes the gap between compiler-supported automatic model construction and the manual analytical modeling of workloads.
In [2] we proposed a novel hybrid software-hardware coherence mechanism, where software is responsible for triggering the coherence actions - self-invalidations and writebacks - at appropriate times while hardware uses Bloom filters to perform more selective self-invalidations.
Publications (sorted by date)
[1] An Evaluation of an OS-Based Coherence Scheme for Tiled CMPs
Christian Fensch and Marcelo Cintra
Intl. Journal of Parallel Programming (IJPP), Special Issue on High Performance and Embedded Architecture and Compilation, vol. 39, no. 3, p 271-295, June 2011
Stream Chaining: Exploiting Multiple Levels of Correlation in Data Prefetching
Department of Computer Science, University of Cambridge, Cambridge, UK, May 2011
Stream Chaining: Exploiting Multiple Levels of Correlation in Data Prefetching
Department of Informatics Engineering, University of Porto, Porto, Portugal, November 2010
Fair Prefetching in Multi-Cores with Resizable Prefetch Heaps
Intel European Research and Innovation Conference (ERIC)
Intel Germany Research Laboratory, Braunschweig, Germany, September 2010
Compiler-Directed Performance Model Construction for Parallel Programs (presentation)
Intl. Conf. on Architecture of Computing Systems (ARCS), Hanover, Germany, February 2010
Distance-Aware Round-Robin Mapping for Large NUCA Caches (presentation)
Intl. Conf. on High Performance Computing (HiPC), Kochi, India, December 2009
Stream Chaining: Exploiting Multiple Levels of Correlation in Data Prefetching
Department of Informatics, University of Munich, Munich, Germany, November 2009
Stream Chaining: Exploiting Multiple Levels of Correlation in Data Prefetching (presentation)
Intl. Symp. on Computer Architecture, Austin, USA, June 2009
Stream Chaining: Exploiting Multiple Levels of Correlation in Data Prefetching
Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, USA, May 2009
An OS-Based Alternative to Full Hardware Coherence on Tiled CMPs
Department of Electrical and Computer Engineering, University of Rochester, Rochester, USA, March 2009
An OS-Based Alternative to Full Hardware Coherence on Tiled CMPs
Intel Germany Research Laboratory, Braunschweig, Germany, July 2008
An OS-Based Alternative to Full Hardware Coherence on Tiled CMPs
Department of Computer Engineering, Universidad de Murcia, Murcia, Spain, May 2008
An OS-Based Alternative to Full Hardware Coherence on Tiled CMPs (presentation)
Intl. Symp. on High-Performance Computer Architecture, Salt Lake City, USA, February 2008
Alternatives to Eager Hardware Cache Coherence on Large-Scale CMPs (presentation)
IBM T. J. Watson Research Center, Yorktown Heights, USA, July 2007
Alternatives to Eager Hardware Cache Coherence on Large-Scale CMPs
Department of Informatics, University of Erlangen-Nuremberg, Erlangen, Germany, June 2007
Alternatives to Eager Hardware Cache Coherence on Large-Scale CMPs
Oak Ridge National Laboratory, Oak Ridge, USA, March 2007
Alternatives to Eager Hardware Cache Coherence on Large-Scale CMPs
ARM Ltd., Cambridge, UK, February 2007