Apple-CORE: harnessing general-purpose many-cores with hardware concurrency management.
Poss, R.; Lankamp, M.; Yang, Q.; Fu, J.; van Tol , M. W.; Uddin, I.; and Jesshope, C.
Microprocessors and Microsystems, 37(8): 1090–1101. November 2013.
Doi
Local
doi
link
bibtex
abstract
5 downloads
@article{poss13micpro,
Abstract = {To harness the potential of CMPs for scalable, energy-efficient performance in general-purpose computers, the Apple-CORE project has co-designed a general machine model and concurrency control interface with dedicated hardware support for concurrency management across multiple cores. Its SVP interface combines dataflow synchronisation with imperative programming, towards the efficient use of parallelism in general-purpose workloads. Its implementation in hardware provides logic able to coordinate single-issue, in-order multi-threaded RISC cores into computation clusters on chip, called Microgrids. In contrast with the traditional ``accelerator'' approach, Microgrids are components in distributed systems on chip that consider both clusters of small cores and optional, larger sequential cores as system services shared between applications. The key aspects of the design are asynchrony, i.e. the ability to tolerate irregular long latencies on chip, a scale-invariant programming model, a distributed chip resource model, and the transparent performance scaling of a single program binary code across multiple cluster sizes. This article describes the execution model, the core micro-architecture, its realization in a many-core, general-purpose processor chip and its software environment. This article also presents cycle-accurate simulation results for various key algorithmic and cryptographic kernels. The results show good efficiency in terms of the utilisation of hardware despite the high-latency memory accesses and good scalability across relatively large clusters of cores.},
Author = {Raphael Poss and Mike Lankamp and Qiang Yang and Jian Fu and Michiel W. {van Tol} and Irfan Uddin and Chris Jesshope},
Doi = {10.1016/j.micpro.2013.05.004}, Urldoi = {http://dx.doi.org/10.1016/j.micpro.2013.05.004},
Issn = {0141-9331},
Journal = {Microprocessors and Microsystems},
Month = {November},
Number = {8},
Pages = {1090--1101},
Read = {1},
Title = {{Apple-CORE}: harnessing general-purpose many-cores with hardware concurrency management},
Urllocal = {pub/poss.13.micpro.pdf},
Volume = {37},
Year = {2013},
}
To harness the potential of CMPs for scalable, energy-efficient performance in general-purpose computers, the Apple-CORE project has co-designed a general machine model and concurrency control interface with dedicated hardware support for concurrency management across multiple cores. Its SVP interface combines dataflow synchronisation with imperative programming, towards the efficient use of parallelism in general-purpose workloads. Its implementation in hardware provides logic able to coordinate single-issue, in-order multi-threaded RISC cores into computation clusters on chip, called Microgrids. In contrast with the traditional “accelerator” approach, Microgrids are components in distributed systems on chip that consider both clusters of small cores and optional, larger sequential cores as system services shared between applications. The key aspects of the design are asynchrony, i.e. the ability to tolerate irregular long latencies on chip, a scale-invariant programming model, a distributed chip resource model, and the transparent performance scaling of a single program binary code across multiple cluster sizes. This article describes the execution model, the core micro-architecture, its realization in a many-core, general-purpose processor chip and its software environment. This article also presents cycle-accurate simulation results for various key algorithmic and cryptographic kernels. The results show good efficiency in terms of the utilisation of hardware despite the high-latency memory accesses and good scalability across relatively large clusters of cores.
Machines are benchmarked by code, not algorithms.
Poss, R.
Computing Research Repository. September 2013.
Paper
Local
link
bibtex
abstract
3 downloads
@article{poss13bench,
Abstract = {This article highlights how small modifications to either the source code of a benchmark program or the compilation options may impact its behavior on a specific machine. It argues that for evaluating machines, benchmark providers and users be careful to ensure reproducibility of results based on the machine code actually running on the hardware and not just source code. The article uses color to grayscale conversion of digital images as a running example.},
Author = {{Raphael~`kena'} Poss},
Journal = {Computing Research Repository},
Month = {September},
Read = {1},
Title = {Machines are benchmarked by code, not algorithms},
Url = {http://arxiv.org/abs/1309.0534},
Urllocal = {pub/poss.13.bench.pdf},
Year = {2013},
}
This article highlights how small modifications to either the source code of a benchmark program or the compilation options may impact its behavior on a specific machine. It argues that for evaluating machines, benchmark providers and users be careful to ensure reproducibility of results based on the machine code actually running on the hardware and not just source code. The article uses color to grayscale conversion of digital images as a running example.
Optimizing for confidence—Costs and opportunities at the frontier between abstraction and reality.
Poss, R.
Computing Research Repository. August 2013.
Paper
Local
link
bibtex
abstract
3 downloads
@article{poss13iocosts,
Abstract = {Is there a relationship between computing costs and the confidence people place in the behavior of computing systems? What are the tuning knobs one can use to optimize systems for human confidence instead of correctness in purely abstract models? This report explores these questions by reviewing the mechanisms by which people build confidence in the match between the physical world behavior of machines and their abstract intuition of this behavior according to models or programming language semantics. We highlight in particular that a bottom-up approach relies on arbitrary trust in the accuracy of I/O devices, and that there exists clear cost trade-offs in the use of I/O devices in computing systems. We also show various methods which alleviate the need to trust I/O devices arbitrarily and instead build confidence incrementally "from the outside" by considering systems as black box entities. We highlight cases where these approaches can reach a given confidence level at a lower cost than bottom-up approaches. },
Author = {{Raphael~`kena'} Poss},
Journal = {Computing Research Repository},
Month = {August},
Read = {1},
Title = {Optimizing for confidence---Costs and opportunities at the frontier between abstraction and reality},
Url = {http://arxiv.org/abs/1308.1602},
Urllocal = {pub/poss.13.iocosts.pdf},
Year = {2013},
}
Is there a relationship between computing costs and the confidence people place in the behavior of computing systems? What are the tuning knobs one can use to optimize systems for human confidence instead of correctness in purely abstract models? This report explores these questions by reviewing the mechanisms by which people build confidence in the match between the physical world behavior of machines and their abstract intuition of this behavior according to models or programming language semantics. We highlight in particular that a bottom-up approach relies on arbitrary trust in the accuracy of I/O devices, and that there exists clear cost trade-offs in the use of I/O devices in computing systems. We also show various methods which alleviate the need to trust I/O devices arbitrarily and instead build confidence incrementally “from the outside” by considering systems as black box entities. We highlight cases where these approaches can reach a given confidence level at a lower cost than bottom-up approaches.
On-demand Thread-level Fault Detection in a Concurrent Programming Environment.
Fu, J.; Yang, Q.; Poss, R.; Jesshope, C.; and Zhang, C.
In
Proc. Intl. Conf. on Embedded Computer Systems: Architectures, MOdeling and Simulation (SAMOS XIII), pages 255–262, July 2013.
IEEE
Doi
Local
doi
link
bibtex
abstract
@inproceedings{fu13samos,
Abstract = {The vulnerability of multi-core processors is increasing due to tighter design margins and greater susceptibility to interference. Moreover, concurrent programming environments are the norm in the exploitation of multi-core systems. In this paper, we present an on-demand thread-level fault detection mechanism for multi-cores. The main contribution is on-demand redundancy, which allows users to set the redundancy scope in the concurrent code. To achieve this we introduce intelligent redundant thread creation and synchronization, which manages concurrency and synchronization between the redundant threads via the master. This framework was implemented in an emulation of a multi-threaded, many-core processor with single, in-order issue cores. It was evaluated by a range of programs in image and signal processing, and encryption. The performance overhead of redundancy is less than 11% for single core execution and is always less than 100% for all scenarios. This efficiency derives from the platform's hardware concurrency management and latency tolerance.},
Author = {Jian Fu and Qiang Yang and Raphael Poss and Chris Jesshope and Chunyuan Zhang},
Booktitle = {Proc. Intl. Conf. on Embedded Computer Systems: Architectures, MOdeling and Simulation (SAMOS XIII)},
Doi = {10.1109/SAMOS.2013.6621132}, Urldoi = {http://dx.doi.org/10.1109/SAMOS.2013.6621132},
Month = {July},
Pages = {255--262},
Publisher = {IEEE},
Read = {1},
Title = {On-demand Thread-level Fault Detection in a Concurrent Programming Environment},
Urllocal = {pub/fu.13.samos.pdf},
Year = {2013},
}
The vulnerability of multi-core processors is increasing due to tighter design margins and greater susceptibility to interference. Moreover, concurrent programming environments are the norm in the exploitation of multi-core systems. In this paper, we present an on-demand thread-level fault detection mechanism for multi-cores. The main contribution is on-demand redundancy, which allows users to set the redundancy scope in the concurrent code. To achieve this we introduce intelligent redundant thread creation and synchronization, which manages concurrency and synchronization between the redundant threads via the master. This framework was implemented in an emulation of a multi-threaded, many-core processor with single, in-order issue cores. It was evaluated by a range of programs in image and signal processing, and encryption. The performance overhead of redundancy is less than 11% for single core execution and is always less than 100% for all scenarios. This efficiency derives from the platform’s hardware concurrency management and latency tolerance.
Characterizing traits of coordination.
Poss, R.
Computing Research Repository. July 2013.
Paper
Local
link
bibtex
abstract
4 downloads
@article{poss13ctc,
Abstract = {How can one recognize coordination languages and technologies? As this report shows, the common approach that contrasts coordination with computation is intellectually unsound: depending on the selected understanding of the word "computation", it either captures too many or too few programming languages. Instead, we argue for objective criteria that can be used to evaluate how well programming technologies offer coordination services. Of the various criteria commonly used in this community, we are able to isolate three that are strongly characterizing: black-box componentization, which we had identified previously, but also interface extensibility and customizability of run-time optimization goals. These criteria are well matched by Intel's Concurrent Collections and AstraKahn, and also by OpenCL, POSIX and VMWare ESX. },
Author = {{Raphael~`kena'} Poss},
Journal = {Computing Research Repository},
Month = {July},
Read = {1},
Title = {Characterizing traits of coordination},
Url = {http://arxiv.org/abs/1307.4827},
Urllocal = {pub/poss.13.ctc.pdf},
Year = {2013},
}
How can one recognize coordination languages and technologies? As this report shows, the common approach that contrasts coordination with computation is intellectually unsound: depending on the selected understanding of the word “computation”, it either captures too many or too few programming languages. Instead, we argue for objective criteria that can be used to evaluate how well programming technologies offer coordination services. Of the various criteria commonly used in this community, we are able to isolate three that are strongly characterizing: black-box componentization, which we had identified previously, but also interface extensibility and customizability of run-time optimization goals. These criteria are well matched by Intel’s Concurrent Collections and AstraKahn, and also by OpenCL, POSIX and VMWare ESX.
Extrinsically adaptable systems.
Poss, R.
Computing Research Repository. June 2013.
Paper
Local
link
bibtex
abstract
2 downloads
@article{poss13exadapt,
Abstract = {Are there qualitative and quantitative traits of system design that
contribute to the ability of people to further innovate? We propose that
extrinsic adaptability, the ability given to secondary parties to change a
system to match new requirements not envisioned by the primary provider, is
such a trait. "Extrinsic adaptation" encompasses the popular concepts of
"workaround", "fast prototype extension" or "hack", and extrinsic adaptability
is thus a measure of how friendly a system is to tinkering by curious minds. In
this report, we give "hackability" or "hacker-friendliness" scientific
credentials by formulating and studying a generalization of the concept. During
this exercise, we find that system changes by secondary parties fall on a
subjective gradient of acceptability, with extrinsic adaptations on one side
which confidently preserve existing system features, and invasive modifications
on the other side which are perceived to be disruptive to existing system
features. Where a change is positioned on this gradient is dependent on how an
external observer perceives component boundaries within the changed system. We
also find that the existence of objective cost functions can alleviate but not
fully eliminate this subjectiveness. The study also enables us to formulate an
ethical imperative for system designers to promote extrinsic adaptability.},
Author = {{Raphael~`kena'} Poss},
Journal = {Computing Research Repository},
Month = {June},
Read = {1},
Title = {Extrinsically adaptable systems},
Url = {http://arxiv.org/abs/1306.5445},
Urllocal = {pub/poss.13.exadapt.pdf},
Year = {2013},
}
Are there qualitative and quantitative traits of system design that contribute to the ability of people to further innovate? We propose that extrinsic adaptability, the ability given to secondary parties to change a system to match new requirements not envisioned by the primary provider, is such a trait. “Extrinsic adaptation” encompasses the popular concepts of “workaround”, “fast prototype extension” or “hack”, and extrinsic adaptability is thus a measure of how friendly a system is to tinkering by curious minds. In this report, we give “hackability” or “hacker-friendliness” scientific credentials by formulating and studying a generalization of the concept. During this exercise, we find that system changes by secondary parties fall on a subjective gradient of acceptability, with extrinsic adaptations on one side which confidently preserve existing system features, and invasive modifications on the other side which are perceived to be disruptive to existing system features. Where a change is positioned on this gradient is dependent on how an external observer perceives component boundaries within the changed system. We also find that the existence of objective cost functions can alleviate but not fully eliminate this subjectiveness. The study also enables us to formulate an ethical imperative for system designers to promote extrinsic adaptability.
The essence of component-based design and coordination.
Poss, R.
Computing Research Repository. June 2013.
Paper
Local
link
bibtex
abstract
4 downloads
@article{poss13coord,
Abstract = {Is there a characteristic of coordination languages that makes them qualitatively different from general programming languages and deserves special academic attention? This report proposes a nuanced answer in three parts. The first part highlights that coordination languages are the means by which composite software applications can be specified using components that are only available separately, or later in time, via standard interfacing mechanisms. The second part highlights that most currently used languages provide mechanisms to use externally provided components, and thus exhibit some elements of coordination. However not all do, and the availability of an external interface thus forms an objective and qualitative criterion that distinguishes coordination. The third part argues that despite the qualitative difference, the segregation of academic attention away from general language design and implementation has non-obvious cost trade-offs. },
Author = {{Raphael~`kena'} Poss},
Institution = {University of Amsterdam},
Journal = {Computing Research Repository},
Month = {June},
Read = {1},
Title = {The essence of component-based design and coordination},
Url = {http://arxiv.org/abs/1306.3375},
Urllocal = {pub/poss.13.coord.pdf},
Year = {2013},
}
Is there a characteristic of coordination languages that makes them qualitatively different from general programming languages and deserves special academic attention? This report proposes a nuanced answer in three parts. The first part highlights that coordination languages are the means by which composite software applications can be specified using components that are only available separately, or later in time, via standard interfacing mechanisms. The second part highlights that most currently used languages provide mechanisms to use externally provided components, and thus exhibit some elements of coordination. However not all do, and the availability of an external interface thus forms an objective and qualitative criterion that distinguishes coordination. The third part argues that despite the qualitative difference, the segregation of academic attention away from general language design and implementation has non-obvious cost trade-offs.
On whether and how D-RISC and Microgrids can be kept relevant (self-assessment report).
Poss, R.
Technical Report arXiv:1303.4892v1 [cs.
AR], University of Amsterdam, March 2013.
Paper
Local
link
bibtex
abstract
3 downloads
@techreport{poss13mg,
Abstract = {This report lays flat my personal views on D-RISC and Microgrids as of March 2013. It reflects the opinions and insights that I have gained from working on this project during the period 2008-2013. This report is structed in two parts: deconstruction and reconstruction. In the deconstruction phase, I review what I believe are the fundamental motivation and goals of the D-RISC/Microgrids enterprise, and identify what I judge are shortcomings: that the project did not deliver on its expectations, that fundamental questions are left unanswered, and that its original motivation may not even be relevant in scientific research any more in this day and age. In the reconstruction phase, I start by identifying the merits of the current D-RISC/Microgrids technology and know-how taken at face value, re-motivate its existence from a different angle, and suggest new, relevant research questions that could justify continued scientific investment.},
Author = {{Raphael~`kena'} Poss},
Institution = {University of Amsterdam},
Month = {March},
Number = {arXiv:1303.4892v1 [cs.AR]},
Read = {1},
Title = {On whether and how {D-RISC} and {Microgrids} can be kept relevant (self-assessment report)},
Url = {http://arxiv.org/abs/1303.4892},
Urllocal = {pub/poss.13.mg.pdf},
Year = {2013},
}
This report lays flat my personal views on D-RISC and Microgrids as of March 2013. It reflects the opinions and insights that I have gained from working on this project during the period 2008-2013. This report is structed in two parts: deconstruction and reconstruction. In the deconstruction phase, I review what I believe are the fundamental motivation and goals of the D-RISC/Microgrids enterprise, and identify what I judge are shortcomings: that the project did not deliver on its expectations, that fundamental questions are left unanswered, and that its original motivation may not even be relevant in scientific research any more in this day and age. In the reconstruction phase, I start by identifying the merits of the current D-RISC/Microgrids technology and know-how taken at face value, re-motivate its existence from a different angle, and suggest new, relevant research questions that could justify continued scientific investment.
On-Chip Traffic Regulation to Reduce Coherence Protocol Cost on a Micro-threaded Many-Core Architecture with Distributed Caches.
Yang, Q.; Fu, J.; Poss, R.; and Jesshope, C.
ACM Trans. Embed. Comput. Syst., 13(3s): 103:1–103:21. March 2013.
Doi
doi
link
bibtex
abstract
@article{yang13tecs,
Abstract = {When hardware cache coherence scales to many cores on chip, the coherence protocol of the shared memory system may offset the benefit from massive hardware concurrency. In this article, we investigate the cost of a write-update policy in terms of on-chip memory network traffic and its adverse effects on the system performance based on a multi-threaded many-core architecture with distributed caches. We discuss possible software and hardware solutions to alleviate the network pressure without changing the protocol. We find that in the context of massive concurrency, by introducing a write-merging buffer with 0.46% area overhead to each core, applications with good locality and concurrency are boosted up by 18.74% in performance on average. Other applications also benefit from this addition and even achieve a throughput increase of 5.93%. In addition, this improvement indicates that higher levels of concurrency per core can be exploited without impacting performance, thus tolerating latency better and giving higher processor efficiencies compared to other solutions.},
Acmid = {2567931},
Address = {New York, NY, USA},
Author = {Qiang Yang and Jian Fu and Raphael Poss and Chris Jesshope},
Doi = {10.1145/2567931}, Urldoi = {http://dx.doi.org/10.1145/2567931},
Issn = {1539-9087},
Journal = {ACM Trans. Embed. Comput. Syst.},
Month = {March},
Number = {3s},
Pages = {103:1--103:21},
Publisher = {ACM},
Title = {On-Chip Traffic Regulation to Reduce Coherence Protocol Cost on a Micro-threaded Many-Core Architecture with Distributed Caches},
Volume = {13},
Year = {2013},
}
When hardware cache coherence scales to many cores on chip, the coherence protocol of the shared memory system may offset the benefit from massive hardware concurrency. In this article, we investigate the cost of a write-update policy in terms of on-chip memory network traffic and its adverse effects on the system performance based on a multi-threaded many-core architecture with distributed caches. We discuss possible software and hardware solutions to alleviate the network pressure without changing the protocol. We find that in the context of massive concurrency, by introducing a write-merging buffer with 0.46% area overhead to each core, applications with good locality and concurrency are boosted up by 18.74% in performance on average. Other applications also benefit from this addition and even achieve a throughput increase of 5.93%. In addition, this improvement indicates that higher levels of concurrency per core can be exploited without impacting performance, thus tolerating latency better and giving higher processor efficiencies compared to other solutions.
Task Migration for S-Net/LPEL.
Verstraaten, M.; Kok, S.; Poss, R.; and Grelck, C.
In Grelck, C.; Hammond, K.; and Scholz, S., editor(s),
Proc. 2nd HiPEAC Workshop on Feedback-Directed Compiler Optimization for Multi-Core Architectures, January 2013.
Paper
Local
link
bibtex
abstract
@inproceedings{verstraaten13fdcoma,
Abstract = {We propose an extension to S-NET's light-weight parallel execution layer (LPEL): dynamic migration of tasks between cores for improved load balancing and higher throughput of S-NET streaming networks. We sketch out the necessary implementation steps and empirically analyse the impact of task migration on a variety of S-NET applications.},
Author = {Merijn Verstraaten and Stefan Kok and Raphael Poss and Clemens Grelck},
Booktitle = {Proc. 2nd HiPEAC Workshop on Feedback-Directed Compiler Optimization for Multi-Core Architectures},
Editor = {Clemens Grelck and Kevin Hammond and Sven-Bodo Scholz},
Month = {January},
Read = {1},
Title = {Task Migration for {S-Net/LPEL}},
Url = {http://www.project-advance.eu/wp-content/uploads/2012/07/proceedings.pdf},
Urllocal = {pub/verstraaten.13.fdcoma.pdf},
Year = {2013},
}
We propose an extension to S-NET’s light-weight parallel execution layer (LPEL): dynamic migration of tasks between cores for improved load balancing and higher throughput of S-NET streaming networks. We sketch out the necessary implementation steps and empirically analyse the impact of task migration on a variety of S-NET applications.