ACA Unit 8 Hardware and Software for VLIW and EPIC Notes — Unit 8

G-2 Appendix G Hardware and Software for VLIW and EPIC. In this chapter we discuss compiler technology for increasing the amount of par- allelism that we.

We have presented the co-design of the following four compiler optimizations and architecture features, which reduce code size across three generations of the C6X processor: This company, like Multiflow, failed after a few years.

For most superscalar designs, the instruction width is 32 bits or fewer. On the C6X processor, the load LD and branch B nardware have five and six cycle latencies, respectively. Therefore, it is critical that a VLIW processor be a good compiler target. softwar

By the time the entire loop body has been inserted into the loop buffer, the loop kernel is present and can execute entirely from there. The compiler provides options to select the processor generation and to disable optimization passes that target specific processor features.

This section does not cite any sources. Clearly, the MLB reduces code size and improves power efficiency by eliminating the overlapped copies of the instructions in the loop body.

The shortest path through the code now computes only one full iteration of the loop. Only the kernel code is explicitly represented.

Notes for Advanced Computer Architecture – ACA by Tarini Mishra

Morgan Kaufman Publishers Inc. Proceedings of the 14th Annual Workshop on Microprogramming. Muchnick, Advanced Compiler Design and Implementation. The C6X processors are supported by an optimizing compiler 4. Minimizing physical memory requirements reduces total system cost and harxware performance and power efficiency.


The operator denotes instructions that execute in parallel. Example of a software-pipelined loop with all epilog stages collapsed.

Because cache activity occurs for data items that are never used, data cache performance and power efficiency are negatively impacted. For the following results, the baseline configuration is the C6X-1 generation processor compiled with software-pipelined loop collapsing disabled and the speed-or-size option set to speed. This eliminates the NOP that often occurs after a load instruction in control-oriented code.

In contrast, the VLIW method depends on the programs providing all the decisions regarding which instructions to execute simultaneously and how to resolve conflicts. This has led to increasingly complex instruction-dispatch logic that attempts to guess correctlyand the simplicity of the original reduced instruction set computing RISC designs has been eroded.

Observe that, before loop execution begins, the new predicate register is initialized to one less than the trip counter, so that ins2 is not executed during the last iteration of the kernel.

Unlike software-pipelined loop collapsing, the MLB reduces code size without requiring instruction speculation.

Very long instruction word

All of this fits in one bit instruction:. Typical bit instruction encoding format. The size is the number of execute packets in the kernel; therefore, this limits the maximum II xoftware a software-pipelined loop in the MLB.

Very long instruction word computing Digital signal processing Instruction processing Instruction set architectures Parallel computing.

Kennedy, Optimizing Compilers for Modern Architectures. This occurs when there are dependencies in the code and the instruction pipelines must be allowed to drain before later operations can proceed. The Cydra 5 architecture was a VLIW system that was designed for optimizing the execution of inner loops using software pipelining. It consists of stages of II cycles each. All of the instructions in an execute packet execute in parallel. The benchmarks were compiled with the TI C6X compiler version 6.


Epuc then selects the overlay that packs the most instructions in the new fetch packet.

The Trace compiler did not use software pipelining, but instead used extensive loop unrolling. A zero overhead loop buffer has an additional function to eliminate the need for an zoftware branch instruction in the program source code. Compact instruction header format. Within each of the multiple-opcode instructions, a bit field is allocated to denote dependency on the prior VLIW instruction within the program instruction stream.

The bit instructions implement frequently occurring instructions such as addition, subtraction, multiplication, shift, load, and store. For each new fetch packet, the compressor selects a window of instructions and softwate for each overlay which instructions may be converted to bit.

Co-design of Compiler and Hardware Techniques to Reduce Program Code Size on a VLIW Processor

The predicate field used to signify a fetch packet header occupies four bits bits He also developed region scheduling methods to identify parallelism beyond basic blocks. Code-size reduction and performance improvement on all benchmarks. If the branch takes an unexpected way, the compiler has already generated compensating code to discard speculative results to preserve program semantics. As fetch packets are read from program memory, the instruction dispatch logic extracts hardwxre packets from the hardwarr packets.

