AMD Alchemy Au1550 Security Network Processor Data Book
19
Au1 Core and System Bus
30283D
2.2.3
Execute Stage
In the execute stage, instructions that do not access memory are processed in hardware (shifters, adders, logical, compar-
ators, etc.). Most instructions complete in a single cycle, but a few require multiple cycles (CLO, CLZ, MUL).
The virtual address calculation begins in the decode stage so that physical address calculation can complete in the execute
stage, in time to initiate the access to the data cache in the execute stage. If the physical address misses in the TLB, a TLB
exception is posted.
Multiplies and divides are forwarded to the multiply-accumulate unit. These instructions require multiple cycles and execute
independently of the main five-stage pipeline.
All exception conditions (arithmetic, TLB, interrupt, etc.) are posted by the end of the execute stage so that exceptions can
be signalled in the cache stage.
2.2.4
Cache Stage
In the cache stage, load and store accesses complete.
Loads that hit in the data cache obtain the data in the cache stage. If a load misses in the data cache or is from a non-
cacheable location, the request is sent to the SBUS to be fulfilled. Load data is forwarded to dependent instructions in the
pipeline.
Stores that hit in the data cache are written into the cache array. If a store misses in the data cache or is to a non-cacheable
location, the store is sent to the write buffer.
If any exceptions are posted, an exception is signaled and the Au1 core is directed to fetch instructions at the appropriate
exception vector address.
2.2.5
Writeback Stage
In the writeback stage, results are posted to the general purpose register file, and forwarded to other stages as needed.
2.2.6
Multiply-Accumulate Unit
The multiply-accumulate unit (MAC) executes all multiply and divide instructions, except MUL. The MAC is composed of a
32x16 bit pipelined array multiplier that supports early out detection, divide block, and the HI and LO registers used in cal-
culations.
The MAC operates in parallel with the main five-stage pipeline. Instructions in the main pipeline that do not have dependen-
cies on the MAC calculations execute simultaneously with instructions in the MAC unit.
A multiply calculation of 16x16 or 32x16 bits can complete in one cycle. The 32x16 bit multiply must have the sign-extended
16-bit value in register operand rt of the instruction.
32x32 bit multiplies may be started every other CPU cycle. The 32x32 multiplies complete in two cycles if the results are
written to the general purpose registers.
If the results are written to the HI/LO registers, three cycles are required for 16x16 and 32x16 bits multiplies. 32x32 bit mul-
tiplies that use HI/LO complete in 4 cycles.
Divide instructions complete in a maximum of 35 cycles.
2.3
Caches
The Au1 core contains independent, on-chip 16 KB instruction and data caches. As shown in
Figure 2-2, each cache con-
tains 128 sets and is four-way set associative with 32 bytes per way (cache line).
Figure 2-2. Cache Organization
Way 3
Word 0 Word 1 Word 2 Word 3 Word 4 Word 5 Word 6 Word 7
Address Tag & State
Word 0 Word 1 Word 2 Word 3 Word 4 Word 5 Word 6 Word 7
Way 2
Way 1
Way 0
128 Sets