Simple Microprocessor Projectsby Krishnan Srinivasan
Many graduate- and undergraduate-level courses in Very Large Scale Integration -- VLSI -- have projects pertaining to one or more aspects of microprocessor design. The focus of these projects is to test the student's ability to apply theoretical knowledge in a practical environment. Such projects help the student gain experience with software and tools used in VLSI design. The aim should be to keep the projects simple enough so that the student can complete them within a reasonable time, while presenting a challenge to the student to think beyond the textbook.
Simple Reduced Instruction Set Computing -- RISC -- Processor
In this project, the student is expected to create a very simple RISC processor. The item processes six instructions: add, subtract, multiply, divide, load and store. Assume that there are three registers: R1, R2 and R3. The arithmetic operations -- namely add, subtract, multiply and divide -- are performed on the values stored in the registers R1 and R2. Therefore, before performing the operation, the registers must be loaded. The output of the operation will be saved in register R3. Before performing a new operation, the value of R3 should be stored in the memory. For this project, a memory should be implemented. The memory may be partitioned into data and instruction portion respectively. The RISC processor should read instructions from the instruction-memory from address 0x0 through a maximum value, and perform the corresponding actions. Memory may be loaded with a set of instructions for testing.
Pipelined RISC Processor
In this project, the RISC processor above should be implemented as a pipe-lined processor. A typical RISC processor has five pipe-line stages: "Fetch," "Decode," "Execute," "Memory" and "Write Back." Pipe-lining allows multiple instructions to be active at the same time, thus improving system performance. Pipe-lined processors can suffer from bubbles or time cycles where no useful operation is performed. In an ideal processor pipe-line, it is assumed at all stages of the pipe-line complete in one time-cycle, and there is no dependency of one instruction on the other. However, sometimes the result of the previous instruction may be needed by the next instruction. In such cases, no useful operation can be performed until the previous instruction completes. Students who alter the processor architecture to minimize these dead cycles may be given additional credit.
Performance Analysis of Arithmetic Units
For this project, different implementations of the arithmetic functions can be studied for gate-count versus performance benefits. For example, the adder could be implemented as a ripple-carry adder or a carry-look-ahead adder. A ripple-carry adder ripples the carry from one addition stage to another, and the final result is available when the last stage has performed the addition. This adder is slow in the sense that it takes many cycles for the result to be available. However, the ripple-carry adder can be implemented with a low gate-count. A carry-look-ahead adder determines the value of carry of an addition ahead of time. Since the carry value is calculated ahead of time, the adder can compute the result in fewer cycles. However, the carry-look-ahead adder performs more calculations, and hence is high on gate-count. The multiplier could be implemented as a Booth multiplier or a shift-add based multiplier. The shift-add based scheme is the regular paper-and-pencil method of shifting and adding repeatedly until the result is finalized. The Booth multiplier represents the multiplier in a more optimal manner to considerably reduce the number of additions required. Hence, it takes less number of time cycles to compute the final result.
Simple Cache Controller
A simple cache controller can be built. The cache controller could be four-way set-associative, with a least-recently-used -- LRU -- based replacement policy. When a cache block has to be replaced, the LRU policy chooses the least recently used cache block, and replaces that block. In a four-way set-associative cache, each memory block can be placed in one of four locations in the cache. Compared to the direct-mapped cache where each memory block can be located at exactly one location in the cache, the four-way set-associative cache provides more flexibility for the block location and correspondingly, better cache performance. The cache should be implemented with both write-back and write-through policies. When data in cache is modified, the write-back policy updates the main memory only when the cache block is replaced. On the other hand, the write-through policy updates the main memory every time the data in the cache is modified.
Simple Cache Coherent System
A simple two-processor cache coherent system with a modified-exclusive-shared-invalid-based scheme can be implemented. Each processor will have its own cache memory. In this scheme, a line or block of cache memory can be in one of the four states namely, "modified," "exclusive," "shared" or "invalid." A line is in "modified" state if the data in that line is only valid in the cache of the processor. A line is "exclusive" if the data in that line is present in the cache of the processor as well as in the main memory. A line is "shared" if the data is valid in the cache of both processors. A line is "invalid" if the data is not valid in the cache of the processor. Both directory-based cache coherence and snooping-based cache coherence should be implemented, and the scalability of each algorithm with increasing number of processors should be studied. A directory-based cache coherence mechanism maintains a directory of the state of the cache in the main memory. This directory is then used to send messages to the processor about the state of each cache block. In a snoop-based scheme, each modification of the cache block results in a broadcast mechanism by which caches of the other processors are notified about the change in the cache block.
- photo_camera microprocessor image by Pavol Kmeto from Fotolia.com