Wyvern Semiconductors

Expert in Digital IP Solutions

Reference Manual
for the RISC-V Verilog Softcore
Processor (rv32i_cpu_core)

Simon Southwell

September 2021

(last updated: 7th December 2021)

abstract

Here is presented the design and use of the open-source rv32i_cpu_core RISC-V softcore. It is a Verilog based design targeted a FPGAs, with an example simulation environment targeting ModelSim and an example synthesis environment targeting the DE10-Nano with a Cyclone V BA6 FPGA. The purpose of the core is not to compete with either commercially available cores, or other open-source cores, but to be an informative example of a simple core design using the RISC-V open-source architecture. In that aim, the document provides details of the core's design, and the design choices made, as well as the use of the simulation and platform environments. The design is a pipelined architecture, and employs some common rudimentary optimisation techniques to achieve single cycle execution for all the instructions except branches/jumps and loads, to serve as an working example of a (hopefully) useful core. The current core design is to the RV32I standard, and uses approximately 700ALMs (~1600LEs) and two M10K RAMS, if configured for a RAM based register file.

Introduction
- rv32i_cpu_core Supported Features
Rv32I Core Implementation
Zicsr Extensions
RV32M Extensions
Verification
- Simulation
  - Compiling and Running the Simulation
  - Test Bench
    - Core Parameters
- Results
  - Tests for Zicsr Extensions
  - Tests for RV32M Extensions
Synthesis
Download
References

Introduction

This document details the design and use of the open-source rv32i_cpu_core RISC-V softcore. It is a Verilog based design targeted a FPGAs, but can be targeted at ASICs with little or no modification.

It is meant as an exercise in implementing a modern RISC based processor with the aim to demonstrate the operations of such a device, and some of the design problems encountered in constructing such a component. The code itself is also designed to be accessible and expandable, as an education tool and a starting point for customisation. As such, this document, as well as usage information, will give some details of the internal design of the source code sufficient to navigate and modify the code, as desired. At this time it's fixed at implementing the RV32I features, with the extensions being a possible future expansion.

The implementation trades off between performance, resource usage and clarity of design. It was designed to be an efficient implementation running at a usefully high clock frequency, with a modest FPFA resource requirement-but where clarity of operation would be unnecessarily obfuscated, a more practical route is taken. The design can be used without the need to understand the code, but if one desires to delve deeper and make modifications, then the intent is that the code is as easy to understand as possible, consistent with implementing the desired features, and design details are well documented in this manual, along with some of the design decisions that were made along the way.

This being said, an initial specification, when targeting the chosen example platform (a DE10-nano with a Cyclone V 5CSEBA6U23I7 FPGA) is for a 100MHz clock, less than 1000ALMs (~2600 LEs) for the base implementatoin and a pipelined Harvard architecture.

Along with the implementation is an example simulation test bench, targeting the ModelSim simulator from Mentor Graphics, and the synthesis setup to target the DE10-nano, using the Intel/Altera Quartus Prime tools. These two environments are documented within this manual as well.

rv32i_cpu_core Supported Features

The overall features of this RV32I initial delivery are listed below

All RV32I instructions implemented
- Configurable for RV32E
- Single HART
Separate instruction and data memory interfaces (Harvard architecture)
5 deep pipeline architecture
1 cycle operations for all instructions except branch, jump and load
- Regfile update bypass feedback employed
Branch instructions take 1 cycle when not branching, 4 when branching
- 'never take' branch prediction policy employed, with pipeline cancellation on branch
- Jump instruction takes 4 cycles
Load instructions take a minimum of 3 cycles, plus any additional wait states
Register file configurable between register or RAM based
- Defaults to RAM based, using 2 × M10K RAM blocks
- Register based costs approximately 700ALMs (~1750 LEs)
Example simulation test bench provided
- Targets ModelSim
Example FPGA target platform using the terasIC DE10-nano development board (employing the Intel Cyclone V 5CSEBA6U23I7 FPGA).
- Targeting 100MHz clock operation
- < 1000 ALMs (~2600 LEs) when also employing Zicsr and RV32M extensions (RV32I implementation currently around 700 ALMs, ~1600 LEs).

The initial delivery instantiate the design in a core component, suitable for adding into the FPGA environment as a "Platform Designer" component. This core also instantiates some local FPGA RAM for the instruction memory (imem) and data memory (dmem). A future enhancement is to add a memory management unit for use of the, up to 1GB DRAM available on the DE10-nano.

RV32I Core Implementation

The base core implements the RISC-V RV32I specification (see [1]). The idea is to implement a 'classic' RISC processor architecture, with multi-stage pipelining, to allow all instructions that have inputs and outputs only within the core, and do not alter the flow of the running program, to execute in a single cycle. That is to say, each instructions takes as many cycles as the pipeline is deep, but that the pipeline stages each have a partially executed instruction at that stage, so that the rate of processing is 1 cycle per instruction-with just an initial latency to fill it.

The instructions that do not have single cycle operations are jumps, branches (if taken), and memory load instructions. Memory store instructions will also have wait states when a memory sub-system is added but, for now, the example design that instantiates the core, uses FPGA memory blocks, which can be written in a single cycle.

Top Level

The implementation of the rv32i_cpu_core consists of three modules, which implement a 5-stage pipeline. The diagram below shows a simplified view of these components, which are instantiated within a top level.

The source code for the CPU can be found in the repository at HDL/src, and consists of the following files

rv32i_cpu_core.v : Top level CPU core
rv32i_regfile.v : Register file, containing the CPUs registers and program counter (the first phase of instruction reads)
rv32i_decode.v : Instruction decoder implementing the next two phases of the pipe, source register pre-fetch and instruction decode
rv32i_alu.v : Arithmetic logic unit executing the instruction and register write-back (the last two phases)

Pipeline Overview

The 5-stage pipeline allows for the efficient implementation of instructions, by breaking up the individual calculations required for the instructions into smaller pieces. It would be possible to execute an instruction in a single clock cycle, but that clock cycle would have to be long enough for all the calculations to be completed before allowing the next instruction to proceed. The aim of the pipeline is to divide the calculations into smaller 'chunks', with the output of one stage, feeding its results to the next stage. Since each stage's calculations are smaller than the whole, the clock cycle speed may be increased, limited by the slowest of the stages. In an ideal design, the stages would all take the same time, and the pipeline would be balanced, and run as fast as is possible.

Register Hazards

Having a pipeline architecture cause its own problems, however. If an instruction's input relies on a result of a neighbouring instructions before it, but it takes several cycles before that result is available (in a register, say), then that instruction may get stale data if it does not wait for the result of the instruction in front of it to be available. This gets worse, the deeper the pipeline. Several options are available. Conceptually, the simplest system might be that the issuing of instructions is halted whilst an instruction is in the pipeline, and is only issued when the writeback phase is reached for the instruction in front. This, though, gets back to the single cycle operation, as only one instruction is active at a time. Overheads would probably make this worse than single cycle operations. A better approach might be to know which registers (or memory locations, possibly) are to be updated by the instructions in the pipeline, and issue a new instructions only if there are no hazards-that is, its source registers do not match any of the destination registers of instructions in the pipe. This would require keeping and monitoring destination registers (and, possibly, store addresses), at each stage.

The approach taken in this design it to allow instructions to proceed regardless. The values of the registers indexed by the instruction's RS1 and RS2 indexes (as appropriate), are accessed very early on, with the possibility of retrieving stale value. At each subsequent stage, except writeback, if a destination register is being written back, the stage will swap its source value with the writeback value, if its index matches that being written (and is not x0, which is never written). Since the ALU's execute phase immediately precedes the writeback phase, there is opportunity to update the values at all the phases up to that point. The details of destination feedback will be discussed on the sections for each block but, in summary each block has a circuit similar to the diagram shown below (though possibly with different signal names).

Each module in the pipeline's path is feedback the same RD index and RD value that is being written to the regfile module. One slight variation is that the decoder is not fed back the register indexes for the prefetch RS1 and RS2 values, as it has these internally, decoded from the instruction and used to generate the prefetch register addressing. The index value is non-zero when a write to the regfile is active, so the circuit inspects for non-zero values, and compares the index to the index value of the RS registers it is processing (RS1 and RS2, with a circuit for each). If there is a match it selects the feedback value in preference to the RS values forwarded from the previous phase.

The diagram below shows the regfile bypass feedback paths.

In this design there are two sources of RD updates-one from the ALU and one from the Zicsr module (if enabled). It is the mux'ed version of this that is routed back to each of the other blocks, as well as forward to the regfile module. The feedback logic (shows as fb in the diagram), takes the RD values and indexes and compares with the module's source of the RS indexes, forwarding to the internal logic the appropriate value.

This system allows a pipelined architecture to get the correct RS value (i.e. the most up-to- date) without stalling the path. However, it creates critical paths in the design. In this design, this is from the registered RD output of the ALU, through the mux with the Zicsr RD output, back to the RS inputs of the ALU and through the feedback logic. Then it follows the ALUs logic path to the register for the RD (see the diagram in the Arithmetic Logic Unit section below, with the RS inputs are marked as a and b, and the RD value output is marked as c). For the de10-nano example platform, this is the critical path.

A similar long path exists from the registered ALU's RD output, through the mux into the regfile. Internally to the regfile module is a short pipeline of the value being written to allow for the latency of storing and the immediately accessing the data from the register file's memory. A similar feedback logic circuit is used to choose whether to use the incoming value being written, the one in the previous cycle, or that from memory. This adds a multiplexer to the path for selecting the returned prefetch RS values, with the fetch from memory likely to the longest of the data paths. The prefetch values then have to go through the decoder's feedback logic before being processed by the decoder (see diagram in the Decoder section below) and ultimately registered as the RS outputs. In the example platform this isn't the critical path, but might easily become so with additional decoding (e.g. for more extension support) or on a different targeted platform.

The trade-off here is the advantage of issuing instructions once a cycle for maximising compute cycle efficiency, against the maximum speed at which the logic can be clocked. Given that pipeline is 5 phases deep, it is likely that the clocking would not be increased 4 or 5 fold by removing the bypass feedback functionality.

Branch Hazards

Another issue with a pipeline that allows instructions to always be issued is the problem of jumps and branches. If a jump is issued, or a branch that will be taken, the PC is updated from a linear increment to some arbitrary address to start fetching instructions from that location. When the jump is issued, though, it will be several cycles before the PC is updated at the writeback phase. In the meantime, the pipeline is fetching instructions beyond the jump, as if the jump has not happened. Those are not meant to be executed, and can alter state, with dire consequences.

In this design, the instructions proceed until an update to the PC is flagged at writeback. At this point, all the stages in the pipeline are 'cancelled'-effectively they are turned to nops (no- operation), and the pipeline continues. Since state is only updated at the writeback phase, by forcing the stages to nop, none of the erroneously fetched instructions have any effect, even though the pipeline proceeds. What this does do, though, is introduce cycles in the pipeline that do nothing. This is why, in the sections above, it has been stated that branches take 1 cycle if the branch is not taken but 4 cycles if taken. This is a function of the pipeline depth, and it will be seen in other implementations as well. Since the jump instructions are effectively a branch with an 'always taken' status, they are handled no differently, and the pipeline is 'stuffed' with the next instructions. The new PC value will not be known until the writeback stage, so early decode of jumps to stop issuing instructions until the new PC is available would just introduce additional logic to handle these. This might have an advantage in a power- conscious design, when one would not want to clock stages that do nothing, but this design does not attempt to be that sophisticated.

Stalling Mechanism

Another hazard with a pipeline architecture is for instructions that can cause "wait-state" or stalls. This boils down accesses to memory (and memory mapped I/O). In a general system, a memory management unit (MMU) might be employed which gives access to the DRAM (say) for a large storage space. Accessing that space can take tens of cycles (or more if I/O). A cache can be employed, with local fast access memory, with, perhaps, single cycle accesses. But, eventually, even the cache will need to be updated from main memory, and the access by the CPU will have to stall. This applies to both reading instructions from memory as well as load and stall instructions.

The current design treats instruction reads and load/store instructions differently. The example design using the rv32i_cpu_core, has local FPGA memory, with single cycles writes, and a delayed cycle read. The design, however, has the ability to handle arbitrary wait-states for both instruction reads and memory access instructions.

imem reads

For instruction reads the, design policy is to issue nop instructions for all cycles where instruction read data is not valid. This takes care of situations where the memory sub-system stalls on instruction fetches, and in the initial latency of fetching the first instructions.

As mentioned before, this is not perhaps the best policy from a power consumption point of view (if a design has features to support this), but, for this design, it is the simplest solution.

dmem loads

For memory accesses, the example design, with local memory will not stall on store instructions. On load instructions, however, the reading from memory takes a single cycle. In addition, the address to be read is not available until the execution phase output, as it is calculated during this phase. So wait states must be introduced to allows issue the read, wait for the return value, and then writeback to the destination register.

Conceptually, the easiest thing to do might be to stop issuing new instructions after a load (or stuff with nops), until the load writeback is reached. Decoding for a load instruction takes a couple of phases, and so a latency on stopping the PC increment exists, and so some sort of wind-back might be needed to pick up at the correct address when the load is complete. The problem with this approach is that is empties the pipeline behind the load instruction, adding additional cycles over-and-above the wait states.

In this design, the approach is to allow the pipeline to proceed behind the load instruction, until the execute phase, when the pipe is stalled. The writeback phase will hold with the memory read accesses for as long as necessary to get the returned data, and route it to the destination register. At this point the pipeline is allowed to proceed, and no added latency was introduced by stuffing the pipe with nop instructions.

The spreadsheet fragment below is an attempt to show what happens for a load access stall (and was used for the signalling design)

The spreadsheet shows how the PC and phases proceed until stalled. A stall is 'initiated' when a dmem access is activated (read signal = 1), but the wait signal from the memory is also high. This is shown in read, and this may be the state for a number of cycles. Now it is known that a load instruction will have at least one wait state, and so early warning on stalling the PC can be affected when the load reaches the execution phase. This is needs so as not to push instructions into the pipeline with no space left, and hence lose them. The 'hold PC column shows this signal, asserted before the others.

The other stages in the pipeline need to stall on the first memory access wait state, and continue to stall until valid data is returned (when 'wait' is de-asserted-green). This is shown in the row below 'hold PC'). Finally, the next cycle after the memory read valid data is where the writeback to a register is performed. Since the load instruction is still active, the dmem 'read' must be cleared in this phase (which would otherwise do because of the still active load instruction), so a new access is not started.

After this, the pipeline continues, with the instructions behind the load already to go. Thus only two additional cycles are added.

Parameters and Configuration

In general, the logic will work as intended without the need for configuration. However, the CPU core (rv32i_cpu_core), and the top level wrapper (core) have some parameters to alter their configuration, and these are documented below.

Parameters for the rv32i_cpu_core

The following lists the parameters available for the CPU core module, which are also routed to the core top level to make available to test benches and synthesis instantiations.

RV32I_TRAP_VECTOR
RV32I_RESET_VECTOR
RV32I_REGFILE_USE_MEM
RV32I_LOG2_REGFILE_ENTRIES (5 for RV32I, 4 for RV32E)
RV32I_ENABLE_ECALL

The initial RV32I softcore implementation does not have CSR registers, as per the Zicsr extension specification, and so does not have an mvtec register defining the address for traps. The RV32I_TRAP_VECTOR parameter defines this address and has a default value of 0x00000004. Since there is no mvtec register, this cannot be changed programmatically. When the Zicsr extensions are added, this will become the default value of the mvtec register.

In a similar manner, the reset vector is defined using the RV32I_RESET_VECTOR parameter, and defaults to 0x00000000. Unlike the trap vector, there is no CSR register to change this, and the specifications leave this as an implementation choice. The two default values for the vectors, for this design, have been chosen to match the riscv-tests expectations.

The RV32I_REGFILE_USE_MEM parameter is used to select between a register based register file array when 0, or a RAM based register file array when 1.The default is 1 for RAM based. The choice is based on whether register or RAM is a limiting factor in a design using the core. In the current design, with a register based regfile, the FPGA resources used for the array is around that used for the rest of the core, doubling the requirements. If RAM is chosen, the two M-RAM blocks are used. For the example target platform this is two M10K blocks, with only 1024 bits of the 10240 bits used in each block. There are a few hundred blocks in the target FPGA, however, so the percentage usage is very low. If the core were targeted at an ASIC development, though, RAM may be more of an overhead than gates, so both implementations are available.

The RV32I_LOG2_REGFILE_ENTRIES parameter defines the 'address' width for the register file register array, which, thus, defines the number of entries. The default value is 5, giving the RV32I specification of 32 registers (x0 to x31). If an RV32E specification is required, the this may be altered to 4, for 16 registers (x0 to x15). When selecting a RAM based register file array, switching to an RV32E specification will save no RAM bits, and so is less useful.

The RV32I_ENABLE_ECALL parameter is purely for test purposes. It is there to allow the riscv-tests code to be run without the need for Zicsr extensions. These tests, as part of initialization set up some CSR registers and make tests on others. Normally, tests are made and will skip part of the initialisation, including executing an ecall instruction. With the parameter set to 0 (it defaults to1), the ecall instruction is treated as a nop. With the implementation's default values, and setting this parameter, the test code drops through to the main test code with sufficient initialisation to execute without issue. The intent is to remove this parameter once Zicsr extensions are added.

Additional Parameters for the top level Core

As well as the parameters for the CPU core, which are available at the top level core, some additional parameters are added, as listed below.

RV32I_IMEM_ADDR_WIDTH
RV32I_DMEM_ADDR_WIDTH
RV32I_IMEM_INIT_FILE
RV32I_DMEM_INIT_FILE
RV32I_IMEM_SHADOW_WR

The RV32I_IMEM_ADDR_WIDTH and RV32I_DMEM_ADDR_WIDTH parameters specify the address widths of the internal instruction and data memories, and hence their size. By default they are 12 bits, for 4096 x 32 giving 16Kbytes for each RAM.

Memory initialisation files can be specified for the RAMs with the RV32I_IMEM_INIT_FILE and RV32I_DMEM_INIT_FILE parameters. The format for these are specific to the targeted example platform, and can either be in intel hex format (.hex), or in a 'memory initialisation file' (.mif) format. It is expected that, for different targets, the RAM modules will be replaced with the appropriate substitutes. The parameters can still be routed to these new blocks and specify an initialisation file of the appropriate type, if required. The default value for these parameters is "UNUSED", indicating no initialisation file is specified for the RAM.

The RV32I_IMEM_SHADOW_WR parameter is meant for test purposes only. By default it is 0, with the logic's behaviour as normal. When set to 1, logic is enabled where all writes to the data memory, will also be written to the instruction memory. This was added specifically to allow the riscv-tests fence_i.s test to write, where stores to code locations are made for self-modifying purposes to test the fence.i instruction. Since IMEM and DMEM are separate memories, writes to code locations will be routed to DMEM. By enabling this feature, these writes are shadowed in the IMEM as well. It is expected that in an implementation this parameter is left at its default value. The example synthesis environment has modification of this parameter disabled.

Decoder

The decoder block, implemented in rv32i_decode.v, is responsible for processing raw instructions values and splitting out the instruction type and their various relevant fields of the instruction for sending to the arithmetic logic unit (ALU), as well as instigating source register prefetching. Thus it implements the pipeline stages 1 (rs prefetch) and 2 (decode). The diagram below shows a simplified view of the decoder module.

Instructions arrive from IMEM on the instr input, and the logic immediately routes the RS1 and RS2 fields to the register file block to start retrieving the values. This is regardless if what type the instruction will decode to be. The RISC-V specification has been constructed in such a way as to position the RS indexes in the same place for each instruction that specifies a source register. By routing the bits in phase 1 to the register file block for all instructions, without further decode, the values become available for phase 2, if required. Otherwise the returned values are ignored. For a power sensitive design, additional decode in phase 1 might allow only fetching of data when relevant. Since the instr input is from memory, then it may arrive late in the cycle, and so needs to be only lightly processed where possible.

The instruction input is registered for the phase 1 processing. There are three main sections to the decode. Firstly, all possible opcode/function, register indexing and immediate value fields are extracted from the instruction. This is just wiring, and so is a convenience only.

The second stage is to use the extracted opcode and function fields to generate bits flagging the different instruction types, such as arithmetic, jump, memory accesses etc. With the operation bits, the various immediate fields are multiplexed to a single immediate value, based on the decode instruction type.

The final stage is to generate the outputs for routing to the ALU. The main outputs are an A and B value for inputs to the main ALU calculations. The values set on these are dependent on the instructions type. A can be the RS1 value, the PC or 0. The B value can be the RS2 value, the trap vector value or the selected immediate value. With appropriate A and B values the ALU calculations can take these common inputs and generate its output. It will still need to know the operation to perform on these value, and so the operation bit values are sent to the ALU (shown as ALU control on the diagram).

In addition to the A and B values, the source register indexes for the A and B values are sent to the ALU as well for instructions that use them (not shown in the diagram). This is purely for the pipeline destination (RD) feedback mechanism to allow for updating stale RS fetches (see Pipeline Overview section above). The decode block itself monitors for matching RS and RD indexes, and substitutes the register file returned values for the RD value update when a match occurs. This is not shown in the diagram. The RD index is also sent to the ALU to allow the ALU result to be routed to the appropriate destination register.

In addition to the A and B values, some instruction require calculations on the PC. For example, branch instructions must do a comparison of two registers, and A and B are used for these values, but a new PC is required if the branch is taken and a calculation is required on the PC value an offset value. The PC and the extracted offset are also routed from the decoder to the ALU.

Arithmetic Logic Unit

The ALU is where all the calculations for an instruction are executed, and where the results are written back to registers or memory, where applicable. New address values are also written back to the PC when necessary. Thus it implements phases 3 and 4 of the pipeline.

The diagram below shows a simplified view of the ALU logic.

At the heart of the design is the main execution calculations, when take the A and B values from the decoder to provide a result in a C output. For the most part, A and B are fed to the various possible operation types, and the appropriate result mux'ed to the C register, selected from the instruction type bit controls from the decoder. The categories of instruction in this mux are:

load
store
jump
comparison (nor branches)
bitwise operations
arithmetic (add and subtract)

Of these categories all use the A and B inputs, except the following, The load instruction takes the returned data and aligns the data dependent on the data width and address. Similarly, the store instructions have the source data aligned to the appropriate byte lanes, ready for writing to memory. Jump instructions take the PC value for the instruction and adds 4. This is for the link address that will then be written to the indexed destination register, The actual new PC value is calculated separately and is discussed below. All the other types of instruction take the A and B values, as selected by the decoder, whether from a source register, and immediate value, or whatever. Not shown in the diagram is the writeback feedback mechanism where the A or B values from the decoder are overridden by the C output if their index matches the destination register index (and not 0). This is the phase 4 stage's 'stale rs data' mechanism,

As mentioned before, some of the instructions require additional calculations. For loads and stores a memory address is calculated. This takes the an address base value on the A input and an offset value sent from the decoder and extracted from the instruction's immediate value. Normally an immediate value appears on the B input, but the main calculation logic is busy with either aligning memory read data (loads) are aligning write data (stores). An addr output sends the address to memory, and with byte enables calculated from the bottom bits.

For jumps and branches, a new PC value must be calculated, and this comes from the offset mentioned above and the PC value input to generate a new PC value.

All the c, addr and pc calculations proceed regardless of instruction type being executed. Whether they affect state, or not, depends on the control outputs. A writeback to a destination register only happens if the rd index output is non-zero. Also, an update_pc output flags to write the pc output to the PC register. A load and store output control initiating accesses to memory. All of these are functions of the instruction type control inputs.

Stalling

In the pipeline section, stalling was detailed from an overall perspective. The calculations on when to stall each stage of the pipeline are done at the top level. For the ALU, a single stall input controls stalling of the phase 3 and 4 stages. When stall is asserted the ALU will hold key state with its previous value. These include the following outputs

addr
load
store
rd
update_rd
pc
update_pc

Register File

The register file module is where all the main CPU state is implemented---the main integer registers x0 to x31, and the PC program counter. At its simplest, it is an array of 32 bit registers that can be written and read. As it holds the program counter, it implements phase 0 of the pipeline.

Registers versus RAM

In this implementation, the registers can be configured to be implemented as an array of logic registers or as memory blocks under the control of a parameter, RV32I_REGFILE_USE_MEM.

When registers are chose, the an array of 32 bit registers is instantiated, which allows then to be written, selected with an rd_idx input and a new_rd value to be written. Two values need to be read simultaneously, for RS1 and RS2 values, and rs1_idx and rs2_idx index the registers to be read, with data returned on rs1 and rs2 outputs in the following cycle. Note that all three ports can be active simultaneously.

If memory blocks are used, to save on register usage, some problems arise. A two port memory block can have, at most two active accesses, where three are required. Secondly, write-through (where a location being read that's also being written sees the new value) is not possible without additional logic. These problems need to be solved id RAM blocks are to be used.

The multiple accesses issue is solved by instantiating two RAM blocks, which both get written simultaneously when a new RD value is being written, thus they always contain the same data. When two reads are required, one is directed to the first RAM, the second to the other RAM, giving simultaneous read access. This, however, does not solve the write through problem.

Pipeline Feedback

When writing to the RAM, the inputs are latched, and then the value is written in the next cycle. If reading in the same cycle as the write, these inputs will also be latched, and data read in the next cycle, but this will be the old data, as the state is updating in the same cycle. So, the regfile module has a short queue of recently updated values to bypass the RAM in these cases.

If a read access matches the input write location, the value is simply passed back as the read value, just as for pipeline feedback in other stages. If the read index matches the previous write location, then a stored write value is routed back as the read value, to compensate for the cycles needs to update memory. In this manner, the correct value is returned, however close read and write accesses to the same location are.

This mechanism is employed whether registers or RAMs are used so as to keep the logic the same in both cases, thus simplifying it. It also allows the register implementation to register its returned value outputs, relieving timing for the decoder.

PC and Stalling

For the regfile module, stalling is just a matter if holding the PC state. Two PC values are kept in the register file block. The pc output is that used to fetch the instructions from IMEM. An additional value, last_pc, is the value of the PC in the last cycle (normally PC - 4). This value is fed to the decoder as the PC value of the instruction sent, since there is a delay in instruction fetches, and is

Memory Interfaces

The two memory interfaces, for the instructions (IMEM) and load/store data (DMEM) are modelled on the Avalon memory mapped master bus specification from Intel/Altera. This is because the example target platform uses the Intel/Altera Cyclone V FPGA, and this maps well to a solutions based on these devices. Mapping to a more generic protocol, such as the open source Wishbone interface should be quite simple, with the main difference being a 'cycle' signal to indicate a bus access, and a write enable indicating a write (and not read), instead of separate read and write strobes.

The signalling is not fixed cycle but has a wait request input back to the master buses of the CPU core, to introduce wait states as discussed before for loads and instruction reads.

The current example design uses local memories, but the longer term aim is to have a memory sub-system with caches and access to main DRAM memory. The use of wait request inputs allows for this enhancement, with arbitrary wait states that may occur, without alteration of the CPU core logic.

Zicsr Extensions

The Zicsr extensions add the following features to the base implementation:

clock counter
Timer
- 1µs resolution
- Scales with system clock (on CLK_FREQ_MHZ parameter)
- Timer interrupts
Counters/timers mapped to user addresses as read-only registers
Retired instruction counter
External interrupts
Vectored and non-vectored interrupts
Support for software interrupts
Support for memory mapped timer register access

Configuration Parameters

The Zicsr extensions can be enabled and disabled with the core's RV32_ZICSR_EN parameter. In addition, certain features of the CSR registers can be disabled to save on area.

The RV32_DISABLE_TIMER parameter excludes the real-time timer counter, when the Zicsr logic is included. When non-zero, the 64 bit counter, and it's 'tick' circuitry, is removed. Since the processor still needs a real-time clock, the cycle counter is used instead, and it is this that is read when accessing the user timer CSR registers at 0xC01 and 0xC81. Since the timer can't stop, the CY bit of the mcountinhibit CSR register ([2], sec 3.1.13) has no effect when set if configured for no timer. For software to be able to schedule real-time events in this mode, the value of the CLK_FREQ_MHZ parameter is made available in the custom read-only CSR register at 0xFFF, to indicate the rate at which the cycle counter is counting.

The RV32_DISABLE_INSTRET parameter excludes the retired instruction 64 bit counter logic when set. When this parameter is non-zero, reads of the low and high counter bits (at 0xC02 and 0xC82) return 0.

Additional Instructions

With the Zicsr logic enabled, the relevant addition instructions are available, as defined in [1], sec 9.1:

csrrw
csrrs
csrrc
csrrwi
csrrsi
csrrci
mret

Supported Control and Status Registers

A subset of all possible machine level CSR registers has been chosen to balance the requirements of a real-time system or OS, and the need to minimize the resources used in the logic. See [2], Sec 3.1, 3.2 and [1], Sec 10.1. The supported machine level registers are:

mstatus
misa
mie
mtvec
mcounteren
mcountinhibit
mscratch
mepc
mcause
mtval
mip
mcycle, mcycleh
minstret, minstreth (reads 0 when RV32_DISABLE_INSTRET non-zero)
mvendor (fixed at 0 at present)
marchid (fixed at 0 at present)
mimpid (fixed at 0 at present)
mhartid (fixed at 0)

In addition, the user view of the counters is also added:

ucycle, ucycleh
utimer, utimerh (reads cycle counter when RV32_DISABLE_TIMER non-zero)
uinstret, uinstreth (reads 0 when RV32_DISABLE_INSTRET non-zero)

Finally, a custom machine CSR register is added at 0xFFF for reading the configured timer tick frequency (see [2], table 2.1), where it is set to 1 (MHz) when the timer is configured, but reads the CLK_FREQ_MHZ parameter when the timer is not configured, and the cycle counter is doubled-up and used for the timer instead. Logic Design The logic design for adding the Zicsr functionality, both in terms of the addition instructions and the CSR registers themselves, was done to balance making the logic configurable for adding or removing the resources, and making the design simple and efficient. The diagram below is an updated top level diagram showing the blocks when Zicsr extensions are included, with the appropriate additional data path connections. The design adds the decoding of instructions directly to the original base decoder unit, but this then farms out the processing of the instructions to a separate rv32_zicsr module, highlighted in the diagram above. This not only executes the instructions, but also includes the implemented CSR registers, as described in the previous section. The decode interface to the Zicsr block (marked as Zicsr ctrl in the diagram) is two separate paths. One flags when a CSRxxx or mret instruction is decoded, with parameters, whilst the other flags when synchronous exceptions happen, and includes exception type and PC address of instruction. The ALU provides memory access misalignment to the decoder to route as exception information, along with the other types of exceptions. The Zicsr block is the 'ALU' block for the extensions, and can generate changes in the program counter. So it, like the RV32I ALU, generates destination register and PC update signaling. These are multiplexed with the ALU before sending to the register file block. To maintain some of the CSR register state, the Zicsr block needs to know when instructions are completed (retired), and a signal is sent from the ALU. Lastly, an external IRQ input is synchronised and sent to the Zicsr block. Not shown on the diagram, but an external software interrupt input is also sent to the block. This is a pin on the rv32_zicsr module, but it is expected that this would be connected to a memory mapped register in the external memory mapped space. Extensions to Decoder The decoder block has additional logic added to support the Zicsr extensions, controlled with the RV32_ZICSR_EN parameter. Firstly, the detection of illegal instructions is extended, with the invalid_instr signal set if a shift value of 32 or more is seen. The system instruction decode signal (system_instr) is altered to remove the now defunct RV32I_ENABLE_ECALL parameter, disabling ecall instructions, as this is now fully supported. The a decode is added for CSRxxx instructions, with a flag to indicate whether immediate or not, and the mret instruction. The no_writeback flag is updated to be set for system instructions only when the Zicsr extensions disabled. In addition, it is set when Zicsr instruction decoded. Within the synchronous process, logic is added to set an exception pulse output (exception) when synchronous exceptions occur. These conditions are as follows: Misaligned IADDR: based on the input PC value not being 32 bit aligned Misaligned loads and store: uses input from the ALU Illegal instruction: based on the instr_invalid flag ecall or ebreak instruction The exception pulse is accompanied by an exception type, an address (for load or store misalignments) and a PC of the instruction. For the load and store faults the PC is based on the last PC value as the exception was detected in the ALU, which is one phase behind the decoder. These exception signals go to the Zicsr module. When a Zicsr instructions is decoded, the type of the instruction (read-write, read-set or read- clear) is output, along with the instruction destination register index. Similarly a signal output to when the mret instruction is decoded. These are also sent to the Zicsr block to take appropriate action. If the Zicsr instruction is an immediate instruction, then the a value output (usually for the ALU) is set to the RS1 index value signal, as these bits correspond to the instruction's immediate bits. Finally, when there are no Zicsr extensions, updating the PC on system instructions is taken care of by the rv32_zicsr module, and so the b output, which for the base implementation is set to the trap vector parameter, does not select this when the Zicsr extensions are configured. Similarly for the RS index output (a_rs_index), which is zero on a system instruction when Zicsr is configured. This is so that ALU processes the system PC update when no Zicsr, but not when there is. Extensions to ALU The changes to the ALU are mainly passive, as the new functionality is implemented in the rv32_zicsr module. The main difference is the detection and output of load and store misalignment exceptions. These are sent out on a new interface (misaligned_xxx) and routed to the decoder, when merges these in with its own exception generation. In addition, a signal is exported that is set for each instruction completed, and sent directly to the rv32_zicsr module for counting retired instructions. rv32_zicsr Module The functionality of this block is to implement the supported CSR register (as defined in the Supported Control and Status Registers section, above) and process the CSRxxx instructions that access and update them. In addition, it amalgamates exception, both synchronous from the decoder block interface, and asynchronous from the external interrupt, the software interrupt and the timer. Within the block, there is logic to read and write the CSR registers, either from reads and writes via the CSRxxx instructions, or updates to the applicable registers when events occur. The actual indexing for the registers is done via an auto-generated module (zicsr_rv32_regs, implemented in zicsr_rv32_regs_auto.v, with a header zicsr_auto.vh), generated from a JSON description. This has a read and a write port, with strobes, addresses (indexes) and data paths. The address/index decode logic is all generated internally. The implementation of the registers is configurable, via the JSON description, to be an internal or external register. For the CSR registers, if they are only updated via CSRxxx instructions, such as mstatus, then they are internally implemented as art of the logic file auto-generation, and there is not external port. Similarly, a register may be marked as a constant, such as mhartid, and implemented internally. If a register may be updated by both a CSRxxx instructions and external events, then it is implemented externally, and the auto-generated logic has a port with a pulse for that register, which pulses when written, and an output data port with write data. An input data port is also added for reading the external CSR register value. If a register is described in the JSON with bit fields, the input and output data ports will be separate for each described bit field. Since the CSRxxx instructions can be set or clear, as well as write, the rv32_zicsr logic uses the dual ports to read-modify-write the registers, so the clear and set functionality can be implemented on the register values. The counters are implemented in the rv32_zicsr module, controlled via parameters, as described in the Configuration Parameters sub-section of this Zicsr Extension section. When the RT timer is enabled a microsecond counter is implemented to count the appropriate clock ticks. This is scaled dependent on the CLK_FREQ_MHZ parameter. The instruction retired counter is incremented on a signal from the ALU, but is removed if disabled, and will read 0. The exceptions are processed, updating the PC based on type and mode and mtvec value, and updating the appropriate status, cause and interrupt registers, along with disabling further interrupts etc. RV32M Extensions The RV32M extensions add integer multiplication and division to the RV32I base specifications, as defined in [1], Chapter 7. In this design there is a separate module, rv32_m, that implements the functionality, and may be configured in various modes, depending on priorities, such as area, speed and availability of hardware DSP multiplication units (e.g. in an FPGA). The logic add the following 8 instructions, for signed and unsigned integer multiplication and division mul: signed/unsigned multiplication, returning lower 32 bit of result mulh: signed multiplication, returning upper 32 bits of result mulhu: unsigned multiplication, returning upper 32 bits of result mulhsu: signed × unsigned multiplication, returning upper 32 bits of result div: signed division, returning 32 bit quotient divu: unsigned division, returning 32 quotient rem: signed division, returning 32 bit remainder remu: unsigned division, returning 32 bit remainder Configuration Parameters The rv32_m module has three parameters, available at the rv32i_cpu_core top level to configure how it operates, as listed below: RV32_M_EN: default 1 (i.e. enabled) RV32M_FIXED_TIMING: default 1 (i.e. no optimization logic-always performs fixed timed calculation) RV32M_MUL_INFERRED: default 0 (i.e. does not infer DSP multiplication units, and uses shift logic The RV32_M_EN parameter is used to control whether the RV32M functionality is instantiated or not. By default it is, but if an implementation does not need the features, then area can be saved by disabling the module. If the module is enabled, then performance and area can be traded off using two associated parameters. Firstly the RV32M_FIXED_TIMING parameter, when non-zero (the default), will always perform a full calculation (multiply, divide or remainder) for each relevant instruction, fixing the calculation time for each instruction, as defined in the Performance section below. This is the cheapest option, in terms of logic resources, but, since each instruction only returns 32 bits, to get all 64 bots for a multiplication, or the division and remainder of a divide, two instructions must be executed. With the parameter non-zero the calculation will be done twice for each instruction pair (e.g. mul, mulh or div, rem). When the parameter is zero, optimization logic is instantiated that monitors for the same operation (i.e. one of the multiplication instructions or one of the division/remainder instructions) with the same inputs as the last successful calculation. If there is a match, the stored result is accessed without the need for running the entire calculation. Therefore, if a mul instruction is followed by a mulh, or a div is followed by a rem, with the same input values, then only the first instruction takes the calculation penalty, whilst the second returns in just two cycles. This is irrespective of the RS1 and RS2 registers used to source the inputs, or the RD register indexed to store the result. Secondly, the RV32M_MUL_INFERRED parameter is used to control how the multiplication logic operates. By default this is set to 0, meaning that the multiplication shift circuit is instantiated, incurring the penalty in logic resource, and in the calculation execution time. The benefit is that the logic is vanilla RTL and can be synthesized to any target. If the parameter is non-zero, then the logic instantiated is a straight forward C <= A * B; Verilog statement. This mode is intended, primarily, for FPGA targets that have hardware DSP elements with multipliers-usually with optional add or accumulate features. In this case the statement will infer one or more of these elements (depending on the underlying size of the hardware multipliers). For FPGA targets with DSP elements this saves logic resources and also speeds up the multiplication calculation time. Since not all targets (ASIC or FPGA) have DSP elements, the configuration allows the appropriate logic to be used. Performance Below is shown the cycle count for the executions of the instruction types, as a function of the parameter settings, matching inputs to last calculation and (for the case of multiplication) the inference of hardware DSP elements Multiplication with RV32M_MUL_INFERRED = 0: 34 cycles Multiplication with RV32M_MUL_INFERRED > 0: 3 cycles Multiplication with RV32_FIXED_TIMING = 0: as above if no match, or 2 cycles if match Division and remainder with RV32_FIXED_TIMING = 0: 34 cycles if no match, or 2 cycles if match Exceptions (divide-by-zero or divide signed overflow): 2 cycles Logic Design The logic design for the RV32M extensions houses the execution (phase 3) logic in a separate rv32_m module. The decoder is extended to do initial decode of the added instructions, and the result is routed to the ALU. This is shown in the updated block diagram below: The decoder does a partial decode and flags to the module of a dedicated interface. The calculation is done as a phase 3 execution, and the result routed to the ALU to send out as the phase 4 writeback. Extensions to Decoder As mentioned above, the execution of multiply and divide instructions is housed in the rv32_m module, but the decoder is extended, when the RV32_M_EN parameter is non-zero, to decode these instructions. A new interface is added, prefixed with extm_, which flags to the rv32_m module when a RV32M instruction is decoded (extm_instr), which of the eight instructions it is (extm_funct) and the RD destination register the result is destined for (extm_rd). These latter two value are simply extracted from already decoded and registered values, when the values occupy the same location in the instruction. For the extm_instr flag, and internal signal is added, mul_div_instr, which decode the instr_reg value to uniquely decode the RV32M group of instructions, if RV32_M_EN is non-zero, and this is then registered as extm_instr. The mul_div_instr signal is also used in the decoder to ensure that the ALU doesn't perform any calculation, by setting the no_writeback signal. The rv32_m module has the decoder's A and B outputs routed to it as its data inputs, along with their matching register indexes (for bypass feedback purposes). By default, the A input selects RS2, which is what is required, but the B default selection is the immediate value, but requires to be RS1. So the mul_div_instr is also used to update the b and b_rs_idx decoder outputs to be the values for RS2. Note that the decoding to the individual multiply or divide instructions is done in the rv32_m module and not in the decoder. This minimizes the changes required to the common decoder, and the ease of including or excluding the relevant logic on the RV32_M_EN parameter. Extensions to ALU Some small extensions to the ALU were needed for the RV32M functionality. As mentioned before, in order not to introduce an additional multiplexing on the RD updating to the register file (as was done for Zicsr), which is a critical path on the feedback bypassing for all modules, the outputs of the rv32_m are routed to the ALU to re-use its c and rd outputs. This still adds a selection to the critical path, namely from rd updates to the ALU result (c) output, but experimentation showed that this gave the most favourable result. A new interface (with extm_ prefixes on the signals) is added with a flag indicating that the rv32_m module has an active result (extm_update_rd), the RD destination register index (extm_rd_idx), and the value being written (extm_rd_val). When the update signal is flagged, the internal next_rd signal is mux'ed to the extm_rd_idx value. This takes priority over the stall signal, as this is set when the rv32_m module is active to prevent the ALU advancing whilst the multi-cycle calculations are proceeding. Then, in the synchronous process, the ALU's c output is updated (again as a priority) with the extm_rd_val if the extm_update_rd signal is active. Note that the ALU does not have any parameters associated with the RV32M functionality. The extm_update_rd and ext_rd signals in the rv32i_cpu_core are tied off to 0 when RV32_M_EN is zero. In this case, the synthesis process will optimize away the logic in the ALU associated with the multiplication and di rv32_m Module The rv32_m module has logic multiplication and division functionality, based on shift registers doing shift and add (for multiplication) and long division. These take a fixed number of cycles (see Performance section above), with the calculations always being done on unsigned values, and a cycle used at either end to convert the inputs to unsigned values and the result back to signed value, as appropriate. This is not necessarily the most efficient way of doing this, and the two sign conversion cycles might be saved with more complex logic. The decision was made, however, to keep with the conversion implementation for the sake of clarity and simplicity in implementation. This aligns the core with similar FPGA softcores, such as the LatticeMico32 [3] from Lattice Semiconductors, which takes 34 cycles for its division instructions, or the NIOS II [4], from Intel, which takes 35 cycles. When converting the division result sign the rule followed is that the dividend and the remainder have the same sign, whilst the result (quotient) is negative if the sign of the inputs (dividend and divisor) are different. For divide-by-zero and overflow conditions, the logic detects these at the start of the calculation, and sets the outputs according to Table 7.1 in section 7.2 of [1]. In these cases the calculation is terminated, with the result returned after 2 cycles. Verification The verification strategy of the design follows that of the instruction set simulator. The riscv-tests supplied by RISC-V International, which are categorised into tests for the various base specifications and extensions. They are self-checking assembly code, and thus require a minimum amount of functionality in order to run correctly. These tests, in turn, rely on the riscv-test-env repository. Some 'turn-on' test code was written in order to get to a useful standard, but this was not self-checking and is superseded by the riscv-tests code, and so is not part of the repository. An example simulation test environment is provided to run the riscv-tests, as well as a synthesis environment to build for the DE10-nano board, with the ability to run these same tests. The following sections describe these environments and how they are used. Simulation The CPU core simulation environment can be found in the repository in the folder HDL/test. This contains all the files to compile and run a simulation for the core. Compiling and Running the Simulation Prerequisites A provided makefile allows for compilation and execution of the example simulation tests. The current support for simulation is only for ModelSim at this time, and has some prerequisites before being able to be run. The following tools are expected to be installed ModelSim Intel FPGA Starter Edition. Version used is 10.5b, bundled with Quartus Prime 18.1 Lite Edition. The MODEL_TECH environment variable must be set to point to the simulator binaries directory MinGW and Msys 1.0: For the tool chain and for the utility programs (e.g. make). The PATH environment variable is expected to include paths to these two tools' executables The versions mentioned above are those that have been used to verify the design, but different versions will likely work just as well. If MODEL_TECH environment variable is set correctly, and MinGW and Msys bin folders added to the PATH, then the makefile should just use whichever versions are installed. Compiling Test Code Before running a simulation a test file must be created to load into the instruction memory. By default the memory is looking for a memory initialisation file named test.mif. This will be generated from one of the tests in the riscv-tests repository. The riscv-tests repository is expected to be checked out in the same directory as this repository, as well as the accompanying riscv-test-env repository, which riscv-tests relies on. A make file (makefile.test) is provided to compile a test to an executable and then convert this to a memory initialisation file (.mif). There is a program supplied for the ELF to .mif conversion, which is in the test folder and called elf2vmif.cpp. The make file takes care of building this if it hasn't been built already. The make file itself is executed with some arguments to override variables to select a test to be compiled and the sub-directory in riscv-tests/isa where it is located. For example, from a command window, and in the test folder, one can execute, say: make -f makefile.test SUBDIR=rv32ui FNAME=addi.S This will compile the addi.S test to a file in the test folder called addi.exe. It then converts this to a file test.mif. If a simulation is now run, the compiled code will be loaded into the instruction memory before the simulation is run. Running a Simulation The main make file (makefile) is used to compile and then run a simulation, either as a batch simulation or a GUI simulation. Typing make help displays the following usage message: Giving no arguments compiles the Verilog code. The compile does the same thing. The various execution modes are for batch, GUI and batch with logging. The batch run with logging assumes there is a wave.do file (generated from the waveform viewer) that it processes into a batch.do with signal logging commands. When run, a vsim.wlf file is produced by the simulator, just as it would when running in GUI mode, with data for the logged signals. The make wave command can then be used to display the logged data without running a new simulation. Note that the batch.do generation is fairly simple at this point, and if the wave.do is constructed with complex mappings the translation may not work. The make regression command runs all the regression tests, as per the run32i_tests.bat script. The make clean command deletes all the artefacts from previous builds to ensure the next compilation starts from a fresh state. . The riscv-tests exit with the gp register containing a test number from bit 1 upwards, and the bottom bit set. If the test number is 0, then the test has passed (if the bottom bit is set). Otherwise the test number is the number of a test that failed (it halts after the first failure). The test bench, at the end of the test inspects the value and prints a message for pass or fail, with some information about the state. All of the rv32ui tests can be run as a regression test using the run32i_tests.bat batch file. This goes through all of the RV32I unprivileged instruction tests, compiling them, running a batch simulation, and extracting the PASS/FAIL messages in a test log (test.log), which it prints to the screen at the end for inspection for failures. Test Bench The test bench (tb.v) is a straight forward Verilog module, in a traditional format. The top level core is instantiated as the UUT, and a clock and reset are generated locally and sent to the core. Not much else needs to connected to the core, and the main other inputs to the design are the parameters. The test bench module has parameters, some of which are those of the instantiated top level core, allowing them to be overridden when running a test. The diagram below shows the general structure of the test bench and instantiated core. The core has an optionally instantiated test block for monitoring the imem read bus, and halting the core when certain conditions are met, as defined by the control register bits in the local registers. The test bench sets the RV32I_INCL_TEST_BLOCK parameter to 1, so that this block is present during simulations. The core has an Avalon memory mapped slave bus for reads and writes to registers/memory from an external source. In normal usage, this is mainly for loading a program to instruction memory, and then taking the core out of reset to execute the program. In simulation, the program is loaded from an initialisation file to save wasting time writing to memory during simulation. Since it is instantiated, the core's test block is configured using other registers, to select halt conditions (on unimplemented instruction reads, ecall instruction and/or specific address reads). When a halt condition occurs, the core will set a signal which can be read via the registers. The registers in the core are defined is shown below: Some probes into the core are made from the top level to extract signals for control of execution. In particular, the GP register (x3) and PC register, along with the test_halt signal from the test block. The test bench logic monitors these signals and determines when a halt condition occurs. It is when the test bench detects a halt condition that test status is inspected and a PASS/FAIL message is displayed. In addition, the software interrupt pin can be controlled via a register, and the timer update port of the rv32i_cpu_core is hooked up to registers to allow its update, along with the comparator, for test purposes. Core Parameters The core has various parameters which configure the logic. In the test bench, these are configured in such a way as to allow the riscv-tests to be run. The table below shows the parameters, their default values and the values set by the test bench: Core parameter name rv32i_cpu_core? Default Test Bench TB parameter name CLK_FREQ_MHZ yes 100 100 CLK_FREQ_MHZ RV32I_RESET_VECTOR yes 32'h00000000 32'h00000000 RESET_ADDR RV32I_TRAP_VECTOR yes 32'h00000004 32'h00000004 TRAP_ADDR RV32I_LOG2_REGFILE_ENTRIES yes 5 5 LOG2_REGFILE_ENTRIES RV32I_REGFILE_USE_MEM yes 1 1 REGFILE_USE_MEM RV32_ZICSR_EN yes 1 1 ZICSR_EN RV32_DISABLE_TIMER yes 0 0 - RV32_DISABLE_INSTRET yes 0 0 - RV32_M_EN yes 1 1 M_EN RV32M_FIXED_TIMING yes 1 1 M_FIXED_TIMING RV32M_MUL_INFERRED yes 1 1 M_MUL_INFERRED RV32I_IMEM_ADDR_WIDTH no 16 16 - RV32I_DMEM_ADDR_WIDTH no 16 16 - RV32I_IMEM_INIT_FILE no "UNUSED" "test.mif" - RV32I_DMEM_INIT_FILE no "UNUSED" "UNUSED" - RV32I_ENABLE_ECALL no 1 0 - RV32I_IMEM_SHADOW_WR no 0 1 IMEM_SHADROW_ER RV32I_INCL_TEST_BLOCK no 0 1 INCL_TEST_BLACK The table shows which parameters are also part of the rv32i_cpu_core parameters, and which configure the top level core only. Even though the test bench will give default values to all the core's parameters, as shown, only some are exposed at the test bench top level. These are shown in the right most column with the name used for the equivalent test bench. Thus, only these parameters can be altered from the command line when simulating. Results The simulations were run for each of the riscv-tests under the rv32ui folder-that is, the unprivileged instructions for RV32I base specification. The table below shows the tests that have been run and their status. sub-test folder test status rv32ui simple.S Passed add.S Passed addi.S Passed and.S Passed andi.S Passed auipc.S Passed beq.S Passed bge.S Passed bgeu.S Passed blt.S Passed bltu.S Passed bne.S Passed jal.S Passed jalr.S Passed lb.S Passed lbu.S Passed lh.S Passed lhu.S Passed lui.S Passed lw.S Passed or.S Passed ori.S Passed sb.S Passed sh.S Passed sll.S Passed slli.S Passed slt.S Passed slti.S Passed sltiu.S Passed sltu.S Passed sra.S Passed srai.S Passed srl.S Passed srli.S Passed sub.S Passed sw.S Passed xor.S Passed xori.S Passed These are all the relevant tests for the current implementation. When the extension specifications are added, the additional tests will be run, as for the instruction set simulator. Tests for Zicsr Extensions The simulations were run for each of the riscv-tests under the rv32mi folder-that is, the machine instructions for RV32I specification. The table below shows the tests that have been run and their status. sub-test folder test status rv32mi breakpoint.S Not yet supported^† csr.S Passed illegal.S Passed ma_addr.S Passed ma_fetch.S Passed mcsr.S Passed sbreak.S Passed scall.S Passed shamt.S Passed NOTES:^† Test relies on debug CSR registers to be implemented, which they currently aren't in this model. Tests for RV32M extensions The simulations were run for each of the riscv-tests under the rv32um folder-that is, the machine instructions for RV32M extensions specification. The table below shows the tests that have been run and their status. sub-test folder test status rv32um mul.S Passed mulh.S Passed mulhu.S Passed mulhsu.S Passed div.S Passed divu.S Passed rem.S Passed remu.S Passed Synthesis An example target platform environment is provided, targeting the DE10-nano board (shown below). Files are provided for the top level core to be instantiated in a Platform Designer infrastructure (.qsys file) and the setup and constraints for a Cyclone V BA6 device synthesis, suitable for the chosen platform. These files can be found in the HDL/de10-nano/build folder. The environment assumes that the Quartus Prime tools have been installed on the host machine. The code has been tested on Quartus Prime Lite 18.1, but later versions should work just as well. Core Component The core logic is specific to the DE10-Nano target platform. Instantiates the rv32i_cpu_core and two Intel/Altera M-RAM memories. The source code is located in the HDL/de10-nano/src folder, along with a top.v file that is the synthesis top level for the target FPGA. The core has ports for all the possible interfaces for the DE10-Nano, but in this example most of these are not used. The Quartus tools have been used to create a Platform Designer component from the core block that can be instantiated in the top_system, along with the Hard Processor System (HPS) and some other required peripherals. It is the top_system that is instantiated in the top.v code, and contains the core component. In the same folder as the core.v component file, the core_hw.tcl defines this component for use in the Platform Designer, listing the all the files required and other attributes. The core parameters can be set in the platform designer by editing the instantiated core, and a windows appears as shown below with the main interfaces and the parameters for altering. Command Line Build The design can be synthesised using the Quartus GUI tools in a standard manner, but a make based command line build option is provided. In the build directory a makefile can be used to synthesise the design by simply typing make from a command window from within the build folder. When this is done, the script will construct the Platform Designer top_system from the top_system.qsys file, and then proceed to go through the build steps of synthesis. When complete, the timing report will be scanned for violations, and any found printed out for inspection. The generated reports and FPGA image files are all generated in an output_files folder. Both a top.sof and a top.rbf file are generated for use, as appropriate, to configure the FPGA. Results Timing The design was constrained for the target clock frequency of 100MHz via the Platform Designer's instantiated PLL components, with outclk0 being this frequency and connected to the clk input of the core. No timing violations were reported and the estimated fmax frequency for the slowest corner was ~117.6MHz when compiled for RV32I only, reducing to 108.0MHz when the Zicsr extensions are enabled. Resource Usage The table below shows the achieved resource usage for the synthesis of the base RV32I implementation, showing both the actual number of ALMs used, plus an estimate of the logic element (LE) equivalent, based on a synthesis run targeting a Cyclone 10LP device (10CL010YU256C6G). Also the number of M10K RAM blocks is given. module ALMs (~LEs) M10Ks rv32i_cpu_core 6 (31) 0 rv32i_decode 133 (412) 0 rv32i_alu 480 (1055) 0 rv32i_regfile (RAM) 64 (156) 2 Totals 683 (1654) 2 When the Zicsr extensions are added, the results become as shown in the table below: module ALMs (~LEs) Disabled timer & instret M10Ks rv32i_cpu_core 37 (81) 37 (80) 0 rv32i_decode 174 (458) 170 (465) 0 rv32i_alu 506 (1120) 506 (1102) 0 rv32i_regfile (RAM) 81 (218) 81 (218) 2 rv32_zicsr 451 (996) 314 (687) 0 Totals 1249 (2873) 1108 (2552) 2 When the RV32M extensions are added, on top of the RV32I base and Zicsr extensions, the results become as shown in the table below, with RV32M_FIXED_TIMING set to zero, and RV32M_MUL_INFERRED set to non-zero, in the first column, and RV32M_FIXED_TIMING set to non-zero, and RV32M_MUL_INFERRED set to zero, in the second column. The DSP column gives the DSP multiplier counts for the two cases, with the Cyclone 10LP counts in brackets. module ALMs (~LEs) ALMs (~LEs) M10Ks (M9K) DSP (C10LP) rv32i_cpu_core 37 (85) 37 (85) 0 0/0 (0/0) rv32i_decode 168 (457) 166 (457) 0 0/0 (0/0) rv32i_alu 551 (1276) 534 (1275) 0 0/0 (0/0) rv32i_regfile (RAM) 79 (218) 75 (219) 2 0/0 (0/0) rv32_zicsr 456 (963) 458 (965) 0 0/0 (0/0) rv32_m 306 (617) 262 (649) 0 3/0 (8/0) Totals 1597 (3616) 1532 (3650) 2 3/0 (8/0) Platform Testing Testing on the example platform based around repeating the riscv-tests appropriate to the build-at this time, the rv32ui tests and, if Zicsr extensions enabled, rv32mi tests, as well as the rv32um tests if the RV32M extensions are enabled. In order to do this, the DE10-Nano FPGA needs to be configured with the image generated in synthesis (see Command Line Build section above). Some software is also required to load a program to the internal memory and then run a test. The core component is configured to instantiate the test logic (see diagram in Test Bench section above), which monitors for configured halt conditions. It is controlled via local core registers, accessed over a memory mapped control and status register bus (CSR) and also makes available the halt status and the value of the GP registers (see register table in the Test Bench section above). The test software accesses these registers to load a program, bring the core out of reset and then monitor for a halt status. It then determines the exit status and prints a PASS or FAIL message. This mimics the running of the tests in the simulation test bench, and the core parameters are set in synthesis to be the same as for simulation, apart from the RV32I_IMEM_INIT_FILE, which is left at the default of "UNUSED", as the code will be loaded by test program. This is shown in diagram in the Core Component section above. Loading the FPGA Image This document cannot go through a tutorial on the use of the Quartus tools, and the procedures are specific to the Intel FPGAs and the DE10-Nano board. It is supposed that familiarity with these tools, or their equivalent if targeting a different platform, is sufficient without further discussion. A brief summary is documented here for the example platform to indicate the steps required. The DE10-Nano, for these tests is running a console version of Ångström Linux. At time of writing, this image can be found here. With this setup, there are two ways to load an FPGA image on the DE10-Nano; over USB-Blaster JTAG or copy an image to the FAT32 partition so that it is loaded at boot time. The later needs the FAT32 partition mounted (e.g. /mnt/fat32), and the generated RBF (top.rbf) copied to that folder and named soc_system.rbf. The advantage of this is that the FPGA will be configured with the image every time the board is powered on. Here we will concentrate on loading over the USB-Blaster. From the Quartus Prime GUI, the Programmer window can be launched. If connected correctly, the tool will detect the connection. If the first time connected, then the Auto Detect button should be pressed to inspect for the connected FPGA, but a top.cdf file is provided which already defines this chain. In the diagram above is shown a programmer window with the correct chain configuration, and the top.sof file selected as the configuration image. Programming is then simply a matter of pressing the Start button. Programming this way means the image will be lost at power down, but it allows the image to be updated, without a reboot, when developing the logic. Compiling the Test Program The HDL\de10-nano\test folder contains the source files to compile a Linux ARM program to run riscv-tests/rv32ui tests on the DE10-nano platform. A makefile exists to build the test program. On a windows host, it assumes that the following tool chain is installed at the given location: c:\Tools\gcc-linaro-4.9.4-2017.01-i686-mingw32_arm-linux-gnueabihf If a toolchain is installed elsewhere, then the TOOLPATH variable should be modified, either on the command line, or the makefile updated. If no toolchain is available, a compiled version of the code is available in the folder (make.exe), but modifications to the program will not be possible. Alternatively, if the Linux running on the DE10-Nano has the GNU gcc toolchain installed, it may be compiled on platform. At the time of writing, the toolchain used in testing is available here. Uploading the Test Code The main.exe executable should be copied to a suitable folder on the DE10-Nano. How this is done depends on the particular setup. For this development the DE10-Nano was connected to the local network (with a fixed MAC address-otherwise the allocated address must be discovered), and MobaXterm used to connect to the board as root. The diagram below shows a MobaXterm session and a single test run: When run on the platform, main.exe program expects a RISC-V executable file called test.exe, which it will load to the internal memory and then execute. When the internal test block has halted, it will extract the state and print a PASS or FAIL message, as shown above. Running the Tests A script (run_tests.sh) is also provided to run the riscv-tests/rv32ui tests as a regression. A test/ folder should be located in the same folder as the main.exe and the script, and contain all the compiled rv32ui tests. The script will run each executable in turn, and log the output in test.log. When complete all the PASS/FAIL messages are dumped to the screen for inspection. The current version of the core has had all the tests run that were run in simulation (see table in the Results sub-section of the Simulation section), and all pass, verifying the logic on the example platform. Download The source code is available for download on github. As well as the source code, files are provided to support simulation with ModelSim and synthesis, targeting the DE10-Nano development board with a Cyclone V BA6 FPGA device. Also scripts to compile and run the riscv-tests suite are bundled in the repository. References [1] The RISC-V Instruction Set Manual, Volume I: User-Level ISA, Document Version 20191213, Editors Andrew Waterman and Krste Asanović, RISC-V Foundation, December 2019. [2] The RISC-V Instruction Set Manual, Volume II: Privileged Architecture, Document Version 20190608-Priv-MSU-Ratified, Editors Andrew Waterman and Krste Asanović, RISC-V Foundation, June 2019 [3] LatticeMico32 Processor Reference Manual, Lattice Semiconductors, June 2012 [4] Nios II Processor Reference Guide Version 2020.10.22, Intel, October 2020

Contents

Introduction

rv32i_cpu_core Supported Features

RV32I Core Implementation

Top Level

Pipeline Overview

Register Hazards

Branch Hazards

Stalling Mechanism

imem reads

dmem loads

Parameters and Configuration

Parameters for the rv32i_cpu_core

Additional Parameters for the top level Core

Decoder

Arithmetic Logic Unit

Stalling

Register File

Registers versus RAM

Pipeline Feedback

PC and Stalling

Memory Interfaces

Zicsr Extensions

Configuration Parameters

Additional Instructions

Supported Control and Status Registers

Logic Design

Extensions to Decoder

Extensions to ALU

rv32_zicsr Module

RV32M Extensions

Configuration Parameters

Performance

Logic Design

Extensions to Decoder

Extensions to ALU

rv32_m Module

Verification

Simulation

Compiling and Running the Simulation

Prerequisites

Compiling Test Code

Running a Simulation

Test Bench

Core Parameters

Core parameter name

rv32i_cpu_core?

Default

Test Bench

TB parameter name

CLK_FREQ_MHZ

yes

100

100

CLK_FREQ_MHZ

RV32I_RESET_VECTOR

yes

32'h00000000

32'h00000000

RESET_ADDR

RV32I_TRAP_VECTOR

yes

32'h00000004

32'h00000004

TRAP_ADDR

RV32I_LOG2_REGFILE_ENTRIES

yes

5

5

LOG2_REGFILE_ENTRIES

RV32I_REGFILE_USE_MEM

yes

1

1

REGFILE_USE_MEM

RV32_ZICSR_EN

yes

1

1

ZICSR_EN