Wyvern Semiconductors

Expert in Digital IP Solutions

Articles: Article 1: JPEG Concepts and Format; Article 2: JPEG Decoder Implementation; Article 3: JPEG IP Development Process

JPEG Implementation

by Simon Southwell
7th February 2014

abstract

This article is the second paper discussing JPEG topics, and documents an example software and hardware decoder implementation using the concepts outlined in the previous article. A software C model implementation is discussed first, which is ultimately to be used as a reference in testing a verilog RTL implementation, targetted at, though no limited to, an FPGA. The internal hardware design is discussed in detail, and conclusions drawn.

Introduction
Decoder Implementation for Software Model
Decoder HDL Implementation
Download
Conclusions
References

Introduction

In this article we will look at the implementation of a JPEG decoder in order to solidfy the concepts discussed in the previous article. The development of the impementation discussed takes a two step approach. Firstly a C model (with a C++ core) is implemented, but with an architecture that is hardware implementation friendly. A C model is easier and quicker to implement than an RTL hardware implementation, and will execute much faster than an RTL simulation. Therefore the C model can be tested with a much larger set of test data for a given test period. The C model then becomes a specification and test reference for the second step of an RTL implementation.

Both the C model and the RTL implementations will be discussed. The C model details will be less detailed than for the hardware implementation, as the hardware is the final goal, but a general understanding is still required in orer to explain the details of the RTL. The RTL design will be explained in more detail such that the verilog source code is fully understood as a piece of IP that can be reused and further developed.

Decoder Implementation for Software Model

The software model for the decoder covers all functionality that will be required in the hardware implementation, plus some peripheral code for a command line executable program interface and debugging information etc. As a whole the model will serially process an input JPEG file and output a bitmap file. This basic operation is further broken down into processing the header information, and then looping through the MCUs, decoding them, and placing the bitmap data in a buffer and outputing the results. The core is based around a class 'jfif', with a single public method (jpeg_process_jfif()). The class header is defined in jfif_class.h, and the methods are defined in jfif.cpp. The jfif class inherits a base class, jfif_idct that provides the inverse DCT functionality. This is defined in jfif_idct.cpp and jfif_idct.h. The basic calling structure of the code is as shown below in the form of pseudo-code. The function names shown match with the actual jfif class methods, and the calling hierarchy is as it is in the model.

   jpeg_process_jfif()             -- Converts input baseline DCT JFIF buffer to 24 bit bitmap
       jpeg_extract_header()       -- Extracts jpeg header information (scan, frame, quantisation/huffman tables, DRI)
           jpeg_dht()              -- Constructs a Huffman decode structure from huffman table data
       jpeg_bitmap_init()          -- Creates space for appropriately sized bitmap and initialises header data
   
     LOOP for each MCU:            -- jpeg_process_jfif() process an MCU at a time until EOI (or error)
       jpeg_huff_decode()          -- Gets an adjusted Huffman/RLE decoded amplitude value and ZRLs, or marker or end-of-block
           jpeg_dht_lookup()       -- Does huffman lookup on code and extracts amplitude data, or flags a marker
               jpeg_get_bits()     -- Gets top bits from barrel shifter and (optionally) removes. Pulls in extra input if needed.
               jpeg_amp_adjust()   -- Adjusts decoded huffman decoded amplitude to +/- amplitude value
           <dequantise>            -- De-quantisation done in jpeg_huff_decode directly from selected table
       jpeg_idct()                 -- Inverse discrete cosine transform
       jpeg_ycc_to_rgb()           -- Converts YCbCr to RGB on an MCU
       jpeg_bitmap_update()        -- Updates bitmap data buffer with 8x8 RGB values
     ENDLOOP

     return pointer to bitmap

The top level entry point for JFIF decoding is the jpeg_process_jfif() method. This has a very simple interface. The first argument is simply a pointer to a buffer of uint8 bytes (*ibuf), containing the JFIF data to be decoded. The JFIF data is expected to start in the first byte and continue consecutively for the whole image. No assumptions are made by the jpeg_process_jfif() method about the size of the file, or dimensions of the image etc.—all this is calculated from the header

The second argument is a pointer to a uint8 pointer (**obuf). The pointer referenced by the passed in pointer will be updated to point to a buffer containing the decoded bitmap data (assuming no errors). The memory required for the bitmap is dynamically allocated by jpeg_process_jfif(), and it is the responsibility of the calling process to free this memory when it is no longer required.

The functions returns one of several return codes, defined in jfif.h:

JPEG_NO_ERROR : Decode completed without error
JPEG_FORMAT_ERROR : A format error was encountered when parsing the header
JPEG_MEMORY_ERROR : A memory allocation failure occured
JPEG_UNSUPPORTED_ERROR : A valid, but unsupported, format was encountered

Within jpeg_process_jfif() the method calls jpeg_extract_header() to parse the JFIF header, and return various poniters to the header information, such as SOS data (scan_header_t *), SOF0 data (frame_header_t *) and reset interval. With this data jpeg_bitmap_init() is called that allocates a buffer for bitmap data (bmp_ptr) and generates a bitmap header for a 24 bit RGB image (even if it's greyscale). It also calculates, from the header data, the number of MCUs it is to process (total_mcus). The main body of the method is then a while loop that processes MCUs one at a time (and markers) until and EOI marker is encountered. The main body of the method jpeg_huff_decode() is a while loop, that loops, calling jpeg_huff_decode() to return either a marker or an MCU consisting of an 8 by 8 array of adjusted amplitude values (scan_data_ptr). The only markers expected to be returned during scan data processing are RSTn and EOI, otherwise an error is returned. Some checks are done of the sequence of RSTn markers received, though the actual resetting of DC values is handled by lower level functions. The scan_data_ptr data is then inverse discrete cosine tranformed (in jpeg_idct()) and then, if a colour image, converted to RBG (jpeg_ycc_to_rgb()). The resulting RGB data is then sent to jpeg_bitmap_update() for constructing into the bitmap buffer. When all MCUs are processed, the **optr is updated to point to the bitmap data, and jpeg_process_jfif() returns with a status.

Header extraction

JFIF header extraction is performed by the jpeg_extract_header() method. This functions is based around a single "switch" statement that parses all valid marker types expected before the SOS marker, though it flags an error if any of those markers are from an unsupported file format. The JFIF format mandates some ordering of segments delineated by the markers but, for robustness, the jpeg_extract_header() method only expects the first item to be an SOI marker, and that an SOS marker marks the end of the header information. Any other segment can come in any order, whatever the JFIF specification might say. It is known that not all images follow thw JFIF format completely strictly.

All markers fall in to one of two categories: ones that have no following data (e.g. SOI) and those that do (e.g. APP0). The latter are always followed by 2 bytes of length, which dictates the total number of following bytes (including the 2 byte length) of the segment. The main markers of interest to the jpeg_extract_header() method are the SOI, SOF0, SOS, DHT, DQT, APP0 and APP14. The SOI has no information, but marks the start of the image, and is expected first, otherwise an error is returned. The SOF0 segment contains data for image dimensions, whether colour or greyscale and a set of values determining sub-sampling (if any), quantisation table selections and component ID—one for each component of a set (see Table 4 in the previous article). A pointer to a frame_header_t structure is set to the input buffer location 2 bytes from the SOF0 marker (to skip the length field) and returned in the **fptr argument. Note that all other start of frame markers (SOF1 to SOF15) are for unsupported formats, and the method returns an error if they are encountered.

The start of scan segment (SOS marker) has information for each component, defining a component selector, a DC Huffman table selector and an AC Huffman table selector. Unlike for SOF0, we cannot simply point to the input buffer with a pointer to a structure as the variable length part of the segment (i.e. the component specific fields, which may have 1 set or 3 sets), as this variable length sub-section is followed by another fixed section, whose offset is dependent on the number of component data sets. Therefore some buffer space is allocated and the data extracted into this buffer, and a pointer returned (*sptr) of type scan_header_t, that points to this extracted data.

The Huffman table segment (DHT) requires a bit more processing, and so a sub-function (jpeg_dht) is called to process segments of this type, returning a pointer of type DHT_offsets_t which, in turn is returned by jpeg_extract_header(). The aim of the jpeg_dht() method is to construct a Huffman table with as few bits of information as possible. This information consists of the 16 lengths of the segment (see Figure 7 in the previous article), the offsets within the Vn data where the different bit width codes start, and the end codes for the different bit widths. Note that data for bitwidth n is placed at index n-1 in the data arrays, so, for example, bit width 3 data is at index 2.

As discussed in the previous article, this is all that is needed to decode the Huffman encoded data. If input data bits are less than a certain bit width's end code, but bigger than the next smallest width's end code, then that is the number of bits of the code. Subtracting the end code of the next smallest bit width, doubled to give the start code of the matching bit width, from the input data bits gives the offset into the entries for that bit width, and hence the decoded value. Taking the example of Figure 5 from the previous article, we would have a DHT_offsets_t structure filled in as shown:

*Ln[] (points directly to Ln values in dht buffer, overlaid on the input data buffer, buf[])

0, 1, 4, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

*vmn_offset[n] (array of pointers to dht[] indexes)

0 -> NULL
1 -> dht[0] : 01
2 -> dht[1] : 00, 07, 04, 05
3 -> dht[5] : 21, 31, 32
others -> NULL

row_break_codes[n]

0 = 0b
1 = 01b
2 = 110b
3 = 1111b
others don't care

If the input data was a Huffman encoded run-length, and presented, say, ..010101011, then the first bit (using lsb) is bigger than row_break_codes[0] (for 1 bit width), so this is not a match. The first two bits are 11b, which is bigger than row_break_codes[1], but the first three bits are 011b which is less than row_break_codes[2], meaning we have a 3 bit code. Taking the endcode for 2 bits (row_break_codes[1] = 01b) and doubling it gives 010b, or 2. Subtracting this from the 3 bits of the input is 011b - 010b = 001b, or 1. The 3 bit vmn_offset pointer (vmn_offset[2]) points to dht[1], and the value at offset 1 from this has a value 07h, and hence is the decoded value—i.e. the variable code 011b of the input has a byte value of 07h. The jpeg_dht() functions returns a pointer to a DHT_offsets_t structure with the values filled in for the Huffman table, having allocated some memory space for table. The Tc and Th values are extracted from the input, and the method loops through each bit width (smallest to largest), filling in the table entries, based on the Ln lengths, and Vn values in the DHT segment.

The quantisation table segments (DQT) define the quantisation table values to be used for the different classes (DC or AC) of data, and IDs (which component). Following the 2 byte length are one or more 65 byte value sets. The first byte consists of two nibbles specifying the Pq (always 8) and Tq (destination) value. Then follows the 64 bytes for the 8x8 quantisation table values. These are extracted into the appropriate table (indexed by the Ptq value) pointed to by *qptr, of type DQT_t. The values are turned into integer values at this point as they are pre-scaled with values determined by the Arai et. al methods (see [1]), allowing the saving of a multiplication at that point.

The application sections are really only processed to make some compatibility checks. A JFIF file should have an APP0 segment, and it ought to be the first segment after SOI (but might not be). The first APP0 should begin with the string "JFIF\0", and this is checked. Subsequent APP0 sections, if any, should begin with the string "JFXX\0", and this is also checked. If these criteria are met, then the image is flagged as JFIF (is_JFIF). Some JPEG files that are not JFIF might have an APP14 section, usually associated with Adobe files, and beginning with the string "Adobe". If this is the case, and not flagged as JFIF, then colour space information is plucked from a field in the segment and checked whether it is RGB (if 0) or YCbCr (if 1), and flagged as such (*is_RGB is set TRUE or FALSE, accordingly).

All other markers or forms of segments not covered here are treated as errors, and the jpeg_extract_header() returns the appropriate error code. All being well, the method returns a good status and the header information in the buffers.

Huffman decode

The purpose of the jpeg_huff_decode() method is to return either a marker or a set of luminance and, if a color image, chormiance MCUs that make up a sub-sampled set—i.e. a YCC set of 3 when no sub-sampling, upto a YYYYCC when 4:2:0 sub-sampling. The method makes use of a sub-function jpeg_dht_lookup() which returns a structure of type rle_amplitude_t, as defined below.

    typedef struct {
        int  marker;         // if non-zero, hit a marker
        bool is_EOB;         // Flag if EOB
        bool is_ZRL;         // Flag if ZRL
        int  ZRL;            // run length of zeros
        int  amplitude;      // adjusted amplitude (if not EOB or ZRL != 16)
    } rle_amplitude_t, *rle_amplitude_pt;

The first field (marker) when non-zero contains a valid marker value, indicating that a marker was encountered. The second is a flag indicating that an EOB marker terminated the MCU data early. The third give a zeros runlength that occured before the associated amplitude value (which could be 0). The actual amplitude code, when not EOB or ZRL encountered, is returned in the final field, amplitude.

The jpeg_huff_decode() first does some calculations on the header data to determine the expected number of Y and C MCUs (y_arrays, total_arrays), as well as some sub-sampling factors for use later on. The method then loops for as many MCUs as calculated (indexed by array). Within this outer loop, table selectors are calculated based on whether then MCU is for chromaince or luminance, and the ID etc. The MCU buffers (mcu[][]) are cleared to 0 before commencing to allow for zero run-length skipping and early termination on EOB. The DC value is fetched by calling jpeg_dht_lookup(). This might return a marker, in which case it is processed, and jpeg_huff_decode() returns. When a marker encountered at this point, the jpeg_huff_decode() method returns NULL, and the marker information is returned in the *marker argument. Only one type of marker is expected to be processed in jpeg_huff_decode()—RSTx (others are handled in lower functions, or passed up to the calling method). This is for resetting the DC values (which are a running difference), and the DC counts (current_dc_value[]) are cleared on reception of this marker. If not a marker, the appropriate DC value is updated by adding the returned amplitude value to a running total, and dequantising by multiplying by qptr[Tq][0]. The result is placed in the current mcu[] buffer. After the DC value, the 63 remaining AC values are similarly retrieved, in an inner loop, via jpeg_dht_lookup(), dequantised and put into the mcu[] buffer. If the AC values have a ZRL run-length greater than 0, then the MCU index (mdx) is incremented by that run-length first, before updating the buffer with the de-quantised amplitude. During AC element processing, markers may still be returned. An EOB flagged will simply set mdx to 64 so that the next loop iteration falls through, and effectively skips to the end of that MCU's processing. The only other marker valid to appear during AC data is EOI. This acts like an EOB, but the marker is passed up to the calling method (in *marker). Any other marker is an error. Assuming no markers, when the MCU is complete (with mdx hitting 64), the outer loop iterates to process the next MCU until all MCUs in the group have finished. The jpeg_huff_decode() then returns a pointer to the mcu[] buffer, containing all the mcu data for the whole MCU set.

Huffman lookup

The jpeg_dht_lookup() method referred to in the last section is the procedure that performs the Huffman lookup as described previously, and extracts the following encoded amplitude. As this method is dealing with variable length codes, it works on a barrel shifter variable (jfif_barrel—defined in fjif class() and passed in to jpeg_dht_lookup()). On calling the method, matches are made on the barrel shifter bits from ther smallest bit width upwards until either a match is made, no match is found (an error) or a marker is found. The code is fetched via the jpeg_get_bits() method, which returns the number of bits requested from the barrel. If not enough bits avilable on the barrel for the requested size it will fetch more bits from the input, and add them to the barrel. If it detects a marker (ffh followed by a value byte), it will return the marker value ORed with ffff0000h to distinguish it from a code. Otherwise the 'n' top bits of the barrel are returned. The bits are removed from the barrel if the 'remove_bits' argument is set. Until a code match is found, the bits musn't be removed from the barrel. When the bit width of the code is determined only then is it safe to update the barrel, and delete the bits returned.

Once the jpeg_dht_lookup() method has found a matching bit width, it extracts the code (and flags to delete from the barrel) and does the table lookup to get the byte value. The value is checked for being EOB or ZRL, and the return structure (rle_amplitude_t) flags updated if so. If not EOB or ZRL, the amplitude bitwidth is extracted from the decoded value, and these amplitude bits extracted from the barrel (via jpeg_get_bits()) and passed onto jpeg_amp_adjust(), which decodes the value to undo the encoding as per Table 2 in the previous article. Any zero run-length value from the Huffman decoded value is written to the return structure, and this returned to the calling method.

Inverse DCT

For the inverse DCT, a fast implementation after Arai et. al, and detailed in [1], has been developed and tested, and the discussions in this article assume this method. If JPEG_FAST_INT_IDCT is defined at compile time, this method is selected. In addition, another iDCT method is supplied (jpeg_idct_slow()) which is a simple implementation that follows the equations discussed in the previous article, and is adapated from [2]. A pair of double loops are implemented for a one dimensional iDCT of rows and then columns. This method is used if JPEG_FAST_INT_IDCT is not defined at compile time. The method can be full precision, or be integer fixed point (when compiled with JPEG_DCT_INTEGER defined).

Colour Space Conversion and bitmaps

The C reference model is capable of generating a bitmap output from the JPEG input, converting the YCbCr data to RGB (if not RGB already), and constructing a valid 24bit RGB bitmap. This functionality is not part of the HDL hardware core implementation, but is added to allow test, comparison and validation in a simple raw standard format. As such, only a fragmentary descriptions will be given here.

The YCbCr conversion to RGB is performed by the jpeg_ycc_to_rgb() method. This takes a full set of Y and C MCUs (potentially sub-sampled) and expands into an N by N block of RGB triplets—which could be 8x8 if no sub sampling, or 16x8 for 4:2:2 etc. It is passed the final image dimensions, and clips the MCU padded data for non MCU aligned boundaries. At the heart of the method is the conversion functions as detailed in the previous article. The calculations are slightly different for full precision versus fixed point integer (if JPEG_DCT_INTEGER defined), but are essentially the same operation. The jpeg_ycc_to_rgb() is passed a buffer pointer (*optr) to a block of memory (a 2 dimensional array, big enough for a maximally sub-sampled YCC set), in which to place the RGB data.

The jpeg_ycc_to_rgb() method only works on the actual image data. Two other functions actually construct the full bitmap file; jpeg_bitmap_init() and jpeg_bitmap_update(). The first is called after the header extractor routine, once the image dimensions are known. The X and Y dimensions are passed to the method, which allocates memory big enough for the image, including a header, configures the header appropriately, and returns a pointer to the buffer created. jpeg_bitmap_update() is called immediately after jpeg_ycc_to_rgb(), and takes the buffer returned by that method, and places the data in the appropriate locations of the bitmap buffer, reversing the image from top to bottom to be bottom to top, as for bitmaps. It is this bitmap buffer that is returned by the top level JPEG method jpeg_process_jfif().

The jpeg_process_jfif() method (and it's sub-functions described above) is meant to be used as a stand-alone library class, and also has a C linkage function available, jpeg_process_jfif_c(), and can be compiled as such with the provided make scripts. However, a brief mention of the file jfif_main.c ought to be made, which uses jpeg_process_jfif_c() to produce a command line based executable for exercising the function for purposes such as verification, generation of bit comparison vectors, etc. Details of the usage of the executable, along with source code download instructions and links, can be found here. The main() function in jfif_main.c provides code for a simple command line interface, file input and output and also a graphical display window for viewing the images (using the GTK+2 library—not available when compiled with MSVC).

Decoder HDL implementation

In this section is described the HDL implementation, in verilog, of the JPEG decoder, detailing the sub-blocks, code details and functionality.

Decoder Top Level

The decoder top level consists of four major functional blocks, a data input and a data output interface, and some error and debug ports. The four major blocks consists of a data aligner (barrel_shifter12x4) for taking the 32 bit inputs and extracting the variable size words required internally; a header extractor (extract_hdr) for parsing the jpeg header and presenting the data for image decoding; a Huffman decoder (huff_decode) for the initial decompression; and an inverse DCT block (jpeg_idct) for the decoding, with normalisation and clipping. The diagram below shows these main four blocks, and their general relationship to each other.

Figure 7: Decoder Top Level

In essence, for a given image input, the header extractor will be active, parsing the header data, whilst the Huffman decoder (and hence the iDCT) will be idle. Once the header data is processed, the Huffman decoder and iDCT are activated to process the image until EOI. The control of this state resides in the header extractor block.

Interfaces

The JPEG decoder top level has a very simple interfacing structure. A single clock and reset input port is defined, and the decoder functions all run from a single clock domain. The resets are configurably synchronous or asynchronous.

The input data port is a simple valid/acknowledge interface, with an 32 bit data input. There are 4 valid bits for each byte within the 32 bit word, though the valid must be contiguous and start from bit 0. I.e. a new file must be input as 32 bit data words, except that last bytes which may be an odd number, from 1 to 3. Data is transfered only when both the valid is non-zero and the acknowledge are active (high). Similarly, the output port is a valid/acknowledge interface, this time with 64 bit data output and a single valid output bit (as the output is always padded to 8 bytes).

The data port constitute the main functional interfacing to the block, but an error port is also provided. An 8 bit error code is output, which flags an error condition if non-zero. Upon encountering an error, the decoder outputs the error code and halts. The error condition can only be cleared using the i_clr_error input or by resetting the block. Using i_clr_error will place the decoder in a 'searching for SOI' state, so any further data from the error causing input will be ignored. Note that an error condition can only be generated from processing the header information. Once the decoder state is accepting data input, it will process the data without any further checking. There are 6 possible error states that can be reported:

8'h01 : Invalid code error, set if a marker of invalid value is encountered.
8'h02 : Unsupported code error, set if a marker is encountered that indicates a format not supported by the decoder.
8'h03 : Unexpected code error, set if hunting for a particular marker, and another marker is encountered
8'h04 : Missing EOI error, set if data terminates without seeing and EOI
8'h05 : Expecting SOI error, set if a valid marker seen when not yet seen an SOI
8'hff : Internal error, set if some internal state failure occured

In addition to an error code port, a debug interface is provided to output internal header state. The ports are very wide and would not normally be used. However, they can be mux'ed down to a narrower debug bus for routing to external pins etc., if required, or left unconnected. The ports are used by the simulation test harness for the decoder, and s must be available at the top level of the core.

Data Aligner

The data aligner block consists of a barrel shifter that is 12 bytes wide, and has 4 byte data input and output interfaces. The input side has a unary 4 bit valid input for adding 1 to 4 bytes, and is connected directly to the main block input port. THe shifter returns an ack (o_ack) and data is transferred only when both the ack is set and the valid is non-zero. On the output interface, the valid indicates a byte count, which is 0 to 3 when the barrel shifter has 0 to 3 valid bytes remaining, and is 4 when the shifter has 4 or more valid bytes. The internal blocks (either the header extractor, or the Huffman decoder) return an ack (i_ack), and data is transferred only when both the ack is set and the valid is non-zero.

The purpose of the data aligner is to allow a straight forward 32 bit interface on the decoder block boundary, with 32 bits transferred for all inputs during a decode, except for the last word, and the internal circuitry to extract arbitrary numbers of bytes, from 1 to 4, as necessary to account of the various header data alignments etc. This immunises the external circuitry from complex variable length accesses, and also breaks the internal timing from external timings, limiting them only to the interface itself.

Header Extractor

This header extractor block serves as the main control for the decoder. It is responsible for parsing the incoming jpeg header, extracting the relevant header information and storing for subsequent data encoding. Errors and unsupported formats are handled by the block and the Huffman and quantisation table data are buffered, though these data are not stored locally and are written to buffers withinin the Huffman decode block.

At the heart of the extractor is a state machine that walks through the header decode by searching for markers of appropriate type and in the appropriate sequence. Any variation from valid markers and their order are flagged as errors. The diagram below shows the state diagram for the FSM.

Figure 8: Extractor FSM

After a reset, the FSM sits in state WAIT_SOI, requesting bytes from the data aligner, and consuming them, until an SOI marker is seen at the input, At this point, two bytes are requested (markers are 16 bits), and the state moves to DECODE_MKR. In this state another marker is expected, and logic tests whether the data input is a valid marker, whether a supported marker type and whether it is expected at this point (say, not an EOI straight after the SOI, for example). If an error occurs, the state moves to the ERROR state and remains there, and the error code output port is updated with the type of exception. All being well, though, the state moves to PROCESS_HDR, having latched the 16 bits of data length (data_len) associated with this second marker. Data is now continuously requested from the data aligner, counting down until all bytes for the segment are processed. What happens to the data depends on the type of marker currently active (curr_mkr).

The first marker after SOI that is of interest is likely to be DQT, for one or more quantisation tables. The table consists of a Pq/Tq byte followed by 64 bytes of quantisation data. Pq is of no interest (it's fixed at 8) but the Tq value is latched by the extractor. All the following 64 bytes of quantisation data are stored in a buffer in the Huffman decoder (dqt_tbl). The Tq value specifies one of 4 destinations (exported as o_wr_tbl_sel) and, along with an index (o_wr_tbl_idx) forms the address to be stored in that buffer. The interface is a simple memory like interface with a write strobe (o_wr_dqt) that is used to write the data to the buffer. If, after processing the quantisation data, the number of bytes processed is less than the data length seen after the DHT marker, then additional tables are present, and the process is repeated. However, it is valid for each table to have individual DQT segments as well, and the logic supports both types, or a combination of the two.

After the quantisation table data is likely to be the SOF0 marker. When curr_mkr indicates this, the bytes are extracted and placed in a buffer (SOF0) in byte order, except the 16 bit X and Y fields which are byte swapped for ease of decode. The buffer has enough space for the main part of the header and up to 3 image component parameters. The buffer values are exported on a single wide port (o_sof0) for use by the Huffman decoder, and routing to the debug port.

The next segement is likely to be one or more Huffman tables. When curr_mkr is JPEG_MKR_DHT, the data comprises a Tc/Th byte, 16 length bytes, followed by a variable number of code bytes (the sum of the length byte values). The first byte is extracted and stored locally (dht_Tch), but the length bytes and codes are stored in buffers in the Huffman decoder. Therefore the logic separates the bytes that are lengths (the next 16 after Tch) and those that are codes (i.e. the rest). The data is exported from the block over a RAM like write interface, flagging writes for the lengths (o_dht_write_indexes) or the codes themselves (o_wr_dht). A set of "endcodes" for each of the bit widths are calculated on the fly, whilst processing the length bytes, and are exported with the length data for storage in the Huffman decoder block. With reference to the Huffman table example in the previous article, the "endcodes" are values one greater than the last encoding for a given bit length. For example (refering to the diagram), there are four 3 bit codes, ranging from 010b to 101b, making the endcode for 3 bit codes 110b (or 6). The logic calculates these endcodes by taking the last bit width's endcode (initially 0) and doubling it, adding the value for the current bit width length. This includes where lengths are 0, such that just the doubling is applied. These endcodes will be useful in the Huffman decoding, allowing determination of the encoded variable-length code boundaries—more in later sections. The Tc/Th data specifies one of four possible destination tables, and the interface exports this table selection (o_wr_tbl_sel) along with the data, and buffer indices. If, after processing a DHT table, the bytes processed is less than then length specified with the marker, then there is more than 1 table, and the process is repeated until all the table data is processed. Note, however, that multiple tables can each have their own DHT marker, with a single table following. The logic will handle both cases, or a combination of both.

Finally, the last marker of interest is SOS. This defines parameters for each of the image components (either 1 for black-and-white, or 3 for colour images). The values are simply extracted into a local buffer (SOS), but with the last 3 bytes dropped, as they are fixed values and can safely be ignored. The buffer values are exported on a single wide port (o_sos) for use by the Huffman decoder.

Once a marker and its following segment data have been processed, the FSM will move either to DECODE_MKR to process the next marker, if the last was not SOS, or it will move to PROCESS_SCAN to process the image data. At this point control is handed to the Huffman decoder (o_scan_data_active is set), and the state machine waits for the block to inform it that the scan data processing is complete (i_end_of_scan set by Huffman decoder). If the Huffman decoder reports an error, then the state moves to ERROR and halts. If not, then either the state resturns to WAIT_SOI if the iDCT block is flagging that it's flushed (i_dct_flushed)—i.e. no pending data processing to complete)—or it goes to WAIT_DCT_FLUSH to wait until the iDCT is flushed, before returning to WAIT_SOI.

If the FSM finds itself in the ERROR state, then it remains there until an external signal (i_clr_error) is set to clear the error state. Also, a full block reset forces the state back to WAIT_SOI.

Huffman Decoder

The Huffman decoder's job is to extract the incoming scan data, and decode to a set of de-quantised and de-zigzagged values, using the data and tables extracted by the header extractor block. We'll consider this function in three steps. The scan data consists of Huffman encoded bytes that have a zero run-length nibble, and a bit width. So the first step is to decode these bytes with the Huffman tables extracted previously. The second step is to use the decoded bitwidth to extract the correct number of following bits which is the coded amplitude value (as per table shown in previous article), which is decoded to a true amplitude. Lastly this amplitude data is de-quantised, de-scaled and de-zigzagged into a swing buffer, before being output towards the iDCT. The following sections describe these functions.

Table Selection

During header extraction, tables were updated for the Huffman decoding of the data and Quantisation. There are up to four sets of tables, selected on class (i.e. whether for DC data or AC data) and ID (either 0 or 1). When processing scan data, which tables are used at any one time needs to be determined. This depends on whether the the 8 by 8 MCU is for luminance (Y data) or which of the chrominance (Cb or Cr), and whether the MCU data is the DC value at index 0, or the AC values. The matter is further complicated by the fact that sub-sampling means that there may be anywhere from 1 to 4 luminance MCUs before the chrominance MCUs.

In order to determine the appropriate tables to use, the logic must determine how many luminance MCUs to expect before the two chrominance MCUs, and flag the incoming MCUs appropriately. The SOF0 data extracted from the header has the Hi and Vi values, and the number of luminance MCUs, 'y_arrays', is calculated from these. An 'array_count' is kept to mark each MCU as it's processed, from which the MCUs are marked as luminance (is_lum) or chrominance (~is_lum, and indexed with chrome_idx). Flags are generated to mark the end of each MCU (array_complete) and the end of the last MCU in a group (final_array_complete).

Quantisation table selection (Tq) is a simple matter of selecting the appropriate Tq value from the extracted SOF header data, depending on whether luminance MCU, or the chrominance index. The Tq value selects which quantisation data to use for the MCU being processed. For the Huffman table selection, huff_idx is generated from the chrome_idx and is_lum state. This is used to select the appropriate DC and AC IDs from the extracted SOS data (sos_td, sos_ta). These values, combined with the is_ac state (active for all MCU data except the first amplitude, which is DC), generate the actual Huffman table selector, huf_tbl_sel.

Amplitude Decode

The amplitude decode consists of a two stage process: a "match" phase and a "lookup" phase (marked by the 'match_not_lookup' state). In the match phase the input data is matched against the Huffman bit width endcodes, to find the valid bitwidth of the code, and pull out those bits from the input. The code matching is done using a priority encoded matching across the incoming bits (bits_data). The bits are masked both against the incoming valids (bits_valid) and the Ln valids. The Ln valids were extracted from the header, flagging which bit lengths have associated codes. There are sets of valids for each class and ID, and the table used (dht_ln_valid) is selected based on the MCU's ID and whether the DC value or the AC values. Having thus masked the input bits, all matches are done for all valid bit widths, selecting the smallest matching code. A match is successful for a given bit width if the value for that bit width is less than the endcode (from the previously extracted header data) of that width. An offset is calculated (vn_offset) from the next smallest bitwidth's endcode value, multipled by 2 (i.e. this doubled endcode subtracted from the extracted input bits). During header extraction, a set of tables of offsets were constructed (dht_vn_offset) which mark the offsets into the Huffman table (dht_tbl) where the start of the codes for each bit width begin. The calculated offset is added to the offset table entry for the matching bit width, to give a table index (vn_index) used during lookup. During this match phase, the barrel shifter is acknowledged (bits_req) for the number of bits in the matched code, and thus deleted.

Note that it is possible for there to be no valid code match during the match phase. This either because there is no possible match (an error condition) or because the number of valid bits at the input is insufficient for a match. For the former case, an error is output (code_error), but for the latter the 'match' state is held until such time as there are enough input bits to determine the valid code (or flag an error if still none found).

During the lookup phase, a code is extracted from the selected Huffman table (dht_tbl). This 8 bit value consists of a zero run length (in the upper nibble—'zeros') and a bit length for the following amplitude code (lower nibble—'bitsize'). It is also possible that the looked-up code could be an EOB (end-of-block) or a ZRL (zero-run-length) code. Both these codes are not followed by an amplitude code, but the lower nibble is zero for both of them, and so no bits are removed from the barrel shifter. For the normal case, the amplitude code is extracted from the barrel data bits, adjusted for the encoding (amplitude) and marked as valid (amp_valid). An additional step is made for the DC value, in that the amplitude is a difference value and is added to running DC counts (dc_value_x), with four DC values kept— one for each table type. Along with the amplitude and valid flag, an 'address' is calculated (amp_write_addr). An MCU index is kept whilst processing the data of a given MCU (mdx). In the normal course of things, this would simply increment for each amplitude processed. However, codes that have a zero run length of 1 or more, have this value added to the increment. The amplitude write address thus 'jumps' over the zeros implied in the code. The ZRL code is a special case which indicates 16 zeros (an no amplitude), but since ts code has an upper nibble of 0xf (15), when added to the increment gives the correct jump for 16 zeros. The amplitude valid flag is suppressed for this case. An EOB code also suppresses the valid, and resets the MCU index to 0, as it indicates early termination of the current MCU and implies that all the rest of the MCU consists of zero values.

The figure below give an overview of the match/lookup logic.

Figure 9: Huffman Decoder (to Amplitude Decode)

Dequantise and Buffering

The first part of the decoding process generated a stream of validated amplitudes and their address within their received serial sequence. The second part of the process consists of reconstructing an 8x8 array of absolute amplitude value. To do this, the logic must de-quantise the amplitudes generated by the first part, and de-zig-zag the serialised stream.

The de-quantising consists of selecting either the appropriate running DC value (curr_dc_value) or the AC amplitudes (amplitude) as the value for dequantising (qamplitude), which is now a signed variable. The quantisation table selected by Tq (dqt_tb) has the appropriate quantisation value extracted (curr_dqt_value), using the amp_write_addr as an index. The actual quantisation table value is a prescaled version of the actual DQT value form the header which allows it to be multiplied with the amplitude rather than divided, using implied fixed point arithmetic. This signed 19 bit value is multiplied with the 12 bit qamplitude value to yield a 32 bit result (dequantised_val). This is descaled to a 16 bit value (descaled_value) by simply truncating the 32 bit multiplication result.

The descaled value is then written into the swing buffer block (swing_buff). The address for the write is a de-zig-zagged version of the amp_write_addr, mapping the serial address to the location on the 8x8 matrix. Within the swing buffer, a set of 64 valid bits are kept to mark which locations within the buffer have been written since the start of an MCU. Because of the zero run length 'skips' in amplitude sequences, not all entries will have an update for the current MCU. Instead of sequencing a set of writes for the skipped zeros, the valids will flag the updated entries with those that are implied as zero, and will be forced to be zero when read over the output. The last entry in the buffer is also not guaranteed to be written (i.e. if the MCU terminated with an EOB). So, to mark the buffer full, a flag 'last_mcu_word' is generated at, or just after, the last amplitude is write, informing the swing buffer that no more data will be coming for the buffer and to mark it full.

A full buffer is available for output towards the iDCT block. The decode logic can fill the next available buffer whilst the full buffer is output to the iDCT. Obviously, if both buffers become full, the decode logic must stall, and a flag is generated when this happens (stall). The swing buffer makes data available to the iDCT block as a set of 64 bits (column of 8 bytes). The output bits are masked, however, by the buf_valid bits set during input, so that invalid values are forced to 0. The swing buffer block also generates an empty status (sb_empty) for use in flagging when a complete image processing has completed. The diagram below shows a rough block diagram of what is described here.

Figure 10: Huffman Decoder (Dequantise and Buffer)

Inverse DCT

The inverse DCT block (jpeg_idct) has three major functions. The central function is the 1 dimensional DCT (idct_1d) used to process 8 bytes at a time. Peripheral to this is the buffering and co-ordinating of data flow to efficiently process amplitudes through the iDCT twice and then output. Finally, the processed data must be normalised and clipped before beging sent over the output port. The figure below shows an outline sketch of these functions.

Figure 11: iDCT Module

iDCT Control

The philosophy for the iDCT control is to accept as much input data from the upstream logic as possible. Each set of 8x8 data matrices must be pushed twice through the idct_1d block—once for columns and once for rows. The data comes in as columns, passes through idct_1d and ends up in a swing buffer, ready for processing the buffered data as rows. Subsequent input data for the next MCU can start to be processed into the other buffer within the swing buffer block. The completed iDCT processed column data is sent as rows back through idct_1d to towards the output, and is not rebuffered (except through a short FIFO at the output of the block).

The above description appears as if the process is quite linear, with columns processed, then rows and output. However, the one-dimensional iDCT block is highly pipelined, and data from the input or the buffers can be mixed up without interfering with each other. Priority is always given to the input data, to move the pipe as far forward as possible, and create the least amount of 'bubbles' upstream. It will be stalled by the iDCT block only once both swing buffers are in use, and a new 8x8 data set is waiting. However, if the input stalls upstream, and no valid input data is available to the iDCT, the block will process buffer data, and send it into the idct_1d module. If input becomes available again (and has buffer space available), even if buffer data is only part way through processing, then the block will start processing input once again. When two buffers are active, preference is given to the buffer filled first, as otherwise the data becomes out of order. It does have the advantage that the older data will be further advanced through the iDCT and will clear sooner, freeing a buffer for more input.

As can be seen from Figure 11, the flow of data is control via a multiplexer. A signal (select_input[1:0]) is generated which controls the mux, selecting the input data (select_input[1] set) whenever there is input available and a buffer to send it to. If input is not selected, the data is taken from one of the buffers (buf0 or buf1), depending on a selection flag (buf_to_process). This flag simply swaps between buffers as each is completed. Similarly a register, buf_to_fill, swaps between buffers as a flag for the destination buffer to use for input column data. A buf_full status is also generated to indicate which buffers have data ready for processing from the input—i.e. full of column data, and not available for input. The full status is cleared when the last column is sent back into idct_buf.

Since the idct_1d's pipeline can have data from multiple sources at any given time, attribute data must be sent along with the input/column data in order to determine what to do with the output from idct_1d. There are five control bits sent in with the data (s_ctl). The first two bits determine the destination of the output data, with 01b beinf for the output (via the normalsation), 10b for buffer 0 and 11b for buffer 1 (00b is not used). The s_ctl[1:0] bits are generated directly from the select_input signal. Along with the destination bits, three index bits are generated, indicating the column write location in the buffer (and thus is not needed for output data). When data emerges from the idct_1d block, the control bits (op_ctl) are used to either generate a valid signal on the output (o_valid), if row data, or perform writes to the indicated buffer.

One dimensional iDCT

The one dimensional iDCT is a highly pipelined implementation that attempts to minimise the number of complex operation (i.e. multiplications) to facilitate the transformation. It is based around the algorithm of Arai, Agui and Nakajima, via the Independent JPEG group's library, but pipelined for a hardware implementation. It reduces the iDCT to a set of additions and only 5 multiplications. In order to achieve this, a scaling factor is required after de-quantisation, but this can be bundled into the quantisation table, absorbing the scaling factor to the header quantisation values extracted when processing the header—requiring only one additional multiplier to scale the quantisation data that is serially processed in the header extractor block, and scaled as they are written into the table buffers in the Huffman decoder block.

Details of the algorithm will not be discussed here (see [1], section 4.3.5), but the diagram below shows the seven stages of the pipeline, highlighting the multiplications. The other operations are all additions, subtractions or inversions. The control information is passed through pipeline stages with their associated data.

Figure 12: One Dimensional iDCT Pipeline

The variables shown in Figure 12 refer to the C model implementation, and show variable re-use. In the RTL implementation, this reuse is not done, and each stage has its own set of registers to store results, so that each stage is fully independent of the others, except the output of the previous stage's results of the previous clock cycle, as its input. The hardware implementation is, however, organised in exactly the same manner, and performs the same calculations.

Normalisation and Clipping

The final stage in the processing of scan data is the normalisation and clipping of the data. Until this point, the data processing since initial scaling, has been using 16 bit arithmetic, with the iDCT rescaling the multiplications back to 16 bits after each operation. The pixel values need to be 8 bits, and turned back into positive values (0 to 255, rather than -128 to 127). This is done on the data emerging at the output of the iDCT where the 16 bit values are truncated to 8 bits, and 128 is added. If the values fall below 0, or above 255, then they are clipped (rather than wrapped). The data stream from this normalisation now consists of the fully decoded image pixel information. Between this normalised data and the decoder's output interface is a short FIFO (not shown in Figure 11), placed there to simply make the internal logic timing independent of the external acknowledge, whose timing within the clock cycle cannot be guaranteed.

The actual interface at the output is an 8 byte wide data bus (o_data) with a single valid bit (o_valid). An acknowledge input is used to transfer the data, with data updated when both valid and acknowledge are active (and at no other time). The order of the data emerging from the decoder is row data (starting with DC in o_data[7:0]), followed by the rest of the row's AC values, and the subsequent row following, making up sets of 64 bytes. The order of these sets is the same as in the scan data of the JPEG input. Therefore, this is luminance (Y) data followed by Cb and Cr and sets. In the case of sub-sampled images, there will be multiple Y data sets before the chrominance data emerges. The o_sos and o_sof0 status ports of the decoder make available the information to discern the nature of the data that will emerge. It is outside the scope of this decoder implementation (though implemented in the C model), but a further processor could be contrived to combine the luminance and chrominance data in a set of RGB values, linearised to form a raw image in tuples of RGB information, in sequence similar to that of, say, a 24 bit bitmap.

Download

The entire JFIF pacjage can be downloaded as a self extracting windows executable. It includes all the HDL the verification environment, synthesis scripts and the documentation. It is released under the terms of the GPL (the reference model) and the LGPL (HDL source), copies of which are included. The software reference model can be downloaded separately, if desired. See the the document here for instructions. The full package is available from the link below:

Full HDL pack and reference model self extracting windows executable.

Conclusions

We have looked at a C model of a decoder (the source code of which is available), and a hardware implementation of the same functionality (the source code of which will be made available in the near future), all based on the concepts discussed in the previous article. This has solidified these abstract concepts into practical implementations to demonstrate how real-world software and hardware implementations are acheived, and provide a case study of such implementations.

Having implemented a decoder, the question arises, how do we know it works for all intended cases? Verification of the C model and HDL, and the flow used to achieve this, is discussed in another article, to provide a further case study in the development of a practical piece of digital IP to a high standard.

References

[1] Pennebake, W.B. and Mitchell, J.L., 1992, "JPEG Still Image Data Compression Standard, 1st Ed.", ISBN 0-442-01272-1
[2] Nelson, Mark; Gailly, Jean-loup, 1995, "The Data Compression Book", 2nd Ed., ISBN 1-558-51434-1

JPEG Implementation

abstract

Contents

Introduction

Decoder Implementation for Software Model

Header extraction

Huffman decode

Huffman lookup

Inverse DCT

Colour Space Conversion and bitmaps

Decoder HDL implementation

Decoder Top Level

Interfaces

Data Aligner

Header Extractor

Huffman Decoder

Table Selection

Amplitude Decode

Dequantise and Buffering

Inverse DCT

iDCT Control

One dimensional iDCT

Normalisation and Clipping

Download

Conclusions

References