|
- Articles
-
- Article 1: JPEG Concepts and Format
- Article 2: JPEG Decoder Implementation
- Article 3: JPEG IP Development Process
JPEG Implementation
by Simon Southwell
7th February 2014
abstract |
This article is the second paper discussing JPEG topics, and documents an example software and
hardware decoder implementation using the concepts outlined in the previous article.
A software C model implementation is discussed first, which is ultimately to be used as a reference
in testing a verilog RTL implementation, targetted at, though no limited to, an FPGA. The internal hardware design is discussed
in detail, and conclusions drawn.
|
Contents
Introduction
In this article we will look at the implementation of a JPEG decoder in order to solidfy the
concepts discussed in the previous article. The development of the impementation discussed
takes a two step approach. Firstly a C model (with a C++ core) is implemented, but with an architecture that
is hardware implementation friendly. A C model is easier and quicker to implement than an RTL hardware
implementation, and will execute much faster than an RTL simulation. Therefore the C model
can be tested with a much larger set of test data for a given test period. The C model then
becomes a specification and test reference for the second step of an RTL implementation.
Both the C model and the RTL implementations will be discussed. The C model details will be
less detailed than for the hardware implementation, as the hardware is the final goal, but
a general understanding is still required in orer to explain the details of the RTL. The RTL
design will be explained in more detail such that the verilog source code is fully understood as
a piece of IP that can be reused and further developed.
Decoder Implementation for Software Model
The software model for the decoder covers all functionality that will be required in the
hardware implementation, plus some peripheral code for a command line executable program
interface and debugging information etc. As a whole the model will serially process an input
JPEG file and output a bitmap file. This basic operation is further broken down into processing
the header information, and then looping through the MCUs, decoding them, and placing the
bitmap data in a buffer and outputing the results. The core is based around a class 'jfif',
with a single public method (jpeg_process_jfif()). The class header is defined in
jfif_class.h, and the methods are defined in jfif.cpp. The jfif class inherits a base
class, jfif_idct that provides the inverse DCT functionality. This is defined in
jfif_idct.cpp and jfif_idct.h. The basic calling structure of the code is as
shown below in the form of pseudo-code. The function names shown match with the actual
jfif class methods, and the calling hierarchy is as it is in the model.
jpeg_process_jfif() -- Converts input baseline DCT JFIF buffer to 24 bit bitmap
jpeg_extract_header() -- Extracts jpeg header information (scan, frame, quantisation/huffman tables, DRI)
jpeg_dht() -- Constructs a Huffman decode structure from huffman table data
jpeg_bitmap_init() -- Creates space for appropriately sized bitmap and initialises header data
LOOP for each MCU: -- jpeg_process_jfif() process an MCU at a time until EOI (or error)
jpeg_huff_decode() -- Gets an adjusted Huffman/RLE decoded amplitude value and ZRLs, or marker or end-of-block
jpeg_dht_lookup() -- Does huffman lookup on code and extracts amplitude data, or flags a marker
jpeg_get_bits() -- Gets top bits from barrel shifter and (optionally) removes. Pulls in extra input if needed.
jpeg_amp_adjust() -- Adjusts decoded huffman decoded amplitude to +/- amplitude value
<dequantise> -- De-quantisation done in jpeg_huff_decode directly from selected table
jpeg_idct() -- Inverse discrete cosine transform
jpeg_ycc_to_rgb() -- Converts YCbCr to RGB on an MCU
jpeg_bitmap_update() -- Updates bitmap data buffer with 8x8 RGB values
ENDLOOP
return pointer to bitmap
The top level entry point for JFIF decoding is the jpeg_process_jfif() method. This has a very simple
interface. The first argument is simply a pointer to a buffer of uint8 bytes (*ibuf), containing the JFIF data
to be decoded. The JFIF data is expected to start in the first byte and continue consecutively for the whole
image. No assumptions are made by the jpeg_process_jfif() method about the size of the file, or dimensions
of the image etc.—all this is calculated from the header
The second argument is a pointer to a uint8 pointer (**obuf). The pointer referenced by the passed in pointer
will be updated to point to a buffer containing the decoded bitmap data (assuming no errors). The memory
required for the bitmap is dynamically allocated by jpeg_process_jfif(), and it is the responsibility of
the calling process to free this memory when it is no longer required.
The functions returns one of several return codes, defined in jfif.h:
- JPEG_NO_ERROR : Decode completed without error
- JPEG_FORMAT_ERROR : A format error was encountered when parsing the header
- JPEG_MEMORY_ERROR : A memory allocation failure occured
- JPEG_UNSUPPORTED_ERROR : A valid, but unsupported, format was encountered
Within jpeg_process_jfif() the method calls jpeg_extract_header() to parse the JFIF header,
and return various poniters to the header information, such as SOS data (scan_header_t *),
SOF0 data (frame_header_t *) and reset interval. With this data jpeg_bitmap_init() is called
that allocates a buffer for bitmap data (bmp_ptr) and generates a bitmap header for a 24 bit RGB
image (even if it's greyscale). It also calculates, from the header data, the number of MCUs it
is to process (total_mcus). The main body of the method is then a while loop that processes
MCUs one at a time (and markers) until and EOI marker is encountered. The main body of the method
jpeg_huff_decode() is a while loop, that loops, calling jpeg_huff_decode() to return either
a marker or an MCU consisting of an 8 by 8 array of adjusted amplitude values (scan_data_ptr). The only markers
expected to be returned during scan data processing are RSTn and EOI, otherwise an error
is returned. Some checks are done of the sequence of RSTn markers received, though the actual
resetting of DC values is handled by lower level functions. The scan_data_ptr data is then
inverse discrete cosine tranformed (in jpeg_idct()) and then, if a colour image, converted
to RBG (jpeg_ycc_to_rgb()). The resulting RGB data is then sent to jpeg_bitmap_update() for
constructing into the bitmap buffer. When all MCUs are processed, the **optr is updated to point
to the bitmap data, and jpeg_process_jfif() returns with a status.
Header extraction
JFIF header extraction is performed by the jpeg_extract_header() method. This functions is based
around a single "switch" statement that parses all valid marker types expected before the SOS marker,
though it flags an error if any of those markers are from an unsupported file format. The JFIF format
mandates some ordering of segments delineated by the markers but, for robustness, the jpeg_extract_header()
method only expects the first item to be an SOI marker, and that an SOS marker marks the end
of the header information. Any other segment can come in any order, whatever the JFIF specification
might say. It is known that not all images follow thw JFIF format completely strictly.
All markers fall in to one of two categories: ones that have no following data (e.g. SOI) and those
that do (e.g. APP0). The latter are always followed by 2 bytes of length, which dictates the total
number of following bytes (including the 2 byte length) of the segment. The main markers of interest
to the jpeg_extract_header() method are the SOI, SOF0, SOS, DHT, DQT, APP0 and APP14. The SOI has no
information, but marks the start of the image, and is expected first, otherwise an error is
returned. The SOF0 segment contains data for image dimensions, whether colour or greyscale and
a set of values determining sub-sampling (if any), quantisation table selections and component
ID—one for each component of a set (see Table 4 in the previous
article). A pointer to a frame_header_t structure is set to the input buffer location 2 bytes
from the SOF0 marker (to skip the length field) and returned in the **fptr argument. Note that all
other start of frame markers (SOF1 to SOF15) are for unsupported formats, and the method
returns an error if they are encountered.
The start of scan segment (SOS marker) has information for each component, defining a component selector,
a DC Huffman table selector and an AC Huffman table selector. Unlike for SOF0, we cannot simply
point to the input buffer with a pointer to a structure as the variable length part of the segment (i.e.
the component specific fields, which may have 1 set or 3 sets), as this variable length sub-section is followed
by another fixed section, whose offset is dependent on the number of component data sets. Therefore
some buffer space is allocated and the data extracted into this buffer, and a pointer returned (*sptr)
of type scan_header_t, that points to this extracted data.
The Huffman table segment (DHT) requires a bit more processing, and so a sub-function (jpeg_dht)
is called to process segments of this type, returning a pointer of type DHT_offsets_t which, in turn
is returned by jpeg_extract_header(). The aim of the jpeg_dht() method is to construct a Huffman
table with as few bits of information as possible. This information consists of the 16 lengths
of the segment (see Figure 7 in the previous article), the
offsets within the Vn data where the different bit width codes start, and the end codes for
the different bit widths. Note that data for bitwidth n is placed at index n-1 in the data arrays,
so, for example, bit width 3 data is at index 2.
As discussed in the previous article, this is all that is needed
to decode the Huffman encoded data. If input data bits are less than a certain bit width's end code, but bigger
than the next smallest width's end code, then that is the number of bits of the code. Subtracting
the end code of the next smallest bit width, doubled to give the start code of the matching bit width,
from the input data bits gives the offset into the entries for that bit width, and hence the decoded value.
Taking the example of Figure 5 from the previous article, we
would have a DHT_offsets_t structure filled in as shown:
- *Ln[] (points directly to Ln values in dht buffer,
overlaid on the input data buffer,
buf[])
- 0, 1, 4, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
- *vmn_offset[n] (array of pointers to dht[] indexes)
- 0 -> NULL
- 1 -> dht[0] : 01
- 2 -> dht[1] : 00, 07, 04, 05
- 3 -> dht[5] : 21, 31, 32
- others -> NULL
- row_break_codes[n]
- 0 = 0b
- 1 = 01b
- 2 = 110b
- 3 = 1111b
- others don't care
If the input data was a Huffman encoded run-length, and presented, say, ..010101011, then the first bit
(using lsb) is bigger than row_break_codes[0] (for 1 bit width), so this is not a match. The first two bits are
11b, which is bigger than row_break_codes[1], but the first three bits are 011b which is less
than row_break_codes[2], meaning we have a 3 bit code. Taking the endcode for 2 bits (row_break_codes[1] = 01b)
and doubling it gives 010b, or 2. Subtracting this from the 3 bits of the input is 011b - 010b = 001b, or 1.
The 3 bit vmn_offset pointer (vmn_offset[2]) points to dht[1], and the value at offset 1 from this has
a value 07h, and hence is the decoded value—i.e. the variable code 011b of the input has a byte value of 07h.
The jpeg_dht() functions returns a pointer to a DHT_offsets_t structure with the values filled in for
the Huffman table, having allocated some memory space for table. The Tc and Th values are extracted
from the input, and the method loops through each bit width (smallest to largest), filling in the
table entries, based on the Ln lengths, and Vn values in the DHT segment.
The quantisation table segments (DQT) define the quantisation table values to be used for the different
classes (DC or AC) of data, and IDs (which component). Following the 2 byte length are one or more
65 byte value sets. The first byte consists of two nibbles specifying the Pq (always 8) and Tq (destination)
value. Then follows the 64 bytes for the 8x8 quantisation table values. These are extracted into
the appropriate table (indexed by the Ptq value) pointed to by *qptr, of type DQT_t. The values
are turned into integer values at this point as they are pre-scaled with values determined by the
Arai et. al methods (see [1]), allowing the saving of a multiplication at that point.
The application sections are really only processed to make some compatibility checks. A JFIF file
should have an APP0 segment, and it ought to be the first segment after SOI (but might not
be). The first APP0 should begin with the string "JFIF\0", and this is checked. Subsequent
APP0 sections, if any, should begin with the string "JFXX\0", and this is also checked. If these
criteria are met, then the image is flagged as JFIF (is_JFIF). Some JPEG files that are not JFIF
might have an APP14 section, usually associated with Adobe files, and beginning with the string
"Adobe". If this is the case, and not flagged as JFIF, then colour space information is plucked from
a field in the segment and checked whether it is RGB (if 0) or YCbCr (if 1), and flagged as
such (*is_RGB is set TRUE or FALSE, accordingly).
All other markers or forms of segments not covered here are treated as errors, and the jpeg_extract_header()
returns the appropriate error code. All being well, the method returns a good status and
the header information in the buffers.
Huffman decode
The purpose of the jpeg_huff_decode() method is to return either a marker or a set of luminance
and, if a color image, chormiance MCUs that make up a sub-sampled set—i.e. a YCC set of
3 when no sub-sampling, upto a YYYYCC when 4:2:0 sub-sampling. The method makes use of a sub-function
jpeg_dht_lookup() which returns a structure of type rle_amplitude_t, as defined below.
typedef struct {
int marker; // if non-zero, hit a marker
bool is_EOB; // Flag if EOB
bool is_ZRL; // Flag if ZRL
int ZRL; // run length of zeros
int amplitude; // adjusted amplitude (if not EOB or ZRL != 16)
} rle_amplitude_t, *rle_amplitude_pt;
The first field (marker) when non-zero contains a valid marker value, indicating that a
marker was encountered. The second is a flag indicating that an EOB marker terminated the MCU data
early. The third give a zeros runlength that occured before the associated amplitude value (which could be 0).
The actual amplitude code, when not EOB or ZRL encountered, is returned in the final field, amplitude.
The jpeg_huff_decode() first does some calculations on the header data to determine the expected number of Y and C MCUs
(y_arrays, total_arrays), as well as some sub-sampling factors for use later on. The method
then loops for as many MCUs as calculated (indexed by array). Within this outer loop, table selectors
are calculated based on whether then MCU is for chromaince or luminance, and the ID etc. The
MCU buffers (mcu[][]) are cleared to 0 before commencing to allow for zero run-length skipping and
early termination on EOB. The DC value is fetched by calling jpeg_dht_lookup(). This might return
a marker, in which case it is processed, and jpeg_huff_decode() returns.
When a marker encountered at this point, the jpeg_huff_decode() method returns
NULL, and the marker information is returned in the *marker argument. Only one type of marker is
expected to be processed in jpeg_huff_decode()—RSTx (others are handled in lower functions, or
passed up to the calling method). This is for resetting the DC values (which are a running difference),
and the DC counts (current_dc_value[]) are cleared on reception of this marker.
If not a marker, the appropriate DC value is updated by adding the returned amplitude value to a running total, and dequantising
by multiplying by qptr[Tq][0]. The result is placed in the current mcu[] buffer. After the DC
value, the 63 remaining AC values are similarly retrieved, in an inner loop, via jpeg_dht_lookup(), dequantised
and put into the mcu[] buffer. If the AC values have a ZRL run-length greater than 0, then the
MCU index (mdx) is incremented by that run-length first, before updating the buffer with the de-quantised
amplitude. During AC element processing, markers may still be returned. An EOB flagged will simply set
mdx to 64 so that the next loop iteration falls through, and effectively skips to the end of that MCU's processing.
The only other marker valid to appear during AC data is EOI. This acts like an EOB, but the marker
is passed up to the calling method (in *marker). Any other marker is an error. Assuming no markers, when the MCU is complete
(with mdx hitting 64), the outer loop iterates to process the next MCU until all MCUs in the group have finished.
The jpeg_huff_decode() then returns a pointer to the mcu[] buffer, containing all the mcu data for the whole
MCU set.
Huffman lookup
The jpeg_dht_lookup() method referred to in the last section is the procedure that performs
the Huffman lookup as described previously, and extracts the
following encoded amplitude. As this method is dealing with variable length codes, it works
on a barrel shifter variable (jfif_barrel—defined in fjif class() and passed in to jpeg_dht_lookup()).
On calling the method, matches are made on the barrel shifter bits from ther smallest bit width
upwards until either a match is made, no match is found (an error) or a marker is found. The
code is fetched via the jpeg_get_bits() method, which returns the number of bits requested
from the barrel. If not enough bits avilable on the barrel for the requested size it will fetch
more bits from the input, and add them to the barrel. If it detects a marker (ffh followed
by a value byte), it will return the marker value ORed with ffff0000h to distinguish
it from a code. Otherwise the 'n' top bits of the barrel are returned. The bits are removed
from the barrel if the 'remove_bits' argument is set. Until a code match is found, the bits
musn't be removed from the barrel. When the bit width of the code is determined only then is
it safe to update the barrel, and delete the bits returned.
Once the jpeg_dht_lookup() method
has found a matching bit width, it extracts the code (and flags to delete from the barrel) and
does the table lookup to get the byte value. The value is checked for being EOB or ZRL, and
the return structure (rle_amplitude_t) flags updated if so. If not EOB or ZRL, the amplitude bitwidth is
extracted from the decoded value, and these amplitude bits extracted from the barrel (via jpeg_get_bits())
and passed onto jpeg_amp_adjust(), which decodes the value to undo the encoding as per
Table 2 in the previous article. Any zero run-length value
from the Huffman decoded value is written to the return structure, and this returned to the
calling method.
Inverse DCT
For the inverse DCT, a fast implementation after Arai et. al, and detailed in [1],
has been developed and tested, and the discussions in this article assume this method. If
JPEG_FAST_INT_IDCT is defined at compile time, this method is selected.
In addition, another iDCT method is supplied (jpeg_idct_slow()) which is a simple
implementation that follows the equations discussed in the previous
article, and is adapated from [2]. A pair of double loops
are implemented for a one dimensional iDCT of rows and then columns. This method is used
if JPEG_FAST_INT_IDCT is not defined at compile time. The method can be
full precision, or be integer fixed point (when compiled with JPEG_DCT_INTEGER defined).
Colour Space Conversion and bitmaps
The C reference model is capable of generating a bitmap output from the JPEG input, converting
the YCbCr data to RGB (if not RGB already), and constructing a valid 24bit RGB bitmap. This functionality
is not part of the HDL hardware core implementation, but is added to allow test, comparison and
validation in a simple raw standard format. As such, only a fragmentary descriptions will be given here.
The YCbCr conversion to RGB is performed by the jpeg_ycc_to_rgb() method. This takes a full set
of Y and C MCUs (potentially sub-sampled) and expands into an N by N block of RGB triplets—which
could be 8x8 if no sub sampling, or 16x8 for 4:2:2 etc. It is passed
the final image dimensions, and clips the MCU padded data for non MCU aligned boundaries. At the heart of the
method is the conversion functions as detailed in the previous
article. The calculations are slightly different for full precision versus fixed point integer
(if JPEG_DCT_INTEGER defined), but are essentially the same operation.
The jpeg_ycc_to_rgb() is passed a buffer pointer (*optr) to a block of memory (a 2 dimensional array,
big enough for a maximally sub-sampled YCC set), in which to place the RGB data.
The jpeg_ycc_to_rgb() method only works on the actual image data. Two other functions actually construct
the full bitmap file; jpeg_bitmap_init() and jpeg_bitmap_update(). The first is called after the header
extractor routine, once the image dimensions are known. The X and Y dimensions are passed to the
method, which allocates memory big enough for the image, including a header, configures the
header appropriately, and returns a pointer to the buffer created. jpeg_bitmap_update() is called
immediately after jpeg_ycc_to_rgb(), and takes the buffer returned by that method, and places the
data in the appropriate locations of the bitmap buffer, reversing the image from top to bottom to
be bottom to top, as for bitmaps. It is this bitmap buffer that is returned by the top level JPEG
method jpeg_process_jfif().
The jpeg_process_jfif() method (and it's sub-functions described above) is meant to be used as a stand-alone
library class, and also has a C linkage function available, jpeg_process_jfif_c(), and can be compiled
as such with the provided make scripts. However, a brief mention of the file jfif_main.c ought to be
made, which uses jpeg_process_jfif_c() to produce a command line based executable for exercising the
function for purposes such as verification, generation of bit comparison vectors, etc. Details of the usage
of the executable, along with source code download instructions and links, can be found here.
The main() function in jfif_main.c provides code for a simple command line interface, file input and output and
also a graphical display window for viewing the images (using the GTK+2 library—not available when compiled with
MSVC).
Decoder HDL implementation
In this section is described the HDL implementation, in verilog, of the JPEG decoder, detailing the
sub-blocks, code details and functionality.
Decoder Top Level
The decoder top level consists of four major functional blocks, a data input and a data output
interface, and some error and debug ports. The four major blocks consists of a data aligner
(barrel_shifter12x4) for taking the 32 bit inputs and extracting the variable size words required internally;
a header extractor (extract_hdr) for parsing the jpeg header and presenting the data for image decoding;
a Huffman decoder (huff_decode) for the initial decompression; and an inverse DCT block (jpeg_idct)
for the decoding, with normalisation and clipping. The diagram below shows these main four blocks,
and their general relationship to each other.
Figure 7: Decoder Top Level
|
In essence, for a given image input, the header extractor will be active, parsing the
header data, whilst the Huffman decoder (and hence the iDCT) will be idle. Once the
header data is processed, the Huffman decoder and iDCT are activated to process the
image until EOI. The control of this state resides in the header extractor block.
Interfaces
The JPEG decoder top level has a very simple interfacing structure. A single clock and reset input
port is defined, and the decoder functions all run from a single clock domain. The resets
are configurably synchronous or asynchronous.
The input data port is a simple valid/acknowledge interface, with an 32 bit data input. There are
4 valid bits for each byte within the 32 bit word, though the valid must be contiguous and
start from bit 0. I.e. a new file must be input as 32 bit data words, except that last bytes which
may be an odd number, from 1 to 3.
Data is transfered only when both the valid is non-zero and the acknowledge are active (high). Similarly,
the output port is a valid/acknowledge interface, this time with 64 bit data output and a single valid
output bit (as the output is always padded to 8 bytes).
The data port constitute the main functional interfacing to the block, but an error port is also
provided. An 8 bit error code is output, which flags an error condition if non-zero. Upon
encountering an error, the decoder outputs the error code and halts. The error condition can only
be cleared using the i_clr_error input or by resetting the block. Using i_clr_error will place the
decoder in a 'searching for SOI' state, so any further data from the error causing input will be ignored.
Note that an error condition can only be generated from processing the header information. Once the
decoder state is accepting data input, it will process the data without any further checking.
There are 6 possible error states that can be reported:
- 8'h01 : Invalid code error, set if a marker of invalid value is encountered.
- 8'h02 : Unsupported code error, set if a marker is encountered that indicates a format not supported by the decoder.
- 8'h03 : Unexpected code error, set if hunting for a particular marker, and another marker is encountered
- 8'h04 : Missing EOI error, set if data terminates without seeing and EOI
- 8'h05 : Expecting SOI error, set if a valid marker seen when not yet seen an SOI
- 8'hff : Internal error, set if some internal state failure occured
In addition to an error code port, a debug interface is provided to output internal header state.
The ports are very wide and would not normally be used. However, they can
be mux'ed down to a narrower debug bus for routing to external pins etc., if required, or left unconnected.
The ports are used by the simulation test harness for the decoder, and s must be available
at the top level of the core.
Data Aligner
The data aligner block consists of a barrel shifter that is 12 bytes wide, and has 4 byte
data input and output interfaces. The input side has a unary 4 bit valid input for adding
1 to 4 bytes, and is connected directly to the main block input port. THe shifter returns
an ack (o_ack) and data is transferred only when both the ack is set and the valid is non-zero.
On the output interface, the valid indicates a byte count, which is 0 to 3 when the barrel shifter has
0 to 3 valid bytes remaining, and is 4 when the shifter has 4 or more valid bytes. The internal
blocks (either the header extractor, or the Huffman decoder) return an ack (i_ack), and data
is transferred only when both the ack is set and the valid is non-zero.
The purpose of the data aligner is to allow a straight forward 32 bit interface on the decoder
block boundary, with 32 bits transferred for all inputs during a decode, except for the last
word, and the internal circuitry to extract arbitrary numbers of bytes, from 1 to 4, as necessary
to account of the various header data alignments etc. This immunises the external circuitry from
complex variable length accesses, and also breaks the internal timing from external timings,
limiting them only to the interface itself.
Header Extractor
This header extractor block serves as the main control for the decoder. It is responsible for parsing the
incoming jpeg header, extracting the relevant header information and storing for subsequent data encoding.
Errors and unsupported formats are handled by the block and the Huffman and quantisation table data
are buffered, though these data are not stored locally and are written to buffers withinin the Huffman
decode block.
At the heart of the extractor is a state machine that walks through the header decode by searching for
markers of appropriate type and in the appropriate sequence. Any variation from valid markers and their
order are flagged as errors. The diagram below shows the state diagram for the FSM.
Figure 8: Extractor FSM
|
After a reset, the FSM sits in state WAIT_SOI, requesting bytes from the data aligner, and consuming them, until an SOI
marker is seen at the input, At this point, two bytes are requested (markers are 16 bits), and the state
moves to DECODE_MKR. In this state another marker is expected, and logic tests whether the data input is
a valid marker, whether a supported marker type and whether it is expected at this point (say, not an EOI straight after
the SOI, for example). If an error occurs, the state moves to the ERROR state and remains there, and the error
code output port is updated with the type of exception. All being well, though, the state moves to PROCESS_HDR,
having latched the 16 bits of data length (data_len) associated with this second marker. Data is now continuously
requested from the data aligner, counting down until all bytes for the segment are processed. What happens to the data
depends on the type of marker currently active (curr_mkr).
The first marker after SOI that is of interest is likely to be DQT, for one or more quantisation tables.
The table consists of a Pq/Tq byte followed by 64 bytes of quantisation data. Pq is of no interest (it's fixed at 8)
but the Tq value is latched by the extractor. All the following 64 bytes of quantisation data are stored in
a buffer in the Huffman decoder (dqt_tbl). The Tq value specifies one of 4 destinations (exported as o_wr_tbl_sel)
and, along with an index (o_wr_tbl_idx) forms the address to be stored in that buffer. The interface is a simple
memory like interface with a write strobe (o_wr_dqt) that is used to write the data to the buffer. If, after processing the quantisation
data, the number of bytes processed is less than the data length seen after the DHT marker, then additional
tables are present, and the process is repeated. However, it is valid for each table to have individual
DQT segments as well, and the logic supports both types, or a combination of the two.
After the quantisation table data is likely to be the SOF0 marker. When curr_mkr indicates this, the bytes are extracted
and placed in a buffer (SOF0) in byte order, except the 16 bit X and Y fields which are byte swapped for
ease of decode. The buffer has enough space for the main part of the header and up to 3 image component
parameters. The buffer values are exported on a single wide port (o_sof0) for use by the
Huffman decoder, and routing to the debug port.
The next segement is likely to be one or more Huffman tables. When curr_mkr is JPEG_MKR_DHT, the data comprises
a Tc/Th byte, 16 length bytes, followed by a variable number of code bytes (the sum of the length byte values).
The first byte is extracted and stored locally (dht_Tch), but the length bytes and codes are stored in
buffers in the Huffman decoder. Therefore the logic separates the bytes that are lengths (the next 16 after Tch)
and those that are codes (i.e. the rest). The data is exported from the block over a RAM like write interface,
flagging writes for the lengths (o_dht_write_indexes) or the codes themselves (o_wr_dht).
A set of "endcodes" for
each of the bit widths are calculated on the fly, whilst processing the length bytes, and are exported
with the length data for storage in the Huffman decoder block. With reference to the
Huffman table example in the previous article, the "endcodes" are values one greater than the last encoding for a
given bit length. For example (refering to the diagram), there are four 3 bit codes, ranging from 010b to 101b,
making the endcode for 3 bit codes 110b (or 6). The logic calculates these endcodes by taking the last
bit width's endcode (initially 0) and doubling it, adding the value for the current bit width length. This
includes where lengths are 0, such that just the doubling is applied. These endcodes will be useful in
the Huffman decoding, allowing determination of the encoded variable-length code boundaries—more in
later sections. The Tc/Th data specifies one of four possible
destination tables, and the interface exports this table selection (o_wr_tbl_sel) along with the data,
and buffer indices. If, after processing a DHT table, the bytes processed is less than then length specified
with the marker, then there is more than 1 table, and the process is repeated until all the table data is processed.
Note, however, that multiple tables can each have their own DHT marker, with a single table following. The
logic will handle both cases, or a combination of both.
Finally, the last marker of interest is SOS. This defines parameters for each of the image components
(either 1 for black-and-white, or 3 for colour images). The values are simply extracted into a local
buffer (SOS), but with the last 3 bytes dropped, as they are fixed values and can safely be ignored.
The buffer values are exported on a single wide port (o_sos) for use by the
Huffman decoder.
Once a marker and its following segment data have been processed, the FSM will move either to DECODE_MKR to
process the next marker, if the last was not SOS, or it will move to PROCESS_SCAN to process the
image data. At this point control is handed to the Huffman decoder (o_scan_data_active is set), and
the state machine waits for the block to inform it that the scan data processing is complete
(i_end_of_scan set by Huffman decoder). If the Huffman decoder reports an error, then the state
moves to ERROR and halts. If not, then either the state resturns to WAIT_SOI if the iDCT block
is flagging that it's flushed (i_dct_flushed)—i.e. no pending data processing to complete)—or it goes to
WAIT_DCT_FLUSH to wait until the iDCT is flushed, before returning to WAIT_SOI.
If the FSM finds itself in the ERROR state, then it remains there until an external signal (i_clr_error)
is set to clear the error state. Also, a full block reset forces the state back to WAIT_SOI.
Huffman Decoder
The Huffman decoder's job is to extract the incoming scan data, and decode to a set of de-quantised
and de-zigzagged values, using the data and tables extracted by the header extractor block. We'll consider
this function in three steps. The scan data consists of Huffman encoded bytes that have a zero run-length
nibble, and a bit width. So the first step is to decode these bytes with the Huffman tables extracted previously.
The second step is to use the decoded bitwidth to extract the correct number of following bits which is the
coded amplitude value (as per table shown in previous article), which is decoded
to a true amplitude. Lastly this amplitude data is de-quantised, de-scaled and de-zigzagged into a swing buffer,
before being output towards the iDCT. The following sections describe these functions.
Table Selection
During header extraction, tables were updated for the Huffman decoding of the data and Quantisation.
There are up to four sets of tables, selected on class (i.e. whether for DC data or AC data) and ID (either 0 or 1). When
processing scan data, which tables are used at any one time needs to be determined. This depends
on whether the the 8 by 8 MCU is for luminance (Y data) or which of the chrominance (Cb or Cr), and whether the
MCU data is the DC value at index 0, or the AC values. The matter is further complicated by the fact that sub-sampling
means that there may be anywhere from 1 to 4 luminance MCUs before the chrominance MCUs.
In order to determine the appropriate tables to use, the logic must determine how many
luminance MCUs to expect before the two chrominance MCUs, and flag the incoming MCUs
appropriately. The SOF0 data extracted from the header has the Hi and Vi values, and the number of
luminance MCUs, 'y_arrays', is calculated from these. An 'array_count' is kept to mark each MCU as it's
processed, from which the MCUs are marked as luminance (is_lum) or chrominance (~is_lum, and
indexed with chrome_idx). Flags are generated to mark the end of each MCU (array_complete)
and the end of the last MCU in a group (final_array_complete).
Quantisation table selection (Tq) is a simple matter of selecting the appropriate Tq value from
the extracted SOF header data, depending on whether luminance MCU, or the chrominance index.
The Tq value selects which quantisation data to use for the MCU being processed. For the Huffman
table selection, huff_idx is generated from the chrome_idx and is_lum state. This is used to select
the appropriate DC and AC IDs from the extracted SOS data (sos_td, sos_ta). These values,
combined with the is_ac state (active for all MCU data except the first amplitude, which is DC),
generate the actual Huffman table selector, huf_tbl_sel.
Amplitude Decode
The amplitude decode consists of a two stage process: a "match" phase and a "lookup" phase (marked
by the 'match_not_lookup' state). In the match phase
the input data is matched against the Huffman bit width endcodes, to find the valid bitwidth of the
code, and pull out those bits from the input.
The code matching is done using a priority encoded matching across the incoming bits (bits_data). The bits
are masked both against the incoming valids (bits_valid) and the Ln valids. The Ln valids were extracted
from the header, flagging which bit lengths have associated codes. There are sets of valids for each class and ID,
and the table used (dht_ln_valid) is selected based on the MCU's ID and whether the DC value or the AC values.
Having thus masked the input bits, all matches are done for all valid bit widths, selecting the smallest
matching code. A match is successful for a given bit width if the value for that bit width is less than the
endcode (from the previously extracted header data) of that width.
An offset is calculated (vn_offset) from the next smallest bitwidth's
endcode value, multipled by 2 (i.e. this doubled endcode subtracted from the extracted input bits).
During header extraction, a set of tables of offsets were constructed (dht_vn_offset) which mark the
offsets into the Huffman table (dht_tbl) where the start of the codes for each bit width begin.
The calculated offset is added to the offset table entry for the matching bit width,
to give a table index (vn_index) used during lookup. During this match phase, the barrel shifter is
acknowledged (bits_req) for the number of bits in the matched code, and thus deleted.
Note that it is possible for there to be no valid code match during the match phase. This either because
there is no possible match (an error condition) or because the number of valid bits at the input
is insufficient for a match. For the former case, an error is output (code_error), but for the latter
the 'match' state is held until such time as there are enough input bits to determine the valid code
(or flag an error if still none found).
During the lookup phase, a code is extracted from the selected Huffman table (dht_tbl). This 8 bit value
consists of a zero run length (in the upper nibble—'zeros') and a bit length for the following amplitude code
(lower nibble—'bitsize'). It is also possible that the looked-up code could be an EOB (end-of-block) or a ZRL
(zero-run-length) code. Both these codes are not followed by an amplitude code, but the lower nibble is zero
for both of them, and so no bits are removed from the barrel shifter. For the normal case, the amplitude
code is extracted from the barrel data bits, adjusted for the encoding (amplitude) and marked as valid
(amp_valid). An additional step is made for the DC value, in that the amplitude is a difference value
and is added to running DC counts (dc_value_x), with four DC values kept— one for each table type. Along with the
amplitude and valid flag, an 'address' is calculated (amp_write_addr). An MCU index is kept whilst processing
the data of a given MCU (mdx). In the normal course of things, this would simply increment for each amplitude
processed. However, codes that have a zero run length of 1 or more, have this value added to the increment.
The amplitude write address thus 'jumps' over the zeros implied in the code. The ZRL code is a special case
which indicates 16 zeros (an no amplitude), but since ts code has an upper nibble of 0xf (15), when added
to the increment gives the correct jump for 16 zeros. The amplitude valid flag is suppressed for this case.
An EOB code also suppresses the valid, and resets the MCU index to 0, as it indicates early termination
of the current MCU and implies that all the rest of the MCU consists of zero values.
The figure below give an overview of the match/lookup logic.
Figure 9: Huffman Decoder (to Amplitude Decode)
|
Dequantise and Buffering
The first part of the decoding process generated a stream of validated amplitudes and their address within
their received serial sequence. The second part of the process consists of reconstructing an 8x8 array of
absolute amplitude value. To do this, the logic must de-quantise the amplitudes generated by the first
part, and de-zig-zag the serialised stream.
The de-quantising consists of selecting either the appropriate running DC value (curr_dc_value) or the
AC amplitudes (amplitude) as the value for dequantising (qamplitude), which is now a signed variable.
The quantisation table selected by Tq (dqt_tb) has the appropriate quantisation value extracted (curr_dqt_value),
using the amp_write_addr as an index. The actual quantisation table value is a prescaled version of the
actual DQT value form the header which allows it to be multiplied with the amplitude rather than divided,
using implied fixed point arithmetic. This signed 19 bit value is multiplied with the 12 bit qamplitude
value to yield a 32 bit result (dequantised_val). This is descaled to a 16 bit value (descaled_value) by
simply truncating the 32 bit multiplication result.
The descaled value is then written into the swing buffer block (swing_buff). The address for the
write is a de-zig-zagged version of the amp_write_addr, mapping the serial address to the location on the 8x8
matrix. Within the swing buffer, a set of 64 valid bits are kept to mark which locations within
the buffer have been written since the start of an MCU. Because of the zero run length 'skips' in
amplitude sequences, not all entries will have an update for the current MCU. Instead of sequencing
a set of writes for the skipped zeros, the valids will flag the updated entries with those that
are implied as zero, and will be forced to be zero when read over the output. The last entry in the
buffer is also not guaranteed to be written (i.e. if the MCU terminated with an EOB). So, to mark
the buffer full, a flag 'last_mcu_word' is generated at, or just after, the last amplitude is write,
informing the swing buffer that no more data will be coming for the buffer and to mark it full.
A full buffer is available for output towards the iDCT block. The decode logic can fill the next available
buffer whilst the full buffer is output to the iDCT. Obviously, if both buffers become full, the decode
logic must stall, and a flag is generated when this happens (stall). The swing buffer makes data available
to the iDCT block as a set of 64 bits (column of 8 bytes). The output bits are masked, however, by the
buf_valid bits set during input, so that invalid values are forced to 0. The swing buffer block also
generates an empty status (sb_empty) for use in flagging when a complete image processing has completed.
The diagram below shows a rough block diagram of what is described here.
Figure 10: Huffman Decoder (Dequantise and Buffer)
|
Inverse DCT
The inverse DCT block (jpeg_idct) has three major functions. The central function is the 1 dimensional
DCT (idct_1d) used to process 8 bytes at a time. Peripheral to this is the buffering and co-ordinating of data
flow to efficiently process amplitudes through the iDCT twice and then output. Finally, the processed data
must be normalised and clipped before beging sent over the output port. The figure below shows an outline
sketch of these functions.
Figure 11: iDCT Module
|
iDCT Control
The philosophy for the iDCT control is to accept as much input data from the upstream logic
as possible. Each set of 8x8 data matrices must be pushed twice through the idct_1d block—once for
columns and once for rows. The data comes in as columns, passes through idct_1d and ends up in
a swing buffer, ready for processing the buffered data as rows. Subsequent input data for the next MCU
can start to be processed into the other buffer within the swing buffer block. The completed iDCT processed
column data is sent as rows back through idct_1d to towards the output, and is not rebuffered (except
through a short FIFO at the output of the block).
The above description appears as if the process is quite linear, with columns processed, then rows and
output. However, the one-dimensional iDCT block is highly pipelined, and data from the input or the
buffers can be mixed up without interfering with each other. Priority is always given to the input data,
to move the pipe as far forward as possible, and create the least amount of 'bubbles' upstream. It will
be stalled by the iDCT block only once both swing buffers are in use, and a new 8x8 data set is waiting.
However, if the input stalls upstream, and no valid input data is available to the iDCT, the block will process
buffer data, and send it into the idct_1d module. If input becomes available again (and has buffer space available),
even if buffer data is only part way through processing, then the block will start processing input once again.
When two buffers are active, preference is given to the buffer filled first, as otherwise the data
becomes out of order. It does have the advantage that the older data will be further advanced through
the iDCT and will clear sooner, freeing a buffer for more input.
As can be seen from Figure 11, the flow of data is control via a multiplexer. A signal (select_input[1:0]) is
generated which controls the mux, selecting the input data (select_input[1] set) whenever there is input
available and a buffer to send it to. If input is not selected, the data is taken from one of the buffers
(buf0 or buf1), depending on a selection flag (buf_to_process). This flag simply swaps between buffers
as each is completed. Similarly a register, buf_to_fill, swaps between buffers as a flag for the destination
buffer to use for input column data. A buf_full status is also generated to indicate which buffers have
data ready for processing from the input—i.e. full of column data, and not available for input.
The full status is cleared when the last column is sent back into idct_buf.
Since the idct_1d's pipeline can have data from multiple sources at any given time, attribute data must be sent
along with the input/column data in order to determine what to do with the output from idct_1d. There are
five control bits sent in with the data (s_ctl). The first two bits determine the destination of the output data,
with 01b beinf for the output (via the normalsation), 10b for buffer 0 and 11b for buffer 1 (00b is not used).
The s_ctl[1:0] bits are generated directly from the select_input signal. Along with the destination bits, three
index bits are generated, indicating the column write location in the buffer (and thus is not needed for
output data). When data emerges from the idct_1d block, the control bits (op_ctl) are used to either
generate a valid signal on the output (o_valid), if row data, or perform writes to the indicated buffer.
One dimensional iDCT
The one dimensional iDCT is a highly pipelined implementation that attempts to minimise the
number of complex operation (i.e. multiplications) to facilitate the transformation. It is based around
the algorithm of Arai, Agui and Nakajima, via the Independent JPEG group's library, but pipelined
for a hardware implementation. It reduces the iDCT to a set of additions and only 5 multiplications.
In order to achieve this, a scaling factor is required after de-quantisation, but this can be bundled
into the quantisation table, absorbing the scaling factor to the header quantisation values extracted
when processing the header—requiring only one additional multiplier to scale the quantisation data that
is serially processed in the header extractor block, and scaled as they are written into the table buffers
in the Huffman decoder block.
Details of the algorithm will not be discussed here (see [1], section 4.3.5), but
the diagram below shows the seven stages of the pipeline, highlighting the multiplications. The other
operations are all additions, subtractions or inversions. The control information is passed through
pipeline stages with their associated data.
Figure 12: One Dimensional iDCT Pipeline
|
The variables shown in Figure 12 refer to the C model implementation, and show variable re-use. In
the RTL implementation, this reuse is not done, and each stage has its own set of registers to store
results, so that each stage is fully independent of the others, except the output of the previous
stage's results of the previous clock cycle, as its input. The hardware implementation is, however,
organised in exactly the same manner, and performs the same calculations.
Normalisation and Clipping
The final stage in the processing of scan data is the normalisation and clipping of the data. Until
this point, the data processing since initial scaling, has been using 16 bit arithmetic, with the iDCT
rescaling the multiplications back to 16 bits after each operation. The pixel values need to be
8 bits, and turned back into positive values (0 to 255, rather than -128 to 127). This is done on
the data emerging at the output of the iDCT where the 16 bit values are truncated to 8 bits, and 128 is added.
If the values fall below 0, or above 255, then they are clipped (rather than wrapped). The data stream from this
normalisation now consists of the fully decoded image pixel information. Between this normalised data and the
decoder's output interface is a short FIFO (not shown in Figure 11), placed there to simply make the internal
logic timing independent of the external acknowledge, whose timing within the clock cycle cannot be guaranteed.
The actual interface at the output is an 8 byte wide data bus (o_data) with a single valid bit (o_valid).
An acknowledge input is used to transfer the data, with data updated when both valid and acknowledge
are active (and at no other time). The order of the data emerging from the decoder is row data (starting with
DC in o_data[7:0]), followed by the rest of the row's AC values, and the subsequent row following, making up
sets of 64 bytes. The order of these sets is the same as in the scan data of the JPEG input. Therefore, this
is luminance (Y) data followed by Cb and Cr and sets. In the case of sub-sampled images, there will be multiple
Y data sets before the chrominance data emerges. The o_sos and o_sof0 status ports of the decoder make available
the information to discern the nature of the data that will emerge. It is outside the scope of this decoder
implementation (though implemented in the C model), but a further processor could be contrived to combine the
luminance and chrominance data in a set of RGB values, linearised to form a raw image in tuples of RGB information,
in sequence similar to that of, say, a 24 bit bitmap.
Download
The entire JFIF pacjage can be downloaded as a self extracting windows executable. It includes all the HDL
the verification environment, synthesis scripts and the documentation. It is released under the terms of the GPL
(the reference model) and the LGPL (HDL source), copies of which are included. The software reference model
can be downloaded separately, if desired. See the the document here for instructions.
The full package is available from the link below:
Conclusions
We have looked at a C model of a decoder (the source code of which is available),
and a hardware implementation of the same functionality (the source code of which will be made available
in the near future), all based on the concepts discussed in the previous article.
This has solidified these abstract concepts into practical implementations to demonstrate how real-world
software and hardware implementations are acheived, and provide a case study of such implementations.
Having implemented a decoder, the question arises, how do we know it works for all intended cases? Verification
of the C model and HDL, and the flow used to achieve this, is discussed in another article,
to provide a further case study in the development of a practical piece of digital IP to a high standard.
References
- [1] Pennebake, W.B. and Mitchell, J.L., 1992, "JPEG Still Image Data Compression Standard, 1st Ed.", ISBN 0-442-01272-1
- [2] Nelson, Mark; Gailly, Jean-loup, 1995, "The Data Compression Book", 2nd Ed., ISBN 1-558-51434-1
Copyright © 2014 Simon Southwell
simon@anita-simulators.org.uk
|