Implementation of H.264 in FPGA faces multiple  levels of complexity because real-time video compression requires careful design to achieve high data throughput and computational complexity. Generally the video encoders require a lot of resources and also need for faster processing. This will be a bottle neck with area & speed conscious design of FPGA. A high-profile implementation of H.264 inter-frame encoding still pose as an engineering challenge even on the latest FPGAs.

Important modules and the complexity estimates

The modules those can exploit the data parallelism are referred as hardware friendly modules. Module such as Intra Processing, SAD, SATD etc are example for this. Not all modules can exploit parallelism because the algorithms such as CAVLC, CAVLD or reference MB search are highly sequential.

Module Submodule Hardware
Friendliness
(0-5)
Complexity
(0-5)
Intra Processing   4 3
  Sum of Absolute Difference (SAD) 5 1
Inter Processing   2 5
  Reference frame fetching 1 2
Quantization   3 2
Transformation   4 1
Entropy Coding   2 4
  CAVLC 2 4
  CABAC 1 5
Deblocking Filter   3 3
Rate Control   3 3
Main Controller   4 3
MB Controller   4 4
NAL generator   4 2
  Header generator 4 2
  Packetizer 4 2
External memory controller   4 3

The implementation complexity increases when the sequential nature of the algorithm or the number of processes inside the algorithm, increases.

Motion estimation

H.264/AVC encoding, motion estimation can take up to 70% of the computational burden of a complete video compression procedure, making it as a bottleneck in term of power-consumption, computing speed and hardware cost. Support for 'Variable block size' is generally discarded in hardware implementation.

Sub-pixel Resolution

The sub-pixel resolution increases the algorithmic and computational complexity significantly. The decoding portion, which requires performing subpixel motion compensation only once per block, takes about 10 to 20 percent of decoding pipeline. The bulk of this time is spent interpolating values between pixels to generate the sub-pixel-offset reference blocks. The cost of performing sub-pixel estimation varies with the encoding algorithm, but may require performing motion compensation more than once.

Interpolation Algorithm

The interpolation algorithm to generate offset reference blocks is defined differently for luma and chroma blocks. For luma, interpolation is performed in two steps, half-pixel and then quarter-pixel interpolation. The half-pixel values are created by filtering with this kernel horizontally and vertically:
[1 -5 20 20 -5 1]/32

Quarter-pixel interpolation is then performed by linearly averaging adjacent half-pixel values. Motion compensation for chroma blocks uses bilinear interpolation with quarter-pixel or eighth-pixel accuracy, depending on the chroma format. Each subpixel position is a linear combination of the neighboring pixels.

After interpolating to generate the reference block, the algorithm adds that reference block to the decoded difference information to get the reconstructed block. The encoder executes this step to get reconstructed reference frames, and the decoder executes this step to get the output frames.

DCT

Instead of the DCT, the H.264 algorithm uses an integer transform as its primary transform to translate the difference data between the spatial and frequency domains. The transform is an approximation of the DCT that is both lossless and computationally simpler. The core transform, can be implemented using only shifting and adding.

Quantization

Quantization in H.264 is arithmetically expressed as a two-stage operation. The first stage is multiplying each coefficient in the 4x4 block by a fixed coefficient-specific value. This stage allows the coefficients to be scaled unequally according to importance or information. The second stage is dividing by an adjustable quantization parameter (QP) value. This stage provides a single “knob” for ad-justing the quality and resultant bitrate of the encoding.
The two operations can be combined into a single multiplication and single shift operation. The QP is expressed as an integer from 0 to 51. This integer is converted to a quantization step size (QStep) nonlinearly. Each 6 steps increases the step size by a factor of 2, and between each pair of power-of-two step sizes N and 2N there are 5 steps: 1.125N, 1.25N, 1.375N, 1.625N, 1.75N.

Entropy Coding

The use of CABAC can improve the compression of around 5-7%, but requires a 30-40% of additional total processing power to be accomplished.

Rate Distortion Optimization

H.264 video coding is based on the concept of rate distortion optimization (RDO) which means that the encoder has to encode the blocks using all the mode combinations and choose the one that gives the best RDO performance. But personally I will not suggest this in FPGA due to the high memory requirements.

 

 

 

 

You can consult the author Tony Gladvin George for the implementation of latest Video CODECs on FPGA.

   
  H.264 Video Codec on FPGA
Implementation of Video Codec

Features in H.264 and implementation feasibilities
Identifying and finalizing the expected features is important for a RTL project.

Encoder architecture for H.264 Video encoder in FPGA

Important modules and complexity estimates

Memory Management in the FPGA Codec

H.264 on FPGA IP Vendors