DiracAlgorithm

From Diracvideo

Jump to: navigation, search

Contents

[edit] Overall Architecture

Overall, the codec is a classic motion-compensated hybrid codec. The coder has the architecture shown below, whilst the decoder performs the inverse operations.

Downloading a GIF rendering as your browser doesn't support SVG. Please ignore the "install additional plugins" message if you see it. More details Sorry, your browser can't connect to the server to download a GIF substitute. Either install an SVG-enabled browser or connect to the internet to download the diagram.

Figure: Overall hybrid encoder architecture

There are four main elements or modules to the coder:

   * Transform and scaling involves taking frame data and applying a transform (in this case the wavelet transform) and scaling the coefficients to perform quantisation;
   * Entropy coding is applied to quantised transform coefficients and to motion vector (MV) data and performs lossless compression on them;
   * Motion estimation (ME) involves finding matches for frame data from previously coded frames, trading off accuracy with motion vector bit rate;
   * Motion compensation (MC) involves using the motion vectors to predict the current frame, in such a way as to minimise the cost of encoding the residual data.

The following sections describe these modules in more detail, after first describing the rate-distortion framework used throughout the encoder.

Dirac supports interlaced coding by means of coding whole fields rather than frames. Unlike MPEG codecs there is no complex macroblock-by-macroblock switching. Such complexity is not justified as interlace (an analogue compression technique) becomes a legacy system.

The codec can support any frame dimensions and common chroma formats (444, 422, 420 but not 411) by means of picture padding. The padding ensures that the wavelet transform can be applied properly.

[edit] Rate-Distortion optimisation

The key to making good decisions in compression is to be able to trade off the number of bits used to encode some part of the signal being compressed, with the error that is produced by using that number of bits. There is no point striving hard to compress one feature of the signal if the degradation it produces is much more significant than that of compressing some other feature with fewer bits. In other words, one wishes to distribute the bit rate to get the least possible distortion overall. So how can this be done?

Rate distortion can be described in terms of Lagrangian multipliers. It can also be described by the Principle of Equal Slopes, which states that the coding parameters should be selected so that the rate of change of distortion with respect to bit rate is the same for all parts of the system.

To see why this is so, consider two independent components of a signal. They might be different blocks in a video frame, or different subbands in a wavelet transform. Compress them at various rates using your favourite coding technique, and you tend to get curves like those in the figure below. They show that at low rates, there is high distortion (or error) and at high rates there is low distortion, and there is generally a smooth(ish) curve between these points with a nice convex shape.

Downloading a GIF rendering as your browser doesn't support SVG. Please ignore the "install additional plugins" message if you see it. More details Sorry, your browser can't connect to the server to download a GIF substitute. Either install an SVG-enabled browser or connect to the internet to download the diagram.

Figure: Rate-distortion curves for two signal components

Now suppose that we assign B1 bits to component X and B2 bits to component Y. Look at the slope of the rate-distortion curves at these points. At B1 the slope of X's distortion with respect to bit rate is much higher than the slope at B2, which measures the rate of change of Y's distortion with respect to bit rate. It's easy to see that this isn't the most efficient allocation of bits. To see this, increase B1 by a small amount to B1+Δ and decrease B2 to B2-Δ. Then the total distortion has reduced even though the total bit rate hasn't changed, due to the disproportionately greater drop in the distortion of X.

The conclusion is therefore that for a fixed total bit rate, the error or distortion is minimised by selecting bit rates for X and Y at which the rate-distortion curves have the same slope. Likewise, the problem can be reversed and for a fixed level of distortion, the total bitrate can be minimised by finding points with the same slope.

Two questions arise in practise: firstly, how does one find points on these curves with the same slope; and secondly, how does one hit a fixed overall bit budget? The first question can be answered by the succeeding figure: the intercept of the tangent to the rate-distortion curve at the point (R0,D0) to the D-axis is the value D0+λR0 where -λ is the slope at the point (R0,D0). Furthermore it is the smallest value of D+λR for all values of (R,D) that lie on the curve. So in selecting, for example, a quantizer in a given block or subband, one minimises the value D(Q)+λR(Q) over all quantizers Q, where D(Q) is the error produced by quantizing with Q and R(Q) is the rate implied.

Downloading a GIF rendering as your browser doesn't support SVG. Please ignore the "install additional plugins" message if you see it. More details Sorry, your browser can't connect to the server to download a GIF substitute. Either install an SVG-enabled browser or connect to the internet to download the diagram.

Figure: minimisation of the Lagrangian cost function

In order to hit an overall bit budget, one needs to iterate over values of the Lagrangian parameter λ in order to find the one that gives the right rate. In practise, this iteration can be done in slow time given any decent encoding buffer size, and by modelling the overall rate distortion curve based on the recent history of the encoder. Rate-distortion optimisation (RDO) is used throughout Dirac, and it has a very beneficial effect on performance. Control of the example Dirac encoder is by a single parameter ("-qf") that effectively sets Lagrangian parameters for each part of the encoding process.

This description makes RDO sound like a science: in fact it isn't and the reader will be pleased to learn that there is plenty of scope for engineering ad-hoc-ery of all kinds. This is because there are some practical problems in applying the procedure:

1) There may be no common measure of distortion. For example: quantising a high-frequency subband is less visually objectionable than quantising a low-frequency subband, in general. So there is no direct comparison with the significance of the distortion produced in one subband with that produced in another. This can be overcome by perceptual weighting, in which the noise in HF bands is downgraded according to an estimate of the Contrast Sensitivity Function (CSF) of the human eye, and this is what we have done. The problem even occurs in block-based coders, however, since quantisation noise can be successfully masked in some areas but not in others. Perceptual fudge factors are therefore necessary in RDO in all types of coders.

2) Rate and distortion may not be directly measurable. In practice, measuring rate and distortion for, say, every possible quantiser in a coding block or subband cannot mean actually encoding for every such quantiser and counting the bits and measuring MSE. What one can do is estimate the values using entropy calculations or assuming a statistical model and calculating, say, the variance. In this case, the R and D values may well be only roughly proportional to the true values, and some sort of factor to compensate is necessary in using a common multiplier across the encoder.

3) Components of the bitstream will be interdependent. The model describes a situation where the different signals X and Y are fully independent. This is often not true in a hybrid video codec. For example, the rate at which reference frames are encoded affects how noisy the prediction from them will be, and so the quantisation in predicted frames depends on that in the reference frame. Even if elements of the bitstream are logically independent, perceptually they might not be. For example, with Intra coding, each frame could be subject to RDO independently, but this might lead to objectionally large variations in quantisation noise between frames with low bit rates and rapidly changing content.

Incorporating motion estimation into RDO is also tricky, because motion parameters are not part of the content but have an indirect effect on how the content looks. They also have a coupled effect on the rest of the coding process, since the distortion measured by prediction error, say, affects both the bit rate needed to encode the residuals and the distortion remaining after coding. This is discussed in more detail later.

[edit] Transform coding

Once any motion compensation has been performed, motion -compensated residuals are treated almost identically to Intra picture data. In both cases we have three (luminance and two chrominance) components in the form of two-dimensional arrays of data values. The picture (frame or field) component data is coded in three stages. First the data arrays are wavelet- transformed using separable wavelet filters and divided into subbands. Then they are quantised (using RDO quantisers in the reference encoder). Finally, the quantised data is entropy coded.

The architecture of coefficient coding is shown here:

Downloading a GIF rendering as your browser doesn't support SVG. Please ignore the "install additional plugins" message if you see it. More details Sorry, your browser can't connect to the server to download a GIF substitute. Either install an SVG-enabled browser or connect to the internet to download the diagram.

Figure: coefficient coding architecture

As can be seen, each wavelet subband is coded in turn. Both the quantisation and the entropy coding of each band can depend on the coding of previously coded bands. This does limit parallelisation, but the dependences are limited to parent-child relationships so some parallelisation/multi-threading is still possible.

The only difference between Intra picture coefficient coding and Inter picture residual coefficient coding lies in the use of prediction within the DC wavelet subband of Intra picture components.

At the decoder side, the three stages of the coding process are reversed. The entropy coding is decoded to produce the quantised coefficients, which are then reconstructed to produce the real values. Then, after undoing any prediction, the inverse transform produces the decoded picture component. The reference Dirac encoder has to maintain a local decoder within it, in part so that the result of the compression picture can be viewed at the time of compression, but mainly because compressed pictures must be used as reference frames for subsequent motion compensation else the encoder and the decoder will not remain in sync.

[edit] Wavelet transform

The discrete wavelet transform is now extremely well-known and is described in numerous references. In Dirac it plays the same role of the DCT in MPEG-2 in

   decorrelating data in a roughly frequency-sensitive way, whilst
   having the advantage of preserving fine details better. In one
   dimension it consists of the iterated application of a
   complementary pair of half-band filters followed by subsampling
   by a factor 2:



Downloading a GIF rendering as your browser doesn't support SVG.

      Please ignore the "install additional plugins" message if you see it. 
     More details
     Sorry, your browser can't connect to the server to download a GIF substitute.         Either install an SVG-enabled browser or connect to the internet to download the diagram.
     



Figure: Perfect reconstruction analysis and synthesis

   filter pairs



These filters are termed the analysis filters. Corresponding

   synthesis filters can undo the aliasing introduced by the
   critical sampling and perfectly reconstruct the input. Clearly
   not just any pair of half-band filters can do this, and there
   is an extensive mathematical theory of wavelet filter banks.
   The filters split the signal into a LH and HF part; the
   wavelet transform then iteratively decomposes the LF component
   to produce an octave-band decomposition of the signal.



Applied to two-dimensional images, wavelet filters are

   normally applied in both vertical and horizontal directions to
   each image component to produce four so-called subbands

termed Low-Low (LL), Low-High

   (LH), High-Low (HL) and High-High (HH). In the case of two
   dimensions, only the LL band is iteratively decomposed to
   obtain the decomposition of the two-dimensional spectrum shown
   below:



Downloading a GIF rendering as your browser doesn't support SVG.

      Please ignore the "install additional plugins" message if you see it. 
     More details
     Sorry, your browser can't connect to the server to download a GIF substitute.         Either install an SVG-enabled browser or connect to the internet to download the diagram.
     




Figure: wavelet transform frequency

   decomposition



The number of samples in each resulting subband is as

   implied by the diagram: the critical sampling ensures that after
   each decomposition the resulting bands all have one quarter of
   the samples of the input signal.



photograph of girl (LENA) decomposed by 3-level wavelet transform



Figure: 3-level wavelet transform of LENA



Wavelet filters



The choice of wavelet filters has an impact on compression

   performance, filters having to have both compact impulse
   response in order to reduce ringing artefacts and other
   properties in order to represent smooth areas compactly. It also has an 
   impact on encoding and decoding speed in software. 


There are numerous filters supported by Dirac to allow a trade-off
   between complexity and performance. These are configurable in
   the reference software.
   These filters
   are all defined using the 'lifting scheme' for speed.



The lifting stages are defined as follows. One fiter available in Dirac

   is a lifting approximation of the Daubechies (9,7) 
   wavelet: for this we have (s denoting sum and d denoting difference),





                 s
                 n
                 0
              
              =
                 x
                 
                    2n
                 
              
              
           
        
        
           
              
                 d
                 n
                 0
              
              =
                 x
                 
                    2n+1
                 
              
              
           
        
        
           
              
                 d
                 n
                 1
              
              = 
                 d
                 n
                 0
              
              -(6497⋅( 
                 s
                 n
                 0
              
               +
                 s
                 
                    n+1
                 
                 0
              
              ))/4096
           
        
        
           
              
                 s
                 n
                 1
              
              = 
                 s
                 n
                 0
              
              -(217⋅( 
                 d
                 n
                 1
              
               +
                 d
                 
                    n-1
                 
                 1
              
              ))/4096
           
        
        
           
              
                 d
                 n
                 2
              
              = 
                 d
                 n
                 1
              
              +(3616⋅( 
                 s
                 n
                 1
              
               +
                 s
                 
                    n+1
                 
                 1
              
              ))/4096
           
        
        
           
              
                 s
                 n
                 2
              
              = 
                 s
                 n
                 1
              
              +(1817⋅( 
                 d
                 n
                 2
              
               +
                 d
                 
                    n-1
                 
                 2
              
              ))/4096
           
        
     
     
  




   The magic numbers are integer approximations of the Daubechies lifting ceofficients. This makes the transform fully
   invertible, where a floating point implementation wouldn't quite be. The implementation ignores scaling
   coefficients, since these can be taken into account in quantiser selection by weighting quantiser noise
   appropriately. The problem with this filter is that it has four lifting stages, and so it takes longer
   in software. At the other extreme, there is the (5,3) filter:
   




                 d
                 n
                 1
              
              = 
                 d
                 n
                 0
              
              -( 
                 s
                 n
                 0
              
               +
                 s
                 
                    n+1
                 
                 0
              
              )/2
           
        
        
           
              
                 s
                 n
                 1
              
              = 
                 s
                 n
                 0
              
              +( 
                 d
                 n
                 1
              
               +
                 d
                 
                    n-1
                 
                 1
              
              )/4
           
        
     
     




   We can improve the (5,3) high pass filter, and get an approximating of the Daubechies
   low-pass filter by having more taps in the first lifing stage. This is the Deslauriers-Dubuc
   (9,7) filter:
   




                 d
                 n
                 1
              
              = 
                 d
                 n
                 0
              
              -(- 
                 s
                 
                    n-1
                 
                 0
              
              +9
                 s
                 n
                 0
              
               +9
                 s
                 
                    n+1
                 
                 0
              
              -
                 s
                 
                    n+2
                 
                 0
              
              )/16
           
        
        
           
              
                 s
                 n
                 1
              
              = 
                 s
                 n
                 0
              
              +( 
                 d
                 n
                 1
              
               +
                 d
                 
                    n-1
                 
                 1
              
              )/4
           
        
     
     
  
   


The Deslauriers-Dubuc (13,7) extends this further to provide better frequency selectivity in
   the low pass filter, still only using two lifting stages:
   





                 d
                 n
                 1
              
              = 
                 d
                 n
                 0
              
              -(- 
                 s
                 
                    n-1
                 
                 0
              
              +9
                 s
                 n
                 0
              
               +9
                 s
                 
                    n+1
                 
                 0
              
              -
                 s
                 
                    n+2
                 
                 0
              
              )/16
           
        
        
           
              
                 s
                 n
                 1
              
              = 
                 s
                 n
                 0
              
              +(- 
                 d
                 
                    n+1
                 
                 1
              
               + 9
                 d
                 n
                 1
              
               +9
                 d
                 
                    n-1
                 
                 1
              
              -
                 d
                 
                    n-2
                 
                 1
              
              )/32
           
        
     
     
  



Padding and invertibility



Clearly, applying an N-level wavelet transform requires N levels of

   subsamplings, and so for reversibility, it is necessary that 2N
   divides all the dimensions of each component. 
   So if this condition is not
   met, the input picture components are padded as they are read in, by edge values
   for best compression performance.

[edit] Parent-child relationships

Since each subband represents a filtered and subsampled version of the frame component, coefficients within each subband correspond to specific areas of the underlying picture and hence those that relate to the same area can be related. It is most productive to relate coefficients that also have the same orientation (in terms of combination of high- and low-pass filters). The relationship is illustrated below, showing the situation for HL bands i.e. those that have been high-pass filtered horizontally and low-pass filtered vertically.

Downloading a GIF rendering as your browser doesn't support SVG. Please ignore the "install additional plugins" message if you see it. More details Sorry, your browser can't connect to the server to download a GIF substitute. Either install an SVG-enabled browser or connect to the internet to download the diagram.

Figure: Parent-child relationship between subband coefficients

In the diagram it's easy to see that the subsampling structure means that a coefficient (the parent) in the lowest HL band corresponds spatially to a 2x2 block of coefficients (the children) in the next HL band, each coefficient of which itself has a 2x2 block of child coefficients in the next band, and so on. This relationship relates closely to spectral harmonics: when coding image features (edges, especially) significant coefficients are found distributed across subbands, in positions related by the parent-child structure, and corresponding to the original position of the feature. In particular, a coefficient is more likely to be significant if its parent is, and children with zero or small parents or ancestors may have different statistics from children with large parents or ancestors.

These factors suggest that when entropy coding coefficients, it will be helpful to take their parents into account in predicting how likely, say, a zero value is.

By coding from low-frequency subbands to high-frequency ones, and hence by coding parent before child subbands, parent-child dependencies can be exploited in these ways without additional signalling to the decoder.

The wavelet coefficient coding section describes specifically how these relationships are exploited.

[edit] Quantisation

Having transformed the component data, each subband's coefficients

   are quantised by Dirac using so-called uniform dead-zone quantisers. A
   simple uniform quantiser is a division of the real line into
   equal-width bins, of size equal to the quantisation factor Qf:
   the bins are numbered and a reconstruction value is selected
   for each bin. So the bins consist of the intervals


 [
 
   (
   N
   −
   
     1
     2
   
   )
 

   Q
   f
 
 ,
 
   (
   N
   +
   
     1
     2
   
   )
 
 
   Q
   f
 
 ]



for integers N, which are also the labels for the bin, and

   it is the labels that are subsequently encoded. The
   reconstruction value used in the decoder (and for local
   decoding in the encoder) can be any value in each of the bins.
   The obvious, but not necessarily the best, reconstruction value
   is the midpoint NQf. See the diagram, part a, below.



Downloading a GIF rendering as your browser doesn't support SVG.

      Please ignore the "install additional plugins" message if you see it. 
     More details
     Sorry, your browser can't connect to the server to download a GIF substitute.         Either install an SVG-enabled browser or connect to the internet to download the diagram.
     




Figure: Uniform and dead-zone quantisers, with mid

   point reconstruction values.



A uniform dead-zone quantiser is slightly different in that

   the bin containing zero is twice as wide. So the bins consist
   of [-Qf,Qf], with a reconstruction value of 0, together with

other bins of the form



 [
 N
 
   Q
   f
 
 ,
 
   (
   N
   +
   1
   )
 
 
   Q
   f
 
 ]



for N>0 and


 [
 
   (
   N
   −
   1
   )
 
 
   Q
   f
 
 ,
 N
 
   Q
   f
 
 ]



for N<0, with reconstruction points somewhere in the

   intervals. The bin structure is shown in part b above with
   mid-point reconstruction points.



The advantage of the dead-zone quantiser is two-fold.

   Firstly, it applies more severe quantisation of the smallest
   values, which acts as a simple but effective de-noising
   operation. Secondly, it admits a very simple and efficient
   implementation: simply divide by the quantisation factor and

round towards zero. In Dirac, this process is approximated by a multiplication and a bitshift.




    N=Q(v)=⌊ 
     
      
       | v |
      
      
       
        Q
        f
       
       
      
     
     
     ⌋ if v≥0
   
  
  
   
    N=Q(v)=−⌊ 
     
      
       | v |
      
      
       
        Q
        f
       
       
      
     
     
     ⌋ otherwise
   
  
 
 








where the braces ⌊⌋ mean that the

   remainder is to be discarded. The corresponding reconstructed
   value 
   

 
  v
  ˜
  

is given by (an integer approximation to):





     v
     ˜
    
    =0 if N=0
   
  
  
   
    
     v
     ˜
    
    =(N+X)
     Q
     f
    
     if N>0
   
  
  
   
    
     v
     ˜
    
    =(N−X)
     Q
     f
    
     if N<0
   
  
 
 



for a reconstruction point X between 0 and 1. 



A value of X=0.5, giving the mid-point of the interval might

   be the obvious reconstruction point, giving as it does the
   mid-point of the bin. This is indeed what we use for intra pictures.
   For inter pictures (motion-compensated prediction residues) the values of
   transformed coefficients in a wavelet subband have a
   distribution with mean very near zero and which decays pretty
   rapidly and uniformly for larger values. Values are therefore
   more likely to occur in the first half of a bin than in the
   second half and the smaller value of X=0.375 reflects this bias,
   and gives better performance in practice.



This reconstructed value is used by the encoder to produce the

locally decoded component data, which is identical to what the decoder would produce, after decoding the quantised value N.

[edit] Lagrangian parameter control of subband quantisation

Selection of quantisers is a matter for the encoder only.

   The current Dirac encoder uses an RDO technique to pick

a quantiser by minimising a

   Lagrangian combination of rate and distortion.Essentially, lots

of quantisers are tried and the best picked. Rate is estimated

   via a an adaptively-corrected measure of zeroth-order entropy measure Ent(q)
   of the quantised symbols resulting from applying the
   quantisation factor q, calculated as a value of bits/pixel.
   Distortion is measured in terms of the perceptually-weighted
   error fourth-power error E(q,4), resulting from the difference
   between the original and the quantised coefficients:




  E(q,4)=
   
    (
     
      
       
        ∑
        
         i,j
        
       
       
        
         
          | 
           
            p
            
             i,j
            
           
           −Q(i,j)
           |
         
         4
        
        
       
      
     
    )
   
   
    
     
      1
      4
     
    
    
   
  
  
 





The
   total measure for each quantiser q is:




  λ.C.Ent(q)+(
   
    
     
      E
       
        (q,4)
       
       2
      
      
     
     w
    
    
   
  )
 





where w is the perceptual weight associated with the

   subband - higher frequencies having a larger weighting factor -
   and C is a correction factor. Using the square of
   E(q,4) makes it equal to the mean-square error for constant
   values, but in general it gives greater weight to large values 
   than the MSE, for a mixed signal. The correction factor
   compensates for any discrepancy between the measure of entropy
   and the actual cost in terms of bits, based on the actual bit
   rate produced by the corresponding elements of previous
   frames. It is necessary because the entropy measure does not
   take into account dependencies between coefficients that are
   taken into account in the actual 

coefficient entropy coding.



The quantisers are incremented in quarter-powers of 2 - i.e.

   q is an integer approximation of 

 
  
   2
   
    
     
      n
      4
     
    
    
   
  
  
 



for integers n. In
   other words, the quantisers represent the coefficient
   magnitudes to variable fractional-bit accuracies in quarter-bit
   increments.



The Lagrangian parameter λ
   is derived from the encoder quantisation parameter.
   The larger the value, the lower the resulting bit rate, and
   vice-versa.



Clearly, there are a lot of quantisers to search. The current

Dirac encoder implementation speeds things up by splitting the search up into three stages. First, one quarter of the coefficients are used to obtain the best quantiser to bit-accuracy. Secondly, one quarter of the coefficients are again used to refine this estimate to half-bit accuracy. Thirdly, half the coefficients are used to refine the search further to 1/4-bit. In each stage, only a single loop over the coefficients is used to test all the candidate quantisers. The result is much faster than a brute-force search of all the quantisers, and almost as good in performance terms.

[edit] Wavelet Coefficient Coding

Dirac codes subbands from low-frequency to high frequency in a zig-zag
   order. Within each subband, the coefficients are further partitioned 
   spatially divided into code blocks. Coefficients are scanned by coding the
   code blocks in normal raster order and coding the coefficients within each
   code block in raster order within that block. However, each code block
   can be skipped so a skip flag is included in the stream of symbols to be
   coded. The skip flag is interpreted in the decoder as setting all
   coefficients within the block to be zero.



The number of code blocks in a subband depends on the frame type and
   on the subband. Intra frames have a single code block for 7 the 
   lowest-frequency 7 subbands (ie the bottom 2 levels of the wavelet 
   transform) and a 4x3 array of code blocks for the remaining bands. 
   Predicted frames have more code blocks, as coefficients are more likely
   to be zero: 1 block only for the 4 lowest-level subbands; 8x6 for the
   next 3 subbands; 12x8 for the remaining subbands. These values are
   somewhat ad hoc, and probably will ultimately be configurable.



   There are two possible quantisation modes made possible by the code block
   structure. We may either have a single quantiser for the whole subband or
   have different quantisers for different blocks. Currently the software
   only supports the first option, but ultimately both options will be
   implemented, which will allow Region Of Interest coding.



   Entropy coding
   



The entropy coding used by Dirac in wavelet subband coefficient coding is based on three

   stages: binarisation, context modelling and adaptive arithmetic
   coding.



Downloading a GIF rendering as your browser doesn't support SVG.

      Please ignore the "install additional plugins" message if you see it. 
     More details
     Sorry, your browser can't connect to the server to download a GIF substitute.         Either install an SVG-enabled browser or connect to the internet to download the diagram.
     




Figure: Entropy coding block diagram



The purpose of the first stage is to provide a bitstream

   with easily analysable statistics that can be encoded using
   arithmetic coding, which can adapt to those statistics,
   reflecting any local statistical features.



Binarization



Binarization is the process of transforming the multi-valued

   coefficient symbols into bits. The resulting bitstream can then
   be arithmetic coded. The original symbol stream could have been

coded directly, using a multi-symbol arithmetic coder, but this tends to suffer from 'context dilution': most symbols occur very rarely and so only sparse statistics can be gathered, which reduces coding efficiency.



The most obvious way to binarize a symbol is directly: a

   symbol is encoded by encoding the constituent bits of the
   binary representation of its magnitude, followed by a sign bit.
   This is termed bit-plane coding. The problem with this is that
   the number of bit planes depends on the size of the largest value,
   which means that either a lot of unnecessary zero symbols must be sent
   for smaller values or the size (number of bits) of each symbol must be encoded
   somehow.



Dirac uses an interleaved exp-Golomb binarisation. This can be thought of as combining

   a length code with the binary representation. Here's how it works for coding magnitude
   (non-negative) values. 
   



   Firstly, add 1 to your value. Then the binary representation has a
   leading 1: 
   



   N+1 = 1 bK-1...b0
   


   The leading 1 is redundant and doesn't have to be sent. We can prefix each
   data bit with a "follow" bit which tells us there's a new bit to come: 0 means
   there's an additional bit and 1 means terminate. So the code for N will be:
  



  0 bK-1 0 ... 0 b0 1
   



   In coventional exp-Golomb coding, the length code of K zeroes followed by a
   1 would be a prefix to the data bits. By interleaving them we have ensured
   that the value can be decoded in a single loop, two bits at a time, rather
   than in two loops.  
   



Context modelling



The context modelling in Dirac is based on the principle that whether

   a coefficient is small (or zero, in particular) or not is
   well-predicted by its neighbours and its parents. Therefore the
   codec conditions the probabilities used by the arithmetic coder
   for coding the first follow bit on whether the neighbouring
   coefficients or the parent coefficient are zero.



The reason for this approach is that, whereas the wavelet transform largely

   removes correlation between a coefficient and its neighbours,
   they may not be statistically independent even if they are
   uncorrelated. The main reason for this is that small and
   especially zero coefficients in wavelet subbands tend to clump
   together, located at points corresponding to smooth
   areas in the image, and as discussed

elsewhere, are grouped together across

   subbands in the parent-child relationship.



After binarization, a context is selected, and the

   probabilities for 0 and 1 that are maintained in the
   appropriate context will be fed to the arithmetic coding
   function along with the value itself to be coded.



Arithmetic coding



Conceptually, an

   arithmetic coder can be thought of a progressive way of
   producing variable-length codes for entire sequences of symbols
   based on the probabilities of their constituent symbols. For
   example, if we know the probability of 0 and 1 in a binary
   sequence, we also know the probability of the sequence itself
   occurring. So if



P(0)=0.2, P(1)=0.8



then



P(11101111111011110101)=(0.2)3(0.8)17=1.8x10-4

   (assuming independent occurrences).



Information theory then says that optimal entropy coding of

   this sequence requires 

 
  
   
    log�
   
   2
  
  (
   
    
     1
     p
    
    
   
  )=12.4 bits
 



. AC

   produces a code-word very close to this optimal length, and
   implementations can do so progressively, outputting bits when
   possible as more arrive.



All arithmetic coding (AC) requires are estimates of the

   probabilities of symbols as they occur, and this is where
   context modelling fits in. Since AC can, in effect, assign a
   fractional number of bits to a symbol, it is very efficient for
   coding symbols with probabilities very close to 1, without the
   additional complication of run-length coding. The aim of
   context modelling within Dirac is to use information about the
   symbol stream to be encoded to produce accurate probabilities
   as close to 1 as possible.



Dirac computes these estimates for each context by maintaining

   a 16 bit probability word representing the probability of a zero
   value in that context. When a value has been coded, the probability
   is modified by incrementing (if zero was coded) or decrementing
   this value by a delta. 



The delta value itself depends only on the

    probability and is derived from a look-up table. If the probability is near 1 or 0 the delta is approximately
   1/256; if the probability is near 1/2, the delta is approximately 1/32.
   This avoids the need to maintain explicit counts of 1 and 0 and makes
   for very efficient updating. There is no rescaling or division in the computation,
   and the estimate adapts rapidly for balanced probabilities and slowly for
   highly skewed probabilities, as it should.

[edit] Motion Estimation

[edit] Temporal prediction structures

Motion estimation and compensation is the most complex part of any video codec, both conceptually and in terms of computation. The Dirac encoder uses three types of picture. Intra pictures (I pictures) are coded without reference to other pictures in the sequence. Level 1 pictures (L1 pictures) and Level 2 pictures are both inter pictures, that is they are coded with reference to other previously coded pictures. The difference between L1 and L2 pictures is that L1 pictures are forward-predicted only (also known a P-pictures) whereas L2 pictures are B pictures (predicted from both earlier and later references).

The Dirac software employs a picture buffer to manage temporal prediction. Each picture is encoded with a header that specifies the picture number in display order, the picture numbers of any references and how long the picture must stay in the buffer. The decoder then decodes each picture as it arrives, searching the buffer for the appropriate reference pictures and placing the picture in the buffer. The decoder maintains a counter indicating which picture to 'display' (i.e. push out through the picture IO to the application calling the decoder functions, which may be a video player or may be something else). It searches the buffer for the picture with that picture number and displays it. Finally, it goes through the buffer eliminating pictures which have expired.

This decoder process allows for quite arbitrary prediction structures to be employed, not just those of MPEG-like GOPs.

Nevertheless, the encoder operates with standard GOP modes whereby the number of L1 pictures between I pictures, and the separation between L1 pictures, can be specified; and various presets for streaming, SDTV and HDTV imply specific GOP structures. A prediction structure for picture coding using a standard GOP structure is shown below:

Downloading a GIF rendering as your browser doesn't support SVG. Please ignore the "install additional plugins" message if you see it. More details Sorry, your browser can't connect to the server to download a GIF substitute. Either install an SVG-enabled browser or connect to the internet to download the diagram.

Figure: Prediction of L1 and L2 pictures when L1 pictures are P pictures.

The picture buffer structure gives great flexibility, including the ability for the decoder to decode dynamically-varying GOP structures. However, it also brings some dangers, since at least in theory it means that I pictures need not be random access points - that is points where a decoder may start decoding. This is because it is perfectly possible for a subsequent L1 or L2 picture to have as a reference a picture that temporally precedes a preceding I picture, and indeed forms part of a chain of reference right back to the start of the sequence. This will need to be constrained by specifying levels and profiles.

I-picture only coding

Setting the number of L1 pictures to be 0 on the encoder side implies that we don't have a GOP, and that we're doing I-picture only coding. I-picture only coding is useful for editing and other applications where fast random access to all pictures is required, but I-picture only coding is not essential for these applications. with suitable support.

Skipping pictures and global motion

The picture header also contains other goodies, which are not yet fully supported. Firstly it contains a flag indicating whether or not the picture is skipped or not. In this case no picture data is sent at all. What action the decoder takes in this case has yet to be determined, but it is likely that in future versions the decoder will return the most recent decoded picture in temporal order.

The second flag the picture header contains indicates the presence of global motion data, that is a parameterized model of the motion data. Exactly what this means and how the data should be encoded and how the decoder should behave, have yet to be determined either.

When suitable algorithms have been implemented on the encoder side, both these tools should have a powerful impact on compression performance, allowing the picture rate to be scaled down and the motion more heavily compressed, when the encoder is very pushed for bit rate.

Interlace coding

Dirac supports interlace coding by coding sequences of fields, rather than frames.

[edit] Overlapped block-based motion compensation

Motion compensation in Dirac uses Overlapped Block-based

   Motion Compensation (OBMC) to avoid block-edge artefacts which
   would be expensive to code using wavelets.

Pretty much any size blocks can be used, with any degree of

   overlap selected: this is configurable at the encoder and

transmitted to the decoder. One issue is that there should be an exact number of macroblocks horizontally and vertically, where a macroblock is a 4x4 set of blocks. This is achieved by padding

   the data. Further padding may also be needed because after motion
   compensation the wavelet transform
    is applied, which has its own
   requirements for divisibility.



Although Dirac is not specifically designed to be scalable, the size of

   blocks is the only non-scalable feature, and for lower
   resolution frames, smaller blocks can easily be selected.



Dirac's OBMC scheme is based on a separable

   linear ramp mask. This acts as a weight function on the
   predicting block. Given a pixel p=p(x,y,t) in frame t, p may
   fall within only one block or in up to four if it lies at the
   corner of a block (see the figure below).



Downloading a GIF rendering as your browser doesn't support SVG.

      Please ignore the "install additional plugins" message if you see it. 
     More details
     Sorry, your browser can't connect to the server to download a GIF substitute.         Either install an SVG-enabled browser or connect to the internet to download the diagram.
     



Figure: overlapping blocks. The darker-shades areas

   show overlapping areas



Each block that the pixel p is part of has a predicting

   block within the reference frame selected by motion
   estimation. The predictor 

 
  p
  �
 
 



for p is the
   weighted sum of all the corresponding pixels in the predicting
   blocks in frame 

 
  t
  �
 
 



, given by 

 
  p(x�
   V
   i
  
  ,y�
   W
   i
  
  ,
   t
   �
  
  )
 



for motion vectors 

 
  (
   V
   i
  
  ,
   W
   i
  
  )
 



. The
   Raised-Cosine mask has the necessary property that the sum of
   the weights will always be 1:





   p
   �
  
  (x,y,t)=
   
    �
    i
   
   
    
     w
     i
    
    p(x�
     V
     i
    
    ,y�
     W
     i
    
    ,
     t
     �
    
    ),
     
      �
      i
     
     
      
       w
       i
      
      =1
     
    
   
  
 





This may seem complicated but in implementation the only

   additional complexity over standard block-based motion compensation

is to apply the weighting mask to a

   predicting block before subtracting it from the frame. The fact that the weights sum to
   1 automatically takes care of splicing the predictors together across the overlaps.



As explained elsewhere, Dirac provides

motion vectors to sub-pixel accuracy, the software supporting

   accuracy up to 1/8th pixel, although 1/4 pixel is the default. 
   This means upconverting

the reference frame components by a factor of up to 8 in each dimension. The area corresponding to the matching block in the upconverted reference then consists of 64 times more points. These can be thought of as 64 reference blocks on different sub-lattices of points separated by a step of 8 'sub-'pixels, each one corresponding to different sub-pixel offsets.



Sub-pixel motion compensation places a huge load on memory bandwidth if done naively, i.e. by upconverting the reference by a factor 8 in each dimension. In Dirac, however, we just upconvert the reference by a factor of 2 in each dimension and compute the other offsets by linear interpolation on the fly. In other words we throw the load from the bus to the CPU. The 2x2 upconversion filter has been designed to get the best prediction error across all the possible sub-pixel offsets.

[edit] Motion estimation

Motion estimation is specific to the encoder. It's always the most complicated part of the system, and can absorb huge system resources, so methods have to be found to produce short-cuts. Dirac adopts a 3-stage approach. In the first stage, motion vectors are found for every block and each reference to pixel accuracy using hierarchical motion estimation. In the second stage, these vectors are refined to sub-pixel accuracy. In the third stage, we do mode decision, which chooses which predictor to use, and how to aggregate motion vectors by grouping blocks with similar motion together.

Motion estimation is most accurate when all three components are involved, but this is more expensive in terms of computation as well as more complicated. Dirac only uses the luma (Y) component.

Hierarchical motion estimation

Hierarchical ME speeds things up by repeatedly downconverting both the current and the reference frame by a factor of two in both dimensions, and doing motion estimation on smaller pictures. At each stage of the hierarchy, vectors from lower levels (smaller versions of the picture) are used as a guide for searching at higher levels. This dramatically reduces the size of searches for large motions. Dirac has four levels of downconversion. The block size remains constant (and the blocks will still overlap at all resolutions) so that at each level there are only a quarter as many blocks and each block corresponds to 4 blocks at the next higher resolution; and so each block provides a guide motion vector to 4 blocks at the next higher resolution layer. At each resolution, block matching proceeds by searching in a small range around the guide vector for the best match using the RDO metric (which is described below).

Search strategies in hierarchical ME

The hierarchical approach dramatically reduces the computational effort involved in motion estimation for an equivalent search range. However it risks missing small motions and it might not make good decisions when there are a variety of motions near to each other.

To mitigate this, the codec also always uses the zero vector (0,0) as another guide vector - this allows it to track slow- as well as fast-moving objects. Finally, the motion vectors already found in neighbouring blocks can also be used as guide vectors, it they have not already been tried.

Since each layer has twice the horizontal and vertical resolution of the one below it, it would appear to make sense to just search in an area +/-1 pixel of the guide vectors. In fact,the search ranges are always larger than this because this could cause the motion estimator to get trapped in a local minimum.

Sub-pixel refinement and upconversion

Dirac supports variable levels of motion vector accuracy. In the software currently, these are hard-wired in the code at 1/4 pixel but 1/8 is possible with the current software and even higher resolutions could be defined. The MV precision is signalled with each frame.

Sub-pixel refinement operates hierarchically also. Once pixel-accurate motion vectors have been determined, each block will have an associated motion vector (V0,W0) where V0 and W0 are multiples of 4 (for quarter-pel accuracy) or 8 (for eighth-pel accuracy). 1/2-pel accurate vectors are found by finding the best match out of (V0,W0) and its 8 neighbours: (V0+4,W0+4), (V0,W0+4), (V0-4,W0+4), (V0+4,W0), (V0-4,W0), (V0+4,W0-4), (V0,W0-4), (V0-4,W0-4). This in turn produces a new best vector (V1,W1), which provides a guide for 1/4-pel refinement, and so on until the desired accuracy. The process is illustrated in the figure below.

Downloading a GIF rendering as your browser doesn't support SVG. Please ignore the "install additional plugins" message if you see it. More details Sorry, your browser can't connect to the server to download a GIF substitute. Either install an SVG-enabled browser or connect to the internet to download the diagram.

Figure: sub-pixel motion-vector refinement

The sub-pixel matching process is complicated slightly since the reference is only upconverted by a factor of 2 in each dimension, not 8, and so more accurate vectors require frame component values to be calculated on the fly by linear interpolation.This means that the 1/2-pel interpolation filter has a bit of pass-band boost to counteract the sag introduced by doing linear interpolation. It was designed to produce the lowest interpolation error across all the phases. The taps are (scaled to 5 bits):

( -1 , 3 , -7 , 21 , 21 , -7 , 3 , -1 )

[edit] RDO motion estimation metric

The performance of motion-estimation and motion-vector

   coding is absolutely critical to the performance of a video
   coding scheme.With motion vectors at 1/4
   or 1/8th pixel accuracy, a simple-minded strategy of finding
   the best match between frames can greatly inflate the
   resulting bitrate for little or no gain in quality because the

additional accuracy is very sensitive to noise. What is

   required is the ability to trade off the vector bitrate with
   prediction accuracy and hence the bit rate required to code the
   residual frame and the eventual quality of that frame, whilst

at the same time making the estimator more robust.



The simplest way to do this is to incorporate a smoothing

   factor into the metric used for matching blocks. So the metric
   consists of a basic block matching metric, plus some constant
   times a measure of the local motion vector smoothness. The
   basic block matching metric used by Dirac is Sum of Absolute

Differences (SAD). Given two blocks X,Y of samples, this

   is given by:



  SAD(X,Y)=
   
    �
    
     i,j
    
   
   
    | 
     
      X
      
       i,j
      
     
     �
      Y
      
       i,j
      
     
     
     |
   
  
 





The smoothness measure used is based on the difference between the candidate motion vector and the median of the neighbouring previously

   computed motion vectors.
   Since the blocks are estimated in
   raster-scan order then vectors for blocks to the left and above
   are available for calculating the median:




Downloading a GIF rendering as your browser doesn't support SVG.

      Please ignore the "install additional plugins" message if you see it. 
     More details
     Sorry, your browser can't connect to the server to download a GIF substitute.         Either install an SVG-enabled browser or connect to the internet to download the diagram.
     



Figure: neighbouring vectors available in raster-scan

   order for local variance calculation



The vectors chosen for computing the local median

predictor are V2, V3 and V4; this has the merit of being the same predictor as is used in coding the motion vectors.



The total metric is a combination of these two metrics. Given a
   vector V which maps the current frame block X to a block
   Y=V(X) in the reference frame, the metric is given by:




  SAD(X,Y)+
   �
   
    max�
   
  
  (
   
    | 
     
      V
      x
     
     �
      pred
      x
     
     
     |+| 
     
      V
      y
     
     �
      pred
      y
     
     
     |,48
   
  )
 






The value � is a coding parameter used to control the

   trade-off between the smoothness of the motion vector field and
   the accuracy of the match. When � is very large, the
   local variance dominates the calculation and the motion vector
   which gives the smallest metric is simply that which is closest
   to its neighbours. When � is very small, the metric is
   dominated by the SAD term, and so the best vector will simply
   be that which gives the best match for that block. For values
   in between, varying degrees of smoothness can be achieved. The
   parameter � is calculated as a multiple of the RDO

parameters for the L1 and

   L2 frames, so that if the inter frames
   are compressed more heavily then smoother motion vector fields
   will also result.



The limit on the size of the smoothing factor is to 48 1/8ths

   of a pixel. This prevents the
   motion field being smoothed excessively, since where there is a genuine
   motion transition the motion vector will legitimately differ
   from its neighbours and shouldn't be excessively penalised.

[edit] Superblock structures

Dirac uses macroblock structures to introduce a degree of adaption into motion estimation by allowing the size of the blocks used to vary. The motion estimation stage of the encoding is organised by macroblock, and each combination of block size and prediction mode is tried using the RDO block-matching metric - this is called mode decision in the software - and the best solution adopted macroblock by macroblock.

A macroblock consists of a 4x4 array of blocks, and there are three possible ways of splitting a MB:

Splitting level 0: no split, a single MV per reference frame for the MB;

Splitting level 1: split into four sub-macroblocks (sub-MBs), each a 2x2 array of blocks, one MV per reference frame per sub-MB;

Splitting level 2: split into the 16 constituent blocks.

Downloading a GIF rendering as your browser doesn't support SVG. Please ignore the "install additional plugins" message if you see it. More details Sorry, your browser can't connect to the server to download a GIF substitute. Either install an SVG-enabled browser or connect to the internet to download the diagram.

Figure: Macroblock splitting modes

The splitting mode is chosen by redoing motion estimation for the sub-MBs and the MB as a whole, again using the RDO metric , suitably scaled to take into account the different sizes of the blocks. At the same time, the best prediction mode for each prediction unit (block, sub-MB or MB) is chosen. Four prediction modes are available:

INTRA: intra coded, predicted by DC value;

REF1_ONLY: only predict from the first reference;

REF2_ONLY: only predict from the second reference (if one exists);

REF1AND2: bi-directional prediction.

The result is a hierarchy of parameters: the splitting level determines what modes, motion vectors and block DC values (in the case of INTRA) need to be present.

In motion estimation, an overall cost for each MB is computed, and compared for each legal combination of these parameters. This is a tricky operation, and has a very significant effect on performance. The decisions interact very heavily with those made in coding the wavelet coefficients of the resulting residuals, and the best results probably depend on picture material, bit rate, the block size and its relationship to the size of the video frames, and the degree of perceptual weighting used in selecting quantisers for wavelet coefficients.

Dirac can use any block sizes, although blocks parameters do have to meet some constraints, so that the overlapping process works properly, especially in conjunction with subsampled chroma components (for which the blocks will be correspondingly smaller). For example, the block separations and corresponding lengths must differ by a multiple of four, so that overlap is symmetric for luma and sub-sampled chroma.

[edit] Block data

Parameters other than the MB splitting level are termed block data, even though they may apply to blocks, sub-MBs or the MB itself depending on the value of the splitting mode. The prediction mode has already been described. The five remaining block parameters are:

REF1_x: horizontal component of motion vector to the first reference frame;

REF1_y: vertical component of motion vector to the first reference frame;

REF2_x: horizontal component of motion vector to the second reference frame;

REF2_y: vertical component of motion vector to the second reference frame.

DC: DC or average value for the prediction unit for each component (Y, U or V) being coded, to 8-bit accuracy.

Clearly not all of these values must be coded. If the prediction mdoe is REF1_ONLY then REF2_x and REF2_y will not be coded, for example, and if the prediction unit is not INTRA, then no DC value need be sent.

Each different type of block data is coded as a separate block of data, allowing for parallelisation of the decoding process.

[edit] Motion vector data coding

Motion vector (MV) data coding is important to the performance of video coding, especially for codecs with a high level of MV accuracy (1/4 or 1/8 pel). For this reason, MV coding and decoding is quite complicated, since significant gains in efficiency can be made by choosing a good prediction and entropy coding structure. The basic format of the MV coding module is similar to the coding of coefficient data: it consists of prediction, followed by binarisation, context modelling and adaptive arithmetic coding:

Downloading a GIF rendering as your browser doesn't support SVG. Please ignore the "install additional plugins" message if you see it. More details Sorry, your browser can't connect to the server to download a GIF substitute. Either install an SVG-enabled browser or connect to the internet to download the diagram.

Figure: motion vector entropy coding architecture

[edit] Prediction of motion vector data

All the motion vector data is predicted from previously encoded data from nearest neighbours. In predicting the data a number of conventions are observed.

The first convention is that all the so-called block data (prediction modes and the motion vectors themselves, and/or any DC values) is actually associated with the top-left block of the prediction unit to which they refer. This allows for a consistent prediction and coding structure to be adopted.

Example. If splitting level=1 and then the prediction units in a MB are sub-MBs. Nevertheless, the prediction mode and any motion vectors are associated with the top-left block of each sub-MB and values need not be coded for other blocks in the sub-MB.

The second convention is that all MB data is scanned in raster order for encoding purposes. All block data is scanned first by MB in raster order, and then in raster order within each MB. That is, taking each MB in raster order, each block value which needs to be coded within that MB is coded in raster order:

Downloading a GIF rendering as your browser doesn't support SVG. Please ignore the "install additional plugins" message if you see it. More details Sorry, your browser can't connect to the server to download a GIF substitute. Either install an SVG-enabled browser or connect to the internet to download the diagram.

Figure: block data is scanned in raster order by MB and then in raster order within each MB

The third convention concerns the availability of values for prediction purposes when they may not be coded for every block. Since prediction will be based on neighbouring values, it is necessary to propagate values for the purposes of prediction when the MV data has conspired to ensure that values are not required for every block.

Example. In the next diagram, we can see the effect of this. Suppose we are coding REF1_x. In the first MB, splitting level=0 and so at most only the top-left block needs a value, which can be predicted from values in previously coded MBs. As it happens, the prediction mode REF1_ONLY and so a value is coded. The value v is then deemed to be applied to every block in the MB. In the next MB, splitting level=1, so the unit of prediction is the sub-MB. In the top-left sub-MB the prediction mode is, say, REF1AND2 and so a value x is coded for the top-left block of that sub-MB. It can be predicted from any available values in neighbouring blocks, and in particular the value v is available from the adjacent block.

Downloading a GIF rendering as your browser doesn't support SVG. Please ignore the "install additional plugins" message if you see it. More details Sorry, your browser can't connect to the server to download a GIF substitute. Either install an SVG-enabled browser or connect to the internet to download the diagram.

Figure: For the purposes of prediction, values are deemed to be propagated within MBs or sub-MBs.

Prediction methods

The prediction used depends on the MV data being coded, but in all cases the aperture for the predictor is shown in the figure below. This aperture is interpreted as blocks where block data is concerned and MBs where MB data is concerned.

Downloading a GIF rendering as your browser doesn't support SVG. Please ignore the "install additional plugins" message if you see it. More details Sorry, your browser can't connect to the server to download a GIF substitute. Either install an SVG-enabled browser or connect to the internet to download the diagram.

Figure: Aperture for MV prediction.

The splitting level is predicted as the mean of the levels of the three MBs in the aperture.

Of the block data, the MV prediction mode is predicted for reference 1 and reference 2 separately. The MV prediction mode is interpreted as a two bit number encoding the four possibilities: INTRA=0, REF1_ONLY=1, REF2_ONLY=2 and REF1AND2=3. In this way the first bit (bit 0) is a flag indicating that reference 1 is used and the second bit (bit 1) that reference 2 is used. Bit 0 and Bit 1 are predicted separately: if most of the 3 predicting blocks use reference 1 (ie their mode is REF1_ONLY or REF1AND2 and bit 0 is set) then it is predicted that bit 0 is set, and likewise for bit 1.

The DC values are predicted by the average of the three values in the aperture. The motion vectors themselves are predicted by predicting the horizontal and vertical components separately. The predictor for motion vector components is the median of the three corresponding values in the prediction aperture for the block.

In many cases MV or DC values are not available from all blocks in the aperture, for example if the prediction mode is different. In this case the blocks are merely excluded from consideration. Where only two values are available, the median motion vector predictor reverts to a mean. Where only one value is available, this is the prediction. Where no value is available, no prediction is made, except for the DC values, where 0 is used by default.

In the case of the MB_split data, the number of possible values is only 3 in the case of MB_split and 2. The prediction therefore can use modulo arithmetic and produces an unsigned prediction residue of 0,1 or 2 in the first case and 0 or 1 in the second. All other predictions produce signed prediction residues.

[edit] Motion vector entropy coding

Entropy coding of the MV prediction residuals uses the same basic architecture as for wavelet coefficient coding: interleaved exp-Golomb binarization, followed by adaptive arithmetic coding with multiple context models. For MV coding there are many different types of data, and these have their own context models.

Personal tools