DiracAlgorithm
From Diracvideo
Contents |
[edit] Overall Architecture
Overall, the codec is a classic motion-compensated hybrid codec. The coder has the architecture shown below, whilst the decoder performs the inverse operations.
Downloading a GIF rendering as your browser doesn't support SVG. Please ignore the "install additional plugins" message if you see it. More details Sorry, your browser can't connect to the server to download a GIF substitute. Either install an SVG-enabled browser or connect to the internet to download the diagram.
Figure: Overall hybrid encoder architecture
There are four main elements or modules to the coder:
* Transform and scaling involves taking frame data and applying a transform (in this case the wavelet transform) and scaling the coefficients to perform quantisation; * Entropy coding is applied to quantised transform coefficients and to motion vector (MV) data and performs lossless compression on them; * Motion estimation (ME) involves finding matches for frame data from previously coded frames, trading off accuracy with motion vector bit rate; * Motion compensation (MC) involves using the motion vectors to predict the current frame, in such a way as to minimise the cost of encoding the residual data.
The following sections describe these modules in more detail, after first describing the rate-distortion framework used throughout the encoder.
Dirac supports interlaced coding by means of coding whole fields rather than frames. Unlike MPEG codecs there is no complex macroblock-by-macroblock switching. Such complexity is not justified as interlace (an analogue compression technique) becomes a legacy system.
The codec can support any frame dimensions and common chroma formats (444, 422, 420 but not 411) by means of picture padding. The padding ensures that the wavelet transform can be applied properly.
[edit] Rate-Distortion optimisation
The key to making good decisions in compression is to be able to trade off the number of bits used to encode some part of the signal being compressed, with the error that is produced by using that number of bits. There is no point striving hard to compress one feature of the signal if the degradation it produces is much more significant than that of compressing some other feature with fewer bits. In other words, one wishes to distribute the bit rate to get the least possible distortion overall. So how can this be done?
Rate distortion can be described in terms of Lagrangian multipliers. It can also be described by the Principle of Equal Slopes, which states that the coding parameters should be selected so that the rate of change of distortion with respect to bit rate is the same for all parts of the system.
To see why this is so, consider two independent components of a signal. They might be different blocks in a video frame, or different subbands in a wavelet transform. Compress them at various rates using your favourite coding technique, and you tend to get curves like those in the figure below. They show that at low rates, there is high distortion (or error) and at high rates there is low distortion, and there is generally a smooth(ish) curve between these points with a nice convex shape.
Downloading a GIF rendering as your browser doesn't support SVG. Please ignore the "install additional plugins" message if you see it. More details Sorry, your browser can't connect to the server to download a GIF substitute. Either install an SVG-enabled browser or connect to the internet to download the diagram.
Figure: Rate-distortion curves for two signal components
Now suppose that we assign B1 bits to component X and B2 bits to component Y. Look at the slope of the rate-distortion curves at these points. At B1 the slope of X's distortion with respect to bit rate is much higher than the slope at B2, which measures the rate of change of Y's distortion with respect to bit rate. It's easy to see that this isn't the most efficient allocation of bits. To see this, increase B1 by a small amount to B1+Δ and decrease B2 to B2-Δ. Then the total distortion has reduced even though the total bit rate hasn't changed, due to the disproportionately greater drop in the distortion of X.
The conclusion is therefore that for a fixed total bit rate, the error or distortion is minimised by selecting bit rates for X and Y at which the rate-distortion curves have the same slope. Likewise, the problem can be reversed and for a fixed level of distortion, the total bitrate can be minimised by finding points with the same slope.
Two questions arise in practise: firstly, how does one find points on these curves with the same slope; and secondly, how does one hit a fixed overall bit budget? The first question can be answered by the succeeding figure: the intercept of the tangent to the rate-distortion curve at the point (R0,D0) to the D-axis is the value D0+λR0 where -λ is the slope at the point (R0,D0). Furthermore it is the smallest value of D+λR for all values of (R,D) that lie on the curve. So in selecting, for example, a quantizer in a given block or subband, one minimises the value D(Q)+λR(Q) over all quantizers Q, where D(Q) is the error produced by quantizing with Q and R(Q) is the rate implied.
Downloading a GIF rendering as your browser doesn't support SVG. Please ignore the "install additional plugins" message if you see it. More details Sorry, your browser can't connect to the server to download a GIF substitute. Either install an SVG-enabled browser or connect to the internet to download the diagram.
Figure: minimisation of the Lagrangian cost function
In order to hit an overall bit budget, one needs to iterate over values of the Lagrangian parameter λ in order to find the one that gives the right rate. In practise, this iteration can be done in slow time given any decent encoding buffer size, and by modelling the overall rate distortion curve based on the recent history of the encoder. Rate-distortion optimisation (RDO) is used throughout Dirac, and it has a very beneficial effect on performance. Control of the example Dirac encoder is by a single parameter ("-qf") that effectively sets Lagrangian parameters for each part of the encoding process.
This description makes RDO sound like a science: in fact it isn't and the reader will be pleased to learn that there is plenty of scope for engineering ad-hoc-ery of all kinds. This is because there are some practical problems in applying the procedure:
1) There may be no common measure of distortion. For example: quantising a high-frequency subband is less visually objectionable than quantising a low-frequency subband, in general. So there is no direct comparison with the significance of the distortion produced in one subband with that produced in another. This can be overcome by perceptual weighting, in which the noise in HF bands is downgraded according to an estimate of the Contrast Sensitivity Function (CSF) of the human eye, and this is what we have done. The problem even occurs in block-based coders, however, since quantisation noise can be successfully masked in some areas but not in others. Perceptual fudge factors are therefore necessary in RDO in all types of coders.
2) Rate and distortion may not be directly measurable. In practice, measuring rate and distortion for, say, every possible quantiser in a coding block or subband cannot mean actually encoding for every such quantiser and counting the bits and measuring MSE. What one can do is estimate the values using entropy calculations or assuming a statistical model and calculating, say, the variance. In this case, the R and D values may well be only roughly proportional to the true values, and some sort of factor to compensate is necessary in using a common multiplier across the encoder.
3) Components of the bitstream will be interdependent. The model describes a situation where the different signals X and Y are fully independent. This is often not true in a hybrid video codec. For example, the rate at which reference frames are encoded affects how noisy the prediction from them will be, and so the quantisation in predicted frames depends on that in the reference frame. Even if elements of the bitstream are logically independent, perceptually they might not be. For example, with Intra coding, each frame could be subject to RDO independently, but this might lead to objectionally large variations in quantisation noise between frames with low bit rates and rapidly changing content.
Incorporating motion estimation into RDO is also tricky, because motion parameters are not part of the content but have an indirect effect on how the content looks. They also have a coupled effect on the rest of the coding process, since the distortion measured by prediction error, say, affects both the bit rate needed to encode the residuals and the distortion remaining after coding. This is discussed in more detail later.
[edit] Transform coding
Once any motion compensation has been performed, motion -compensated residuals are treated almost identically to Intra picture data. In both cases we have three (luminance and two chrominance) components in the form of two-dimensional arrays of data values. The picture (frame or field) component data is coded in three stages. First the data arrays are wavelet- transformed using separable wavelet filters and divided into subbands. Then they are quantised (using RDO quantisers in the reference encoder). Finally, the quantised data is entropy coded.
The architecture of coefficient coding is shown here:
Downloading a GIF rendering as your browser doesn't support SVG. Please ignore the "install additional plugins" message if you see it. More details Sorry, your browser can't connect to the server to download a GIF substitute. Either install an SVG-enabled browser or connect to the internet to download the diagram.
Figure: coefficient coding architecture
As can be seen, each wavelet subband is coded in turn. Both the quantisation and the entropy coding of each band can depend on the coding of previously coded bands. This does limit parallelisation, but the dependences are limited to parent-child relationships so some parallelisation/multi-threading is still possible.
The only difference between Intra picture coefficient coding and Inter picture residual coefficient coding lies in the use of prediction within the DC wavelet subband of Intra picture components.
At the decoder side, the three stages of the coding process are reversed. The entropy coding is decoded to produce the quantised coefficients, which are then reconstructed to produce the real values. Then, after undoing any prediction, the inverse transform produces the decoded picture component. The reference Dirac encoder has to maintain a local decoder within it, in part so that the result of the compression picture can be viewed at the time of compression, but mainly because compressed pictures must be used as reference frames for subsequent motion compensation else the encoder and the decoder will not remain in sync.
[edit] Wavelet transform
The discrete wavelet transform is now extremely well-known and is described in numerous references. In Dirac it plays the same role of the DCT in MPEG-2 in
decorrelating data in a roughly frequency-sensitive way, whilst having the advantage of preserving fine details better. In one dimension it consists of the iterated application of a complementary pair of half-band filters followed by subsampling by a factor 2:
Downloading a GIF rendering as your browser doesn't support SVG.
Please ignore the "install additional plugins" message if you see it.
More details
Sorry, your browser can't connect to the server to download a GIF substitute. Either install an SVG-enabled browser or connect to the internet to download the diagram.
Figure: Perfect reconstruction analysis and synthesis
filter pairs
These filters are termed the analysis filters. Corresponding
synthesis filters can undo the aliasing introduced by the critical sampling and perfectly reconstruct the input. Clearly not just any pair of half-band filters can do this, and there is an extensive mathematical theory of wavelet filter banks. The filters split the signal into a LH and HF part; the wavelet transform then iteratively decomposes the LF component to produce an octave-band decomposition of the signal.
Applied to two-dimensional images, wavelet filters are
normally applied in both vertical and horizontal directions to each image component to produce four so-called subbands
termed Low-Low (LL), Low-High
(LH), High-Low (HL) and High-High (HH). In the case of two dimensions, only the LL band is iteratively decomposed to obtain the decomposition of the two-dimensional spectrum shown below:
Downloading a GIF rendering as your browser doesn't support SVG.
Please ignore the "install additional plugins" message if you see it.
More details
Sorry, your browser can't connect to the server to download a GIF substitute. Either install an SVG-enabled browser or connect to the internet to download the diagram.
Figure: wavelet transform frequency
decomposition
The number of samples in each resulting subband is as
implied by the diagram: the critical sampling ensures that after each decomposition the resulting bands all have one quarter of the samples of the input signal.
photograph of girl (LENA) decomposed by 3-level wavelet transform
Figure: 3-level wavelet transform of LENA
Wavelet filters
The choice of wavelet filters has an impact on compression
performance, filters having to have both compact impulse response in order to reduce ringing artefacts and other properties in order to represent smooth areas compactly. It also has an impact on encoding and decoding speed in software.
There are numerous filters supported by Dirac to allow a trade-off between complexity and performance. These are configurable in the reference software. These filters are all defined using the 'lifting scheme' for speed.
The lifting stages are defined as follows. One fiter available in Dirac
is a lifting approximation of the Daubechies (9,7) wavelet: for this we have (s denoting sum and d denoting difference),
s
n
0
=
x
2n
d
n
0
=
x
2n+1
d
n
1
=
d
n
0
-(6497⋅(
s
n
0
+
s
n+1
0
))/4096
s
n
1
=
s
n
0
-(217⋅(
d
n
1
+
d
n-1
1
))/4096
d
n
2
=
d
n
1
+(3616⋅(
s
n
1
+
s
n+1
1
))/4096
s
n
2
=
s
n
1
+(1817⋅(
d
n
2
+
d
n-1
2
))/4096
The magic numbers are integer approximations of the Daubechies lifting ceofficients. This makes the transform fully invertible, where a floating point implementation wouldn't quite be. The implementation ignores scaling coefficients, since these can be taken into account in quantiser selection by weighting quantiser noise appropriately. The problem with this filter is that it has four lifting stages, and so it takes longer in software. At the other extreme, there is the (5,3) filter:
d
n
1
=
d
n
0
-(
s
n
0
+
s
n+1
0
)/2
s
n
1
=
s
n
0
+(
d
n
1
+
d
n-1
1
)/4
We can improve the (5,3) high pass filter, and get an approximating of the Daubechies low-pass filter by having more taps in the first lifing stage. This is the Deslauriers-Dubuc (9,7) filter:
d
n
1
=
d
n
0
-(-
s
n-1
0
+9
s
n
0
+9
s
n+1
0
-
s
n+2
0
)/16
s
n
1
=
s
n
0
+(
d
n
1
+
d
n-1
1
)/4
The Deslauriers-Dubuc (13,7) extends this further to provide better frequency selectivity in the low pass filter, still only using two lifting stages:
d
n
1
=
d
n
0
-(-
s
n-1
0
+9
s
n
0
+9
s
n+1
0
-
s
n+2
0
)/16
s
n
1
=
s
n
0
+(-
d
n+1
1
+ 9
d
n
1
+9
d
n-1
1
-
d
n-2
1
)/32
Padding and invertibility
Clearly, applying an N-level wavelet transform requires N levels of
subsamplings, and so for reversibility, it is necessary that 2N divides all the dimensions of each component. So if this condition is not met, the input picture components are padded as they are read in, by edge values for best compression performance.
[edit] Parent-child relationships
Since each subband represents a filtered and subsampled version of the frame component, coefficients within each subband correspond to specific areas of the underlying picture and hence those that relate to the same area can be related. It is most productive to relate coefficients that also have the same orientation (in terms of combination of high- and low-pass filters). The relationship is illustrated below, showing the situation for HL bands i.e. those that have been high-pass filtered horizontally and low-pass filtered vertically.
Downloading a GIF rendering as your browser doesn't support SVG. Please ignore the "install additional plugins" message if you see it. More details Sorry, your browser can't connect to the server to download a GIF substitute. Either install an SVG-enabled browser or connect to the internet to download the diagram.
Figure: Parent-child relationship between subband coefficients
In the diagram it's easy to see that the subsampling structure means that a coefficient (the parent) in the lowest HL band corresponds spatially to a 2x2 block of coefficients (the children) in the next HL band, each coefficient of which itself has a 2x2 block of child coefficients in the next band, and so on. This relationship relates closely to spectral harmonics: when coding image features (edges, especially) significant coefficients are found distributed across subbands, in positions related by the parent-child structure, and corresponding to the original position of the feature. In particular, a coefficient is more likely to be significant if its parent is, and children with zero or small parents or ancestors may have different statistics from children with large parents or ancestors.
These factors suggest that when entropy coding coefficients, it will be helpful to take their parents into account in predicting how likely, say, a zero value is.
By coding from low-frequency subbands to high-frequency ones, and hence by coding parent before child subbands, parent-child dependencies can be exploited in these ways without additional signalling to the decoder.
The wavelet coefficient coding section describes specifically how these relationships are exploited.
[edit] Quantisation
Having transformed the component data, each subband's coefficients
are quantised by Dirac using so-called uniform dead-zone quantisers. A simple uniform quantiser is a division of the real line into equal-width bins, of size equal to the quantisation factor Qf: the bins are numbered and a reconstruction value is selected for each bin. So the bins consist of the intervals
[
(
N
−
1
2
)
Q
f
,
(
N
+
1
2
)
Q
f
]
for integers N, which are also the labels for the bin, and
it is the labels that are subsequently encoded. The reconstruction value used in the decoder (and for local decoding in the encoder) can be any value in each of the bins. The obvious, but not necessarily the best, reconstruction value is the midpoint NQf. See the diagram, part a, below.
Downloading a GIF rendering as your browser doesn't support SVG.
Please ignore the "install additional plugins" message if you see it.
More details
Sorry, your browser can't connect to the server to download a GIF substitute. Either install an SVG-enabled browser or connect to the internet to download the diagram.
Figure: Uniform and dead-zone quantisers, with mid
point reconstruction values.
A uniform dead-zone quantiser is slightly different in that
the bin containing zero is twice as wide. So the bins consist of [-Qf,Qf], with a reconstruction value of 0, together with
other bins of the form
[ N Q f , ( N + 1 ) Q f ]
for N>0 and
[ ( N − 1 ) Q f , N Q f ]
for N<0, with reconstruction points somewhere in the
intervals. The bin structure is shown in part b above with mid-point reconstruction points.
The advantage of the dead-zone quantiser is two-fold.
Firstly, it applies more severe quantisation of the smallest values, which acts as a simple but effective de-noising operation. Secondly, it admits a very simple and efficient implementation: simply divide by the quantisation factor and
round towards zero. In Dirac, this process is approximated by a multiplication and a bitshift.
N=Q(v)=⌊
| v |
Q
f
⌋ if v≥0
N=Q(v)=−⌊
| v |
Q
f
⌋ otherwise
where the braces ⌊⌋ mean that the
remainder is to be discarded. The corresponding reconstructed value v ˜
is given by (an integer approximation to):
v
˜
=0 if N=0
v
˜
=(N+X)
Q
f
if N>0
v
˜
=(N−X)
Q
f
if N<0
for a reconstruction point X between 0 and 1.
A value of X=0.5, giving the mid-point of the interval might
be the obvious reconstruction point, giving as it does the mid-point of the bin. This is indeed what we use for intra pictures. For inter pictures (motion-compensated prediction residues) the values of transformed coefficients in a wavelet subband have a distribution with mean very near zero and which decays pretty rapidly and uniformly for larger values. Values are therefore more likely to occur in the first half of a bin than in the second half and the smaller value of X=0.375 reflects this bias, and gives better performance in practice.
This reconstructed value is used by the encoder to produce the
locally decoded component data, which is identical to what the decoder would produce, after decoding the quantised value N.
[edit] Lagrangian parameter control of subband quantisation
Selection of quantisers is a matter for the encoder only.
The current Dirac encoder uses an RDO technique to pick
a quantiser by minimising a
Lagrangian combination of rate and distortion.Essentially, lots
of quantisers are tried and the best picked. Rate is estimated
via a an adaptively-corrected measure of zeroth-order entropy measure Ent(q) of the quantised symbols resulting from applying the quantisation factor q, calculated as a value of bits/pixel. Distortion is measured in terms of the perceptually-weighted error fourth-power error E(q,4), resulting from the difference between the original and the quantised coefficients:
E(q,4)=
(
∑
i,j
|
p
i,j
−Q(i,j)
|
4
)
1
4
The total measure for each quantiser q is:
λ.C.Ent(q)+(
E
(q,4)
2
w
)
where w is the perceptual weight associated with the
subband - higher frequencies having a larger weighting factor - and C is a correction factor. Using the square of E(q,4) makes it equal to the mean-square error for constant values, but in general it gives greater weight to large values than the MSE, for a mixed signal. The correction factor compensates for any discrepancy between the measure of entropy and the actual cost in terms of bits, based on the actual bit rate produced by the corresponding elements of previous frames. It is necessary because the entropy measure does not take into account dependencies between coefficients that are taken into account in the actual
coefficient entropy coding.
The quantisers are incremented in quarter-powers of 2 - i.e.
q is an integer approximation of
2
n
4
for integers n. In other words, the quantisers represent the coefficient magnitudes to variable fractional-bit accuracies in quarter-bit increments.
The Lagrangian parameter λ is derived from the encoder quantisation parameter. The larger the value, the lower the resulting bit rate, and vice-versa.
Clearly, there are a lot of quantisers to search. The current
Dirac encoder implementation speeds things up by splitting the search up into three stages. First, one quarter of the coefficients are used to obtain the best quantiser to bit-accuracy. Secondly, one quarter of the coefficients are again used to refine this estimate to half-bit accuracy. Thirdly, half the coefficients are used to refine the search further to 1/4-bit. In each stage, only a single loop over the coefficients is used to test all the candidate quantisers. The result is much faster than a brute-force search of all the quantisers, and almost as good in performance terms.
[edit] Wavelet Coefficient Coding
Dirac codes subbands from low-frequency to high frequency in a zig-zag order. Within each subband, the coefficients are further partitioned spatially divided into code blocks. Coefficients are scanned by coding the code blocks in normal raster order and coding the coefficients within each code block in raster order within that block. However, each code block can be skipped so a skip flag is included in the stream of symbols to be coded. The skip flag is interpreted in the decoder as setting all coefficients within the block to be zero.
The number of code blocks in a subband depends on the frame type and on the subband. Intra frames have a single code block for 7 the lowest-frequency 7 subbands (ie the bottom 2 levels of the wavelet transform) and a 4x3 array of code blocks for the remaining bands. Predicted frames have more code blocks, as coefficients are more likely to be zero: 1 block only for the 4 lowest-level subbands; 8x6 for the next 3 subbands; 12x8 for the remaining subbands. These values are somewhat ad hoc, and probably will ultimately be configurable.
There are two possible quantisation modes made possible by the code block structure. We may either have a single quantiser for the whole subband or have different quantisers for different blocks. Currently the software only supports the first option, but ultimately both options will be implemented, which will allow Region Of Interest coding.
Entropy coding
The entropy coding used by Dirac in wavelet subband
coefficient coding is based on three
stages: binarisation, context modelling and adaptive arithmetic coding.
Downloading a GIF rendering as your browser doesn't support SVG.
Please ignore the "install additional plugins" message if you see it.
More details
Sorry, your browser can't connect to the server to download a GIF substitute. Either install an SVG-enabled browser or connect to the internet to download the diagram.
Figure: Entropy coding block diagram
The purpose of the first stage is to provide a bitstream
with easily analysable statistics that can be encoded using arithmetic coding, which can adapt to those statistics, reflecting any local statistical features.
Binarization
Binarization is the process of transforming the multi-valued
coefficient symbols into bits. The resulting bitstream can then be arithmetic coded. The original symbol stream could have been
coded directly, using a multi-symbol arithmetic coder, but this tends to suffer from 'context dilution': most symbols occur very rarely and so only sparse statistics can be gathered, which reduces coding efficiency.
The most obvious way to binarize a symbol is directly: a
symbol is encoded by encoding the constituent bits of the binary representation of its magnitude, followed by a sign bit. This is termed bit-plane coding. The problem with this is that the number of bit planes depends on the size of the largest value, which means that either a lot of unnecessary zero symbols must be sent for smaller values or the size (number of bits) of each symbol must be encoded somehow.
Dirac uses an interleaved exp-Golomb binarisation. This can be thought of as combining
a length code with the binary representation. Here's how it works for coding magnitude (non-negative) values.
Firstly, add 1 to your value. Then the binary representation has a leading 1:
N+1 = 1 bK-1...b0
The leading 1 is redundant and doesn't have to be sent. We can prefix each data bit with a "follow" bit which tells us there's a new bit to come: 0 means there's an additional bit and 1 means terminate. So the code for N will be:
0 bK-1 0 ... 0 b0 1
In coventional exp-Golomb coding, the length code of K zeroes followed by a 1 would be a prefix to the data bits. By interleaving them we have ensured that the value can be decoded in a single loop, two bits at a time, rather than in two loops.
Context modelling
The context modelling in Dirac is based on the principle
that whether
a coefficient is small (or zero, in particular) or not is well-predicted by its neighbours and its parents. Therefore the codec conditions the probabilities used by the arithmetic coder for coding the first follow bit on whether the neighbouring coefficients or the parent coefficient are zero.
The reason for this approach is that, whereas the wavelet
transform largely
removes correlation between a coefficient and its neighbours, they may not be statistically independent even if they are uncorrelated. The main reason for this is that small and especially zero coefficients in wavelet subbands tend to clump together, located at points corresponding to smooth areas in the image, and as discussed
elsewhere, are grouped together across
subbands in the parent-child relationship.
After binarization, a context is selected, and the
probabilities for 0 and 1 that are maintained in the appropriate context will be fed to the arithmetic coding function along with the value itself to be coded.
Arithmetic coding
Conceptually, an
arithmetic coder can be thought of a progressive way of producing variable-length codes for entire sequences of symbols based on the probabilities of their constituent symbols. For example, if we know the probability of 0 and 1 in a binary sequence, we also know the probability of the sequence itself occurring. So if
P(0)=0.2, P(1)=0.8
then
P(11101111111011110101)=(0.2)3(0.8)17=1.8x10-4
(assuming independent occurrences).
Information theory then says that optimal entropy coding of
this sequence requires
log�
2
(
1
p
)=12.4 bits
. AC
produces a code-word very close to this optimal length, and implementations can do so progressively, outputting bits when possible as more arrive.
All arithmetic coding (AC) requires are estimates of the
probabilities of symbols as they occur, and this is where context modelling fits in. Since AC can, in effect, assign a fractional number of bits to a symbol, it is very efficient for coding symbols with probabilities very close to 1, without the additional complication of run-length coding. The aim of context modelling within Dirac is to use information about the symbol stream to be encoded to produce accurate probabilities as close to 1 as possible.
Dirac computes these estimates for each context by maintaining
a 16 bit probability word representing the probability of a zero value in that context. When a value has been coded, the probability is modified by incrementing (if zero was coded) or decrementing this value by a delta.
The delta value itself depends only on the
probability and is derived from a look-up table. If the probability is near 1 or 0 the delta is approximately 1/256; if the probability is near 1/2, the delta is approximately 1/32. This avoids the need to maintain explicit counts of 1 and 0 and makes for very efficient updating. There is no rescaling or division in the computation, and the estimate adapts rapidly for balanced probabilities and slowly for highly skewed probabilities, as it should.
[edit] Motion Estimation
[edit] Temporal prediction structures
Motion estimation and compensation is the most complex part of any video codec, both conceptually and in terms of computation. The Dirac encoder uses three types of picture. Intra pictures (I pictures) are coded without reference to other pictures in the sequence. Level 1 pictures (L1 pictures) and Level 2 pictures are both inter pictures, that is they are coded with reference to other previously coded pictures. The difference between L1 and L2 pictures is that L1 pictures are forward-predicted only (also known a P-pictures) whereas L2 pictures are B pictures (predicted from both earlier and later references).
The Dirac software employs a picture buffer to manage temporal prediction. Each picture is encoded with a header that specifies the picture number in display order, the picture numbers of any references and how long the picture must stay in the buffer. The decoder then decodes each picture as it arrives, searching the buffer for the appropriate reference pictures and placing the picture in the buffer. The decoder maintains a counter indicating which picture to 'display' (i.e. push out through the picture IO to the application calling the decoder functions, which may be a video player or may be something else). It searches the buffer for the picture with that picture number and displays it. Finally, it goes through the buffer eliminating pictures which have expired.
This decoder process allows for quite arbitrary prediction structures to be employed, not just those of MPEG-like GOPs.
Nevertheless, the encoder operates with standard GOP modes whereby the number of L1 pictures between I pictures, and the separation between L1 pictures, can be specified; and various presets for streaming, SDTV and HDTV imply specific GOP structures. A prediction structure for picture coding using a standard GOP structure is shown below:
Downloading a GIF rendering as your browser doesn't support SVG. Please ignore the "install additional plugins" message if you see it. More details Sorry, your browser can't connect to the server to download a GIF substitute. Either install an SVG-enabled browser or connect to the internet to download the diagram.
Figure: Prediction of L1 and L2 pictures when L1 pictures are P pictures.
The picture buffer structure gives great flexibility, including the ability for the decoder to decode dynamically-varying GOP structures. However, it also brings some dangers, since at least in theory it means that I pictures need not be random access points - that is points where a decoder may start decoding. This is because it is perfectly possible for a subsequent L1 or L2 picture to have as a reference a picture that temporally precedes a preceding I picture, and indeed forms part of a chain of reference right back to the start of the sequence. This will need to be constrained by specifying levels and profiles.
I-picture only coding
Setting the number of L1 pictures to be 0 on the encoder side implies that we don't have a GOP, and that we're doing I-picture only coding. I-picture only coding is useful for editing and other applications where fast random access to all pictures is required, but I-picture only coding is not essential for these applications. with suitable support.
Skipping pictures and global motion
The picture header also contains other goodies, which are not yet fully supported. Firstly it contains a flag indicating whether or not the picture is skipped or not. In this case no picture data is sent at all. What action the decoder takes in this case has yet to be determined, but it is likely that in future versions the decoder will return the most recent decoded picture in temporal order.
The second flag the picture header contains indicates the presence of global motion data, that is a parameterized model of the motion data. Exactly what this means and how the data should be encoded and how the decoder should behave, have yet to be determined either.
When suitable algorithms have been implemented on the encoder side, both these tools should have a powerful impact on compression performance, allowing the picture rate to be scaled down and the motion more heavily compressed, when the encoder is very pushed for bit rate.
Interlace coding
Dirac supports interlace coding by coding sequences of fields, rather than frames.
[edit] Overlapped block-based motion compensation
Motion compensation in Dirac uses Overlapped Block-based
Motion Compensation (OBMC) to avoid block-edge artefacts which would be expensive to code using wavelets.
Pretty much any size blocks can be used, with any degree of
overlap selected: this is configurable at the encoder and
transmitted to the decoder. One issue is that there should be an exact number of macroblocks horizontally and vertically, where a macroblock is a 4x4 set of blocks. This is achieved by padding
the data. Further padding may also be needed because after motion
compensation the wavelet transform
is applied, which has its own
requirements for divisibility.
Although Dirac is not specifically designed to be scalable,
the size of
blocks is the only non-scalable feature, and for lower resolution frames, smaller blocks can easily be selected.
Dirac's OBMC scheme is based on a separable
linear ramp mask. This acts as a weight function on the predicting block. Given a pixel p=p(x,y,t) in frame t, p may fall within only one block or in up to four if it lies at the corner of a block (see the figure below).
Downloading a GIF rendering as your browser doesn't support SVG.
Please ignore the "install additional plugins" message if you see it.
More details
Sorry, your browser can't connect to the server to download a GIF substitute. Either install an SVG-enabled browser or connect to the internet to download the diagram.
Figure: overlapping blocks. The darker-shades areas
show overlapping areas
Each block that the pixel p is part of has a predicting
block within the reference frame selected by motion estimation. The predictor p �
for p is the weighted sum of all the corresponding pixels in the predicting blocks in frame t �
, given by p(x� V i ,y� W i , t � )
for motion vectors ( V i , W i )
. The Raised-Cosine mask has the necessary property that the sum of the weights will always be 1:
p
�
(x,y,t)=
�
i
w
i
p(x�
V
i
,y�
W
i
,
t
�
),
�
i
w
i
=1
This may seem complicated but in implementation the only
additional complexity over standard block-based motion compensation
is to apply the weighting mask to a
predicting block before subtracting it from the frame. The fact that the weights sum to 1 automatically takes care of splicing the predictors together across the overlaps.
As explained elsewhere, Dirac provides
motion vectors to sub-pixel accuracy, the software supporting
accuracy up to 1/8th pixel, although 1/4 pixel is the default. This means upconverting
the reference frame components by a factor of up to 8 in each dimension. The area corresponding to the matching block in the upconverted reference then consists of 64 times more points. These can be thought of as 64 reference blocks on different sub-lattices of points separated by a step of 8 'sub-'pixels, each one corresponding to different sub-pixel offsets.
Sub-pixel motion compensation places a huge load on
memory bandwidth if done naively, i.e. by upconverting the reference by
a factor 8 in each dimension. In
Dirac, however, we just upconvert the reference by a factor of 2
in each dimension and compute the other offsets by linear
interpolation on the fly. In other words we throw the load
from the bus to the CPU. The 2x2 upconversion filter has been
designed to get the best prediction error across all the
possible sub-pixel offsets.
[edit] Motion estimation
Motion estimation is specific to the encoder. It's always the most complicated part of the system, and can absorb huge system resources, so methods have to be found to produce short-cuts. Dirac adopts a 3-stage approach. In the first stage, motion vectors are found for every block and each reference to pixel accuracy using hierarchical motion estimation. In the second stage, these vectors are refined to sub-pixel accuracy. In the third stage, we do mode decision, which chooses which predictor to use, and how to aggregate motion vectors by grouping blocks with similar motion together.
Motion estimation is most accurate when all three components are involved, but this is more expensive in terms of computation as well as more complicated. Dirac only uses the luma (Y) component.
Hierarchical motion estimation
Hierarchical ME speeds things up by repeatedly downconverting both the current and the reference frame by a factor of two in both dimensions, and doing motion estimation on smaller pictures. At each stage of the hierarchy, vectors from lower levels (smaller versions of the picture) are used as a guide for searching at higher levels. This dramatically reduces the size of searches for large motions. Dirac has four levels of downconversion. The block size remains constant (and the blocks will still overlap at all resolutions) so that at each level there are only a quarter as many blocks and each block corresponds to 4 blocks at the next higher resolution; and so each block provides a guide motion vector to 4 blocks at the next higher resolution layer. At each resolution, block matching proceeds by searching in a small range around the guide vector for the best match using the RDO metric (which is described below).
Search strategies in hierarchical ME
The hierarchical approach dramatically reduces the computational effort involved in motion estimation for an equivalent search range. However it risks missing small motions and it might not make good decisions when there are a variety of motions near to each other.
To mitigate this, the codec also always uses the zero vector (0,0) as another guide vector - this allows it to track slow- as well as fast-moving objects. Finally, the motion vectors already found in neighbouring blocks can also be used as guide vectors, it they have not already been tried.
Since each layer has twice the horizontal and vertical resolution of the one below it, it would appear to make sense to just search in an area +/-1 pixel of the guide vectors. In fact,the search ranges are always larger than this because this could cause the motion estimator to get trapped in a local minimum.
Sub-pixel refinement and upconversion
Dirac supports variable levels of motion vector accuracy. In the software currently, these are hard-wired in the code at 1/4 pixel but 1/8 is possible with the current software and even higher resolutions could be defined. The MV precision is signalled with each frame.
Sub-pixel refinement operates hierarchically also. Once pixel-accurate motion vectors have been determined, each block will have an associated motion vector (V0,W0) where V0 and W0 are multiples of 4 (for quarter-pel accuracy) or 8 (for eighth-pel accuracy). 1/2-pel accurate vectors are found by finding the best match out of (V0,W0) and its 8 neighbours: (V0+4,W0+4), (V0,W0+4), (V0-4,W0+4), (V0+4,W0), (V0-4,W0), (V0+4,W0-4), (V0,W0-4), (V0-4,W0-4). This in turn produces a new best vector (V1,W1), which provides a guide for 1/4-pel refinement, and so on until the desired accuracy. The process is illustrated in the figure below.
Downloading a GIF rendering as your browser doesn't support SVG. Please ignore the "install additional plugins" message if you see it. More details Sorry, your browser can't connect to the server to download a GIF substitute. Either install an SVG-enabled browser or connect to the internet to download the diagram.
Figure: sub-pixel motion-vector refinement
The sub-pixel matching process is complicated slightly since the reference is only upconverted by a factor of 2 in each dimension, not 8, and so more accurate vectors require frame component values to be calculated on the fly by linear interpolation.This means that the 1/2-pel interpolation filter has a bit of pass-band boost to counteract the sag introduced by doing linear interpolation. It was designed to produce the lowest interpolation error across all the phases. The taps are (scaled to 5 bits):
( -1 , 3 , -7 , 21 , 21 , -7 , 3 , -1 )
[edit] RDO motion estimation metric
The performance of motion-estimation and motion-vector
coding is absolutely critical to the performance of a video coding scheme.With motion vectors at 1/4 or 1/8th pixel accuracy, a simple-minded strategy of finding the best match between frames can greatly inflate the resulting bitrate for little or no gain in quality because the
additional accuracy is very sensitive to noise. What is
required is the ability to trade off the vector bitrate with prediction accuracy and hence the bit rate required to code the residual frame and the eventual quality of that frame, whilst
at the same time making the estimator more robust.
The simplest way to do this is to incorporate a smoothing
factor into the metric used for matching blocks. So the metric consists of a basic block matching metric, plus some constant times a measure of the local motion vector smoothness. The basic block matching metric used by Dirac is Sum of Absolute
Differences (SAD). Given two blocks X,Y of samples, this
is given by:
SAD(X,Y)=
�
i,j
|
X
i,j
�
Y
i,j
|
The smoothness measure used is based on the difference between the candidate motion vector and the median of the neighbouring previously
computed motion vectors. Since the blocks are estimated in raster-scan order then vectors for blocks to the left and above are available for calculating the median:
Downloading a GIF rendering as your browser doesn't support SVG.
Please ignore the "install additional plugins" message if you see it.
More details
Sorry, your browser can't connect to the server to download a GIF substitute. Either install an SVG-enabled browser or connect to the internet to download the diagram.
Figure: neighbouring vectors available in raster-scan
order for local variance calculation
The vectors chosen for computing the local median
predictor are V2, V3 and V4; this has the merit of being the same predictor as is used in coding the motion vectors.
The total metric is a combination of these two metrics. Given a vector V which maps the current frame block X to a block Y=V(X) in the reference frame, the metric is given by:
SAD(X,Y)+
�
max�
(
|
V
x
�
pred
x
|+|
V
y
�
pred
y
|,48
)
The value � is a coding parameter used to control the
trade-off between the smoothness of the motion vector field and the accuracy of the match. When � is very large, the local variance dominates the calculation and the motion vector which gives the smallest metric is simply that which is closest to its neighbours. When � is very small, the metric is dominated by the SAD term, and so the best vector will simply be that which gives the best match for that block. For values in between, varying degrees of smoothness can be achieved. The parameter � is calculated as a multiple of the RDO
parameters for the L1 and
L2 frames, so that if the inter frames are compressed more heavily then smoother motion vector fields will also result.
The limit on the size of the smoothing factor is to 48 1/8ths
of a pixel. This prevents the motion field being smoothed excessively, since where there is a genuine motion transition the motion vector will legitimately differ from its neighbours and shouldn't be excessively penalised.
[edit] Superblock structures
Dirac uses macroblock structures to introduce a degree of adaption into motion estimation by allowing the size of the blocks used to vary. The motion estimation stage of the encoding is organised by macroblock, and each combination of block size and prediction mode is tried using the RDO block-matching metric - this is called mode decision in the software - and the best solution adopted macroblock by macroblock.
A macroblock consists of a 4x4 array of blocks, and there are three possible ways of splitting a MB:
Splitting level 0: no split, a single MV per reference frame for the MB;
Splitting level 1: split into four sub-macroblocks (sub-MBs), each a 2x2 array of blocks, one MV per reference frame per sub-MB;
Splitting level 2: split into the 16 constituent blocks.
Downloading a GIF rendering as your browser doesn't support SVG. Please ignore the "install additional plugins" message if you see it. More details Sorry, your browser can't connect to the server to download a GIF substitute. Either install an SVG-enabled browser or connect to the internet to download the diagram.
Figure: Macroblock splitting modes
The splitting mode is chosen by redoing motion estimation for the sub-MBs and the MB as a whole, again using the RDO metric , suitably scaled to take into account the different sizes of the blocks. At the same time, the best prediction mode for each prediction unit (block, sub-MB or MB) is chosen. Four prediction modes are available:
INTRA: intra coded, predicted by DC value;
REF1_ONLY: only predict from the first reference;
REF2_ONLY: only predict from the second reference (if one exists);
REF1AND2: bi-directional prediction.
The result is a hierarchy of parameters: the splitting level determines what modes, motion vectors and block DC values (in the case of INTRA) need to be present.
In motion estimation, an overall cost for each MB is computed, and compared for each legal combination of these parameters. This is a tricky operation, and has a very significant effect on performance. The decisions interact very heavily with those made in coding the wavelet coefficients of the resulting residuals, and the best results probably depend on picture material, bit rate, the block size and its relationship to the size of the video frames, and the degree of perceptual weighting used in selecting quantisers for wavelet coefficients.
Dirac can use any block sizes, although blocks parameters do have to meet some constraints, so that the overlapping process works properly, especially in conjunction with subsampled chroma components (for which the blocks will be correspondingly smaller). For example, the block separations and corresponding lengths must differ by a multiple of four, so that overlap is symmetric for luma and sub-sampled chroma.
[edit] Block data
Parameters other than the MB splitting level are termed block data, even though they may apply to blocks, sub-MBs or the MB itself depending on the value of the splitting mode. The prediction mode has already been described. The five remaining block parameters are:
REF1_x: horizontal component of motion vector to the first reference frame;
REF1_y: vertical component of motion vector to the first reference frame;
REF2_x: horizontal component of motion vector to the second reference frame;
REF2_y: vertical component of motion vector to the second reference frame.
DC: DC or average value for the prediction unit for each component (Y, U or V) being coded, to 8-bit accuracy.
Clearly not all of these values must be coded. If the prediction mdoe is REF1_ONLY then REF2_x and REF2_y will not be coded, for example, and if the prediction unit is not INTRA, then no DC value need be sent.
Each different type of block data is coded as a separate block of data, allowing for parallelisation of the decoding process.
[edit] Motion vector data coding
Motion vector (MV) data coding is important to the performance of video coding, especially for codecs with a high level of MV accuracy (1/4 or 1/8 pel). For this reason, MV coding and decoding is quite complicated, since significant gains in efficiency can be made by choosing a good prediction and entropy coding structure. The basic format of the MV coding module is similar to the coding of coefficient data: it consists of prediction, followed by binarisation, context modelling and adaptive arithmetic coding:
Downloading a GIF rendering as your browser doesn't support SVG. Please ignore the "install additional plugins" message if you see it. More details Sorry, your browser can't connect to the server to download a GIF substitute. Either install an SVG-enabled browser or connect to the internet to download the diagram.
Figure: motion vector entropy coding architecture
[edit] Prediction of motion vector data
All the motion vector data is predicted from previously encoded data from nearest neighbours. In predicting the data a number of conventions are observed.
The first convention is that all the so-called block data (prediction modes and the motion vectors themselves, and/or any DC values) is actually associated with the top-left block of the prediction unit to which they refer. This allows for a consistent prediction and coding structure to be adopted.
Example. If splitting level=1 and then the prediction units in a MB are sub-MBs. Nevertheless, the prediction mode and any motion vectors are associated with the top-left block of each sub-MB and values need not be coded for other blocks in the sub-MB.
The second convention is that all MB data is scanned in raster order for encoding purposes. All block data is scanned first by MB in raster order, and then in raster order within each MB. That is, taking each MB in raster order, each block value which needs to be coded within that MB is coded in raster order:
Downloading a GIF rendering as your browser doesn't support SVG. Please ignore the "install additional plugins" message if you see it. More details Sorry, your browser can't connect to the server to download a GIF substitute. Either install an SVG-enabled browser or connect to the internet to download the diagram.
Figure: block data is scanned in raster order by MB and then in raster order within each MB
The third convention concerns the availability of values for prediction purposes when they may not be coded for every block. Since prediction will be based on neighbouring values, it is necessary to propagate values for the purposes of prediction when the MV data has conspired to ensure that values are not required for every block.
Example. In the next diagram, we can see the effect of this. Suppose we are coding REF1_x. In the first MB, splitting level=0 and so at most only the top-left block needs a value, which can be predicted from values in previously coded MBs. As it happens, the prediction mode REF1_ONLY and so a value is coded. The value v is then deemed to be applied to every block in the MB. In the next MB, splitting level=1, so the unit of prediction is the sub-MB. In the top-left sub-MB the prediction mode is, say, REF1AND2 and so a value x is coded for the top-left block of that sub-MB. It can be predicted from any available values in neighbouring blocks, and in particular the value v is available from the adjacent block.
Downloading a GIF rendering as your browser doesn't support SVG. Please ignore the "install additional plugins" message if you see it. More details Sorry, your browser can't connect to the server to download a GIF substitute. Either install an SVG-enabled browser or connect to the internet to download the diagram.
Figure: For the purposes of prediction, values are deemed to be propagated within MBs or sub-MBs.
Prediction methods
The prediction used depends on the MV data being coded, but in all cases the aperture for the predictor is shown in the figure below. This aperture is interpreted as blocks where block data is concerned and MBs where MB data is concerned.
Downloading a GIF rendering as your browser doesn't support SVG. Please ignore the "install additional plugins" message if you see it. More details Sorry, your browser can't connect to the server to download a GIF substitute. Either install an SVG-enabled browser or connect to the internet to download the diagram.
Figure: Aperture for MV prediction.
The splitting level is predicted as the mean of the levels of the three MBs in the aperture.
Of the block data, the MV prediction mode is predicted for reference 1 and reference 2 separately. The MV prediction mode is interpreted as a two bit number encoding the four possibilities: INTRA=0, REF1_ONLY=1, REF2_ONLY=2 and REF1AND2=3. In this way the first bit (bit 0) is a flag indicating that reference 1 is used and the second bit (bit 1) that reference 2 is used. Bit 0 and Bit 1 are predicted separately: if most of the 3 predicting blocks use reference 1 (ie their mode is REF1_ONLY or REF1AND2 and bit 0 is set) then it is predicted that bit 0 is set, and likewise for bit 1.
The DC values are predicted by the average of the three values in the aperture. The motion vectors themselves are predicted by predicting the horizontal and vertical components separately. The predictor for motion vector components is the median of the three corresponding values in the prediction aperture for the block.
In many cases MV or DC values are not available from all blocks in the aperture, for example if the prediction mode is different. In this case the blocks are merely excluded from consideration. Where only two values are available, the median motion vector predictor reverts to a mean. Where only one value is available, this is the prediction. Where no value is available, no prediction is made, except for the DC values, where 0 is used by default.
In the case of the MB_split data, the number of possible values is only 3 in the case of MB_split and 2. The prediction therefore can use modulo arithmetic and produces an unsigned prediction residue of 0,1 or 2 in the first case and 0 or 1 in the second. All other predictions produce signed prediction residues.
[edit] Motion vector entropy coding
Entropy coding of the MV prediction residuals uses the same basic architecture as for wavelet coefficient coding: interleaved exp-Golomb binarization, followed by adaptive arithmetic coding with multiple context models. For MV coding there are many different types of data, and these have their own context models.
