Planet Gamedev

Game From Scratch

Blender Video Tutorial/Tip: Modelling organic objects quickly and easily using Splines

by Mike@gamefromscratch.com at December 19, 2014 09:14 PM

 

This is the first of a series of Blender video quick tips that show how to do things ( normally the easy/lazy way ) in Blender you may not already know.

 

In this video we look at how to quickly model organic shapes using:

  • splines/curves
  • edge loop bridging
  • solidify
  • grid fill

 

The video is available in full 1080p here.  I am sorry for the lack of onscreen keys, I thought Camtasia would record these, unfortunately it didn’t.  For future videos of this type I will use some form of onscreen keyboard.  If you have a suggestion, I would love to hear it!

 

LibGDX Video Tutorial: Spritesheets and TextureAtlases

by Mike@gamefromscratch.com at December 19, 2014 06:38 PM

 

In this tutorial we look at the process of creating and using a Spritesheet in LibGDX.  This involves creating a series of sprites, putting them together with TexturePacker, then using a TextureAtlas and TextureRegion to display them with our Sprite.  We also quickly look at TexturePacker ( different product ) for those that prefer a UI.  Sample code and links to included assets below the video.

 

Once again, you can view the video in HD on YouTube by click here.

 

 

Example’s Source

package com.gamefromscratch;

import com.badlogic.gdx.ApplicationAdapter;
import com.badlogic.gdx.Gdx;
import com.badlogic.gdx.Input;
import com.badlogic.gdx.InputProcessor;
import com.badlogic.gdx.graphics.GL20;
import com.badlogic.gdx.graphics.g2d.Sprite;
import com.badlogic.gdx.graphics.g2d.SpriteBatch;
import com.badlogic.gdx.graphics.g2d.TextureAtlas;
import com.badlogic.gdx.graphics.g2d.TextureRegion;

public class SpritesheetDemo extends ApplicationAdapter implements InputProcessor {
   SpriteBatch batch;
   TextureAtlas textureAtlas;
   Sprite sprite;
   TextureRegion textureRegion;
   int currentFrame = 1;
   int MAX_FRAMES = 19;
   
   @Override
   public void create () {
      batch = new SpriteBatch();
      textureAtlas = new TextureAtlas(Gdx.files.internal("ss.txt"));
      textureRegion = textureAtlas.findRegion("0001");
      sprite = new Sprite(textureRegion);
      sprite.setPosition(Gdx.graphics.getWidth()/2 - sprite.getWidth()/2,
            Gdx.graphics.getHeight()/2 - sprite.getHeight()/2);

      Gdx.input.setInputProcessor(this);
   }

   @Override
   public void render () {
      Gdx.gl.glClearColor(0, 0, 0, 1);
      Gdx.gl.glClear(GL20.GL_COLOR_BUFFER_BIT);
      batch.begin();
      sprite.draw(batch);
      batch.end();
   }

   @Override
   public boolean keyDown(int keycode) {
      if(keycode == Input.Keys.UP){
         currentFrame++;
         if(currentFrame > MAX_FRAMES)
            currentFrame = 1;
         sprite.setRegion(textureAtlas.findRegion(String.format("%04d",currentFrame)));
      }
      if(keycode == Input.Keys.DOWN){
         currentFrame--;
         if(currentFrame < 1)
            currentFrame = MAX_FRAMES;

         sprite.setRegion(textureAtlas.findRegion(String.format("%04d",currentFrame)));
      }
      return true;
   }

   @Override
   public boolean keyUp(int keycode) {
      return false;
   }

   @Override
   public boolean keyTyped(char character) {
      return false;
   }

   @Override
   public boolean touchDown(int screenX, int screenY, int pointer, int button) {
      return false;
   }

   @Override
   public boolean touchUp(int screenX, int screenY, int pointer, int button) {
      return false;
   }

   @Override
   public boolean touchDragged(int screenX, int screenY, int pointer) {
      return false;
   }

   @Override
   public boolean mouseMoved(int screenX, int screenY) {
      return false;
   }

   @Override
   public boolean scrolled(int amount) {
      return false;
   }
}

 

The sprite art used for this example was rendered using this Blender file.

The texture packing application (near the end) was CodeAndWeb’s TexturePacker.

Icare3D Blog

Maxwell GM204 OpenGL extensions

by Cyril Crassin (noreply@blogger.com) at December 18, 2014 05:06 PM

NVIDIA just launched the second-generation Maxwell architecture with the GM204 GPU, which is I believe, an incredible chip. The Maxwell 2 architecture is both highly energy efficient (~2x perf/watt of Kepler in games), and provides a lot of very exciting new graphics features (some of them are exposed in Direct3D). These features are exposed in form of new OpenGL extensions in the R344 driver that was released today, and the specification for all NVIDIA supported GL extensions can be found here. NVIDIA also released new SDK samples using these extensions.

Quick description of the new extensions

Multisampling

This feature adds a lot of flexibility to the multi-sampled rasterization. It decouples the rasterization sampling frequency (which can be set explicitly) from the actual framebuffer storage. This enables rasterization to operate at higher sampling frequency than the one of the target render color buffers. It supports both depth and stencil testing at this frequency, if the corresponding depth and stencil buffers are sampled accordingly (it must be a multiple of the number of samples in the color buffers).
There are still some constraints; All color buffers must have the same number of samples, and the raster sample count must match the depth and stencil buffer sample count if depth or stencil test is enabled, and it must be higher or equal to the color buffer sample count.

A new “coverage reduction stage” is introduced in the per-fragment operations (after the fragment shader in early-z mode, after the depth-test in late-z), which converts a set of covered raster/depth/stencil samples to a set of covered color samples. There is an implementation-dependent association of raster samples to color samples. The reduced "color coverage" is computed such that the coverage bit for each color sample is 1 if any of the associated bits in the fragment's coverage is set, and 0 otherwise. This feature can be used in conjunction with the coverage to color feature (cf. below), in order to get the FS output coverage mask automatically transformed into a color by ROP. According to AnandTech, it seems that when rasterizing with explicit multisampling and no render-target, GM204 allows evaluating primitive coverage at 16x MSAA.

Note that EXT_raster_multisample is equivalent to "Target-Independent Rasterization" in Direct3D 11.1, which allows using multiple raster samples with a single color sample, as long as depth and stencil tests are disabled, and it is actually a subset of NV_framebuffer_mixed_samples which is more general and exposes more flexibility.


This allows using ROP to automatically convert the post depth-/stencil-/alpha- test coverage mask into a color and write it into a color render target. This is performed before the new coverage reduction stage (cf. NV_framebuffer_mixed_samples). This can be useful for deferred shading.


This extension allows the fragment shader to get the post depth-test coverage mask of the current fragment as input (gl_SampleMaskIn[]) when operating in early-depth mode (for which only sample passing the depth-test are set), unlike the standard GL 4.5 behavior which provides the pre- depth-test coverage (actual triangle coverage).


The standard GL behavior for FS output coverage mask (gl_SampleMask[]) is to AND it with the actual primitive input coverage mask. This extension disables this operation, which allows the fragment shader to fully override the primitive coverage, potentially setting coverage bits that were not set in the input mask. This is actually very nice, because it allows dynamically routing color output values into arbitrary sample locations inside a multisampled render target.


Allows applications to explicitly set the location of sub-pixel samples for multisample rasterization, providing fully programmable sampling patterns. It seems that the sub-pixel positions are snapped to a 16x16 sub-pixel grid, and sampling patterns can be defined within a grid of 2x2 adjacent pixels.


Rasterization

This is a really great feature. It allows rasterization to generate fragments for any pixel touched by a triangle, even if no sample location is covered on the pixel. A new control is also provided to modify the window coordinate snapping precision in order to allow the application to match conservative rasterization triangle snapping with the snapping that would have occurred at higher resolution. Polygons with zero area generate no fragments. Any location within a pixel may be used for interpolating attributes, potentially causing attribute extrapolation if outside the triangle. This can be useful for binning purpose for instance (one pixel per-tile).


This extension exposes an hardware-accelerate critical section for the fragment shader, allowing hazard-free read-modify-write operations on a per-pixel basis. It also allows enforcing primitive-ordering for threads entering the critical section. It provides new GLSL functions beginInvocationInterlockNV() and endInvocationInterlockNV() that defines a critical section which is guaranteed to be executed for only one fragment at a time. Interlock can be done per-pixel or per-sample if multi-sampled rasterization is used. This is useful for algorithms that need to access per-pixel data structures via shader load and store operations, while avoiding race conditions. Obvious applications are OIT and programmable blending for instance.


This allows rasterizing the axis-aligned screen-space bounding box of submitted triangles, disregarding the actual triangle edges. This can be useful for drawing a full-screen quad without an internal edge for instance, or for more efficiently drawing user interfaces.


Geometry processing

This extension allows to write more efficient geometry shaders in the case there is a one-to-one mapping between input and output primitives, and per-vertex attributes are simply copied from the input primitive to corresponding outputs in the vertices of the output primitive, and the geometry shader is only used to set per-primitive attributes (like gl_Layer ... ).


Texturing

This extension improves on ARB_sparse_texture, which separate the allocation of virtual address space from physical memory for textures, and provides the ability to sparsely allocate the physical backing-store of 2D/3D/2DArray textures on a per-tile basis. This new extension adds the ability to retrieve texture access residency information from GLSL, to specify minimum allocated LOD to texture fetches and to return a constant zero value for lookups into unallocated pages. It also adds support for multi-sampled textures.


This exposes a new sampler parameter which allows performing a min or max reduction operation on the values sampled inside a texture filtering footprint, instead of the regular linear interpolation. It is supported for all kind of textures, as well as anisotropic filtering.


Atomics 

This extension provides a set of new atomic operations operating on 2 and 4 components vectors of 16b floating point values for images, bindless pointers to global memory and storage buffers.


List of new extensions




cbloom rants

12-17-14 - PVQ Vector Distribution Note

by cbloom (noreply@blogger.com) at December 17, 2014 08:40 AM

So, PVQ, as in "Pyramid VQ" is just a way of making a VQ codebook for a unit vector that has a certain probability model (each value is laplacian (the same laplacian) and independent).

You have a bunch of values, you send the length separately so you're left with a unit vector. Independent values are not equally distributed on the unit sphere, so you don't want a normal unit vector quantizer, you want this one that is scaled squares.

Okay, that's all fine, but the application to images is problematic.

In images, we have these various AC residuals. Let's assume 8x8 blocks for now for clarity. In each block, you have ACij (ij in 0-7 and AC00 excluded). The ij are frequency coordinates for each coefficient. You also have spatial neighbors in the adjacent blocks.

The problem is that the AC's are not independent - their not independent either in frequency or spatial coordinates. AC42 is strongly correlated to AC21 and also to AC42 in the neighboring blocks. They also don't have the same distribution; lower freqency-index AC's have much higher means.

In order to use Pyramid VQ, we need to find a grouping of AC's into a vector, such that the values we are putting in that vector are as uncorrelated and equally distributed as possible.

One paper I found that I linked in the previous email forms a vector by taking all the coefficients in the same frequency slot in a spatial region (for each ij, take all ACij in the 4x4 neighorhood of blocks). This is appealing in the sense that it gathers AC's of the same frequency subband, so they have roughly the same distribution. The problem is there are strong spatial correlations.

In Daala they form "subbands" of the coefficients by grouping together chunks of ACs that are in similar frequency groups.

The reason why correlation is bad is that it makes the PVQ codebook not optimal. For correlated values you should have more codebook vectors with neighboring values similar. eg. more entries around {0, 2, 2, 0} and fewer around {2, 0, 0, 2}. The PVQ codebok assumes those are equally likely.

You can however, make up for this a bit with the way you encode the codebook index. It doesn't fix the quantizer, but it does extract the correlation.

In classical PVQ (P = Pyramid) you would simply form an index to the vector and send it with an equiprobable code. But in practice you might do it with binary subdivision or incremental enumeration schemes, and then you need not make all codebook vectors equiprobable.

For example in Daala one of the issues for the lower subbands is that the vectors that have signal in the low AC's are more probable than the high AC's. eg. for subband that spans AC10 - AC40 , {1,0,0,0} is much more likely than {0,0,0,1}.

Of course this becomes a big mess when you consider Predictive VQ, because the Householder transform scrambles everywhere up in a way that makes it hard to model these built-in skews. On the other hand, if the Predictive VQ removes enough of the correlation with neighbors and subband, then you are left with a roughly evenly distributed vector again which is what you want.

Game From Scratch

No More waiting for Godot! Godot Engine reaches 1.0 milestone release

by Mike@gamefromscratch.com at December 16, 2014 07:28 PM

 

Sorry… absolutely can’t resist that pun, no matter how obvious it is.  Anyways… the Godot Game Engine has been on my radar since they announced they were going open source at the beginning of the year.  Today they finally announced their 1.0 release today.godot

 

 

Never heard of Godot?  It’s a Unity-esque game engine, except powered by C++ and scripted using a Python-like scripting language.  It includes a surprising number of tools and most importantly, a complete game editor (pictured right).

 

Godot works in both 2D and 3D, with 2D being a first class citizen, not just 3D minus a D.  Godot runs on Windows, OSX and Linux.  Godot is able to target iOS, Android, Desktops, Googles’ NaCL, PlayStation3 and Vita, as well as HTML5 and Windows Phone coming soon.

 

You can read the complete feature list here.

 

You can browse available documentation here.

 

Godot is an open source project hosted on GitHub.

 

What do you think… are you interested in Godot, would you be interested in seeing GameFromScratch do some more in-depth coverage now that it’s reached such a milestone release?

cbloom rants

12-16-14 - Daala PVQ Emails

by cbloom (noreply@blogger.com) at December 16, 2014 05:02 PM

First, I want to note that the PVQ Demo page has good links at the bottom with more details in them, worth reading.

Also, the Main Daala page has more links, including the "Intro to Video" series, which is rather more than an intro and is a good read. It's a broad survey of modern video coding.

Now, a big raw dump of emails between me, ryg, and JM Valin. I'm gonna try to color them to make it a bit easier to follow. Thusly :

cbloom

ryg
valin

And this all starts with me being not very clear on PVQ so the beginning is a little fuzzy.

I will be following this up with a "summary of PVQ as I now understand it" which is probably more useful for most people. So, read that, not this.

(also jebus the internet is rindoculuos. Can I have BBS's back? Like literally 1200 baud text is better than the fucking nightmare that the internet has become. And I wouldn't mind playing a little Trade Wars once a day...)


Hi, I'm the author of the latest Daala demo on PVQ on which you commented recently. Here's some comments on your comments. I wasn't able to submit this to your blog, but feel free to copy them there. > 1. I've long believed that blocks should be categorized into > "smooth", "detail" and "edge". For most of this discussion we're > going to ignore smooth and edge and just talk about detail. That is, > blocks that are not entirely smooth, and don't have a dominant edge > through them (perhaps because that edge was predicted). I agree here, although right now we're treating "smooth" and "detail" in the same way (smooth is just low detail). Do you see any reason to treat those separately? > 2. The most important thing in detail blocks is preserving the > amount of energy in the various frequency subbands. This is something > that I've talked about before in terms of perceptual metrics. This is exactly what I had in mind with this PVQ work. Before Daala, I worked on the CELT part of the Opus codec, which has strict preservation of the energy. In the case of Daala, it looks so far like we want to relax the energy constraint a little. Right now, the codebook has an energy-preserving structure, but the actual search is R/D optimized with the same weight given to the amount of energy and its location. It's pretty easy to change the code to give more weight to energy preservation. I could even show you how to play with it if you're interested. > 3. You can take a standard type of codec and optimize the encoding > towards this type of perceptual metric, and that helps a bit, but > it's the wrong way to go. Because you're still spending bits to > exactly specify the noise in the high frequency area. Correct, hence the gain-shape quantization in my post. > 4. What you really want is a joint quantizer of summed energy and > the distribution of that energy. At max bit rate you send all the > coefficients exactly. As you reduce bitrate, the sum is preserved > pretty well, but the distribution of the lower-right (highest > frequency) coefficients becomes lossy. As you reduce bit rate more, > the total sum is still pretty good and the overall distribution of > energy is mostly right, but you get more loss in where the energy is > going in the lower frequency subbands, and you also get more scalar > quantization of the lower frequency subbands, etc. Well, my experience with both Opus and CELT is that you want the same resolution for the energy as you use for the "details". That being said, having an explicit energy still means you can better preserve it in the quantization process (i.e. it won't increase or decrease too much due to the quantization). > 5. When the energy is unspecified, you'd like to restore in some nice > way. That is, don't just restore to the same quantization vector > every time ("center" of the quantization bucket), since that could > create patterns. I dunno. Maybe restore with some randomness; restore > based on prediction from the neighborhood; restore to maximum > likelihood? (ML based on neighborhood/prediction/image not just a > global ML) I experimented a little bit with adding noise at the correct energy and while it slightly improved the quality on still images, it wasn't clear how to apply it to video because then you have the problem of static vs dynamic noise. > 6. An idea I've tossed around for a while is a quadtree/wavelet-like > coding scheme. Take the 8x8 block of coefficients (and as always > exclude DC in some way). Send the sum of the whole thing. Divide into > four children. So now you have to send a (lossy) distribution of that > sum onto the 4 child slots. Go to the upper left (LL band) and do it > again, etc. I considered something along these lines, but it would not be easy to do because the lowest frequencies would kind of drown out the high frequencies. > 7. The more energy you have, the less important its exact > distribution, due to masking. As you have more energy to distribute, > the number of vectors you need goes up a lot, but the loss you can > tolerate also goes up. In terms of the bits to send a block, it > should still increase as a function of the energy level of that > block, but it should increase less quickly than naive > log2(distributions) would indicate. Yes, that is exactly what the PVQ companding does. > 8. Not all AC's are equally likely or equally perceptually important. > Specifically the vector codebook should contain more entries that > preserve values in the upper-left (low frequency) area. This is the equivalent of the quantization matrix, which PVQ has as well (though I didn't really talk about it). > 9. The interaction with prediction is ugly. (eg. I don't know how to > do it right). The nature of AC values after mocomp or > intra-prediction is not the same as the nature of AC's after just > transform (as in JPEG). Specifically, ideas like variance masking and > energy preservation apply to the transformed AC values, *not* to the > deltas that you typically see in video coding. Handling the prediction is exactly what the whole Householder reflection in PVQ is about (see the 6 steps figure). The PVQ gain encoding scheme is always done on the input and not on the prediction. So the activity masking is applied on the input energy and not based on the energy of the residual. > 10. You want to send the information about the AC in a useful order. > That is, the things you send first should be very strong classifiers > of the entropy of that block for coding purposes, and of the masking > properties for quantization purposes. Well, coding the energy first achieves most of this. > You don't want sending the "category" or "masking" information to be > separate side-band data. It should just be the first part of sending > the coefficients. So your category is maybe something like the > bit-vector of which coefficient groups have any non-zero > coefficients. Something like that which is not redundant with sending > them, it's just the first gross bit of information about their > distribution. Well, the masking information is tied to the gain. For now, the category information is only tied to the block size decision (based on the assumption that edges will be 4x4), but it's not ideal and it's something I'd like to improve. On the topic of lapped transform, this has indeed been causing us all sorts of headaches, but it also has interesting properties. Jury's still out on that one, but so far I think we've managed to make reasonably good use of it. Cheers, Jean-Marc

Thanks for writing! Before I address specific points, maybe you can teach me a bit about PVQ and how you use it? I can't find any good resources on the web (your abstract is rather terse). Maybe you can point me at some relevant reference material. (the CELT paper is rather terse too!) Are you constructing the PVQ vector from the various AC's within a single block? Or gathering the same subband from spatial neighbors? (I think the former, but I've seen the latter in papers) Assuming the former - Isn't it just wrong? The various AC's have different laplacian distributions (lower frequencies more likely) so using PVQ just doesn't seem right. In particular PVQ assumes all coefficients are equally likely and equally distributed. In your abstract you seem to describe a coding scheme which is not a uniform length codeword like traditional PVQ. It looks like it assigns shorter codes to vectors that have their values early on in some kind of z-scan order. How is K chosen?

Hi, On 02/12/14 08:52 PM, Charles Bloom wrote: > Thanks for writing! Before I address specific points, maybe you can > teach me a bit about PVQ and how you use it? I can't find any good > resources on the web (your abstract is rather terse). Maybe you can > point me at some relevant reference material. (the CELT paper is rather > terse too!) I'm currently writing a longer paper for a conference in February, but for now there isn't much more than the demo and the abstract I link to at the bottom. I have some notes that describe some of the maths, but it's a bit all over the place right now: http://jmvalin.ca/video/video_pvq.pdf > Are you constructing the PVQ vector from the various AC's within a > single block? Or gathering the same subband from spatial neighbors? (I > think the former, but I've seen the latter in papers) > > Assuming the former - Correct. You can see the grouping (bands) in Fig. 1 of: http://jmvalin.ca/video/spie_pvq_abstract.pdf > Isn't it just wrong? The various AC's have different laplacian > distributions (lower frequencies more likely) so using PVQ just doesn't > seem right. > > In particular PVQ assumes all coefficients are equally likely and > equally distributed. > > > In your abstract you seem to describe a coding scheme which is not a > uniform length codeword like traditional PVQ. It looks like it assigns > shorter codes to vectors that have their values early on in some kind of > z-scan order. One thing to keep in mind if that the P in PVQ now stands for "perceptual". In Daala we are no longer using the indexing scheme from CELT (which does assume identical distribution). Rather, we're using a coding scheme based on Laplace distribution of unequal variance. You can read more about the actual encoding process in another document: http://jmvalin.ca/video/pvq_encoding.pdf > How is K chosen? The math is described (poorly) in section 6.1 of http://jmvalin.ca/video/video_pvq.pdf Basically, the idea is to have the same resolution in the direction of the gain as in any other direction. In the no prediction case, it's roughly proportional to the gain times the square root of the number of dimensions. Because K only depends on values that are available to the decoder, we don't actually need to signal it. Hope this helps, Jean-Marc

Thanks for the responses and the early release papers, yeah I'm figuring most of it out. K is chosen so that distortion from the PVQ (P = Pyramid) quantization is the same as distortion from gain quantization. Presumably under a simple D metric like L2. The actual PVQ (P = Pyramid) part is the simplest and least ambiguous. The predictive stuff is complex. Let me make sure I understand this correctly - You never actually make a "residual" in the classic sense by subtracting the prediction off. You form the prediction in transformed space. (perhaps by having a motion vector, taking the pixels it points to and transforming them, dealing with lapping, yuck!) The gain of the current block is sent (for each subband). Not the gain of the delta. The gain of the prediction in the same band is used as coding context? (the delta of the quantized gains could be sent). The big win that you guys were after in sending the gain seems to have been the non-linear quantization levels; essentially you're getting "variance adaptive quantization" without explicitly sending per block quantizers. The Householder reflection is the way that vectors near the prediction are favored. This is the only way that the predicted block is used!? Madness! (presumably if the prediction had detail that was finer than the quantization level of the current block that could be used to restore within the quantization bucket; eg. for "golden frames")

On 03/12/14 12:18 AM, Charles Bloom wrote: > K is chosen so that distortion from the PVQ (P = Pyramid) quantization > is the same as distortion from gain quantization. Presumably under a > simple D metric like L2. Yes, it's an L2 metric, although since the gain is already warped, the distortion is implicitly weighted by the activity masking, which is exactly what we want. > You never actually make a "residual" in the classic sense by subtracting > the prediction off. Correct. > You form the prediction in transformed space. (perhaps by having a > motion vector, taking the pixels it points to and transforming them, > dealing with lapping, yuck!) We have the input image and we have a predicted image. We just transform both. Lapping doesn't actually cause any issues there (unlike many other places). As far as I can tell, this part is similar to what a wavelet coder would do. > The gain of the current block is sent (for each subband). Not the gain > of the delta. Correct. > The gain of the prediction in the same band is used as > coding context? (the delta of the quantized gains could be sent). Yes, the gain is delta-coded, so coding "same gain" is cheap. Especially, there's a special symbol for gain=0,theta=0, which means "skip this band and use prediction as is". > The big win that you guys were after in sending the gain seems to have > been the non-linear quantization levels; essentially you're getting > "variance adaptive quantization" without explicitly sending per block > quantizers. Exactly. Not only that but it's adaptive based on the variance of the current band, not just an entire macroblock. > The Householder reflection is the way that vectors near the prediction > are favored. This is the only way that the predicted block is used!? > Madness! Well, the reference is used to compute the reflection *and* the gain. In the end, we're using exactly the same amount of information, just in a different space. > (presumably if the prediction had detail that was finer than the > quantization level of the current block that could be used to restore > within the quantization bucket; eg. for "golden frames") Can you explain what you mean here? Jean-Marc

So one thing that strikes me is that at very low bit rate, it would be nice to go below K=1. In the large high-frequency subbands, the vector dimension N is very large, so even at K=1 it takes a lot of bits to specify where the energy should go. It would be nice to be more lossy with that location. It seems that for low K you're using a zero-runlength coder to send the distribution, with a kind of Z-scan order, which makes it very similar to standard MPEG. (maybe you guys aren't focusing on such low bit rates; when I looked at low bit rate video the K=1 case dominated) At 09:42 PM 12/2/2014, you wrote: > (presumably if the prediction had detail that was finer than the > quantization level of the current block that could be used to restore > within the quantization bucket; eg. for "golden frames") Can you explain what you mean here? If you happen to have a very high quality previous block (much better than your current quantizer / bit rate should give you) - with normal mocomp you can easily carry that block forward, and perhaps apply corrections to it, but the high detail of that block is preserved. With the PVQ scheme it's not obvious to me that that works. When you send the quantized gain of the subbands you're losing precision (it looks like you guys have a special fudge to fix this, by offsetting the gain based on the prediction's gain?) But for the VQ part, you can't really "carry forward" detail in the same way. I guess the reflection vector can be higher precision than the quantizer, so in a sense that preserves detail, but it doesn't carry forward the same values, because they drift due to rotation and staying a unit vector, etc.

Some more questions - Is the Householder reflection method also used for Intra prediction? (do you guys do the directional Intra like H26x ?) How much of this scheme is because you believe it's the best thing to do vs. you have to avoid H26x patents? If you're not sending any explicit per-block quantizer, it seems like that removes a lot of freedom for future encoders to do more sophisticated perceptual optimization. (ROI bit allocation or whatever)

On 03/12/14 02:17 PM, Charles Bloom wrote: > So one thing that strikes me is that at very low bit rate, it would be > nice to go below K=1. In the large high-frequency subbands, the vector > dimension N is very large, so even at K=1 it takes a lot of bits to > specify where the energy should go. It would be nice to be more lossy > with that location. Well, for large N, the first gain step already has K>1, which I believe is better than K=1. I've considered adding an extra gain step with K=1 or below, but never had anything that was really worth it (didn't try very hard). > It seems that for low K you're using a zero-runlength coder to send the > distribution, with a kind of Z-scan order, which makes it very similar > to standard MPEG. > > (maybe you guys aren't focusing on such low bit rates; when I looked at > low bit rate video the K=1 case dominated) We're also targeting low bit-rates, similar to H.265. We're not yet at our target level of performance though. > Is the Householder reflection method also used for Intra prediction? > (do you guys do the directional Intra like H26x ?) We also use it for intra prediction, though right now our intra prediction is very limited because of the lapped transform. Except for chroma which we predict from the luma. PVQ makes this particularly easy. We just use the unit vector from luma as chroma prediction and code the gain. > How much of this scheme is because you believe it's the best thing to > do vs. you have to avoid H26x patents? The original goal wasn't to avoid patents, but it's a nice added benefit. > If you're not sending any explicit per-block quantizer, it seems like > that removes a lot of freedom for future encoders to do more > sophisticated perceptual optimization. (ROI bit allocation or > whatever) We're still planning on adding some per-block/macroblock/something quantizers, but we just won't need them for activity masking. Cheers, Jean-Marc

Hi, Just read your "smooth blocks" post and I thought I'd mention on thing we do in Daala to improve the quality of smooth regions. It's called "Haar DC" and the idea is basically to apply a Haar transforms to all the DCs in a superblock. This has the advantage of getting us much better quantization resolution at large scales. Unfortunately, there's absolutely no documentation about it, so you'd have to look at the source code, mostly in od_quantize_haar_dc() and a bit of od_compute_dcts() http://git.xiph.org/?p=daala.git;a=blob;f=src/encode.c;h=879dda;hb=HEAD Cheers, Jean-Marc

Yeah I definitely can't follow that code without digging into it. But this : "much better quantization resolution at large scales." is interesting. When I did the DLI test : http://cbloomrants.blogspot.com/2014/08/08-31-14-dli-image-compression.html something I noticed in both JPEG and DLI (and in everything else, I'm sure) is : Because everyone just does naive scalar quantization on DC's, large regions of solid color will shift in a way that is very visible. That is, it's a very bad perceptual RD allocation. Some bits should be taken away from AC detail and put into making that large region DC color more precise. The problem is that DC scalar quantization assumes the blocks are independent and random and so on. It models the distortion of each block as being independent, etc. But it's not. If you have the right scalar quantizer for the DC when the blocks are in a region of high variation (lots of different DC's) then that is much too large a quantizer for regions where blocks all have roughly the same DC. This is true even when there is a decent amount of AC energy, eg. the image I noticed it in was the "Porsche640" test image posted on that page - the greens of the bushes all color shift in a very bad way. The leaf detail does not mask this kind of perceptual error.

Two more questions - 1. Do you use a quantization matrix (ala JPEG CSF or whatever) ? If so, how does that work with gain preservation and the Pyramid VQ unit vector? 2. Do you mind if I post all these mails publicly?

On 11/12/14 02:03 PM, Charles Bloom wrote: > 1. Do you use a quantization matrix (ala JPEG CSF or whatever) ? If so, > how does that work with gain preservation and the Pyramid VQ unit vector? Right now, we just set a different quantizer value for each "band", so we can't change resolution on a coefficient-by-coefficient basis, but it still looks like a good enough approximation. If needed we might try doing something fancier at some point. > 2. Do you mind if I post all these mails publicly? I have no problem with that and in fact I encourage you to do so. Cheers, Jean-Marc


ryg: Don't wanna post this to your blog because it's a long comment and will probably fail Blogger's size limit. Re "3 1/2. The normal zig-zag coding schemes we use are really bad." Don't agree here about zig-zag being the problem. Doesn't it just boil down to what model you use for the run lengths? Classic JPEG/MPEG style coding rules (H.264 and later are somewhat different) 1. assume short runs are more probable than long ones and 2. give a really cheap way to end blocks early. The result is that the coder likes blocks with a fairly dense cluster in the first few coded components (and only this is where zig-zag comes in) and truncated past that point. Now take Fischer-style PVQ (original paper is behind a paywall, but this: http://www.nul.com/pbody17.pdf covers what seems to be the proposed coding scheme). You have two parameters, N and K. N is the dimensionality of the data you're coding (this is a constant at the block syntax level and not coded) and K is the number of unit pulses (=your "energy"). You code K and then send an integer (with a uniform model!) that says which of all possible arrangements of K unit pulses across N dimensions you mean. For 16-bit ACs in a 8x8 block so N=63, there's on the order of 2^(63*16) = 2^1008 different values you could theoretically code, so clearly for large K this integer denoting the configuration can get quite huge. Anyway, suppose that K=1 (easiest case). Then the "configuration number" will tell us where the pulse goes and what sign it has, uniformly coded. That's essentially a run length with *uniform* distribution plus sign. K=2: we have two pulses. There's N*2 ways to code +-2 in one AC and the rest zeros (code AC index, code sign), and (N choose 2) * 2^2 ways to code two slots at +-1 each. And so forth for higher K. From there, we can extrapolate what the general case looks like. I think the overall structure ends up being isomorphic to this: 1. You code the number M (=N) of nonzero coefficients using a model derived from the combinatorics given N and K (purely counting-based). (K=1 implies M=1, so nothing to code in that case.) 2. Code the M sign bits. 3. Code the positions of the M nonzero coeffs - (N choose M) options here. 4. Code another number denoting how we split the K pulses among the M coeffs - that's an integer partition of K into exactly M parts, not sure if there's a nice name/formula for that. This is close enough to the structure of existing AC entropy coders that we can meaningfully talk about the differences. 1) and 2) are bog-standard (we use a different model knowing K than a regular codec that doesn't know K would, but that's it). You can view 3) in terms of significance masks, and the probabilities have a reasonably simple form (I think you can adapt the Reservoir sampling algorithm to generate them) - or, by looking at the zero runs, in term of run lengths. And 4) is a magnitude coder constrained by knowing the final sum of everything. So the big difference is that we know K at the start, which influences our choice of models forthwith. But it's not actually changing the internal structure that much! That said, I don't think they're actually doing Fischer-style PVQ of "just send a uniform code". The advantage of breaking it down like above is that you have separate syntax elements that you can apply additional modeling on separately. Just having a giant integer flat code is not only massively unwieldy, it's also a bit of a dead end as far as further modeling is concerned. -Fabian

cbloom: At 01:36 PM 12/2/2014, you wrote: Don't wanna post this to your blog because it's a long comment and will probably fail Blogger's size limit. Re "3 1/2. The normal zig-zag coding schemes we use are really bad." Don't agree here about zig-zag being the problem. Doesn't it just boil down to what model you use for the run lengths? My belief is that for R/D optimization, it's bad when there's a big R step that doesn't correspond to a big D step. You want the prices of things to be "fair". So the problem is cases like : XX00 X01 0 vs XX00 X001 000 00 0 which is not a very big D change at all, but is a very big R step. I think it's easy to see that even keeping something equivalent to the zigzag, you could change it so that the position of the next coded value is sent in a way such that the rates better match entropy and distortion. But of course, really what you want is to send the positions of those later values in a lossy way. Even keeping something zigzagish you can imagine easy ways to do it, like you send a zigzag RLE that's something like {1,2,3-4,5-7,8-13} whatever.

ryg: Actually the Fischer (magnitude enumeration) construction corresponds pretty much directly to a direct coder: from the IEEE paper, l = dim, k = number of pulses, then number of code words N(l,k) is N(l,k) = sum_{i=-k}^k N(l-1, k-|i|) This is really direct: N(l,k) just loops over all possible values i for the first AC coeff. The remaining uncoded ACs then are l-1 dimensional and = k-|i|. Divide through by N(l,k) and you have a probability distribution for coding a single AC coeff. Splitting out the i=0 case and sign, we get: N(l,k) = N(l-1,k) + 2 * sum_{j=1}^k N(l-1,k-j) =: N(l-1,k) + 2 * S(l-1,k) which corresponds 1:1 to this encoder: // While energy (k) left for (i = 0; k > 0; i++) { { assert(i Ndims); // shouldn't get to N with leftover energy int l = N - i; // remaining dims int coeff = coeffs[i]; // encode significance code_binary(coeff == 0, N(l-1,k) / N(l,k)); if (coeff != 0) { // encode sign code_binary(coeff 0, 0.5); int mag = abs(coeff); // encode magnitude (multi-symbol) // prob(mag=j) = N(l-1,k-j) / S(l-1, k) // then: k -= mag; } } and this is probably how you'd want to implement it given an arithmetic back end anyway. Factoring it into multiple decisions is much more convenient (and as said before, easier to do secondary modeling on) than the whole "one giant bigint" mess you get if you're not low-dimensional. Having the high-dimensional crap in there blows because the probabilities can get crazy. Certainly Ndims=63 would suck to work with directly. Separately, I'd expect that for k "large" (k >= Ndims? Ndims*4? More? Less?) you can use a simpler coder and/or fairly inaccurate probabilities because that's gonna be infrequent. Maybe given k = AC_sum(1,63) = sum_{i=1}^63 |coeff_i|, there's a reasonably nice way to figure out say AC_sum(1,32) and AC_sum(33,63). And if you can do that once, you can do it more than once. Kind of a top-down approach: you start with "I have k energy for this block" and first figure out which subband groups that energy goes into. Then you do the "detail" encode like above within each subband of maybe 8-16 coeffs; with l=Ndim=8 and k=Ndim*small, you would have reasonable (practical) model sizes. -Fabian

cbloom: No, I don't think that's right. The N recursion is just for counting the number of codewords, it doesn't imply a coding scheme. It explicitly says that the pyramid vector index is coded with a fixed length word, using ceil( N ) bits. Your coding scheme is variable length. I need to find the original Fisher paper because this isn't making sense to me. The AC's aren't equally probable and don't have the same Laplacian distribution so PVQ just seems wrong. I did find this paper ("Robust image and video coding with pyramid vector quantisation") which uses PVQ and is making the vectors not from within the same block, but within the same *subband* in different spatial locations. eg. gathering all the AC20's from lots of neigboring blocks. That does make sense to me but I'm not sure if that's what everyone means when they talk about PVQ ? (paper attached to next email)

ryg: On 12/2/2014 5:46 PM, Charles Bloom {RAD} wrote: No, I don't think that's right. The N recursion is just for counting the number of codewords, it doesn't imply a coding scheme. It explicitly says that the pyramid vector index is coded with a fixed length word, using ceil( N ) bits. Your coding scheme is variable length. I wasn't stating that Fischer's scheme is variable-length; I was stating that the decomposition as given implies a corresponding way to encode it that is equivalent (in the sense of exact same cost). It's not variable length. It's variable number of symbols but the output length is always the same (provided you use an exact multi-precision arithmetic coder that is, otherwise it can end up larger due to round-off error). log2(N(l,k)) is the number of bits we need to spend to encode which one out of N(l,k) equiprobable codewords we use. The ceil(log2(N)) is what you get when you say "fuck it" and just round it to an integral number of bits, but clearly that's not required. So suppose we're coding to the exact target rate using a bignum rationals and an exact arithmetic coder. Say I have a permutation of 3 values and want to encode which one it is. I can come up with a canonical enumeration (doesn't matter which) and send an index stating which one of the 6 candidates it is, in log2(6) bits. I can send one bit stating whether it's an even or odd permutation, which partitions my 6 cases into 2 disjoint subsets of 3 cases each, and then send log2(3) bits to encode which of the even/odd permutations I am, for a total of log2(2) + log2(3) = log2(6) bits. Or I can get fancier. In the general case, I can (arbitrarily!) partition my N values into disjoint subsets with k_1, k_2, ..., k_m elements, respectively, sum_i k_i = N. To code a number, I then first code the number of the subset it's in (using probability p_i = k_i/N) and then send a uniform integer denoting which element it is, in log2(k_i) bits. Say I want to encode some number x, and it falls into subset j. Then I will spend -log2(p_i) + log2(k_i) = -log2(k_i / N) + log2(k_i) = log2(N / k_i) + log2(k_i) = log2(N) bits (surprise... not). I'm just partitioning my uniform distribution into several distributions over smaller sets, always setting probabilities exactly according to the number of "leaves" (=final coded values) below that part of the subtree, so that the product along each path is still a uniform distribution. I can nest that process of course, and it's easy to do so in some trees but not others meaning I get non-uniform path lengths, but at no point am I changing the size of the output bitstream. That's exactly what I did in the "coder" given below. What's the value of the first AC coefficient? It must obey -k = ac_0 = k per definition of k, and I'm using that to partition our codebook C into 2k+1 disjoint subsets, namely C_x = { c in C | ac0(c) = x } and nicely enough, by the unit-pulse definition that leads to the enumeration formula, each of the C_x corresponds to another PVQ codebook, namely with dimension l-1 and energy k-|x|. Which implies the whole thing decomposes into "send x and then do a PVQ encode of the rest", i.e. the loop I gave. That said, one important point that I didn't cover in my original mail: from the purposes of coding this is really quite similar to a regular AC coder, but of course the values being coded don't mean the same thing. In a JPEG/MPEG style entropy coder, the values I'm emitting are raw ACs. PVQ works (for convenience) with code points on an integer lattice Z^N, but the actual AC coeffs coded aren't those lattice points, they're (gain(K) / len(lattice_point)) * lattice_point (len here being Euclidean and not 1-norm!). I need to find the original Fisher paper because this isn't making sense to me. The AC's aren't equally probable and don't have the same Laplacian distribution so PVQ just seems wrong. I did find this paper ("Robust image and video coding with pyramid vector quantisation") which uses PVQ and is making the vectors not from within the same block, but within the same *subband* in different spatial locations. eg. gathering all the AC20's from lots of neigboring blocks. That does make sense to me but I'm not sure if that's what everyone means when they talk about PVQ ? (paper attached to next email) The link to the extended abstract for the Daala scheme (which covers this) is on the Xiph demo page: http://jmvalin.ca/video/spie_pvq_abstract.pdf Page 2 has the assignment of coeffs to subbands. They're only using a handful, and notably they treat 4x4 blocks as a single subband. -Fabian

cbloom: Ah yeah, you are correct of course. I didn't see how you had the probabilities in the coding. There are a lot of old papers I can't get about how to do the PVQ enumeration in an efficient way. I'm a bit curious about what they do. But as I'm starting to understand it all a bit now, that just seems like the least difficult part of the problem. Basically the idea is something like - divide the block into subbands. Let's say the standard wavelet tree for concreteness - 01447777 23447777 55667777 55667777 8.. Send the sum in each subband ; this is the "gain" ; let's say g_s g_s is sent with some scalar quantizer (how do you choose q_s ?) (in Daala a non-linear quantizer is used) For each subband, scale the vector to an L1 length K_s (how do you choose K_s?) Quantize the vector to a PVQ lattice point; send the lattice index So PVQ (P = Pyramid) solves this problem of how to enumerate the distribution given the sum. But that's sort of the trivial part. The how do you send the subband gains, what is K, etc. is the hard part. Do the subband gains mask each other? Then there's the whole issue of PVQ where P = Predictive. This Householder reflection business. Am I correct in understanding that Daala doesn't subtract off the motion prediction and make a residual? The PVQ (P = predictive) scheme is used instead? That's quite amazing. And it seems that Daala sends the original gain, not the gain of the residual (and uses the gain of the prediction as context). The slides (reference #4) clear things up a bit.

ryg: On 12/2/2014 8:21 PM, Charles Bloom {RAD} wrote: Ah yeah, you are correct of course. I didn't see how you had the probabilities in the coding. There are a lot of old papers I can't get about how to do the PVQ enumeration in an efficient way. I'm a bit curious about what they do. Well, the one I linked to has a couple variants already. But it's pretty much besides the point. You can of course turn this into a giant combinatorical circle-jerk, but I don't see the use. For example (that's one of the things in the paper I linked to) if you're actually assigning indexes to values then yeah, the difference between assigning codewords in order { 0, -1, 1, -2, 2, ... } and { -k, -k+1, ..., -1, 0, 1, 2, ..., k } matters, but once you decompose it into several syntax elements most of that incidental complexity just disappears completely. But as I'm starting to understand it all a bit now, that just seems like the least difficult part of the problem. Yeah, agreed. Basically the idea is something like - divide the block into subbands. Let's say the standard wavelet tree for concreteness - 01447777 23447777 55667777 55667777 8.. Yup. Send the sum in each subband ; this is the "gain" ; let's say g_s No, the gain isn't sum (1-norm), it's the Euclidean (2-norm) length. If you used 1-norm you wouldn't deform the integer lattice, meaning you're still just a scalar quantizer, just one with a funky backend. E.g. in 2D, just sending k = q(|x| + |y|) (with q being a uniform scalar quantizer without dead zone for simplicity) and then coding where the pulses go is just using the same rectangular lattice as you would have if you were sending q(x), q(y) directly. (Once you add a dead zone that's not true any more; scalar favors a "+" shape around the origin whereas the 1-norm PVQ doesn't. But let's ignore that for now.) With an ideal vector quantizer you make the "buckets" (=Voronoi regions) approximately equally likely. For general arbitrary 2D points that means the usual hex lattice. The PVQ equivalent of that is the NPVQ pattern: https://people.xiph.org/~jm/daala/pvq_demo/quantizer4.png That's clearly suboptimal (not a real hex lattice at all), but it has the nice gain/shape-separation: the circles are all equal-gain. You unwrap each circle by normalizing the point in the 1-norm, and then sending the corresponding AC pulses. g_s is sent with some scalar quantizer (how do you choose q_s ?) (in Daala a non-linear quantizer is used) q_s would come from the rate control, as usual. g codes overall intensity. You would want that to be roughly perceptually uniform. And you're not sending g at all, you're sending K. CIELab gamma (which is ~perceptually uniform) is 3, i.e. linear->CIELab is pow(x, 1/3). The Daala gain compander uses, surprise, 1/3. This would make sense except for the part where the CIE gamma deals in *linear* values and Daala presumably works on a gamma-infested color space, because that's what you get. My theory is this: the thing they're companding is not g_s, but g_s^2, i.e. sum of squares of AC coeffs. That makes for a total companding curve of (g_s)^(2/3). Display gamma is ~2, perceptually uniform gamma is ~3, so this would be in the right range to actually work out. They're not doing a good job of describing this though! For each subband, scale the vector to an L1 length K_s (how do you choose K_s?) You don't. You have your companded g's. The companding warps the space so now we're in NPVQ land (the thing I sent the image URL for). The companded g is the radius of the circle you're actually on. But of course this is a quantizer so your choices of radius are discrete and limited. You look at circles with a radius in the right neighborhood (most obviously, just floor(g) and ceil(g), though you might want to widen the search if you're doing RDO). You find the closest lattice points on both circles (this is convex, so no risk of getting stuck in a local min). Choose whichever of the two circles is better. (All points *on* the same circle have the same cost, at least with vanilla PVQ. So the only RD trade-off you do is picking which circle.) K_s is the index of the circle you're on. The origin is K_s=0, the first real circle is K_s=1 (and has 2N points where N is your dimensionality), and so forth. Quantize the vector to a PVQ lattice point; send the lattice index Finding that is the convex search. So PVQ (P = Pyramid) solves this problem of how to enumerate the distribution given the sum. But that's sort of the trivial part. Well, that's the combinatorical part. The actual vector quantizer is the idea of warping the 1-norm diamonds into sensibly-spaced 2-norm circles. The regular structure enables the simplified search. The how do you send the subband gains, what is K, etc. is the hard part. Do the subband gains mask each other? Not sure if they're doing any additional masking beyond that. If they do, they're not talking about it. Then there's the whole issue of PVQ where P = Predictive. This Householder reflection business. Am I correct in understanding that Daala doesn't subtract off the motion prediction and make a residual? The PVQ (P = predictive) scheme is used instead? That's quite amazing. And it seems that Daala sends the original gain, not the gain of the residual (and uses the gain of the prediction as context). As far as I can tell, yeah. And yes, definitely gain of the overall block, not of the residual! Again, you have the separation into gain and shape here. The gains are coded separately, and hence out of the equation. What remains is unit vectors for both your target block and your prediction. That means your points are on a sphere. You do a reflection that aligns your prediction vector with the 1st AC coefficient. This rotates (well, reflects...) everything around but your block is still a unit vector on a sphere. 1st AC will now contain block_gain * dot(block_unit_vec, prediction_unit_vec). You already know block_gain. They send the dot product (cosine of the angle, but speaking about this in terms of angles is just confusing IMO; it's a correlation coefficient, period). This tells you how good the prediction is. If it's 0.9, you've just removed 90% of the energy to code. You need to quantize this appropriately - you want to make sure the quantizer resolution here is reasonably matched to quantizer resolution of points on your sphere, or you're wasting bits. Now you turn whatever is left of g into K (as above). You can iterate this as necessary. If you do bi-prediction, you can do another Householder reflection to align the energy of pred2 that was orthogonal to pred1 (the rest is gone already!) with the 2nd AC. You code another correlation coefficient and then deal with the residuals. Fade-in / fade-out just kind of fall out when you do prediction like this. It's not a special thing. The "ACs" don't change, just the gains. Handling cross-fades with one predictor is still shitty, but if you're doing bipred they kinda fall out as well. It all sounds pretty cool. But I have no clue at all how well it works in practice or where it stands cost/benefit-wise. -Fabian

ryg: That means your points are on a sphere. You do a reflection that aligns your prediction vector with the 1st AC coefficient. This rotates (well, reflects...) everything around but your block is still a unit vector on a sphere. Important note for this and all that follows: For this to work as I described it, your block and the prediction need to be in the same space, which in this context has to be frequency (DCT) space (since that's what you eventually want to code with PVQ), so you need to DCT your reference block first. This combined with the reflections etc. make this pretty pricey, all things considered. If you weren't splitting by subbands, I believe you could finesse your way around this: (normalized) DCT and Householder reflections are both unitary, so they preserve both the L2 norm and dot products. Which means you could calculate both the overall gain and the correlation coeffs for your prediction *before* you do the DCT (and hence in the decoder, add that stuff back in post-IDCT, without having to DCT your reference). But with the subband splitting, that no longer works, at least not directly. You could still do it with a custom filter bank that just passes through precisely the DCT coeffs we're interested in for each subband, but eh, somehow I have my doubts that this is gonna be much more efficient than just eating the DCT. It would certainly add yet another complicated mess to the pile. -Fabian

cbloom: At 10:23 PM 12/2/2014, Fabian Giesen wrote: For this to work as I described it, your block and the prediction need to be in the same space, which in this context has to be frequency (DCT) space (since that's what you eventually want to code with PVQ), so you need to DCT your reference block first. This combined with the reflections etc. make this pretty pricey, all things considered. Yeah, I asked Valin about this. They form an entire predicted *image* rather than block-by-block because of lapping. They transform the predicted image the same way as the current frame. Each subband gain is sent as a delta from the predicted image subband gain. Crazy! His words : > You form the prediction in transformed space. (perhaps by having a > motion vector, taking the pixels it points to and transforming them, > dealing with lapping, yuck!) We have the input image and we have a predicted image. We just transform both. Lapping doesn't actually cause any issues there (unlike many other places). As far as I can tell, this part is similar to what a wavelet coder would do.

cbloom: At 09:31 PM 12/2/2014, you wrote: q_s would come from the rate control, as usual. Yeah, I just mean the details of that is actually one of the most important issues. eg. how does Q vary for the different subbands. Is there inter-subband masking, etc. In Daala the Q is non-linear (variance adaptive quantizer) g codes overall intensity. You would want that to be roughly perceptually uniform. And you're not sending g at all, you're sending K. In Daala they send g and derive K. CIELab gamma (which is ~perceptually uniform) is 3, i.e. linear->CIELab is pow(x, 1/3). The Daala gain compander uses, surprise, 1/3. This would make sense except for the part where the CIE gamma deals in *linear* values and Daala presumably works on a gamma-infested color space, because that's what you get. My theory is this: the thing they're companding is not g_s, but g_s^2, i.e. sum of squares of AC coeffs. That makes for a total companding curve of (g_s)^(2/3). Display gamma is ~2, perceptually uniform gamma is ~3, so this would be in the right range to actually work out. They're not doing a good job of describing this though! Err, yeah maybe. What they actually did was take the x264 VAQ and try to reproduce it. For each subband, scale the vector to an L1 length K_s (how do you choose K_s?) You don't. You have your companded g's. The companding warps the space so now we're in NPVQ land (the thing I sent the image URL for). The companded g is the radius of the circle you're actually on. But of course this is a quantizer so your choices of radius are discrete and limited. No, that's not right. K is effectively your "distribution" quantizer. It should be proportional to g in some way (or some power of g) but it's not just g. As the quantizer for g goes up, K goes down. In Daala they choose K such that the distortion due to PVQ is the same as the distortion due to gain scalar quantization. 1st AC will now contain block_gain * dot(block_unit_vec, prediction_unit_vec). You already know block_gain. They send the dot product (cosine of the angle, but speaking about this in terms of angles is just confusing IMO; it's a correlation coefficient, period). I think that in Daala they actually send the angle, not the cosine, which is important because of the non-linear quantization buckets. It's difficult for me to intuit what the Householder reflection is doing to the residuals. But I guess it doesn't matter much. It also all seems to fall apart a bit if the prediction is not very good. Then the gains might mismatch quite a bit, and even though you had some pixels that matched well, they will be scaled differently when normalized. It's a bit blah.

ryg: Yeah, I asked Valin about this. They form an entire predicted *image* rather than block-by-block because of lapping. That doesn't have anything to do with the lapping, I think - that's because they don't use regular block-based mocomp. At least their proposal was to mix overlapping-block MC and Control Grid Interpolation (CGI, essentially you specify a small mesh with texture coordinates and do per-pixel tex coord interpolation). There's no nice way to do this block-per-block in the first place, not with OBMC in the mix anyway; if you chop it up into tiles you end up doing a lot of work twice.

ryg: On 12/03/2014 10:26 AM, Charles Bloom {RAD} g codes overall intensity. You would want that to be roughly perceptually uniform. And you're not sending g at all, you're sending K. In Daala they send g and derive K. Ah, my bad. For each subband, scale the vector to an L1 length K_s (how do you choose K_s?) You don't. You have your companded g's. The companding warps the space so now we're in NPVQ land (the thing I sent the image URL for). The companded g is the radius of the circle you're actually on. But of course this is a quantizer so your choices of radius are discrete and limited. No, that's not right. K is effectively your "distribution" quantizer. It should be proportional to g in some way (or some power of g) but it's not just g. As the quantizer for g goes up, K goes down. In Daala they choose K such that the distortion due to PVQ is the same as the distortion due to gain scalar quantization. Ah OK, that makes sense. 1st AC will now contain block_gain * dot(block_unit_vec, prediction_unit_vec). You already know block_gain. They send the dot product (cosine of the angle, but speaking about this in terms of angles is just confusing IMO; it's a correlation coefficient, period). I think that in Daala they actually send the angle, not the cosine, which is important because of the non-linear quantization buckets. Didn't check how they send it. I do find thinking of this in terms of cross-correlation between block and pred a lot simpler than phrasing it in terms of angles. It's difficult for me to intuit what the Householder reflection is doing to the residuals. But I guess it doesn't matter much. The reflection itself doesn't do anything meaningful. Your normalized points were on a unit sphere before, and still are after. You're just spinning it around. It does mean that your coefficient numbering really loses all meaning. After one such reflection, you're already scrambled. Overall energy is still the same (because it's unitary) but the direction is completely different. Since PVQ already assumes that the directions are equiprobable (well, more or less, since the PVQ doesn't actually uniformly cover the sphere), they don't care. It also all seems to fall apart a bit if the prediction is not very good. Then the gains might mismatch quite a bit, and even though you had some pixels that matched well, they will be scaled differently when normalized. It's a bit blah. Well, it's just a different goal for the predictors. Regular motion search tries to minimize SAD or similar, as do the H.264 spatial predictors. For this kind of scheme you don't care about differences at all, instead you want to maximize the normalized correlation coeff between image and reference. (You want texture matches, not pixel matches.) -Fabian

cbloom: The other thing I note is that it doesn't seem very awesome at low bit rate. Their subband chunks are very large. Even at K=1 the N slots that could have that one value is very large, so sending the index of that one slot is a lot of bits. At that point, the way you model the zeros and the location of the 1 is the most important thing. What I'm getting at is a lossy way of sending that.

ryg: On 12/3/2014 1:04 PM, Charles Bloom {RAD} wrote: The other thing I note is that it doesn't seem very awesome at low bit rate. Their subband chunks are very large. Even at K=1 the N slots that could have that one value is very large, so sending the index of that one slot is a lot of bits. Yeah, the decision to send a subband *at all* means you have to code gain, theta and your AC index. For N=16 that's gonna be hard to get below 8 bits even for trivial signals. At which point you get a big jump in the RD curve, which is bad. Terriberry has a few slides that explain how they're doing inter-band activity masking currently: https://people.xiph.org/~tterribe/daala/pvq201404.pdf The example image is kind of terrible though. The "rose" dress (you'll see what I mean) is definitely better in the AM variant, but the rest is hard to tell for me unless I zoom in, which is cheating. At that point, the way you model the zeros and the location of the 1 is the most important thing. What I'm getting at is a lossy way of sending that. This is only really interesting at low K, where the PVQ codebook is relatively small. So, er, let's just throw this one in: suppose you're actually sending codebook indices. You just have a rate allocation function that tells you how many bits to send, independent of how big the codebook actually is. If you truly believe that preserving narrowband energy is more important than getting the direction right, then getting a random vector with the right energy envelope is better than nothing. Say K=1, Ndim=16. You have N=32 codewords, so a codebook index stored directly is 5 bits. Rate function says "you get 0 bits". So you don't send an index at all, and the decoder just takes codeword 0. Or rate function says "you get 2 bits" so you send two bits of the codebook index, and take the rest as zero. This is obviously biased. So the values you send aren't raw codebook indices. You have some random permutation function family p_x(i) : { 0, ..., N-1 } -> { 0, ..., N-1 } where x is a per-block value that both the encoder and decoder know (position or something), and what you send is not the codebook id but p_x(id). For any given block (subband, whatever), this doesn't help you at all. You either guess right or you guess wrong. But statistically, suppose you shaved 2 bits off the codebook IDs for 1000 blocks. Then you'd expect about 250 of these blocks to reconstruct the right ACs. For the rest, you reconstructed garbage ACs, but it's garbage with the right energy levels at least! :) No clue if this is actually a good idea at all. It definitely allows you to remove a lot of potholes from the RD curve. -Fabian

ryg: On 12/3/2014 1:48 PM, Fabian Giesen wrote: > [..] This is obviously biased. So the values you send aren't raw codebook indices. You have some random permutation function family p_x(i) : { 0, ..., N-1 } -> { 0, ..., N-1 } where x is a per-block value that both the encoder and decoder know (position or something), and what you send is not the codebook id but p_x(id). Now this is all assuming you either get the right code or you get garbage, and living with whichever one it is. You can also go in the other direction and try to get the direction at least mostly right. You can try to determine an ordering of the code book so that distortion more or less smoothly goes down as you add extra bits. (First bit tells you which hemisphere, that kind of thing.) That way, if you get 4 bits out of 5, it's not a 50:50 chance between right vector and some random other vector, it's either the right vector or another vector that's "close". (Really with K=1 and high dim it's always gonna be garbage, though, because you just don't have any other vector in the code book that's even close; this is more interesting at K=2 or up). This makes the per-block randomization (you want that to avoid systematic bias) harder, though. One approach that would work is to do a Householder reflection with a random vector (again hashed from position or similar). All that said, I don't believe in this at all. It's "solving" a problem by "reducing" it to a more difficult unsolved problem (in this case, "I want a VQ codebook that's close to optimal for embedded coding"). Of course, even if you do a bad job here, it's still not gonna be worse than the direct "random permutation" stuff. But I doubt it's gonna be appreciably better either, and it's definitely more complex. -Fabian

cbloom: At 02:17 PM 12/3/2014, Fabian Giesen wrote: That way, if you get 4 bits out of 5, it's not a 50:50 chance between right vector and some random other vector, it's either the right vector or another vector that's "close". Yes, this is the type of scheme I imagine. Sort of like a wavelet significant bit thing. As you send fewer bits the location gets coarser. The codebook for K=1 is pretty obvious. You're just sending a location; you want the top bits to grossly classify the AC and the bottom bits to distinguish neighbors (H neighbors for H-type AC's, and V neighbors for V-type AC's) For K=2 and up it's more complex. You could just train them and store them (major over-train risk) up to maybe K=3 but then you have to switch to an algorithmic method. Really the only missing piece for me is how you get the # of bits used to specify the locations. It takes too many bits to actually send it, so it has to be implicit from some other factors like the block Q and K and I'm not sure how to get that.



Molecular Musings

molecularmusings

by Stefan Reinalter at December 16, 2014 04:50 PM

Did you know that you can now also follow us on Facebook?
Always stay up-to-date with blog posts and the latest news!

If you like our blog and want to express your support for what we do, hit the Like-button on our Facebook page. We appreciate it!


Filed under: Uncategorized

molecularmusings

by Stefan Reinalter at December 16, 2014 03:51 PM

In the previous part of this series, I’ve talked a bit about how to design the stateless rendering API, but left out a few details. This time, I’m going to cover those details as well as some questions that came up in the comments in the meantime, and even show parts of the current implementation.

Command buckets

In the first part of the series, I introduced the idea that not all layers need to store keys of the same size, e.g. a layer for rendering objects into a shadow map might only need a 16-bit key, while the G-Buffer layer requires 64-bit keys.

In Molecule parlance, the thing responsible for storing draw calls (and their associated key and data) is called a CommandBucket, which is a light-weight class template taking the key-type as a template parameter:

template <typename T>
class CommandBucket
{
  typedef T Key;
  ...
};

A renderer would create all the CommandBuckets it needs for rendering the scene, and e.g. store them as members, populating them with draw calls in its Render() method. Or you could make creation & destruction of the CommandBuckets completely data-driven and configurable.

But what does a CommandBucket need to store, and how do we store it?

  • N keys that are used for sorting the N draw calls
  • Data for N draw calls

Note that the amount and type of data stored for each draw call heavily depends on the type of draw call. Therefore, all data needed by a draw call is stored in a separate memory region, and the CommandBucket only stores a pointer to that region:

template <typename T>
class CommandBucket
{
  typedef T Key;
  ...

private:
  Key* m_keys;
  void** m_data;
};

Note that the keys and their data are stored in two separate arrays. The reason for that is that certain sorting algorithms don’t swap during the sort operation but rather populate a separate array with indices to the sorted entries, so they have to touch less data.

When creating a CommandBucket, it takes care of allocating the arrays of keys and data pointers, and also stores all the render targets as well as the view matrix, projection matrix and viewport to be used when submitting the commands stored in that bucket. The rationale behind this is that it is very likely that you render into a certain layer using the same camera & viewport, so it makes no sense to specify that information in each and every draw call. Furthermore, this means that each CommandBucket can only hold a certain number of draw calls, specified when creating the bucket like in the following example:

CommandBucket<uint64_t> gBufferBucket(2048, rt1, rt2, rt3, rt4, dst, viewMatrix, projMatrix);

Of course, the view and projection matrix would most likely be provided by a camera object, or some kind of camera system – but that is an entirely different topic.

Commands

Now that we have buckets for storing draw calls, how do we add commands to a bucket? And what exactly is a command?

A command is a self-contained piece of information that is understood by the render backend, and is stored in a command bucket. A command could identify any kind of draw call (non-indexed, indexed, instanced, …), or any other operation such as copying data into a constant buffer.

Each command is a simple POD that holds all the data needed by the backend in order to carry out the operation associated with a certain command. The following three structs are all examples of simple commands:

namespace commands
{
  struct Draw
  {
    uint32_t vertexCount;
    uint32_t startVertex;

    VertexLayoutHandle vertexLayoutHandle;
    VertexBufferHandle vertexBuffer;
    IndexBufferHandle indexBuffer;
  };

  struct DrawIndexed
  {
    uint32_t indexCount;
    uint32_t startIndex;
    uint32_t baseVertex;

    VertexLayoutHandle vertexLayoutHandle;
    VertexBufferHandle vertexBuffer;
    IndexBufferHandle indexBuffer;
  };

  struct CopyConstantBufferData
  {
    ConstantBufferHandle constantBuffer;
    void* data;
    uint32_t size;
  };
}

Because each command is a separate POD, putting them into a command bucket becomes simple. We can add a method that takes a key, makes space for storing the command, stores the pointer to the data in the internal array, and hands the POD instance to the user:

template <typename U>
U* CommandBucket::AddCommand(Key key)
{
  U* data = AllocateCommand<U>();

  // store key and pointer to the data
  AddKey(key);
  AddData(data);

  return data;
}

This is still really simple, but there are also a few bits we haven’t talked about yet, such as adding synchronization when accessing the arrays, and how to allocate memory for the command. We will revisit this topic later, because for now there are more important things we need to talk about first.

For now, assume that we have a command bucket which we populate in the following manner:

for (size_t i=0; i < meshComponents.size(); ++i)
{
  MeshComponent* mesh = &meshComponents[i];

  commands::DrawIndexed* dc = gBuffer.AddCommand<commands::DrawIndexed>(GenerateKey(mesh->aabb, mesh->material));
  dc->vertexLayoutHandle = mesh->vertexLayout;
  dc->vertexBuffer = mesh->vertexBuffer;
  dc->indexBuffer = mesh->indexBuffer;
  dc->indexCount = mesh->indexCount;
  dc->startIndex = 0u;
  dc->baseVertex = 0u;
}

Compared to the two alternatives presented in the last post, note that there is no need for calling another method after a draw call has been created and inserted into the bucket. After calling AddCommand(), the command completely belongs to you, and you simply fill all its members. That’s it. All store operations would write directly into one contiguous chunk of memory, without any additional copy operations – but more on that later.

The command bucket responsible for draw calls contributing to the G-Buffer now holds an indexed draw call for each mesh component. After all buckets have been populated, we can sort them by their keys:

gBufferBucket.Sort();
lightingBucket.Sort();
deferredBucket.Sort();
postProcessingBucket.Sort();
hudBucket.Sort();

For sorting the commands in the bucket, we can use whatever sorting algorithm we want. The one thing to note here is that each CommandBucket::Sort() can be run on a different thread, sorting all buckets in parallel.

After all buckets have been sorted, we can submit them to the render backend:

gBufferBucket.Submit();
lightingBucket.Submit();
deferredBucket.Submit();
postProcessingBucket.Submit();
hudBucket.Submit();

The submission process has to be done from one thread because it constantly talks to the graphics API (D3D, OGL), submitting work to the GPU. It does not matter whether it is the main thread or a dedicated rendering thread, though.

Submission process

But how do we submit the commands to the graphics API? All we have is a key and a pointer to the associated data. This is clearly not enough, so we need some kind of additional identifier for each command.

One way of implementing this would be to add an identifier (e.g. an enum value) to each command, store it alongside the key and data, and then implement the Submit() method similar to the following piece of code:

void Submit(void)
{
  SetViewMatrix();
  SetProjectionMatrix();
  SetRenderTargets();

  for (unsigned int i=0; i < commandCount; ++i)
  {
    Key key = m_keys[i];
    void* data = m_data[i];
    uint16_t id = m_ids[i];

    // decode the key, and set shaders, textures, constants, etc. if the material has changed.
    DecodeKey();

    switch (id)
    {
      case command::Draw::ID:
        // extract data for a Draw command, and call the backend
        break;

      case command::DrawIndexed::ID:
        // extract data for a DrawIndexed command, and call the backend
        break;

      ...;
    }
  }
}

This would be a possible solution, however I would not recommend it. Why?

First, we kind of do the same thing twice. We once identify the command when storing it into the bucket (e.g. by storing U::ID into our array of IDs, m_ids), and then identify it again in the huge switch statement.

Second, the hardcoded switch statement makes it hard and tedious to add new commands, and impossible to add user-defined commands if we don’t have access to the source code.

There is a better and simpler solution: function pointers.

Backend dispatch

Instead of storing the ID of a command in the bucket, we can directly store a pointer to a function that knows how to deal with a certain command, and forwards it to the render backend. This is what is known as the Backend Dispatch in Molecule.

The backend dispatch is a namespace that consists of simple forwarding functions only:

namespace backendDispatch
{
  void Draw(const void* data)
  {
    const commands::Draw* realData = union_cast<const commands::Draw*>(data);
    backend::Draw(realData->vertexCount, realData->startVertex);
  }

  void DrawIndexed(const void* data)
  {
    const commands::DrawIndexed* realData = union_cast<const commands::DrawIndexed*>(data);
    backend::DrawIndexed(realData->indexCount, realData->startIndex, realData->baseVertex);
}

  void CopyConstantBufferData(const void* data)
  {
    const commands::CopyConstantBufferData* realData = union_cast<const commands::CopyConstantBufferData*>(data);
    backend::CopyConstantBufferData(realData->constantBuffer, realData->data, realData->size);
  }
}

Each function in the backend dispatch has the same signature, hence we can use a typedef to store a pointer to any of those functions:

typedef void (*BackendDispatchFunction)(const void*);

The functions contained in the backend namespace are still the only ones that talk directly to the graphics API, e.g. by using the D3D device.

So let’s go back and revisit the CommandBucket and its AddCommand() method. In addition to the command, we now also need to store a pointer to the dispatch function. Actually, we also need to store two more things we haven’t talked about yet in addition to the above:

The first is a pointer to any other command that needs to be submitted with this command and has the same key. If we store a pointer to another command we build an intrusive linked list that allows us to handle draw calls and commands that always need to be submitted in a certain order, no matter what key was assigned to them. This came up more than once as a question in the comments, and is needed when submitting draw calls where we e.g. first need to copy data to a constant buffer, and then submit the draw call. The intrusive linked list allows us to chain any number of commands.

The second is that certain commands need auxiliary memory to store intermediate data that is needed when submitting the draw call to the API at a later point in time. The perfect example for this is updating a constant buffer with a few bytes of data, such as e.g. lighting information. These bytes are tucked away in the auxiliary memory, and copied from there into the constant buffer when the command is submitted.

Command packets

Because we no longer just store single commands in the bucket, we introduce the concept of command packets. A bucket now stores command packets, and each packet holds the following data:

  • void* : a pointer to the next command packet (if any)
  • BackendDispatchFunction : a pointer to the function responsible for dispatching the call to the backend
  • T : the actual command
  • char[] : auxiliary memory needed by the command (optional)

Whenever the user wants to add a command of type T to the bucket, we need to make space for the other things as well. For that, I simply allocate raw memory which is large enough to hold all the data using appropriate sizeof() operators, and cast the individual parts to their desired type. In order for that to work, a few static_asserts ensure that all commands are POD structs.

Finally, a helper namespace takes care of doing all the offset calculations and casting:

typedef void* CommandPacket;

namespace commandPacket
{
  static const size_t OFFSET_NEXT_COMMAND_PACKET = 0u;
  static const size_t OFFSET_BACKEND_DISPATCH_FUNCTION = OFFSET_NEXT_COMMAND_PACKET + sizeof(CommandPacket);
  static const size_t OFFSET_COMMAND = OFFSET_BACKEND_DISPATCH_FUNCTION + sizeof(BackendDispatchFunction);

  template <typename T>
  CommandPacket Create(size_t auxMemorySize)
  {
    return ::operator new(GetSize<T>(auxMemorySize));
  }

  template <typename T>
  size_t GetSize(size_t auxMemorySize)
  {
    return OFFSET_COMMAND + sizeof(T) + auxMemorySize;
  };

  CommandPacket* GetNextCommandPacket(CommandPacket packet)
  {
    return union_cast<CommandPacket*>(reinterpret_cast<char*>(packet) + OFFSET_NEXT_COMMAND_PACKET);
  }

  template <typename T>
  CommandPacket* GetNextCommandPacket(T* command)
  {
    return union_cast<CommandPacket*>(reinterpret_cast<char*>(command) - OFFSET_COMMAND + OFFSET_NEXT_COMMAND_PACKET);
  }

  BackendDispatchFunction* GetBackendDispatchFunction(CommandPacket packet)
  {
    return union_cast<BackendDispatchFunction*>(reinterpret_cast<char*>(packet) + OFFSET_BACKEND_DISPATCH_FUNCTION);
  }

  template <typename T>
  T* GetCommand(CommandPacket packet)
  {
    return union_cast<T*>(reinterpret_cast<char*>(packet) + OFFSET_COMMAND);
  }

  template <typename T>
  char* GetAuxiliaryMemory(T* command)
  {
    return reinterpret_cast<char*>(command) + sizeof(T);
  }

  void StoreNextCommandPacket(CommandPacket packet, CommandPacket nextPacket)
  {
    *commandPacket::GetNextCommandPacket(packet) = nextPacket;
  }

  template <typename T>
  void StoreNextCommandPacket(T* command, CommandPacket nextPacket)
  {
    *commandPacket::GetNextCommandPacket<T>(command) = nextPacket;
  }

  void StoreBackendDispatchFunction(CommandPacket packet, BackendDispatchFunction dispatchFunction)
  {
    *commandPacket::GetBackendDispatchFunction(packet) = dispatchFunction;
  }

  const CommandPacket LoadNextCommandPacket(const CommandPacket packet)
  {
    return *GetNextCommandPacket(packet);
  }

  const BackendDispatchFunction LoadBackendDispatchFunction(const  CommandPacket packet)
  {
    return *GetBackendDispatchFunction(packet);
  }

  const void* LoadCommand(const CommandPacket packet)
  {
    return reinterpret_cast<char*>(packet) + OFFSET_COMMAND;
  }
};

Note that Create() uses the global operator new for allocating raw memory. In a real implementation, we would use our own linear allocator that ensures that all commands are stored in memory contiguously, which is much more cache-friendly when we need to iterate through the commands in the Submit() method.

Revisiting the command bucket

With command packets in place, the actual code for adding commands to a bucket becomes the following:

template <typename U>
U* AddCommand(Key key, size_t auxMemorySize)
{
  CommandPacket packet = commandPacket::Create<U>(auxMemorySize);

  // store key and pointer to the data
  {
    // TODO: add some kind of lock or atomic operation here
    const unsigned int current = m_current++;
    m_keys[current] = key;
    m_packets[current] = packet;
  }

  commandPacket::StoreNextCommandPacket(packet, nullptr);
  commandPacket::StoreBackendDispatchFunction(packet, U::DISPATCH_FUNCTION);

  return commandPacket::GetCommand<U>(packet);
}

Once we take care of the TODO marked above, we can also start adding commands from any number of threads. As a first implementation, we can simply add a critical section to make the code work, but obviously there are better solutions, which is something I would like to write about in one of the next posts in the series.

Of course, each command now also needs to store a pointer to the backend dispatch, exemplified for the draw command:

struct Draw
{
  static const BackendDispatchFunction DISPATCH_FUNCTION;

  uint32_t vertexCount;
  uint32_t startVertex;

  VertexLayoutHandle vertexLayoutHandle;
  VertexBufferHandle vertexBuffer;
  IndexBufferHandle indexBuffer;
};
static_assert(std::is_pod<Draw>::value == true, "Draw must be a POD.");

const BackendDispatchFunction Draw::DISPATCH_FUNCTION = &backendDispatch::Draw;

Custom commands

As stated earlier, using function pointers this way allows us to support user-defined commands as well. For example, you can make up your own commands by defining a POD, implement a completely custom dispatch function for it, and add that command to any bucket, or even chain it to other commands.

Chaining commands

Now that our command packet also stores a pointer to the next command packet, we can append commands to other commands:

template <typename U, typename V>
U* AppendCommand(V* command, size_t auxMemorySize)
{
  CommandPacket packet = commandPacket::Create<U>(auxMemorySize);

  // append this command to the given one
  commandPacket::StoreNextCommandPacket<V>(command, packet);

  commandPacket::StoreNextCommandPacket(packet, nullptr);
  commandPacket::StoreBackendDispatchFunction(packet, U::DISPATCH_FUNCTION);

  return commandPacket::GetCommand<U>(packet);
}

Note that in this situation we don’t need to store a new key/value pair into our array, because each command that is appended to another one needs to have the same key anyway.

The following example shows how commands can be chained together using the new command bucket API:

for (unsigned int i=0; i < directionalLights.size(); ++i)
{
  PerDirectionalLightConstants constants =
  { directionalLights[i].diffuse, directionalLights[i].specular };

  commands::CopyConstantBufferData* copyOperation = lightingBucket.AddCommand<commands::CopyConstantBufferData>(someKey, sizeof(PerDirectionalLightConstants));

  copyOperation->data = commandPacket::GetAuxiliaryMemory(copyOperation);
  copyOperation->constantBuffer = directionalLightsCB;
  memcpy(copyOperation->data, &constants, sizeof(PerDirectionalLightConstants));
  copyOperation->size = sizeof(PerDirectionalLightConstants);

  commands::Draw* dc = lightingBucket.AppendCommand<commands::Draw>(copyOperation, 0u);
  dc->vertexCount = 3u;
  dc->startVertex = 0u;
}

Revisiting the submission process

Of course, with the command packets in place, the Submit() method also needs to be adapted. By using the backend dispatch, we can get rid of the switch statement, and the linked list of commands can be walked using a simple loop:

void Submit(void)
{
  // ... same as before
  for (unsigned int i=0; i < m_current; ++i)
  {
    // ... same as before
    CommandPacket packet = m_packets[i];
    do
    {
      SubmitPacket(packet);
      packet = commandPacket::LoadNextCommandPacket(packet);
    } while (packet != nullptr);
  }
}

void SubmitPacket(const CommandPacket packet)
{
  const BackendDispatchFunction function = commandPacket::LoadBackendDispatchFunction(packet);
  const void* command = commandPacket::LoadCommand(packet);
  function(command);
}

Recap

This post is quite long already, much longer than I anticipated. But still, let us recap which concepts this post introduced:

  • Commands: a self-contained piece of information that is handed to the backend dispatch. Each command resembles one simple operation such as an indexed draw call, copying data to a constant buffer, etc. Each command is implemented as a POD struct.
  • Backend dispatch: simple forwarding functions that extract data from a command, and forward them to the graphics backend. Each dispatch function deals with a different command.
  • Command packets: a command packet stores a command, along with additional data such as a pointer to a dispatch function, any auxiliary memory a command might need, and an intrusive linked list for chaining commands.
  • Chaining of commands: commands that need to be submitted in a certain order can be chained together.
  • Command bucket: A command bucket stores command packets along with a key of any size.
  • Multi-threaded rendering: Commands can be added to buckets in parallel from multiple threads. The only two points of synchronization are the memory allocation, and storing the key-value pair into the array of command packets.
  • Multi-threaded sorting: Each command bucket can be sorted independently, in parallel.

Even though this is already part 3 of the series, there are still things we haven’t talked about in detail yet:

  • Memory management: How do we allocate the memory for storing the keys and pointers to packets? How do we efficiently allocate memory for individual command packets in the case of multiple threads adding commands to the same bucket? How can we ensure good cache utilization throughout the whole process of storing and submitting command packets? Can we use one contiguous chunk of memory?
  • Key generation: Which information does a key hold? How do we efficiently build a key?

So for now, our rendering process is stateless and layered/bucketized, but its multi-threaded rendering capabilities can still be greatly improved. Until next time!


Filed under: C++, Graphics

OpenGL

OpenCL and OpenGL support in OS X Mavericks and Yosemite

December 16, 2014 12:43 PM

Learn about the OpenGL and OpenCL versions that are supported by your computer in OS X Mavericks and OS X Yosemite. The lowest version of OpenGL is now 3.3, with all 2011 and later desktop and laptops running OpenGL 4.1. OpenCL on desktops and laptops since 2012 are running OpenCL 1.2, with earlier version running 1.0.