May 18, 2005
Fragment Program Reference
The OpenGL Extensions Guide has a great chapter on the ARB fragment program language, I recommend buying it, but there aren't many good references online. The most useful is the official spec, but it's designed as an exhaustive guide, not a quick reference for programmers. Here's a rundown of the instruction set, and some tips and tricks.
Here's a cut-out-and-keep table of all instructions, based on table X.5 in the spec:
| Instruction | Inputs | Output | Description | Pseudocode | Notes |
|---|---|---|---|---|---|
| ABS | v | v | absolute value | fabs(arg1) | |
| ADD | v,v | v | add | arg1+arg2 | |
| CMP | v,v,v | v | compare | if (arg1<0) arg2 else arg3 | |
| COS | s | ssss | cosine | cos(arg1) | Synthesised using 5 instructions on ATI R3xx |
| DP3 | v,v | ssss | 3-component dot product | (arg1.x*arg2.x)+(arg1.y*arg2.y)+(arg1.z*arg2.z) | |
| DP4 | v,v | ssss | 4-component dot product | (arg1.x*arg2.x)+(arg1.y*arg2.y)+(arg1.z*arg2.z)+(arg1.w*arg2.w) | Emulated with 4 native instructions on ATI R3xx |
| DPH | v,v | ssss | homogeneous dot product | (arg1.x*arg2.x)+(arg1.y*arg2.y)+(arg1.z*arg2.z)+arg2.w | |
| DST | v,v | v | distance vector | Funky, see spec | |
| EX2 | s | ssss | exponential base 2 | pow(2,arg1) | Using negative args on NVidia gives different results to ATI |
| FLR | v | v | floor | floor(arg1) | |
| FRC | v | v | fraction | arg1-(int)arg1 | |
| KIL | v | v | kill fragment | if (arg1<0) return | |
| LG2 | s | ssss | logarithm base 2 | log(arg1) | |
| LIT | v | v | compute light coefficients | Funky, see spec | |
| LRP | v,v,v | v | linear interpolation | (arg2*arg1)+(arg3*(1-arg1)) | Order is lerpValue, end, start |
| MAD | v,v,v | v | multiply and add | (arg2*arg3)+arg4 | Really useful to reduce instruction counts |
| MAX | v,v | v | maximum | if (arg1<arg2) arg2 else arg1 | |
| MIN | v,v | v | minimum | if (arg1>arg2) arg2 else arg1 | |
| MOV | v | v | move | arg1 | |
| MUL | v,v | v | multiply | arg1*arg2 | |
| POW | s,s | ssss | exponentiate | pow(arg1,arg2) | |
| RCP | s | ssss | reciprocal | 1/arg1 | |
| RSQ | s | ssss | reciprocal square root | r1/sqrt(arg1) | |
| SCS | s | ss-- | sine/cosine | result.x=sin(arg1) result.y=cos(arg1) | Synthesised using multiple instructions on ATI R3xx |
| SGE | v,v | v | set on greater than or equal | if (arg1>=arg2) 1.0 else 0.0 | |
| SIN | s | ssss | sine | sin(arg1) | Synthesised using five instructions on ATI R3xx |
| SLT | v,v | v | set on less than | if (arg1<arg2) 1.0 else 0.0 | |
| SUB | v,v | v | subtract | arg1-arg2 | |
| SWZ | v | v | extended swizzle | Funky, see spec | Synthesised using multiple instructions on ATI R3xx |
| TEX | v,u,t | v | texture sample | Texture instructions are almost always the performance bottleneck | |
| TXB | v,u,t | v | texture sample with bias | ||
| TXP | v,u,t | v | texture sample with projection | ||
| XPD | v,v | v | cross product | [(arg1.y*arg2.z-arg1.z*arg2.y),(arg1.z*arg2.x-arg1.x*arg2.z),(arg1.x*arg2.y-arg1.y*arg2.x)] |
Always specify if you're only using some components in an instruction, the compilers aren't smart enough generally to figure out if you only use the .x component of the result later on, and both vendors' hardware has clever tricks they can play executing vector and scalar instructions in parallel.
Try and calculate everything you can using arithmetic instructions rather than doing table lookups from textures. Memory access almost always seems to be the limiting factor on the speed of our fragment programs, you've got a lot of free instruction slots that can be filled performing extra calculations while the hardware's waiting on memory.
ATI cards support simple swizzling, either where you're masking out some components in the result register (ADD foo.xy, bob, jim;) or where you're duplicating a single component across the whole register (ADD foo, bob, jim.x;)
Anything more complicated will be emulated using multiple instructions (ADD foo, bob.zyzy, jim; or ADD foo, bob.xxxy, jim;)
On ATI, using GL_TEXTURE_RECTANGLE_EXT textures in TEX instructions (RECT as the target) will generate hidden instructions to convert the coordinates to the 0 to 1 range, from the input range of 0 to
Posted by petewarden at 04:28 PM | Comments (0)
May 17, 2005
Mac PBuffers
If you ever need to do more image processing than you can with a single fragment program pass, you'll need to render to a texture. The best way to do that on OS X is using a pbuffer, here's some example code for creating and using them.
I've included the code inline below, or you can get a zip of the code from PBufferUtils.zip here
To use it, call PBuffer_Create(sharedContext, width, height, 8, ePBufferFlag_ZBuffer) to create a pbuffer with a depth buffer attached, or 0 as the last argument for no z buffer. The 'sharedContext' is a housekeeping device to let texture IDs be shared between multiple contexts, rather than being only usable within a single context which is the default. If you have a single shared context in your app, this means you can create a texture in one pbuffer's context, whether by uploading pixel data or referencing a pbuffer, and use it in any other pbuffer's context in your app. I go into detail on contexts here
Use PBuffer_Begin() before you start drawing into it, and PBuffer_End() once you're done. When you're ready to use it as a texture, call PBuffer_Use() in the same place you'd normally do a
glEnable(GL_TEXTURE_RECTANGLE_EXT);
glBindTexture(GL_TEXTURE_RECTANGLE_EXT, someTextureID);
and do a
glDisable(GL_TEXTURE_RECTANGLE_EXT);
when you're done with it.
PBuffer_Begin() clears the pbuffer, sets up an orthographic view and resets some common gl state, but you need to be careful to reset gl state manually, a lot of our bugs have been caused by state left hanging around.
You'll get an error if you try to create a pbuffer with different pixel attributes than the shared context, even if the difference doesn't seem relevant. For example, you might need to add kCGLPFANoRecovery to the list of attributes if that's what the shared context has been created with.
Creating and destroying pbuffers is expensive, so we tend to carry them over from frame to frame unless they need to change in size. If size changes are needed, we try to over-allocate and create bigger pbuffers than we need, to avoid frequent reallocations for growing objects.
Destroying a pbuffer immediately after using it as a texture also seems to cause rendering problems, as if the surface is destroyed before the renderer uses it. Waiting a little while, for example until the next time you need to render, seems to prevent this.
Here's the code to go into PBufferUtils.h:
------------
#ifndef INCLUDE_PBUFFER_UTILS_H
#define INCLUDE_PBUFFER_UTILS_H
#include
#include
enum EPBufferFlags
{
ePBufferFlag_ZBuffer=(1<<0),
};
typedef struct PetePBuffer_tag {
CGLPBufferObj pbuffer;
CGLContextObj pbufferContext;
CGLContextObj previousContext;
int width;
int height;
GLuint textureID;
bool needsClearing;
bool needsFlush;
int createdWidth;
int createdHeight;
} PetePBuffer;
PetePBuffer* PBuffer_Create(CGLContextObj sharedContext,int nWidth,int nHeight,int colorDepth, int flags);
void PBuffer_Destroy(PetePBuffer* pbuffer);
// Surround rendering with these to draw into the pbuffer, they handle pushing and popping the old
// context for you
void PBuffer_Begin(PetePBuffer* pbuffer);
void PBuffer_End(PetePBuffer* pbuffer);
// Binds the pbuffer as a texture
void PBuffer_Use(PetePBuffer* pbuffer);
#endif // INCLUDE_PBUFFER_UTILS_H
------
and here's the PBufferUtils.cpp code:
------
#include "PBufferUtils.h"
#include
#include
static void checkCGLErrorImplementation(CGLError error, char* sourceFile, int sourceLine)
{
if (error) {
const char* errStr;
errStr = CGLErrorString(error);
fprintf(stderr, "CGL Error: %s at %s:%d\n", errStr, sourceFile, sourceLine);
assert(false);
}
}
#define checkCGLError(error) checkCGLErrorImplementation(error,__FILE__,__LINE__)
PetePBuffer* PBuffer_Create(CGLContextObj sharedContext,int width,int height,int colorDepth, int flags)
{
PetePBuffer* pbuffer=new PetePBuffer;
pbuffer->pbuffer=NULL;
pbuffer->pbufferContext=NULL;
pbuffer->previousContext=NULL;
pbuffer->width=width;
pbuffer->height=height;
pbuffer->textureID=0;
pbuffer->needsClearing=true;
pbuffer->needsFlush=false;
const int bitsPerPixel = (colorDepth*4);
const bool hasZBuffer = (flags&ePBufferFlag_ZBuffer);
int i = 0;
CGLPixelFormatAttribute pixelFormatAttributes[32];
pixelFormatAttributes[i++] = kCGLPFAAccelerated;
pixelFormatAttributes[i++] = kCGLPFAWindow;
pixelFormatAttributes[i++] = kCGLPFAColorSize;
pixelFormatAttributes[i++] = (CGLPixelFormatAttribute)(bitsPerPixel);
if (colorDepth>8)
pixelFormatAttributes[i++] = kCGLPFAColorFloat;
if (hasZBuffer)
{
pixelFormatAttributes[i++] = kCGLPFADepthSize;
pixelFormatAttributes[i++] = (CGLPixelFormatAttribute)(16);
}
pixelFormatAttributes[i++] = (CGLPixelFormatAttribute)0;
long numPixelFormats = 0;
CGLPixelFormatObj pixelFormat = NULL;
CGLError error = CGLChoosePixelFormat(pixelFormatAttributes, &pixelFormat, &numPixelFormats);
checkCGLError(error);
error = CGLCreateContext(pixelFormat, sharedContext, &pbuffer->pbufferContext);
checkCGLError(error);
CGLDestroyPixelFormat(pixelFormat);
error = CGLCreatePBuffer(pbuffer->width,pbuffer->height,GL_TEXTURE_RECTANGLE_EXT,GL_RGBA,0,&pbuffer->pbuffer);
checkCGLError(error);
return pbuffer;
}
void PBuffer_Destroy(PetePBuffer* pbuffer) {
if (pbuffer!=NULL)
{
CGLDestroyPBuffer(pbuffer->pbuffer);
CGLDestroyContext(pbuffer->pbufferContext);
}
delete pbuffer;
}
void PBuffer_Begin(PetePBuffer* pbuffer) {
// Pete- check to ensure begin() hasn't already been called for this pbuffer
assert(pbuffer->previousContext==NULL);
pbuffer->previousContext=CGLGetCurrentContext();
long screen;
CGLError error=CGLGetVirtualScreen(pbuffer->previousContext,&screen);
assert(!error);
error=CGLSetCurrentContext(pbuffer->pbufferContext);
checkCGLError(error);
error=CGLSetPBuffer(pbuffer->pbufferContext,pbuffer->pbuffer,0,0,screen);
checkCGLError(error);
glClearColor(0.0f,0.0f,0.0f,0.0f);
if (pbuffer->needsClearing)
glClear(GL_COLOR_BUFFER_BIT|GL_DEPTH_BUFFER_BIT);
glDisable(GL_DEPTH_TEST);
glDisable(GL_FRAGMENT_PROGRAM_ARB);
glMatrixMode(GL_PROJECTION);
glLoadIdentity();
glMatrixMode(GL_MODELVIEW);
glLoadIdentity();
glColorMask(GL_TRUE,GL_TRUE,GL_TRUE,GL_TRUE);
glColor4f(1.0f,1.0f,1.0f,1.0f);
glOrtho(0, pbuffer->width, 0, pbuffer->height, -1.0, 1.0);
glActiveTexture(GL_TEXTURE0);
}
void PBuffer_End(PetePBuffer* pbuffer) {
glFlush();
pbuffer->needsFlush=true;
assert(pbuffer->previousContext!=NULL);
CGLError error=CGLSetCurrentContext(pbuffer->previousContext);
checkCGLError(error);
pbuffer->previousContext=NULL;
}
void PBuffer_Use(PetePBuffer* pbuffer) {
if (pbuffer->needsFlush)
{
CGLContextObj currentContext=CGLGetCurrentContext();
CGLError error=CGLSetCurrentContext(pbuffer->pbufferContext);
checkCGLError(error);
glFlush();
error=CGLSetCurrentContext(currentContext);
checkCGLError(error);
pbuffer->needsFlush=false;
}
if (pbuffer->textureID==0)
{
CGLContextObj currentContext=CGLGetCurrentContext();
glGenTextures(1,&pbuffer->textureID);
glBindTexture(GL_TEXTURE_RECTANGLE_EXT,pbuffer->textureID);
glTexParameteri(GL_TEXTURE_RECTANGLE_EXT,GL_TEXTURE_WRAP_S, GL_CLAMP_TO_EDGE);
glTexParameteri(GL_TEXTURE_RECTANGLE_EXT,GL_TEXTURE_WRAP_T, GL_CLAMP_TO_EDGE);
CGLError error=CGLTexImagePBuffer(currentContext,pbuffer->pbuffer,GL_FRONT_LEFT);
checkCGLError(error);
}
else
{
glBindTexture(GL_TEXTURE_RECTANGLE_EXT,pbuffer->textureID);
}
glEnable(GL_TEXTURE_RECTANGLE_EXT);
}
Posted by petewarden at 05:25 PM | Comments (0)
Emulating Bilinear Filtering
Most graphics cards that support float textures can't bilinear filter them. We can use a fragment program to do the filtering for us, but it's tricky to get right. Here's an example of how to do it:
!!ARBfp1.0
ATTRIB inputCoords = fragment.texcoord[0];
TEMP sourceCoords;
TEMP sourceCoordsTopLeft, sourceCoordsTopRight, sourceCoordsBottomLeft, sourceCoordsBottomRight, fraction;
TEMP sourceTopLeft,sourceTopRight,sourceBottomLeft,sourceBottomRight;
SUB sourceCoords, inputCoords, {0.5, 0.5, 0.0, 0.0};
FRC fraction.x, sourceCoords.x;
FRC fraction.y, sourceCoords.y;
SUB sourceCoords, sourceCoords, fraction;
ADD sourceCoords, sourceCoords, {0.5, 0.5, 0.0, 0.0};
ADD sourceCoordsTopLeft, sourceCoords, {1,1,0,0};
ADD sourceCoordsTopRight, sourceCoords, {0,1,0,0};
ADD sourceCoordsBottomLeft, sourceCoords, {1,0,0,0};
ADD sourceCoordsBottomRight, sourceCoords, {0,0,0,0};
TEX sourceTopLeft, sourceCoordsTopLeft, texture[0], RECT;
TEX sourceTopRight, sourceCoordsTopRight, texture[0], RECT;
TEX sourceBottomLeft, sourceCoordsBottomLeft, texture[0], RECT;
TEX sourceBottomRight, sourceCoordsBottomRight, texture[0], RECT;
LRP sourceTopLeft, fraction.x, sourceTopLeft, sourceTopRight;
LRP sourceBottomLeft, fraction.x, sourceBottomLeft, sourceBottomRight;
LRP result.color, fraction.y, sourceTopLeft, sourceBottomLeft;
END
Posted by petewarden at 04:32 PM | Comments (0)
May 10, 2005
GPU Optical Flow
Optical flow analysis is heavily used by video and film processing apps, to retime images smoothly, apply motion blur as a post-process, and guide the application of other processes. It's traditionally a slow operation, I wanted to see if a GPU version was possible.
Optical flow analyses frames in a sequence, and trys to map how each pixel is moving. All techniques assume that there's some meaningful movement in the sequence to extract; sequences of unrelated images or random noise won't give useful results.
The results you want are easy to visualize, when we see a moving image, we can tell that there are moving objects there, and roughly how fast and in what direction they're moving. The useful way to represent that is as a vector field, an image where each pixel has a velocity instead of a color, showing what direction the pixel in the underlying image is moving.
Determining Optical Flow by Horn & Shunck lays out the classic method for figuring this out. I took it as a starting point for my implementation, since it seemed pretty simple and robust compared to more recent algorithms.
It took me some time to understand the basic ideas behind their algorithm, but once I did, they're remarkably elegant and straightforward. I'll try an informal explanation here to help anyone else whose eyes glaze over when confronted with a journal paper.
Start with the 1D case. You have a 1D wave that looks like this at time 0:

Here's the same wave at frame 1:

Visually, you can see that the line has moved to the left. If you subtract frame 0 from frame 1, you get this line:

This is just the difference between the first and second frames. The crucial step is to multiply this against the slope of the original wave, to get an estimate of the movement. This gives you this wave:

Why does this work? One way that I found to visualize it is to imagine you've got an infinite straight line, that's constantly sloping and sliding to the right over time. For any point that's on the line at the start of time, the line directly below it drops away at a constant speed:

The difference between two frames is like the distance from the point that was on the line, to where the line is now below it. That distance is determined by three variables; the slope of the line, the time that the line's been sliding, and how fast it's been sliding. Since we know the slope, and how long there's been between frames, we can divide those out and be left with the line's speed of movement.
This simple example should also highlight a couple of limitations of optical flow analysis:
- You need a slope or gradient in the image for the analysis to get a grip on. Flat areas of color in the center of objects can't reveal any movement.
- Movements need to be small, compared to the size of the object. The difference is applied to the gradient at each pixel, if that gradient is from a different object than in the first frame, it will give nonsense results. You need a slope that is roughly the same over the distance that the object moves between frames to get accurate results.
To extend the algorithm to two dimensions, do a difference on the two frames, and then multiply dy and dx gradient vectors for the slope of each pixel by the difference to get the optical flow estimate.
Here's some pseudo-code for the basic process:
(1) CurrentDifference = (Frame1[x][y]-Frame0[x][y]);
(2) GradientX = ((Frame1[x+1][y]-Frame1[x-1][y]) + (Frame0[x+1][y]-Frame0[x-1][y])) / 2.0;
(3) GradientY = ((Frame1[x][y+1]-Frame1[x][y-1]) + (Frame0[x][y+1]-Frame0[x][y-1])) / 2.0;
(4) GradientMagnitude = sqrt((GradientX*GradientX)+(GradientY*GradientY));
(5) VelocityX = CurrentDifference*(GradientX/GradientMagnitude);
(6) VelocityY = CurrentDifference*(GradientY/GradientMagnitude);
This is the heart of the technique. To improve the accuracy of the results, Horn and Schunck do this repeatedly, in order to eliminate spurious errors introduced by noise and to propagate movement across areas of flat color.
They use a couple of ways to do this. The algorithm likes long smooth slopes, and gets confused by short steep ones, and short steep ones are more likely to be the result of high-frequency noise than genuine objects in the sequence. To damp down the effect, a term usually called lambda is added to line 4:
(4) GradientMagnitude = sqrt((GradientX*GradientX)+(GradientY*GradientY)+Lambda);
Since the gradient is divided by this magnitude result, large values of lambda mean that steep gradients give lower velocities than gentle slopes. In practice, it's a noise filter, low values give you a very sensitive but noisy result, high values filter out the noise, but miss small objects.
The other method tries to smooth out the result, removing isolated small areas that are moving in a different direction to their neighbors, and also filling in areas with no gradients with movement data from nearby. This is where the iteration comes in; the vector field result is run through a blur kernel, and then the optical flow calculation is re-run, feeding the previous flow vector back in. This is how it looks in the code:
...
(3.1) PreviousVector = PreviousResultBlurred[x][y];
(3.2) PreviousDotGradient = (PreviousVector.x*GradientX)+(PreviousVector.y*GradientY);
...
(5) VelocityX = PreviousVector.x + (CurrentDifference+PreviousDotGradient)*(GradientX/GradientMagnitude);
(6) VelocityY = PreviousVector.y + (CurrentDifference+PreviousDotGradient)*(GradientY/GradientMagnitude);
Visually, the dot product is high when the vector is going in the same direction as the previous result (which, being blurred, represents the average direction in the local neighborhood) and is negative when the vector is going against the local flow. Iterating over this rewards vectors that are in step with their neighbors, and minimises the vectors that are marching to a different drummer:

To do the blurring, I just run a small gaussian kernel over the vector field between each iteration.
Once I understood the algorithm, it was not too hard to convert it over to a multi-pass sequence of fragment programs. In my implementation I take the two frames I'm interested in, and encode the luminance of each pixel into a 16 bit fixed point value, and store this value in two 8 bit channels in a 32 bit pbuffer, packing both images into a single surface.
Inside my iteration loop, I then run a fragment program that contains the gradient calculation and flow estimation algorithm explained in lines (1) to (6), with the core of it looking like this:
# left, right, bottom, top and centre have the luminance of the neighboring pixels of the first frame
# in the x component, and the luminance from the second in z
SUB left, right, left;
DP4 differential.x, left, {0.5, 0.0, 0.5, 0.0};
SUB bottom, top, bottom;
DP4 differential.y, bottom, {0.5, 0.0, 0.5, 0.0};
SUB differential.z, center.z, center.x;
# velocity.x and .y are the previous velocity for the pixel, both 0 first pass through
MOV velocity.z, 1.0;
DP3 flowError.x, differential, velocity;
MOV differential.z, lambda.x;
DP3 differentialMagnitude.x, differential, differential;
RCP differentialMagnitude.x, differentialMagnitude.x;
MUL differentialScale.x, flowError.x, differentialMagnitude.x;
MUL differentialScale.x, differentialScale.x, -1.0;
MAD velocity, differential, differentialScale.x, velocity;
I run two passes of a seperable gaussian blur convolution over the resulting vector field, one horizontal and one vertical, and then iterate over again. The lambda value I have between 0 and 1, depending on the image, and I normally run between 2 and 10 iterations, again depending on my needs and the image quality.
Posted by petewarden at 06:02 PM | Comments (0)
Random Numbers in Fragment Programs
It's hard to write a fragment program that will calculate pseudo-random numbers. The usual random number algorithms need two things that we don't have on our GPU's:
State
A seed value is usually stored, and updated every time a random number is generated. There's no way to pass this between pixels, with a fragment program, there's no global storage that can be updated to hold that kind of state. This means every pixel has to calculate its value independently.
Logic Operations
Almost all generators rely on bitwise logic operations such as shifts and exclusive-or's in their implementation. These operations can't be done in the ARBFP instruction set, (though I hear there are some NVidia specific extensions that allow a limited number).
Alternatives
The best way around these problems is to off-load as much of the work as we can to the CPU. We can generate a small texture from a table of random values, and do texture reads to get random values. Since our reads must be based on values that are very non-random, such as linearly interpolated texture coordinates, we may need to combine together several levels of indirection to get random looking results over a large area, if we have a small table of random values.
Ken Perlin's classic approach to generating random values at grid points using a 256 entry table is a good example to look at. However, on ATI cards the multiple texture indirections needed can be a problem, since there's a limit of 5 levels in any program.
Do it all on the CPU
Another alternative is to create a larger 2D texture full of random values. This is perfect if the random values do no need to change from frame to frame, but if it does need to animate, the CPU cost of generating and uploading the texture may be a problem. Depending on what it's being used for, the texture may be smaller than the area it needs to cover, and use tiling. This will result in obvious patterns in many cases however, something which is a constant danger with most of these methods. Careful design and tweaking are needed to minimize such ugliness.
Do it all on the GPU
I did manage to create a fragment program that calculated usable random numbers from scratch, using only floating point. I enclose it below, though it failed in its main purpose. It works as intended on NVidia GPU's, because they have 32 bit floating point internal precision, but the 24 bit precision of ATI chips causes visible problems. Since the main goal of this was to avoid texture indirections on ATI cards, it's not very useful to me, but was fun to write, so I thought I should share:
!!ARBfp1.0
# Based on an algorithm described by Francois Grieu, sci.crypt, 5th February 2004
ATTRIB tex0 = fragment.texcoord[0];
PARAM outputMult = program.local[0];
PARAM bounds = program.local[1];
PARAM seed = program.local[2];
PARAM coordsOffset = { -100, 100, 0, 0 };
PARAM cMult = 0.0001002707309736288;
PARAM aSubtract = 0.2727272727272727;
PARAM coordMult0 = { 0.67676, 0.000058758, 0, 0 };
PARAM coordMult1 = { 0.0000696596, 0.797976, 0, 0 };
PARAM coordMult2 = { 0.587976, 0.0000233443, 0, 0 };
TEMP tableCoord, a, b, c, floorA, seedCoords;
ADD seedCoords, tex0, coordsOffset;
# gFastRngA = (((currentX*multX)/(currentY*multY))+
MUL tableCoord, seedCoords, coordMult0;
RCP tableCoord.y, tableCoord.y;
MUL a.x, tableCoord.x, tableCoord.y;
# (((height-currentY)*multX2)/((width-currentX)*multY2))+
SUB tableCoord, bounds, seedCoords;
MUL tableCoord, tableCoord, coordMult1;
RCP tableCoord.x, tableCoord.x;
MAD a.x, tableCoord.x, tableCoord.y, a.x;
# (((height-currentX)*multX3)/((width-currentY)*multY3)));
SUB tableCoord.x, bounds.y, seedCoords.x;
SUB tableCoord.y, bounds.x, seedCoords.y;
MUL tableCoord, tableCoord, coordMult2;
RCP tableCoord.y, tableCoord.y;
MAD a.x, tableCoord.x, tableCoord.y, a.x;
# gFastRngA = fmod(gFastRngA,1);
FRC a.x, a.x;
ADD a.x, a.x, seed;
MOV c.x, 0;
MOV b.x, 0;
# (gFastRngA += gFastRngC*(1./9973)+(3./11)-floor(gFastRngA))
FRC floorA.x, a.x;
SUB floorA.x, a.x, floorA.x;
SUB floorA.x, aSubtract.x, floorA.x;
ADD floorA.x, floorA.x, a.x;
MAD a.x, c.x, cMult.x, floorA.x;
# (gFastRngB += (gFastRngA *= gFastRngA))
MUL a.x, a.x, a.x;
ADD b.x, b.x, a.x;
# (gFastRngC += (gFastRngB -= floor(gFastRngB)))
FRC b.x, b.x;
ADD c.x, c.x, b.x;
# (gFastRngC -= floor(gFastRngC))
FRC c.x, c.x;
# (gFastRngA += gFastRngC*(1./9973)+(3./11)-floor(gFastRngA))
FRC floorA.x, a.x;
SUB floorA.x, a.x, floorA.x;
SUB floorA.x, aSubtract.x, floorA.x;
ADD floorA.x, floorA.x, a.x;
MAD a.x, c.x, cMult.x, floorA.x;
# (gFastRngB += (gFastRngA *= gFastRngA))
MUL a.x, a.x, a.x;
ADD b.x, b.x, a.x;
# (gFastRngC += (gFastRngB -= floor(gFastRngB)))
FRC b.x, b.x;
ADD c.x, c.x, b.x;
# (gFastRngC -= floor(gFastRngC))
FRC c.x, c.x;
# (gFastRngA += gFastRngC*(1./9973)+(3./11)-floor(gFastRngA))
FRC floorA.x, a.x;
SUB floorA.x, a.x, floorA.x;
SUB floorA.x, aSubtract.x, floorA.x;
ADD floorA.x, floorA.x, a.x;
MAD a.x, c.x, cMult.x, floorA.x;
# (gFastRngB += (gFastRngA *= gFastRngA))
MUL a.x, a.x, a.x;
ADD b.x, b.x, a.x;
# (gFastRngC += (gFastRngB -= floor(gFastRngB)))
FRC b.x, b.x;
ADD c.x, c.x, b.x;
# (gFastRngC -= floor(gFastRngC))
FRC c.x, c.x;
MOV result.color, c.x;
MOV result.color.a, 1;
END
Posted by petewarden at 05:06 PM | Comments (1)
Quick and Dirty Vectorization
There aren't many fast techniques for vectorizing an image into polygons that approximate the original, or that take advantage of 3d hardware, so here's one I've used.
Take the original image, blur it to taste to remove noise and leave lots of smooth gradients.
Chose a function that will divide the image into seperate regions. The simplest one would be:
if (brightness<50%)
color = black;
else
color = white;
Split the image into grid squares of some fairly small size. For each corner of a square, work out the brightness value of the pixel underneath it.
Use this value as the horizontal texture coordinate for a 1 dimensional texture that encodes the function you chose. For the example function, this could be a texture 1 pixel high, and 100 pixels wide. The pixels from 0-49 would be black, the ones from 50-99 would be white.
If you then draw each square as two triangles, with the texture coordinates you calculated and with the 1D texture applied, you'll get an image that looks vectorized.

You have to chose a blur amount and grid resolution manually. You have two different ways to split a square into triangles, with a join running from top left to bottom right, or from top right to bottom left. Either one will cause zig-zagging for image edges that go at right angles to the chosen one. It seems like it should be possible to detect this with some clever coding.
This technique is not 'real' vectorization, but is useful for getting that kind of look. Though it's easiest with graphics hardware, the same approach is possible using a very simple software renderer.
Posted by petewarden at 05:05 PM | Comments (0)
Fragment Program Snippets
Here's some instruction sequences for common operations:
Compare against zero:
ABS value.x, value.x;
SUB value.x, 0, value.x;
CMP result, value.x, nonZero, isZero;
Compare against nearly zero (-epsilon<x<epsilon)
MAD value.x, value.x, value.x, negativeEpsilonSquared;
CMP result, value.x, isZero, nonZero;
floor()
FRC floor.x, value.x;
SUB floor.x, value.x, floor.x;
ceil()
FRC ceil.x, value.x;
MUL compare, ceil.x, -1;
SUB ceil.x, 1, ceil.x;
CMP ceil.x, compare, ceil.x, 0;
ADD ceil.x,
The tiling snippets need a couple of common variables, which can be put in the program.local[] parameters:
PARAM bounds = { textureWidth, textureHeight, 0, 0 };
PARAM recipBounds = { 1/textureWidth, 1/textureHeight, 0, 0 };
Tile texture coordinates by repeating:
MUL coords, coords, recipBounds;
FRC coords, coords;
MUL coords, coords, bounds;
Tile texture coordinates by mirroring:
MUL coords, coords, recipTwiceBounds;
FRC coords, coords;
MUL coords, coords, twiceBounds;
SUB coords, bounds, coords;
ABS coords, coords;
SUB coords, bounds, coords;
Clamp to a border:
# check if (x<0) or (y<0)
CMP compareResult, coords, 0, 1;
# Check if ((width-x)<0) or ((height-y)<0)
SUB coords, bounds, coords;
CMP compareResult, coords, 0, compareResult;
# x and y are either 0 if out of range, or 1 if in. Multiply them
# together so that the result is 0 if either are our of range, or
# 1 if they're both inside. Nudge the result down so we can
# compare easily
MAD compareResult.x, compareResult.x, compareResult.y, -0.1
CMP color, compareResult, borderColor, inputColor;
Convert a straight alpha color to premultiplied with black:
MUL premultColor, straightColor, straightColor.a;
MOV premultColor.a, straightColor.a;
Convert from premultiplied with black to straight alpha:
ADD premultColor.a, premultColor.a, 0.0000001;
RCP recipAlpha.x, premultColor.a;
MUL straightColor, premultColor, recipAlpha.x;
MOV straightColor.a, premultColor.a;
Posted by petewarden at 05:04 PM | Comments (0)
Fragment Programs-ATI R3xx vs NVidia NV34
ATI and NVidia cards have some differences it's important to know about when you're writing fragment programs to run on both.
The ATI cards have a limit of four levels of texture indirection.
A texture indirection is a TEX instruction with texture coordinates that are calculated, not just constant or passed in by a glMultiTex call. In practice, a 4 level limit means you can have up to four texture reads in a chain using coordinates from a previous instruction.
TEX firstValue, fragment.texcoord[0], texture[0]; # First indirection
TEX secondValue, firstValue, texture[0]; # Second indirection
TEX thirdValue, secondValue, texture[0]; # Third indirection
TEX fourthValue, thirdValue, texture[0]; # Fourth indirection
The spec to ARB_fragment_program explains in more detail what an indirection is.
When we hit the limit on indirections, we'd see a black output. We'd either try to rewrite to avoid the indirections, or break the shader into multiple passes.
RCP on zero gives different results.
ATI cards return a very large number when you divide by zero. NVidia cards give a plus or minus infinity NaN result, which infects subsequent calculations.
We preferred ATI's behavior, it wasn't mathematically correct, but was usually what we wanted. To emulate that on the NV34 we had two approaches:
- Detect the zero denominator before the operation, and then replace the result with something sensible:
ABS value.x, denominator.x;
SUB value.x, 0, value.x;
RCP reciprocal.x, denominator.x;
CMP reciprocal.x, value.x, reciprocal.x, someValueYouWant.x;
- Don't try to detect the zero, just nudge all numbers up a bit. This only works for inputs you know will be positive (eg texture samples), and loses a little precision, but often this is not significant.
ADD denominator.x, denominator.x, 0.0000001;
Inexact texture coordinates.
It's been hard to get exactly one-to-one texel to pixel mapping on ATI cards, we want to use bilinear filtering in general but the sampling errors can introduce slight blurring to the image. NVidia cards show much smaller errors. Where possible, we switched to nearest neighbor sampling, we also considered griding up each large quad into smaller polys to minimise the errors.
The NV34 doesn't support a floating point pixel format.
For a high precision accumulation buffer, you can get 16bit fixed point precision by using two 8 bit buffers, and storing half of each 16 bit channel in one of the buffer channels.

Since you can only write to one pbuffer at a time, you need to run two passes. Here's a fragment program to implement the red/green pass:
!!ARBfp1.0
# Emulate a 16 bit accumulation buffer for red and green channels
# author - Pete Warden
ATTRIB tex0 = fragment.texcoord[0];
ATTRIB tex1 = fragment.texcoord[1];
PARAM unpackScale = { 255.0, 1.0, 255.0, 1.0 };
PARAM packScale = { 0.0039216, 1.0, 0.0039216, 1.0 };
TEMP s0, s1, s2;
# Fetch the texel to be added
TEX s0, tex0, texture[0], RECT;
# Fetch the current total from the first 8 bit buffer
TEX s1, tex0, texture[1], RECT;
# Multiply the upper half of each channel by 255
MUL s1, s1, unpackScale;
# Add on the lower half to the upper
ADD s1.r, s1.r, s1.g;
ADD s1.g, s1.b, s1.a;
# Do the accumulation of the current input onto the total
ADD s1,s0,s1;
# Put the lower half of each channels into the result
FRC s2.g, s1.r;
FRC s2.a, s1.g;
# Calculate the rounded upper half of both channels
SUB s2.r, s1.r, s2.g;
SUB s2.b, s1.g, s2.a;
# Compress the channels into a 0-1 range to store in the 8 bit buffer
MUL result.color, s2, packScale;
END
The NV34 only passes 4 texture coordinates through to the fragment program.
The solution is to write a vertex program that emulates the fixed-function pipeline, but passes through all 8 texture coordinates, as on ATI:
!!ARBvp1.0
# Passthru vertex program
ATTRIB vertexPosition = vertex.position;
OUTPUT outputPosition = result.position;
DP4 outputPosition.x, state.matrix.mvp.row[0], vertexPosition;
DP4 outputPosition.y, state.matrix.mvp.row[1], vertexPosition;
DP4 outputPosition.z, state.matrix.mvp.row[2], vertexPosition;
DP4 outputPosition.w, state.matrix.mvp.row[3], vertexPosition;
MOV result.color, vertex.color;
MOV result.texcoord, vertex.texcoord;
MOV result.texcoord[1], vertex.texcoord[1];
MOV result.texcoord[2], vertex.texcoord[2];
MOV result.texcoord[3], vertex.texcoord[3];
MOV result.texcoord[4], vertex.texcoord[4];
MOV result.texcoord[5], vertex.texcoord[5];
MOV result.texcoord[6], vertex.texcoord[6];
MOV result.texcoord[7], vertex.texcoord[7];
END
Posted by petewarden at 05:03 PM | Comments (0)
Mac OpenGL contexts
A context is a container for the GL state information, things like the current color, transformation matrix, texture ID and anything else that needs to be remembered by OpenGL.
One of the things the context remembers is which frame buffer you're drawing to, known as the attached surface. The key thing to understand is that a context is not the same as a surface, a context is just a housekeeping structure to hold the information GL needs to remember. A surface is where you actually draw into, and you can change what surface a context draws into (for example, swapping the back and front buffers points the context to a different surface), or even not attach it to any surface.
Normally, texture IDs are local to each context:

if you've created a texture in one context, you can't use it in another. To share textures, you need to use a shared context.
Shared contexts are places to store texture IDs, and other IDs for things like draw lists and programs. Creating a context with a shared context means using that shared context to store those IDs, and information about the objects they point to.

A shared context doesn't even need to be pointing to a surface, since its only purpose is to store all those IDs. The easiest way to use a shared context is to create a single global one when your app starts, don't attach it to any window or pbuffer, and pass it into all create context calls.
Once you've done that, all of the texture IDs you create in any of your contexts will be usable in all of them, and you never need to think about the shared context again.
You can use OpenGL on the Mac in a multithreaded app, but you must never call GL functions on the same context from different threads at the same time.
Think of a GL context like a C++ object, if you call a member function on an object whilst another thread is halfway through another member function on the same object, you may end up with a corrupted object. Corrupting a context by a threading mistake will often cause a kernel panic.
If you're writing a multi-threaded GL app, make sure you periodically run the OpenGL Profiler, and enable 'break on thread errors'. This will stop the app and give a stack trace whenever two threads call into the same context. To find out what the other threads are doing, run the command line gdb with 'gdb -p ' and the process number of your app, and do the command 'thread apply all bt' for a complete thread listing.
CGLMacro.h exposes some of the implementation details of Mac OpenGL calls. Normally, every gl*() call actually ends up doing something like a CGLGetCurrentContext() call internally, and then passing that on to the function that does the actual work.
CGLGetCurrentContext() has to do some magic to figure out which thread it's in, and return the correct context for that thread, so it's a good optimisation to cache the current context if you know it's not changing, and you're making a lot of GL calls. CGLMacros let you do this optimisation but change very little of your code, by letting you specify a variable to use for the CGL context. Then, including the header into a source file will automatically switch all GL calls to use that instead of the standard context system.
Posted by petewarden at 05:02 PM | Comments (0)
Straight Alpha and Bilinear Filtering
There's a subtle problem using graphics hardware to do bilinear filtering on straight alpha images.
Take an image with a single pixel of full white, completely opaque, surrounded by fully transparent pixels, with black in their RGB channels

if you use bilinear filtering to get color and alpha values for a point halfway between the white pixel and a black one, this is what you end up with:

If you render this texture scaled up using the standard blending equation for straight alpha, glBlendFunc(GL_SRC_ALPHA,GL_ONE_MINUS_SRC_ALPHA) then you'll see black fringing around the white pixel.
In games, this is not a problem, because you just make sure any zero alpha pixels in your artwork have the correct straight color values.
Where this does become an issue is when the image is generated as an intermediate stage of an image processing pipeline. For example, if you read in images that are pre-multiplied with black alpha, and convert them to straight alpha for the pipeline, then you have no way of telling what the correct color values for fully transparent pixels are.
Another example is drawing a polygon into a pbuffer or render texture. Nothing's drawn into the fully transparent areas, and there's no definitive way of picking a color to put in those pixels.
We looked at several solutions to this:
- Do a processing pass that averages the rgb values of the non-transparent neighbors of each zero alpha pixel, and replaces the existing color with that, or some other way of 'growing' good rgb values into the zero alpha areas.
Though this would probably do a decent job of removing the fringing in real world images, I could still construct cases where this would fail, such as a zero alpha pixel between black and white opaque pixels:

This would mean an extra pass over all of our intermediate textures, which would hit performance.
- Ignore any pixels with zero alpha when doing the bilinear filtering
Before each lerp in the bilinear process, do a check to see if one of the pixels to be lerped between has a zero alpha, and replace it with the other non-zero alpha pixel rather than lerping between them if so.
In pseudo-code, here's the original bilinear process
topColor = lerp(topLeftNeighbor,topRightNeighbor, fractionalX);
bottomColor = lerp(bottomLeftNeighbour,bottomRightNeighbor, fractionalX);
result = lerp(topColor, bottomColor, fractionalY);
and here's the version that checks for zero alpha
if (topLeftNeighbor.alpha==0)
topColor = topRightNeighbor;
else if (topRightNeighbor.alpha==0)
topColor = topLeftNeighbor;
else
topColor = lerp(topLeftNeighbor,topRightNeighbor, fractionalX);
if (bottomLeftNeighbor.alpha==0)
bottomColor = bottomRightNeighbor;
else if (bottomRightNeighbor.alpha==0)
bottomColor = bottomLeftNeighbor;
else
bottomColor = lerp(bottomLeftNeighbour,bottomRightNeighbor, fractionalX);
if (topColor.alpha==0)
result = bottomColor;
else if (bottomColor.alpha==0)
result = topColor;
else
result = lerp(topColor, bottomColor, fractionalY);
This would fix the problem, but requires 4 texture access's if implemented using OpenGL fragment programs, as well as being rather inelegant.
- Change the pipeline to use premultiplied with black
Using glBlendFunc(GL_ONE, GL_ONE_MINUS_SRC_ALPHA) lets you render premultiplied with black alpha textures correctly, and bilinear filtering does the right thing. In the original case, the neighborhood of the white pixel evaluates to a full white (premultiplied by the alpha which drops away), removing the dark fringing.
This is what we did, even though it seems to go against the grain of normally using straight alpha with graphics cards, unlike the other solutions it didn't hurt performance.
Posted by petewarden at 05:00 PM | Comments (0)
Animating Worley Noise
Texturing and Modelling has a chapter by Steve Worley on using Voronoi diagrams to create a procedural noise pattern. I needed to find a way to animate and render this on a 2D plane using OpenGL.
The basic approach is to take a set of points randomly scattered through space, and take the distance to the nearest point as the texture's value. This is the sort of pattern you end up with;
A naive implementation of this would search through all of the scattered points to find the current closest distance, which gets slow very quickly. Steve's implementation cuts down the search space by dividing space into grid cells and placing a single point randomly within each one, like this:
This allows each evaluation to only check against the point in the cell it's currently in, and its neighbours.
This works great for static images, but I really wanted the points to move to get an animated texture. One way to do this would be go back to randomly scattered points, and then sort them into a spatial sorting structure such as a quad-tree each frame to keep the number of neighbours you need to check to a minimum.
I was aiming at a hardware solution, and wanted to draw a simple 2d image of the pattern, rather than procedurally evaluating it at arbitrary points, so I decided to try an idea I first came across in Stylized Video Cubes; using a z-buffer to construct the voronoi diagram.
I draw a quad centered around each point, enable depth testing and use a small pixel shader that outputs the distance from the central point as the z depth, as well as setting the color from a palette indexed on the depth:
The actual size of the quad that's drawn for each point was chosen pretty arbitrarily, to give a nice look for a typical distribution of points.
I then animate each point by giving it a random velocity vector, and wrapping it's position when it heads off the edge of the image (you can either add a slop border the size of a single quad around the edge to wrap to, or if you want a tileable result, render each point multiple times if it's touching a border).
This works pretty nicely, the main issue was what happened if a part of the image isn't covered by any quads, which leaves a gap in the result. I worked around this by clearing the screen to the color indexed by the maximum distance in the palette.
Having a tileable image means that it's possible to do a poor man's fractal noise, by reusing the constructed image for all the sublayers of the fractal pattern. The repetition can be noticeable, especially when its animated, but it's a very low cost operation once you have the basic image.
I experimented doing multiple passes to implement deeper levels of the algorithm, such as using the distance from the second or third nearest neighbor, but this involved rendering a pass that wrote ID values for each point to compare against in the next pass to figure out if the current nearest point was acceptable, and much larger quads, and so I was unable to get decent performance for L2 and higher textures.
I think the best approach for these on hardware would involve a combination of space-sorting strategies to break areas down into a small enough set of possible neighbor points, and then doing the fine-level sorting within a pixel shader. Space sorting would also be a good way to handle the moving points problem with a software implementation, since you don't have the performance advantage you get from an optimized z-buffer in hardware.
Posted by petewarden at 04:38 PM | Comments (0)



