<?xml version="1.0" encoding="utf-8"?>
<feed version="0.3" xmlns="http://purl.org/atom/ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xml:lang="en">
<title>Pete&apos;s GPU Notes</title>
<link rel="alternate" type="text/html" href="http://petewarden.com/notes/" />
<modified>2006-06-28T22:05:16Z</modified>
<tagline></tagline>
<id>tag:petewarden.com,2006:/notes/1</id>
<generator url="http://www.movabletype.org/" version="3.16">Movable Type</generator>
<copyright>Copyright (c) 2006, petewarden</copyright>
<entry>
<title>Running out of VRAM on OS X</title>
<link rel="alternate" type="text/html" href="http://petewarden.com/notes/archives/2006/06/running_out_of.html" />
<modified>2006-06-28T22:05:16Z</modified>
<issued>2006-06-28T20:31:25Z</issued>
<id>tag:petewarden.com,2006:/notes/1.15</id>
<created>2006-06-28T20:31:25Z</created>
<summary type="text/plain">OpenGL on OS X virtualizes texture memory. This means you can allocate an almost unlimited amount of textures, and the OS will keep them in system memory, and only copy them up to the graphics card when they&apos;re needed. This...</summary>
<author>
<name>petewarden</name>
<url>http://petewarden.com/notes</url>
<email>notes@petewarden.com</email>
</author>

<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://petewarden.com/notes/">
<![CDATA[<p>OpenGL on OS X virtualizes texture memory. This means you can allocate an almost unlimited amount of textures, and the OS will keep them in system memory, and only copy them up to the graphics card when they're needed. This is a lot better than the old model of keeping all textures only in VRAM, and texture creation failing once you run out of memory there. But there's still cases where you have to worry about how much VRAM is available.</p>]]>
<![CDATA[<p>The OS will swap out any textures that aren't being used in the current drawing operation to make room in VRAM for the currently active textures. This means that if you're multi-texturing, or using a fragment program to read from multiple textures, the textures that you've assigned to a texture unit and are enabled can't be swapped out. It's important to make clear that the swapping is very aggressive, anything it can remove it will, but if you're doing a texture fetch from a texture in the current operation, it needs to keep that texture around.</p>

<p>This becomes a problem when you have a lot of textures enabled at once, or when they're very large, or when they're high-bit depth, because if they won't fit in VRAM, the current drawing operation will fail and you'll end up with garbage drawn instead. There isn't an error condition or log message output when this happens, so it can be tricky to track down.</p>

<p>To avoid this, you need to work out if your textures will fit in the currently available VRAM before you draw. You can calculate the VRAM usage of an individual texture with</p>

<p>width * height * 4 * bytes per channel</p>

<p>So, a 2048x2048 eight bit per channel texture takes up 2048*2048*4*1 = 16,77,216 bytes, or 16 MB of VRAM.</p>

<p>To work out if you'll run out of VRAM, take all the active texture's memory, add them together, and do the same for the area (pbuffer/window size) that you're drawing to, since that also needs to be in the graphics card while you're drawing.</p>

<p>For a 64 MB graphics card, you can fit four 16 MB textures in at once, so that means you can use three 2048*2048 input textures, if you're drawing into a 2048*2048 pbuffer.</p>

<p>To figure out how much VRAM is available on the card, you need to call </p>

<p><b><br />
long getAvailableVRAM(void)<br />
{<br />
    CGOpenGLDisplayMask displayMask = CGDisplayIDToOpenGLDisplayMask(CGMainDisplayID());<br />
    CGLRendererInfoObj info;<br />
    CGLQueryRendererInfo(displayMask, &info, &_numRenderers);<br />
    long availableVRAM;<br />
    CGLDescribeRenderer(_info, 0, kCGLRPVideoMemory, &availableVRAM);<br />
    return availableVRAM;<br />
}<br />
</b></p>

<p> and then divide by the number of displays connected to the card, since at least ATI evenly partitions the VRAM between each display.</p>]]>
</content>
</entry>
<entry>
<title>Debugging OpenGL on OS X</title>
<link rel="alternate" type="text/html" href="http://petewarden.com/notes/archives/2006/06/debugging_openg_1.html" />
<modified>2006-06-12T21:29:10Z</modified>
<issued>2006-06-12T20:32:37Z</issued>
<id>tag:petewarden.com,2006:/notes/1.14</id>
<created>2006-06-12T20:32:37Z</created>
<summary type="text/plain">It&apos;s tough to debug OpenGL problems, especially with pixel shaders. I&apos;ll cover some tips for any platform, and then some of the OS X built-in tools that can help....</summary>
<author>
<name>petewarden</name>
<url>http://petewarden.com/notes</url>
<email>notes@petewarden.com</email>
</author>

<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://petewarden.com/notes/">
<![CDATA[<p>It's tough to debug OpenGL problems, especially with pixel shaders. I'll cover some tips for any platform, and then some of the OS X built-in tools that can help.</p>]]>
<![CDATA[<p><font size=+1><b>Error Checking</b></font></p>

<p>First, the fundamental rule of debugging OpenGL: <b>call <i>glGetError()</i> early and often</b>. This returns a non-zero value if any of the previous GL calls ran into a problem. I try to call this at the end of every big chunk of GL processing, for example at the end of a model drawing routine, to keep down the clutter and avoid the performance impact of having it in an inner loop.</p>

<p>Now, you only get a number out of the function, so you need to call <b><i>gluErrorString()</i></b> to get an understandable description of what went wrong. I usually wrap the checking up into a macro, so I can easily add it, and automatically remove it in release builds. Here's what I use:<br />
<b><br />
#ifdef ENABLE_GL_ERRORS<br />
void Effect_ShowGLErrors(void)<br />
{<br />
    GLenum error;<br />
    const GLubyte* errStr;<br />
    if ((error = glGetError()) != GL_NO_ERROR)<br />
    {<br />
        errStr = gluErrorString(error);<br />
        fprintf(stderr, "OpenGL Error: %s\n", errStr);<br />
    }<br />
}<br />
#else // ENABLE_GL_ERRORS<br />
#define EFFECT_SHOW_GL_ERRORS() (void)(0)<br />
#endif // ENABLE_GL_ERRORS<br />
</b><br />
Sometimes it's not obvious which call caused the error, and then you either need to scatter more checks through your GL code to narrow it down, or use a great feature of Apple's <a href="http://developer.apple.com/graphicsimaging/opengl/profiler_image.html">OpenGL Profiler</a>, <b>Break on Error</b>.</p>

<p>Start up the profiler, in /Developer/Applications/Graphics Tools/, and either launch your program from it or attach to a copy that's already running. Choose "Views->Breakpoints" from the main menu, and click "Break on Error" on the bottom left of the window. Now, when an error occurs, the program will display a callstack, arguments, and the exact function that caused it. If you've attached to a program you ran from the debugger, you can hit pause in there and examine everything as normal.</p>

<p><font size=+1><b>Black Screen Blues</b></font></p>

<p>One of the hardest things to debug is a black screen in GL. Is the window or pbuffer setup messed up? Is one of the textures you're using bogus? Is your camera pointing at the floor? Is a fragment program having a problem? Is the geometry you're drawing broken, or translated off to infinity? Is your lighting setup busted?</p>

<p>My first test is the programming equivalent of checking pupil dilation with a flashlight; glClear(). If calling a colored clear at the end of rendering doesn't work, then the problem's outside of the drawing code, and I know I have to fix something in my pbuffer or window setup instead.</p>

<p>Clearing the screen is the only way of getting something to appear on screen that it's impossible to screw up by having dodgy gl state. It doesn't matter what your viewport or matrices are, what textures or programs are bound, if GL is working, calling glClear() will have an effect. It's especially useful if you have a chain of rendering running through a series of pbuffers, it can be tough to figure out where a problem's occuring if the end result is blank, using clear lets you work backwards through the chain.<br />
<b><br />
glClearColor(1,1,0,1);<br />
glClear(GL_COLOR_BUFFER_BIT);<br />
glClearColor(0,0,0,0);<br />
</b><br />
As a bonus, you can change the color (I've got yellow above) if you want to indicate different calls.</p>

<p>If this <b>does</b> give you a yellow screen when you call it after all your rendering, then move it to the start of your rendering code, so anything else will be drawn over it. If you then see a black screen, or a silhouette of your model, you know your geometry is getting rendered, but the texturing, lighting or pixel shaders you're using don't work. If you still see a clear yellow screen, then it's the geometry that's the problem.</p>]]>
</content>
</entry>
<entry>
<title>Fragment Program Utilities</title>
<link rel="alternate" type="text/html" href="http://petewarden.com/notes/archives/2005/06/fragment_progra_3.html" />
<modified>2005-06-28T02:15:45Z</modified>
<issued>2005-06-28T01:11:01Z</issued>
<id>tag:petewarden.com,2005:/notes/1.12</id>
<created>2005-06-28T01:11:01Z</created>
<summary type="text/plain">Here&apos;s a couple of functions to check that a fragment program can be run, and to load it and return the ID. The code is inline below, or you can download it....</summary>
<author>
<name>petewarden</name>
<url>http://petewarden.com/notes</url>
<email>notes@petewarden.com</email>
</author>

<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://petewarden.com/notes/">
<![CDATA[Here's a couple of functions to check that a fragment program can be run, and to load it and return the ID. The code is inline below, or you can <a href="http://petewarden.com/notes/archives/FragmentUtils.cpp">download it</a>.<br>]]>
<![CDATA[<br><br><strong>canRunFragmentProgram()</strong> checks the syntax of the program, and also checks that it can be run on the current graphics card. It will print out a message from the compiler for all syntax errors, if the syntax is ok, but the program is too much for the card to do in hardware, it will output some information about the limit it went over, and return false.<br><br>

<strong>loadFragmentProgram()</strong> loads the fragment program you pass in as a C string, and returns an ID you can bind to use the program.<br><br>

<code>
<pre>
<strong>#include &lt;stdio.h&gt;</strong>
<strong>#include &lt;string.h&gt;</strong>

<strong>#include &lt;OpenGL/gl.h&gt;</strong>
<strong>#include &lt;OpenGL/glext.h&gt;</strong>

<strong>bool</strong> canRunFragmentProgram(<strong>const</strong> <strong>char</strong>* programString);
GLuint loadFragmentProgram(<strong>const</strong> <strong>char</strong>* programString);
        
// Checks for errors in the fragment program
<strong>bool</strong> canRunFragmentProgram(<strong>const</strong> <strong>char</strong>* programString)
<strong>{</strong>
	// Make sure the card supports fragment programs at all, by searching for the extension string
	<strong>const</strong> <strong>char</strong> *extensions = <strong>reinterpret_cast</strong>&lt;<strong>const</strong> <strong>char</strong> *&gt;( glGetString( GL_EXTENSIONS ) );
	<strong>const</strong> <strong>bool</strong> cardSupportsARB = ( strstr( extensions, &quot;GL_ARB_fragment_program&quot; ) != NULL );

	// If it doesn't support them, no program can run, so report the problem and return false
	<strong>if</strong> (!cardSupportsARB)
	<strong>{</strong>
        		fprintf(stderr,&quot;Card does not support the ARB_fragment_program extension\n&quot;);
        		<strong>return</strong> <strong>false</strong>;
	<strong>}</strong>

	// Create a temporary ID to load the shader into
	GLuint tempID;
	glGenProgramsARB( 1, &amp;tempID );
	glBindProgramARB( GL_FRAGMENT_PROGRAM_ARB, tempID );
   
	// Get the driver to load and parse the shader
	glProgramStringARB( GL_FRAGMENT_PROGRAM_ARB,
						GL_PROGRAM_FORMAT_ASCII_ARB,
						strlen( programString ),
						programString );

	GLint	 isUnderNativeLimits;
	glGetProgramivARB( GL_FRAGMENT_PROGRAM_ARB,
					   GL_PROGRAM_UNDER_NATIVE_LIMITS_ARB,
					   &amp;isUnderNativeLimits );

	// If the program is over the hardware's limits, print out some information
	<strong>if</strong> (isUnderNativeLimits!=1)
	<strong>{</strong>
		// Go through the most common limits that are exceeded
		fprintf(stderr, &quot;Fragment program is beyond hardware limits:\n&quot;);

		GLint aluInstructions, maxAluInstructions;
		glGetProgramivARB(GL_FRAGMENT_PROGRAM_ARB, GL_PROGRAM_ALU_INSTRUCTIONS_ARB, &amp;aluInstructions);
		glGetProgramivARB(GL_FRAGMENT_PROGRAM_ARB, GL_MAX_PROGRAM_ALU_INSTRUCTIONS_ARB, &amp;maxAluInstructions);
		<strong>if</strong> (aluInstructions&gt;maxAluInstructions)
			fprintf(stderr, &quot;Compiles to too many ALU instructions (%d, limit is %d)\n&quot;, aluInstructions, maxAluInstructions);

		GLint textureInstructions, maxTextureInstructions;
		glGetProgramivARB(GL_FRAGMENT_PROGRAM_ARB, GL_PROGRAM_TEX_INSTRUCTIONS_ARB, &amp;textureInstructions);
		glGetProgramivARB(GL_FRAGMENT_PROGRAM_ARB, GL_MAX_PROGRAM_TEX_INSTRUCTIONS_ARB, &amp;maxTextureInstructions);
		<strong>if</strong> (textureInstructions&gt;maxTextureInstructions)
			fprintf(stderr, &quot;Compiles to too many texture instructions (%d, limit is %d)\n&quot;, textureInstructions, maxTextureInstructions);

		GLint textureIndirections, maxTextureIndirections;
		glGetProgramivARB(GL_FRAGMENT_PROGRAM_ARB, GL_PROGRAM_TEX_INDIRECTIONS_ARB, &amp;textureIndirections);
		glGetProgramivARB(GL_FRAGMENT_PROGRAM_ARB, GL_MAX_PROGRAM_TEX_INDIRECTIONS_ARB, &amp;maxTextureIndirections);
		<strong>if</strong> (textureIndirections&gt;maxTextureIndirections)
			fprintf(stderr, &quot;Compiles to too many texture indirections (%d, limit is %d)\n&quot;, textureIndirections, maxTextureIndirections);

		GLint nativeTextureIndirections, maxNativeTextureIndirections;
		glGetProgramivARB(GL_FRAGMENT_PROGRAM_ARB, GL_PROGRAM_NATIVE_TEX_INDIRECTIONS_ARB, &amp;nativeTextureIndirections);
		glGetProgramivARB(GL_FRAGMENT_PROGRAM_ARB, GL_MAX_PROGRAM_NATIVE_TEX_INDIRECTIONS_ARB, &amp;maxNativeTextureIndirections);
		<strong>if</strong> (nativeTextureIndirections&gt;maxNativeTextureIndirections)
			fprintf(stderr, &quot;Compiles to too many native texture indirections (%d, limit is %d)\n&quot;, nativeTextureIndirections, maxNativeTextureIndirections);

		GLint nativeAluInstructions, maxNativeAluInstructions;
		glGetProgramivARB(GL_FRAGMENT_PROGRAM_ARB, GL_PROGRAM_NATIVE_ALU_INSTRUCTIONS_ARB, &amp;nativeAluInstructions);
		glGetProgramivARB(GL_FRAGMENT_PROGRAM_ARB, GL_MAX_PROGRAM_NATIVE_ALU_INSTRUCTIONS_ARB, &amp;maxNativeAluInstructions);
		<strong>if</strong> (nativeAluInstructions&gt;maxNativeAluInstructions)
			fprintf(stderr, &quot;Compiles to too many native ALU instructions (%d, limit is %d)\n&quot;, nativeAluInstructions, maxNativeAluInstructions);
	<strong>}</strong>

 	// See if a syntax error was found
	// Often the actual line number won't be given, it will just be zero if there's an error
	// and minus one if it's ok. The error string usually includes the right line number.
	GLint errorLine;
	glGetIntegerv(GL_PROGRAM_ERROR_POSITION_ARB, &amp;errorLine);
	<strong>if</strong> (errorLine!=-1)
	<strong>{</strong>
		<strong>const</strong> GLubyte* errorString = glGetString(GL_PROGRAM_ERROR_STRING_ARB);
		fprintf(stderr,&quot;%s&quot;,errorString);
	<strong>}</strong>
	
	glDeleteProgramsARB( 1, &amp;tempID );

	<strong>const</strong> <strong>bool</strong> result = ((isUnderNativeLimits==1)&amp;&amp;(errorLine==-1));
    
    <strong>return</strong> result;
<strong>}</strong>

GLuint loadFragmentProgram(<strong>const</strong> <strong>char</strong>* programString)
<strong>{</strong>
    GLuint result;
    
    glGenProgramsARB( 1, &amp;result );
    glBindProgramARB( GL_FRAGMENT_PROGRAM_ARB, result );
    glProgramStringARB( GL_FRAGMENT_PROGRAM_ARB,
        GL_PROGRAM_FORMAT_ASCII_ARB,
        strlen( programString ),
        programString );

    <strong>return</strong> result;
<strong>}</strong>

</pre>
</code>]]>
</content>
</entry>
<entry>
<title>Fragment Program Reference</title>
<link rel="alternate" type="text/html" href="http://petewarden.com/notes/archives/2005/05/fragment_progra_2.html" />
<modified>2005-06-28T00:44:36Z</modified>
<issued>2005-05-19T00:28:52Z</issued>
<id>tag:petewarden.com,2005:/notes/1.11</id>
<created>2005-05-19T00:28:52Z</created>
<summary type="text/plain">The OpenGL Extensions Guide has a great chapter on the ARB fragment program language, I recommend buying it, but there aren&apos;t many good references online. The most useful is the official spec, but it&apos;s designed as an exhaustive guide, not...</summary>
<author>
<name>petewarden</name>
<url>http://petewarden.com/notes</url>
<email>notes@petewarden.com</email>
</author>

<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://petewarden.com/notes/">
<![CDATA[<p>The <a href="http://www.charlesriver.com/Books/BookDetail.aspx?productID=65132">OpenGL Extensions Guide</a> has a great chapter on the ARB fragment program language, I recommend buying it, but there aren't many good references online. The most useful is the <a href="http://oss.sgi.com/projects/ogl-sample/registry/ARB/fragment_program.txt">official spec</a>, but it's designed as an exhaustive guide, not a quick reference for programmers. Here's a rundown of the instruction set, and some tips and tricks.</p>]]>
<![CDATA[<p>Here's a cut-out-and-keep table of all instructions, based on table X.5 in the spec:<br />
<TABLE border="2" align=center summary="This table lists the ARB fragment program instructions, their inputs and outputs."><br />
<CAPTION><EM>ARB fragment program instructions</EM></CAPTION><br />
<TR><TH>Instruction<TH>Inputs<TH>Output<TH>Description<TH>Pseudocode<TH>Notes<br />
<TR><TH>ABS<TD>v<TD>v<TD><b>absolute value</b><TD><code>fabs(arg1)</code><br />
<TR><TH>ADD<TD>v,v<TD>v<TD><b>add</b><TD><code>arg1+arg2</code><br />
<TR><TH>CMP<TD>v,v,v<TD>v<TD><b>compare</b><TD><code>if (arg1&lt;0) arg2 else arg3</code><br />
<TR><TH>COS<TD>s<TD>ssss<TD><b>cosine</b><TD><code>cos(arg1)</code><TD>Synthesised using 5 instructions on ATI R3xx<br />
<TR><TH>DP3<TD>v,v<TD>ssss<TD><b>3-component dot product</b><TD><code>(arg1.x*arg2.x)+(arg1.y*arg2.y)+(arg1.z*arg2.z)</code><br />
<TR><TH>DP4<TD>v,v<TD>ssss<TD><b>4-component dot product</b><TD><code>(arg1.x*arg2.x)+(arg1.y*arg2.y)+(arg1.z*arg2.z)+(arg1.w*arg2.w)</code><TD>Emulated with 4 native instructions on ATI R3xx<br />
<TR><TH>DPH<TD>v,v<TD>ssss<TD><b>homogeneous dot product</b><TD><code>(arg1.x*arg2.x)+(arg1.y*arg2.y)+(arg1.z*arg2.z)+arg2.w</code><br />
<TR><TH>DST<TD>v,v<TD>v<TD><b>distance vector</b><TD>Funky, see spec<br />
<TR><TH>EX2<TD>s<TD>ssss<TD><b>exponential base 2</b><TD><code>pow(2,arg1)</code><TD>Using negative args on NVidia gives different results to ATI<br />
<TR><TH>FLR<TD>v<TD>v<TD><b>floor</b><TD><code>floor(arg1)</code><br />
<TR><TH>FRC<TD>v<TD>v<TD><b>fraction</b><TD><code>arg1-(int)arg1</code><br />
<TR><TH>KIL<TD>v<TD>v<TD><b>kill fragment</b><TD><code>if (arg1&lt;0) return</code><br />
<TR><TH>LG2<TD>s<TD>ssss<TD><b>logarithm base 2</b><TD><code>log(arg1)</code><br />
<TR><TH>LIT<TD>v<TD>v<TD><b>compute light coefficients</b><TD>Funky, see spec<br />
<TR><TH>LRP<TD>v,v,v<TD>v<TD><b>linear interpolation</b><TD><code>(arg2*arg1)+(arg3*(1-arg1))</code><TD>Order is lerpValue, end, start<br />
<TR><TH>MAD<TD>v,v,v<TD>v<TD><b>multiply and add</b><TD><code>(arg2*arg3)+arg4</code><TD>Really useful to reduce instruction counts<br />
<TR><TH>MAX<TD>v,v<TD>v<TD><b>maximum</b><TD><code>if (arg1&lt;arg2) arg2 else arg1</code><br />
<TR><TH>MIN<TD>v,v<TD>v<TD><b>minimum</b><TD><code>if (arg1&gt;arg2) arg2 else arg1</code><br />
<TR><TH>MOV<TD>v<TD>v<TD><b>move</b><TD><code>arg1</code><br />
<TR><TH>MUL<TD>v,v<TD>v<TD><b>multiply<TD></b><code>arg1*arg2</code><br />
<TR><TH>POW<TD>s,s<TD>ssss<TD><b>exponentiate</b><TD><code>pow(arg1,arg2)</code><br />
<TR><TH>RCP<TD>s<TD>ssss<TD><b>reciprocal</b><TD><code>1/arg1</code><br />
<TR><TH>RSQ<TD>s<TD>ssss<TD><b>reciprocal square root</b><TD><code>r1/sqrt(arg1)</code><br />
<TR><TH>SCS<TD>s<TD>ss--<TD><b>sine/cosine</b><TD><code>result.x=sin(arg1) result.y=cos(arg1)</code><TD>Synthesised using multiple instructions on ATI R3xx<br />
<TR><TH>SGE<TD>v,v<TD>v<TD><b>set on greater than or equal</b><TD><code>if (arg1&gt;&#61;arg2) 1.0 else 0.0</code><br />
<TR><TH>SIN<TD>s<TD>ssss<TD><b>sine</b><TD><code>sin(arg1)</code><TD>Synthesised using five instructions on ATI R3xx<br />
<TR><TH>SLT<TD>v,v<TD>v<TD><b>set on less than</b><TD><code>if (arg1&lt;arg2) 1.0 else 0.0</code><br />
<TR><TH>SUB<TD>v,v<TD>v<TD><b>subtract</b><TD><code>arg1-arg2</code><br />
<TR><TH>SWZ<TD>v<TD>v<TD><b>extended swizzle<TD>Funky, see spec<TD>Synthesised using multiple instructions on ATI R3xx<br />
<TR><TH>TEX<TD>v,u,t<TD>v<TD><b>texture sample</b><TD><TD>Texture instructions are almost always the performance bottleneck<br />
<TR><TH>TXB<TD>v,u,t<TD>v<TD><b>texture sample with bias</b><TD><br />
<TR><TH>TXP<TD>v,u,t<TD>v<TD><b>texture sample with projection</b><TD><br />
<TR><TH>XPD<TD>v,v<TD>v<TD><b>cross product</b><TD><code>[(arg1.y*arg2.z-arg1.z*arg2.y),(arg1.z*arg2.x-arg1.x*arg2.z),(arg1.x*arg2.y-arg1.y*arg2.x)]</code><br />
</TABLE></p>

<p>Always specify if you're only using some components in an instruction, the compilers aren't smart enough generally to figure out if you only use the .x component of the result later on, and both vendors' hardware has clever tricks they can play executing vector and scalar instructions in parallel.</p>

<p>Try and calculate everything you can using arithmetic instructions rather than doing table lookups from textures. Memory access almost always seems to be the limiting factor on the speed of our fragment programs, you've got a lot of free instruction slots that can be filled performing extra calculations while the hardware's waiting on memory.</p>

<p>ATI cards support simple swizzling, either where you're masking out some components in the result register (ADD foo.xy, bob, jim;) or where you're duplicating a single component across the whole register (ADD foo, bob, jim.x;)<br />
Anything more complicated will be emulated using multiple instructions (ADD foo, bob.zyzy, jim; or ADD foo, bob.xxxy, jim;)</p>

<p>On ATI, using GL_TEXTURE_RECTANGLE_EXT textures in TEX instructions (RECT as the target) will generate hidden instructions to convert the coordinates to the 0 to 1 range, from the input range of 0 to <size in pixels> that's used for the extension. This is especially tricky because it adds another hidden level of texture indirection.</p>]]>
</content>
</entry>
<entry>
<title>Mac PBuffers</title>
<link rel="alternate" type="text/html" href="http://petewarden.com/notes/archives/2005/05/mac_pbuffers.html" />
<modified>2005-06-27T22:40:53Z</modified>
<issued>2005-05-18T01:25:12Z</issued>
<id>tag:petewarden.com,2005:/notes/1.10</id>
<created>2005-05-18T01:25:12Z</created>
<summary type="text/plain">If you ever need to do more image processing than you can with a single fragment program pass, you&apos;ll need to render to a texture. The best way to do that on OS X is using a pbuffer, here&apos;s some...</summary>
<author>
<name>petewarden</name>
<url>http://petewarden.com/notes</url>
<email>notes@petewarden.com</email>
</author>

<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://petewarden.com/notes/">
<![CDATA[<p>If you ever need to do more image processing than you can with a single fragment program pass, you'll need to render to a texture. The best way to do that on OS X is using a pbuffer, here's some example code for creating and using them.</p>

<p>I've included the code inline below, or you can get a zip of the code from <a href="http://petewarden.com/notes/PBufferUtils.zip">PBufferUtils.zip here</a></p>]]>
<![CDATA[<p>To use it, call <b>PBuffer_Create(sharedContext, width, height, 8, ePBufferFlag_ZBuffer)</b> to create a pbuffer with a depth buffer attached, or 0 as the last argument for no z buffer. The 'sharedContext' is a housekeeping device to let texture IDs be shared between multiple contexts, rather than being only usable within a single context which is the default. If you have a single shared context in your app, this means you can create a texture in one pbuffer's context, whether by uploading pixel data or referencing a pbuffer, and use it in any other pbuffer's context in your app. I go into detail on contexts <a href="http://petewarden.com/notes/MacGLContexts.html">here</a></p>

<p>Use <b>PBuffer_Begin()</b> before you start drawing into it, and <b>PBuffer_End()</b> once you're done. When you're ready to use it as a texture, call PBuffer_Use() in the same place you'd normally do a<br />
<b>glEnable(GL_TEXTURE_RECTANGLE_EXT);<br />
glBindTexture(GL_TEXTURE_RECTANGLE_EXT, someTextureID);</b><br />
and do a<br />
<b>glDisable(GL_TEXTURE_RECTANGLE_EXT);</b><br />
when you're done with it.</p>

<p><b>PBuffer_Begin()</b> clears the pbuffer, sets up an orthographic view and resets some common gl state, but you need to be careful to reset gl state manually, a lot of our bugs have been caused by state left hanging around.</p>

<p>You'll get an error if you try to create a pbuffer with different pixel attributes than the shared context, even if the difference doesn't seem relevant. For example, you might need to add kCGLPFANoRecovery to the list of attributes if that's what the shared context has been created with.</p>

<p>Creating and destroying pbuffers is expensive, so we tend to carry them over from frame to frame unless they need to change in size. If size changes are needed, we try to over-allocate and create bigger pbuffers than we need, to avoid frequent reallocations for growing objects. </p>

<p>Destroying a pbuffer immediately after using it as a texture also seems to cause rendering problems, as if the surface is destroyed before the renderer uses it. Waiting a little while, for example until the next time you need to render, seems to prevent this.</p>

<p>Here's the code to go into PBufferUtils.h:<br />
------------<br />
#ifndef INCLUDE_PBUFFER_UTILS_H<br />
#define INCLUDE_PBUFFER_UTILS_H</p>

<p>#include <OpenGL/GL.h><br />
#include <OpenGL/OpenGL.h></p>

<p>enum EPBufferFlags<br />
{<br />
    ePBufferFlag_ZBuffer=(1<<0),<br />
};</p>

<p>typedef struct PetePBuffer_tag {<br />
    CGLPBufferObj pbuffer;<br />
    CGLContextObj pbufferContext;<br />
    CGLContextObj previousContext;<br />
    int width;<br />
    int height;<br />
    GLuint textureID;<br />
    bool needsClearing;<br />
    bool needsFlush;<br />
	int createdWidth;<br />
	int createdHeight;<br />
} PetePBuffer;</p>

<p>PetePBuffer* PBuffer_Create(CGLContextObj sharedContext,int nWidth,int nHeight,int colorDepth, int flags);<br />
void PBuffer_Destroy(PetePBuffer* pbuffer);</p>

<p>// Surround rendering with these to draw into the pbuffer, they handle pushing and popping the old<br />
// context for you<br />
void PBuffer_Begin(PetePBuffer* pbuffer);<br />
void PBuffer_End(PetePBuffer* pbuffer);</p>

<p>// Binds the pbuffer as a texture<br />
void PBuffer_Use(PetePBuffer* pbuffer);</p>

<p>#endif // INCLUDE_PBUFFER_UTILS_H<br />
------<br />
and here's the PBufferUtils.cpp code:<br />
------<br />
#include "PBufferUtils.h"</p>

<p>#include <assert.h><br />
#include <stdio.h></p>

<p>static void checkCGLErrorImplementation(CGLError error, char* sourceFile, int sourceLine)<br />
{<br />
    if (error) {<br />
        const char* errStr;<br />
        errStr = CGLErrorString(error);<br />
        fprintf(stderr, "CGL Error: %s at %s:%d\n", errStr, sourceFile, sourceLine);<br />
        assert(false);<br />
    }<br />
}</p>

<p>#define checkCGLError(error) checkCGLErrorImplementation(error,__FILE__,__LINE__)</p>

<p>PetePBuffer* PBuffer_Create(CGLContextObj sharedContext,int width,int height,int colorDepth, int flags)<br />
{<br />
    PetePBuffer* pbuffer=new PetePBuffer;<br />
    pbuffer->pbuffer=NULL;<br />
    pbuffer->pbufferContext=NULL;<br />
    pbuffer->previousContext=NULL;<br />
    pbuffer->width=width;<br />
    pbuffer->height=height;<br />
    pbuffer->textureID=0;<br />
    pbuffer->needsClearing=true;<br />
    pbuffer->needsFlush=false;<br />
    <br />
    const int bitsPerPixel = (colorDepth*4);<br />
    const bool hasZBuffer = (flags&ePBufferFlag_ZBuffer);<br />
        <br />
    int i = 0;    <br />
    CGLPixelFormatAttribute pixelFormatAttributes[32];<br />
        <br />
    pixelFormatAttributes[i++] = kCGLPFAAccelerated;<br />
    pixelFormatAttributes[i++] = kCGLPFAWindow;<br />
    pixelFormatAttributes[i++] = kCGLPFAColorSize;    <br />
    pixelFormatAttributes[i++] = (CGLPixelFormatAttribute)(bitsPerPixel);</p>

<p>    if (colorDepth>8)<br />
        pixelFormatAttributes[i++] = kCGLPFAColorFloat;</p>

<p>    if (hasZBuffer)<br />
    {<br />
        pixelFormatAttributes[i++] = kCGLPFADepthSize;<br />
        pixelFormatAttributes[i++] = (CGLPixelFormatAttribute)(16);<br />
    }</p>

<p>    pixelFormatAttributes[i++] = (CGLPixelFormatAttribute)0;</p>

<p>    long numPixelFormats = 0;<br />
    CGLPixelFormatObj pixelFormat = NULL;<br />
    CGLError error = CGLChoosePixelFormat(pixelFormatAttributes, &pixelFormat, &numPixelFormats);<br />
    checkCGLError(error);</p>

<p>    error = CGLCreateContext(pixelFormat, sharedContext, &pbuffer->pbufferContext);<br />
    checkCGLError(error);<br />
    <br />
    CGLDestroyPixelFormat(pixelFormat);<br />
        <br />
    error = CGLCreatePBuffer(pbuffer->width,pbuffer->height,GL_TEXTURE_RECTANGLE_EXT,GL_RGBA,0,&pbuffer->pbuffer);<br />
    checkCGLError(error);</p>

<p>    return pbuffer;</p>

<p>}</p>

<p>void PBuffer_Destroy(PetePBuffer* pbuffer) {</p>

<p>    if (pbuffer!=NULL)<br />
    {<br />
        CGLDestroyPBuffer(pbuffer->pbuffer);<br />
        CGLDestroyContext(pbuffer->pbufferContext);<br />
    }</p>

<p>    delete pbuffer;</p>

<p>}</p>

<p>void PBuffer_Begin(PetePBuffer* pbuffer) {</p>

<p>    // Pete- check to ensure begin() hasn't already been called for this pbuffer<br />
    assert(pbuffer->previousContext==NULL); </p>

<p>    pbuffer->previousContext=CGLGetCurrentContext();</p>

<p>    long screen;<br />
    CGLError error=CGLGetVirtualScreen(pbuffer->previousContext,&screen);<br />
    assert(!error);</p>

<p>    error=CGLSetCurrentContext(pbuffer->pbufferContext);<br />
    checkCGLError(error);</p>

<p>    error=CGLSetPBuffer(pbuffer->pbufferContext,pbuffer->pbuffer,0,0,screen);<br />
    checkCGLError(error);</p>

<p>    glClearColor(0.0f,0.0f,0.0f,0.0f); <br />
    if (pbuffer->needsClearing)<br />
        glClear(GL_COLOR_BUFFER_BIT|GL_DEPTH_BUFFER_BIT);<br />
    glDisable(GL_DEPTH_TEST);<br />
    glDisable(GL_FRAGMENT_PROGRAM_ARB);</p>

<p>    glMatrixMode(GL_PROJECTION);<br />
    glLoadIdentity();<br />
    glMatrixMode(GL_MODELVIEW);<br />
    glLoadIdentity();</p>

<p>    glColorMask(GL_TRUE,GL_TRUE,GL_TRUE,GL_TRUE);</p>

<p>    glColor4f(1.0f,1.0f,1.0f,1.0f);</p>

<p>    glOrtho(0, pbuffer->width, 0, pbuffer->height, -1.0, 1.0);<br />
	<br />
    glActiveTexture(GL_TEXTURE0);            </p>

<p>}</p>

<p>void PBuffer_End(PetePBuffer* pbuffer) {<br />
    <br />
    glFlush();</p>

<p>    pbuffer->needsFlush=true;</p>

<p>    assert(pbuffer->previousContext!=NULL);</p>

<p>    CGLError error=CGLSetCurrentContext(pbuffer->previousContext);<br />
    checkCGLError(error);<br />
    <br />
    pbuffer->previousContext=NULL;<br />
    <br />
}</p>

<p>void PBuffer_Use(PetePBuffer* pbuffer) {<br />
    <br />
    if (pbuffer->needsFlush)<br />
    {<br />
        CGLContextObj currentContext=CGLGetCurrentContext();</p>

<p>        CGLError error=CGLSetCurrentContext(pbuffer->pbufferContext);<br />
        checkCGLError(error);<br />
        <br />
        glFlush();<br />
        <br />
        error=CGLSetCurrentContext(currentContext);<br />
        checkCGLError(error);<br />
        <br />
        pbuffer->needsFlush=false;<br />
    }<br />
    <br />
    if (pbuffer->textureID==0)<br />
    {<br />
        CGLContextObj currentContext=CGLGetCurrentContext();<br />
    <br />
        glGenTextures(1,&pbuffer->textureID);<br />
        glBindTexture(GL_TEXTURE_RECTANGLE_EXT,pbuffer->textureID);<br />
        glTexParameteri(GL_TEXTURE_RECTANGLE_EXT,GL_TEXTURE_WRAP_S, GL_CLAMP_TO_EDGE);            <br />
        glTexParameteri(GL_TEXTURE_RECTANGLE_EXT,GL_TEXTURE_WRAP_T, GL_CLAMP_TO_EDGE);            </p>

<p>        CGLError error=CGLTexImagePBuffer(currentContext,pbuffer->pbuffer,GL_FRONT_LEFT);<br />
        checkCGLError(error);<br />
    <br />
    }<br />
    else<br />
    {<br />
        glBindTexture(GL_TEXTURE_RECTANGLE_EXT,pbuffer->textureID);<br />
    }<br />
    <br />
    glEnable(GL_TEXTURE_RECTANGLE_EXT);</p>

<p>}</p>]]>
</content>
</entry>
<entry>
<title>Emulating Bilinear Filtering</title>
<link rel="alternate" type="text/html" href="http://petewarden.com/notes/archives/2005/05/emulating_bilin.html" />
<modified>2005-05-18T01:24:12Z</modified>
<issued>2005-05-18T00:32:56Z</issued>
<id>tag:petewarden.com,2005:/notes/1.9</id>
<created>2005-05-18T00:32:56Z</created>
<summary type="text/plain">Most graphics cards that support float textures can&apos;t bilinear filter them. We can use a fragment program to do the filtering for us, but it&apos;s tricky to get right. Here&apos;s an example of how to do it:...</summary>
<author>
<name>petewarden</name>
<url>http://petewarden.com/notes</url>
<email>notes@petewarden.com</email>
</author>

<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://petewarden.com/notes/">
<![CDATA[<p>Most graphics cards that support float textures can't bilinear filter them. We can use a fragment program to do the filtering for us, but it's tricky to get right. Here's an example of how to do it:</p>]]>
<![CDATA[<p><b>!!ARBfp1.0</b></p>

<p><b>ATTRIB inputCoords = fragment.texcoord[0];</b></p>

<p><b>TEMP sourceCoords;<br />
TEMP sourceCoordsTopLeft, sourceCoordsTopRight, sourceCoordsBottomLeft, sourceCoordsBottomRight, fraction;<br />
TEMP sourceTopLeft,sourceTopRight,sourceBottomLeft,sourceBottomRight;</b></p>

<p><b>SUB sourceCoords, inputCoords, {0.5, 0.5, 0.0, 0.0};<br />
FRC fraction.x, sourceCoords.x;<br />
FRC fraction.y, sourceCoords.y;<br />
SUB sourceCoords, sourceCoords, fraction;<br />
ADD sourceCoords, sourceCoords, {0.5, 0.5, 0.0, 0.0};</b></p>

<p><b>ADD sourceCoordsTopLeft, sourceCoords, {1,1,0,0};<br />
ADD sourceCoordsTopRight, sourceCoords, {0,1,0,0};<br />
ADD sourceCoordsBottomLeft, sourceCoords, {1,0,0,0};<br />
ADD sourceCoordsBottomRight, sourceCoords, {0,0,0,0};</b></p>

<p><b>TEX sourceTopLeft, sourceCoordsTopLeft, texture[0], RECT;<br />
TEX sourceTopRight, sourceCoordsTopRight, texture[0], RECT;<br />
TEX sourceBottomLeft, sourceCoordsBottomLeft, texture[0], RECT;<br />
TEX sourceBottomRight, sourceCoordsBottomRight, texture[0], RECT;</b></p>

<p><b>LRP sourceTopLeft, fraction.x, sourceTopLeft, sourceTopRight;<br />
LRP sourceBottomLeft, fraction.x, sourceBottomLeft, sourceBottomRight;<br />
LRP result.color, fraction.y, sourceTopLeft, sourceBottomLeft;</b></p>

<p><b>END</b><br />
</p>]]>
</content>
</entry>
<entry>
<title>GPU Optical Flow</title>
<link rel="alternate" type="text/html" href="http://petewarden.com/notes/archives/2005/05/gpu_optical_flo.html" />
<modified>2005-05-18T00:32:04Z</modified>
<issued>2005-05-11T02:02:50Z</issued>
<id>tag:petewarden.com,2005:/notes/1.8</id>
<created>2005-05-11T02:02:50Z</created>
<summary type="text/plain">Optical flow analysis is heavily used by video and film processing apps, to retime images smoothly, apply motion blur as a post-process, and guide the application of other processes. It&apos;s traditionally a slow operation, I wanted to see if a...</summary>
<author>
<name>petewarden</name>
<url>http://petewarden.com/notes</url>
<email>notes@petewarden.com</email>
</author>

<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://petewarden.com/notes/">
<![CDATA[<p>Optical flow analysis is heavily used by video and film processing apps, to retime images smoothly, apply motion blur as a post-process, and guide the application of other processes. It's traditionally a slow operation, I wanted to see if a GPU version was possible.</p>]]>
<![CDATA[<p>Optical flow analyses frames in a sequence, and trys to map how each pixel is moving. All techniques assume that there's some meaningful movement in the sequence to extract; sequences of unrelated images or random noise won't give useful results.</p>

<p>The results you want are easy to visualize, when we see a moving image, we can tell that there are moving objects there, and roughly how fast and in what direction they're moving. The useful way to represent that is as a vector field, an image where each pixel has a velocity instead of a color, showing what direction the pixel in the underlying image is moving.</p>

<p><a href ="http://www.caam.rice.edu/~zhang/caam699/opt-flow/horn81.pdf">Determining Optical Flow</a> by Horn & Shunck lays out the classic method for figuring this out. I took it as a starting point for my implementation, since it seemed pretty simple and robust compared to more recent algorithms.</p>

<p>It took me some time to understand the basic ideas behind their algorithm, but once I did, they're remarkably elegant and straightforward. I'll try an informal explanation here to help anyone else whose eyes glaze over when confronted with a journal paper.</p>

<p>Start with the 1D case. You have a 1D wave that looks like this at time 0:<br />
<img src="http://petewarden.com/notes/images/Line1.png"><br />
Here's the same wave at frame 1:<br />
<img src="http://petewarden.com/notes/images/Line2.png"><br />
Visually, you can see that the line has moved to the left. If you subtract frame 0 from frame 1, you get this line:<br />
<img src="http://petewarden.com/notes/images/LineDifference.png"><br />
This is just the difference between the first and second frames. The crucial step is to multiply this against the slope of the original wave, to get an estimate of the movement. This gives you this wave:<br />
<img src="http://petewarden.com/notes/images/LineMovement.png"><br />
Why does this work? One way that I found to visualize it is to imagine you've got an infinite straight line, that's constantly sloping and sliding to the right over time. For any point that's on the line at the start of time, the line directly below it drops away at a constant speed: <br />
<img src="http://petewarden.com/notes/images/LineSlide.png"></p>

<p>The difference between two frames is like the distance from the point that was on the line, to where the line is now below it. That distance is determined by three variables; the slope of the line, the time that the line's been sliding, and how fast it's been sliding. Since we know the slope, and how long there's been between frames, we can divide those out and be left with the line's speed of movement.</p>

<p>This simple example should also highlight a couple of limitations of optical flow analysis:<br />
- <b>You need a slope or gradient in the image for the analysis to get a grip on.</b> Flat areas of color in the center of objects can't reveal any movement.<br />
- <b>Movements need to be small, compared to the size of the object.</b> The difference is applied to the gradient at each pixel, if that gradient is from a different object than in the first frame, it will give nonsense results. You need a slope that is roughly the same over the distance that the object moves between frames to get accurate results.</p>

<p>To extend the algorithm to two dimensions, do a difference on the two frames, and then multiply dy and dx gradient vectors for the slope of each pixel by the difference to get the optical flow estimate.</p>

<p>Here's some pseudo-code for the basic process:</p>

<p><b>(1) CurrentDifference = (Frame1[x][y]-Frame0[x][y]);</b></p>

<p><b>(2) GradientX = ((Frame1[x+1][y]-Frame1[x-1][y]) + (Frame0[x+1][y]-Frame0[x-1][y])) / 2.0;<br />
(3) GradientY = ((Frame1[x][y+1]-Frame1[x][y-1]) + (Frame0[x][y+1]-Frame0[x][y-1])) / 2.0; </b></p>

<p><b>(4) GradientMagnitude = sqrt((GradientX*GradientX)+(GradientY*GradientY)); </b></p>

<p><b>(5) VelocityX = CurrentDifference*(GradientX/GradientMagnitude);<br />
(6) VelocityY = CurrentDifference*(GradientY/GradientMagnitude); </b></p>

<p>This is the heart of the technique. To improve the accuracy of the results, Horn and Schunck do this repeatedly, in order to eliminate spurious errors introduced by noise and to propagate movement across areas of flat color.</p>

<p>They use a couple of ways to do this. The algorithm likes long smooth slopes, and gets confused by short steep ones, and short steep ones are more likely to be the result of high-frequency noise than genuine objects in the sequence. To damp down the effect, a term usually called lambda is added to line 4:</p>

<p><b>(4) GradientMagnitude = sqrt((GradientX*GradientX)+(GradientY*GradientY)+Lambda); </b></p>

<p>Since the gradient is divided by this magnitude result, large values of lambda mean that steep gradients give lower velocities than gentle slopes. In practice, it's a noise filter, low values give you a very sensitive but noisy result, high values filter out the noise, but miss small objects.</p>

<p>The other method tries to smooth out the result, removing isolated small areas that are moving in a different direction to their neighbors, and also filling in areas with no gradients with movement data from nearby. This is where the iteration comes in; the vector field result is run through a blur kernel, and then the optical flow calculation is re-run, feeding the previous flow vector back in. This is how it looks in the code:</p>

<p><b>...<br />
(3.1) PreviousVector = PreviousResultBlurred[x][y];<br />
(3.2) PreviousDotGradient = (PreviousVector.x*GradientX)+(PreviousVector.y*GradientY);<br />
...<br />
(5) VelocityX = PreviousVector.x + (CurrentDifference+PreviousDotGradient)*(GradientX/GradientMagnitude);<br />
(6) VelocityY = PreviousVector.y + (CurrentDifference+PreviousDotGradient)*(GradientY/GradientMagnitude); </b></p>

<p>Visually, the dot product is high when the vector is going in the same direction as the previous result (which, being blurred, represents the average direction in the local neighborhood) and is negative when the vector is going against the local flow. Iterating over this rewards vectors that are in step with their neighbors, and minimises the vectors that are marching to a different drummer:<br />
<img alt="FlowVectors.png" src="http://petewarden.com/notes/images/FlowVectors.png" width="640" height="240" /></p>

<p>To do the blurring, I just run a small gaussian kernel over the vector field between each iteration.</p>

<p>Once I understood the algorithm, it was not too hard to convert it over to a multi-pass sequence of fragment programs. In my implementation I take the two frames I'm interested in, and encode the luminance of each pixel into a 16 bit fixed point value, and store this value in two 8 bit channels in a 32 bit pbuffer, packing both images into a single surface.</p>

<p>Inside my iteration loop, I then run a fragment program that contains the gradient calculation and flow estimation algorithm explained in lines (1) to (6), with the core of it looking like this:</p>

<p># left, right, bottom, top and centre have the luminance of the neighboring pixels of the first frame<br />
# in the x component, and the luminance from the second in z<br />
<b>SUB left, right, left;<br />
DP4 differential.x, left, {0.5, 0.0, 0.5, 0.0};<br />
SUB bottom, top, bottom;<br />
DP4 differential.y, bottom, {0.5, 0.0, 0.5, 0.0};</b></p>

<p><b>SUB differential.z, center.z, center.x;</b></p>

<p># velocity.x and .y are the previous velocity for the pixel, both 0 first pass through<br />
<b>MOV velocity.z, 1.0;<br />
DP3 flowError.x, differential, velocity;</b></p>

<p><b>MOV differential.z, lambda.x;<br />
DP3 differentialMagnitude.x, differential, differential;</b></p>

<p><b>RCP differentialMagnitude.x, differentialMagnitude.x;<br />
MUL differentialScale.x, flowError.x, differentialMagnitude.x;<br />
MUL differentialScale.x, differentialScale.x, -1.0;</b></p>

<p><b>MAD velocity, differential, differentialScale.x, velocity;</b></p>

<p>I run two passes of a seperable gaussian blur convolution over the resulting vector field, one horizontal and one vertical, and then iterate over again. The lambda value I have between 0 and 1, depending on the image, and I normally run between 2 and 10 iterations, again depending on my needs and the image quality. </p>]]>
</content>
</entry>
<entry>
<title>Random Numbers in Fragment Programs</title>
<link rel="alternate" type="text/html" href="http://petewarden.com/notes/archives/2005/05/random_numbers.html" />
<modified>2005-05-13T01:56:41Z</modified>
<issued>2005-05-11T01:06:37Z</issued>
<id>tag:petewarden.com,2005:/notes/1.7</id>
<created>2005-05-11T01:06:37Z</created>
<summary type="text/plain">It&amp;#039;s hard to write a fragment program that will calculate pseudo-random numbers. The usual random number algorithms need two things that we don&amp;#039;t have on our GPU&amp;#039;s:...</summary>
<author>
<name>petewarden</name>
<url>http://petewarden.com/notes</url>
<email>notes@petewarden.com</email>
</author>

<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://petewarden.com/notes/">
<![CDATA[<p>It&#039;s hard to write a fragment program that will calculate pseudo-random numbers. The usual random number algorithms need two things that we don&#039;t have on our GPU&#039;s:</p>]]>
<![CDATA[<p><br /><b>State</b><br />
<br />A seed value is usually stored, and updated every time a random number is generated. There&#039;s no way to pass this between pixels, with a fragment program, there&#039;s no global storage that can be updated to hold that kind of state. This means every pixel has to calculate its value independently.<br />
<br /><b>Logic Operations</b><br />
<br />Almost all generators rely on bitwise logic operations such as shifts and exclusive-or&#039;s in their implementation. These operations can&#039;t be done in the ARBFP instruction set, (though I hear there are some NVidia specific extensions that allow a limited number).<br />
<br /><b>Alternatives</b><br />
<br />The best way around these problems is to off-load as much of the work as we can to the CPU. We can generate a small texture from a table of random values, and do texture reads to get random values. Since our reads must be based on values that are very non-random, such as linearly interpolated texture coordinates, we may need to combine together several levels of indirection to get random looking results over a large area, if we have a small table of random values. <br />
<br />Ken Perlin&#039;s classic approach to generating random values at grid points using a 256 entry table is a good example to look at. However, on ATI cards the multiple texture indirections needed can be a problem, since there&#039;s a limit of 5 levels in any program.<br />
<br /><b>Do it all on the CPU</b><br />
<br />Another alternative is to create a larger 2D texture full of random values. This is perfect if the random values do no need to change from frame to frame, but if it does need to animate, the CPU cost of generating and uploading the texture may be a problem. Depending on what it&#039;s being used for, the texture may be smaller than the area it needs to cover, and use tiling. This will result in obvious patterns in many cases however, something which is a constant danger with most of these methods. Careful design and tweaking are needed to minimize such ugliness.<br />
<br /><b>Do it all on the GPU</b><br />
<br />I did manage to create a fragment program that calculated usable random numbers from scratch, using only floating point. I enclose it below, though it failed in its main purpose. It works as intended on NVidia GPU&#039;s, because they have 32 bit floating point internal precision, but the 24 bit precision of ATI chips causes visible problems. Since the main goal of this was to avoid texture indirections on ATI cards, it&#039;s not very useful to me, but was fun to write, so I thought I should share:<br />
<br /><b><br />
!!ARBfp1.0 <br />
# Based on an algorithm described by Francois Grieu, sci.crypt, 5th February 2004<br />
<font color=660000>ATTRIB</font> tex0 = fragment.texcoord[0]; <br />
<font color=660000>PARAM</font> outputMult = program.local[0]; <br />
<font color=660000>PARAM</font> bounds = program.local[1]; <br />
<font color=660000>PARAM</font> seed = program.local[2]; <br />
<font color=660000>PARAM</font> coordsOffset = { -100, 100, 0, 0 }; <br />
<font color=660000>PARAM</font> cMult = 0.0001002707309736288; <br />
<font color=660000>PARAM</font> aSubtract = 0.2727272727272727; <br />
<font color=660000>PARAM</font> coordMult0 = { 0.67676, 0.000058758, 0, 0 }; <br />
<font color=660000>PARAM</font> coordMult1 = { 0.0000696596, 0.797976, 0, 0 }; <br />
<font color=660000>PARAM</font> coordMult2 = { 0.587976, 0.0000233443, 0, 0 }; <br />
<br /><font color=660000>TEMP</font> tableCoord, a, b, c, floorA, seedCoords; <br />
<br /><font color=660000>ADD</font> seedCoords, tex0, coordsOffset; <br />
<br /># gFastRngA = (((currentX*multX)/(currentY*multY))+ <br />
<font color=660000>MUL</font> tableCoord, seedCoords, coordMult0; <br />
<font color=660000>RCP</font> tableCoord.y, tableCoord.y; <br />
<font color=660000>MUL</font> a.x, tableCoord.x, tableCoord.y; <br />
<br />#	(((height-currentY)*multX2)/((width-currentX)*multY2))+ <br />
<font color=660000>SUB</font> tableCoord, bounds, seedCoords; <br />
<font color=660000>MUL</font> tableCoord, tableCoord, coordMult1; <br />
<font color=660000>RCP</font> tableCoord.x, tableCoord.x; <br />
<font color=660000>MAD</font> a.x, tableCoord.x, tableCoord.y, a.x; <br />
<br />#	(((height-currentX)*multX3)/((width-currentY)*multY3))); <br />
<font color=660000>SUB</font> tableCoord.x, bounds.y, seedCoords.x; <br />
<font color=660000>SUB</font> tableCoord.y, bounds.x, seedCoords.y; <br />
<font color=660000>MUL</font> tableCoord, tableCoord, coordMult2; <br />
<font color=660000>RCP</font> tableCoord.y, tableCoord.y; <br />
<font color=660000>MAD</font> a.x, tableCoord.x, tableCoord.y, a.x; <br />
<br /># gFastRngA = fmod(gFastRngA,1); <br />
<font color=660000>FRC</font> a.x, a.x; <br />
<font color=660000>ADD</font> a.x, a.x, seed; <br />
<br /><font color=660000>MOV</font> c.x, 0; <br />
<font color=660000>MOV</font> b.x, 0; <br />
<br /># (gFastRngA += gFastRngC*(1./9973)+(3./11)-floor(gFastRngA)) <br />
<font color=660000>FRC</font> floorA.x, a.x; <br />
<font color=660000>SUB</font> floorA.x, a.x, floorA.x; <br />
<font color=660000>SUB</font> floorA.x, aSubtract.x, floorA.x; <br />
<font color=660000>ADD</font> floorA.x, floorA.x, a.x; <br />
<font color=660000>MAD</font> a.x, c.x, cMult.x, floorA.x; <br />
<br /># (gFastRngB += (gFastRngA *= gFastRngA)) <br />
<font color=660000>MUL</font> a.x, a.x, a.x; <br />
<font color=660000>ADD</font> b.x, b.x, a.x; <br />
<br /># (gFastRngC += (gFastRngB -= floor(gFastRngB))) <br />
<font color=660000>FRC</font> b.x, b.x; <br />
<font color=660000>ADD</font> c.x, c.x, b.x; <br />
<br /># (gFastRngC -= floor(gFastRngC)) <br />
<font color=660000>FRC</font> c.x, c.x; <br />
<br /># (gFastRngA += gFastRngC*(1./9973)+(3./11)-floor(gFastRngA)) <br />
<font color=660000>FRC</font> floorA.x, a.x; <br />
<font color=660000>SUB</font> floorA.x, a.x, floorA.x; <br />
<font color=660000>SUB</font> floorA.x, aSubtract.x, floorA.x; <br />
<font color=660000>ADD</font> floorA.x, floorA.x, a.x; <br />
<font color=660000>MAD</font> a.x, c.x, cMult.x, floorA.x; <br />
<br /># (gFastRngB += (gFastRngA *= gFastRngA)) <br />
<font color=660000>MUL</font> a.x, a.x, a.x; <br />
<font color=660000>ADD</font> b.x, b.x, a.x; <br />
<br /># (gFastRngC += (gFastRngB -= floor(gFastRngB))) <br />
<font color=660000>FRC</font> b.x, b.x; <br />
<font color=660000>ADD</font> c.x, c.x, b.x; <br />
<br /># (gFastRngC -= floor(gFastRngC)) <br />
<font color=660000>FRC</font> c.x, c.x; <br />
<br /># (gFastRngA += gFastRngC*(1./9973)+(3./11)-floor(gFastRngA)) <br />
<font color=660000>FRC</font> floorA.x, a.x; <br />
<font color=660000>SUB</font> floorA.x, a.x, floorA.x; <br />
<font color=660000>SUB</font> floorA.x, aSubtract.x, floorA.x; <br />
<font color=660000>ADD</font> floorA.x, floorA.x, a.x; <br />
<font color=660000>MAD</font> a.x, c.x, cMult.x, floorA.x; <br />
<br /># (gFastRngB += (gFastRngA *= gFastRngA)) <br />
<font color=660000>MUL</font> a.x, a.x, a.x; <br />
<font color=660000>ADD</font> b.x, b.x, a.x; <br />
<br /># (gFastRngC += (gFastRngB -= floor(gFastRngB))) <br />
<font color=660000>FRC</font> b.x, b.x; <br />
<font color=660000>ADD</font> c.x, c.x, b.x; <br />
<br /># (gFastRngC -= floor(gFastRngC)) <br />
<font color=660000>FRC</font> c.x, c.x; <br />
<br /><font color=660000>MOV</font> result.color, c.x; <br />
<font color=660000>MOV</font> result.color.a, 1; <br />
<br /><font color=660000>END</font></code></b></p>]]>
</content>
</entry>
<entry>
<title>Quick and Dirty Vectorization</title>
<link rel="alternate" type="text/html" href="http://petewarden.com/notes/archives/2005/05/quick_and_dirty.html" />
<modified>2005-05-13T01:58:01Z</modified>
<issued>2005-05-11T01:05:45Z</issued>
<id>tag:petewarden.com,2005:/notes/1.6</id>
<created>2005-05-11T01:05:45Z</created>
<summary type="text/plain">There aren&amp;#039;t many fast techniques for vectorizing an image into polygons that approximate the original, or that take advantage of 3d hardware, so here&amp;#039;s one I&amp;#039;ve used....</summary>
<author>
<name>petewarden</name>
<url>http://petewarden.com/notes</url>
<email>notes@petewarden.com</email>
</author>

<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://petewarden.com/notes/">
<![CDATA[<p>There aren&#039;t many fast techniques for vectorizing an image into polygons that approximate the original, or that take advantage of 3d hardware, so here&#039;s one I&#039;ve used.<br />
</p>]]>
<![CDATA[<p><br />Take the original image, blur it to taste to remove noise and leave lots of smooth gradients.<br />
<br />Chose a function that will divide the image into seperate regions. The simplest one would be:<br />
<br /><b>if (brightness&lt;50%)</b><br />
<b>    color = black;</b><br />
<b>else</b><br />
<b>color = white;</b><br />
<br />Split the image into grid squares of some fairly small size. For each corner of a square, work out the brightness value of the pixel underneath it.<br />
<br />Use this value as the horizontal texture coordinate for a 1 dimensional texture that encodes the function you chose. For the example function, this could be a texture 1 pixel high, and 100 pixels wide. The pixels from 0-49 would be black, the ones from 50-99 would be white.<br />
<br />If you then draw each square as two triangles, with the texture coordinates you calculated and with the 1D texture applied, you&#039;ll get an image that looks vectorized.<br />
<br /><img src="http://petewarden.com/notes/images/VecDiagram.png" width=320 height=240 border=0 alt=''><br />
<br />You have to chose a blur amount and grid resolution manually. You have two different ways to split a square into triangles, with a join running from top left to bottom right, or from top right to bottom left. Either one will cause zig-zagging for image edges that go at right angles to the chosen one. It seems like it should be possible to detect this with some clever coding.<br />
<br />This technique is not &#039;real&#039; vectorization, but is useful for getting that kind of look. Though it&#039;s easiest with graphics hardware, the same approach is possible using a very simple software renderer.</p>]]>
</content>
</entry>
<entry>
<title>Fragment Program Snippets</title>
<link rel="alternate" type="text/html" href="http://petewarden.com/notes/archives/2005/05/fragment_progra_1.html" />
<modified>2005-05-13T02:00:55Z</modified>
<issued>2005-05-11T01:04:57Z</issued>
<id>tag:petewarden.com,2005:/notes/1.5</id>
<created>2005-05-11T01:04:57Z</created>
<summary type="text/plain">Here&apos;s some instruction sequences for common operations:...</summary>
<author>
<name>petewarden</name>
<url>http://petewarden.com/notes</url>
<email>notes@petewarden.com</email>
</author>

<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://petewarden.com/notes/">
<![CDATA[<p>Here's some instruction sequences for common operations:<br />
</p>]]>
<![CDATA[<p>Compare against zero:<br />
<b><br />
<font color=660000>ABS</font> value.x, value.x;<br />
<font color=660000>SUB</font> value.x, 0, value.x;<br />
<font color=660000>CMP</font> result, value.x, nonZero, isZero;<br />
</b><br />
Compare against nearly zero (-epsilon&lt;x&lt;epsilon)<br />
<b><br />
<font color=660000>MAD</font> value.x, value.x, value.x, negativeEpsilonSquared;<br />
<font color=660000>CMP</font> result, value.x, isZero, nonZero;<br />
</b><br />
floor()<br />
<b><br />
<font color=660000>FRC</font> floor.x, value.x;<br />
<font color=660000>SUB</font> floor.x, value.x, floor.x;<br />
</b><br />
ceil()<br />
<b><br />
<font color=660000>FRC</font> ceil.x, value.x;<br />
<font color=660000>MUL</font> compare, ceil.x, -1;<br />
<font color=660000>SUB</font> ceil.x, 1, ceil.x;<br />
<font color=660000>CMP</font> ceil.x, compare, ceil.x, 0;<br />
<font color=660000>ADD</font> ceil.x, <br />
</b><br />
The tiling snippets need a couple of common variables, which can be put in the program.local[] parameters:<br />
<b><br />
<font color=660000>PARAM</font> bounds = { textureWidth, textureHeight, 0, 0 };<br />
<font color=660000>PARAM</font> recipBounds = { 1/textureWidth, 1/textureHeight, 0, 0 };<br />
</b><br />
Tile texture coordinates by repeating:<br />
<b><br />
<font color=660000>MUL</font> coords, coords, recipBounds;<br />
<font color=660000>FRC</font> coords, coords;<br />
<font color=660000>MUL</font> coords, coords, bounds;<br />
</b><br />
Tile texture coordinates by mirroring:<br />
<b><br />
<font color=660000>MUL</font> coords, coords, recipTwiceBounds;<br />
<font color=660000>FRC</font> coords, coords;<br />
<font color=660000>MUL</font> coords, coords, twiceBounds;<br />
<font color=660000>SUB</font> coords, bounds, coords;<br />
<font color=660000>ABS</font> coords, coords;<br />
<font color=660000>SUB</font> coords, bounds, coords;<br />
</b><br />
Clamp to a border:<br />
<b><br />
 # check if (x&lt;0) or (y&lt;0)<br />
<font color=660000>CMP</font> compareResult, coords, 0, 1;</p>

<p># Check if ((width-x)&lt;0) or ((height-y)&lt;0)<br />
<font color=660000>SUB</font> coords, bounds, coords;<br />
<font color=660000>CMP</font> compareResult, coords, 0, compareResult; </p>

<p># x and y are either 0 if out of range, or 1 if in. Multiply them<br />
# together so that the result is 0 if either are our of range, or<br />
# 1 if they&#039;re both inside. Nudge the result down so we can<br />
# compare easily<br />
<font color=660000>MAD</font> compareResult.x, compareResult.x, compareResult.y, -0.1<br />
<font color=660000>CMP</font> color, compareResult, borderColor, inputColor;<br />
</b><br />
Convert a straight alpha color to premultiplied with black:<br />
<b><br />
<font color=660000>MUL</font> premultColor, straightColor, straightColor.a;<br />
<font color=660000>MOV</font> premultColor.a, straightColor.a;<br />
</b><br />
Convert from premultiplied with black to straight alpha:<br />
<b><br />
<font color=660000>ADD</font> premultColor.a, premultColor.a, 0.0000001;<br />
<font color=660000>RCP</font> recipAlpha.x, premultColor.a;<br />
<font color=660000>MUL</font> straightColor, premultColor, recipAlpha.x;<br />
<font color=660000>MOV</font> straightColor.a, premultColor.a;<br />
</b><br clear="all"></p>]]>
</content>
</entry>
<entry>
<title>Fragment Programs-ATI R3xx vs NVidia NV34</title>
<link rel="alternate" type="text/html" href="http://petewarden.com/notes/archives/2005/05/fragment_progra.html" />
<modified>2005-05-13T02:03:50Z</modified>
<issued>2005-05-11T01:03:01Z</issued>
<id>tag:petewarden.com,2005:/notes/1.4</id>
<created>2005-05-11T01:03:01Z</created>
<summary type="text/plain">ATI and NVidia cards have some differences it&apos;s important to know about when you&apos;re writing fragment programs to run on both....</summary>
<author>
<name>petewarden</name>
<url>http://petewarden.com/notes</url>
<email>notes@petewarden.com</email>
</author>

<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://petewarden.com/notes/">
<![CDATA[<p>ATI and NVidia cards have some differences it's important to know about when you're writing fragment programs to run on both.</p>]]>
<![CDATA[<p><b>The ATI cards have a limit of four levels of texture indirection.</b><br />
A texture indirection is a <font color=660000>TEX</font> instruction with texture coordinates that are calculated, not just constant or passed in by a glMultiTex call. In practice, a 4 level limit means you can have up to four texture reads in a chain using coordinates from a previous instruction.<br />
<b><br />
<font color=660000>TEX</font> firstValue, fragment.texcoord[0], texture[0]; # First indirection<br />
<font color=660000>TEX</font> secondValue, firstValue, texture[0]; # Second indirection<br />
<font color=660000>TEX</font> thirdValue, secondValue, texture[0]; # Third indirection<br />
<font color=660000>TEX</font> fourthValue, thirdValue, texture[0]; # Fourth indirection<br />
</b><br />
The spec to <a href="http://oss.sgi.com/projects/ogl-sample/registry/ARB/fragment_program.txt" target="_blank">ARB_fragment_program</a> explains in more detail what an indirection is.</p>

<p>When we hit the limit on indirections, we&#039;d see a black output. We&#039;d either try to rewrite to avoid the indirections, or break the shader into multiple passes.</p>

<p><b><font color=660000>RCP</font> on zero gives different results.</b><br />
ATI cards return a very large number when you divide by zero. NVidia cards give a plus or minus infinity NaN result, which infects subsequent calculations.</p>

<p>We preferred ATI&#039;s behavior, it wasn&#039;t mathematically correct, but was usually what we wanted. To emulate that on the NV34 we had two approaches:</p>

<p>- Detect the zero denominator before the operation, and then replace the result with something sensible:<br />
<b><br />
    <font color=660000>ABS</font> value.x, denominator.x;<br />
    <font color=660000>SUB</font> value.x, 0, value.x;<br />
    <font color=660000>RCP</font> reciprocal.x, denominator.x;<br />
    <font color=660000>CMP</font> reciprocal.x, value.x, reciprocal.x, someValueYouWant.x;<br />
</b><br />
- Don&#039;t try to detect the zero, just nudge all numbers up a bit. This only works for inputs you know will be positive (eg texture samples), and loses a little precision, but often this is not significant.<br />
<b><br />
    <font color=660000>ADD</font> denominator.x, denominator.x, 0.0000001;<br />
</b><br />
<b>Inexact texture coordinates.</b><br />
It&#039;s been hard to get exactly one-to-one texel to pixel mapping on ATI cards, we want to use bilinear filtering in general but the sampling errors can introduce slight blurring to the image. NVidia cards show much smaller errors. Where possible, we switched to nearest neighbor sampling, we also considered griding up each large quad into smaller polys to minimise the errors.</p>

<p><b>The NV34 doesn&#039;t support a floating point pixel format.</b><br />
For a high precision accumulation buffer, you can get 16bit fixed point precision by using two 8 bit buffers, and storing half of each 16 bit channel in one of the buffer channels.<br />
<img src="http://petewarden.com/notes/images/VendorDiagram1.png" width=340 height=240 border=0 alt=''><br />
Since you can only write to one pbuffer at a time, you need to run two passes. Here&#039;s a fragment program to implement the red/green pass:</p>

<p><b><br />
!!ARBfp1.0<br />
# Emulate a 16 bit accumulation buffer for red and green channels<br />
# author - Pete Warden<br />
<br /><font color=660000>ATTRIB</font> tex0 = fragment.texcoord[0];<br />
<font color=660000>ATTRIB</font> tex1 = fragment.texcoord[1];<br />
<br /><font color=660000>PARAM</font> unpackScale = { 255.0, 1.0, 255.0, 1.0 };<br />
<font color=660000>PARAM</font> packScale = { 0.0039216, 1.0, 0.0039216, 1.0 };<br />
<br /><font color=660000>TEMP</font> s0, s1, s2;<br />
<br /># Fetch the texel to be added<br />
<font color=660000>TEX</font> s0, tex0, texture[0], RECT;<br />
# Fetch the current total from the first 8 bit buffer<br />
<font color=660000>TEX</font> s1, tex0, texture[1], RECT;<br />
<br /># Multiply the upper half of each channel by 255<br />
<font color=660000>MUL</font> s1, s1, unpackScale;<br />
<br /># Add on the lower half to the upper<br />
<font color=660000>ADD</font> s1.r, s1.r, s1.g;<br />
<font color=660000>ADD</font> s1.g, s1.b, s1.a;<br />
<br /># Do the accumulation of the current input onto the total<br />
<font color=660000>ADD</font> s1,s0,s1;<br />
<br /># Put the lower half of each channels into the result<br />
<font color=660000>FRC</font> s2.g, s1.r;<br />
<font color=660000>FRC</font> s2.a, s1.g;<br />
<br /># Calculate the rounded upper half of both channels<br />
<font color=660000>SUB</font> s2.r, s1.r, s2.g;<br />
<font color=660000>SUB</font> s2.b, s1.g, s2.a;<br />
<br /># Compress the channels into a 0-1 range to store in the 8 bit buffer<br />
<font color=660000>MUL</font> result.color, s2, packScale;<br />
<br /><font color=660000>END</font><br />
</b><br />
<b>The NV34 only passes 4 texture coordinates through to the fragment program.</b><br />
The solution is to write a vertex program that emulates the fixed-function pipeline, but passes through all 8 texture coordinates, as on ATI:<br />
<b><br />
!!ARBvp1.0<br />
# Passthru vertex program<br />
<font color=660000>ATTRIB</font> vertexPosition  = vertex.position;<br />
<font color=660000>OUTPUT</font> outputPosition  = result.position;<br />
<br /><font color=660000>DP4</font>    outputPosition.x, state.matrix.mvp.row[0], vertexPosition;<br />
<font color=660000>DP4</font>    outputPosition.y, state.matrix.mvp.row[1], vertexPosition;<br />
<font color=660000>DP4</font>    outputPosition.z, state.matrix.mvp.row[2], vertexPosition;<br />
<font color=660000>DP4</font>    outputPosition.w, state.matrix.mvp.row[3], vertexPosition;<br />
<br /><font color=660000>MOV</font>    result.color, vertex.color;<br />
<font color=660000>MOV</font>    result.texcoord, vertex.texcoord;<br />
<font color=660000>MOV</font>    result.texcoord[1], vertex.texcoord[1];<br />
<font color=660000>MOV</font>    result.texcoord[2], vertex.texcoord[2];<br />
<font color=660000>MOV</font>    result.texcoord[3], vertex.texcoord[3];<br />
<font color=660000>MOV</font>    result.texcoord[4], vertex.texcoord[4];<br />
<font color=660000>MOV</font>    result.texcoord[5], vertex.texcoord[5];<br />
<font color=660000>MOV</font>    result.texcoord[6], vertex.texcoord[6];<br />
<font color=660000>MOV</font>    result.texcoord[7], vertex.texcoord[7];<br />
<br /><font color=660000>END</font><br />
</b></p>]]>
</content>
</entry>
<entry>
<title>Mac OpenGL contexts</title>
<link rel="alternate" type="text/html" href="http://petewarden.com/notes/archives/2005/05/mac_opengl_cont.html" />
<modified>2005-05-13T02:04:56Z</modified>
<issued>2005-05-11T01:02:30Z</issued>
<id>tag:petewarden.com,2005:/notes/1.3</id>
<created>2005-05-11T01:02:30Z</created>
<summary type="text/plain">A context is a container for the GL state information, things like the current color, transformation matrix, texture ID and anything else that needs to be remembered by OpenGL....</summary>
<author>
<name>petewarden</name>
<url>http://petewarden.com/notes</url>
<email>notes@petewarden.com</email>
</author>

<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://petewarden.com/notes/">
<![CDATA[<p>A <b>context</b> is a container for the GL state information, things like the current color, transformation matrix, texture ID and anything else that needs to be remembered by OpenGL.<br />
</p>]]>
<![CDATA[<p>One of the things the context remembers is which frame buffer you&#039;re drawing to, known as the attached <b>surface</b>. The key thing to understand is that a context is not the same as a surface, a context is just a housekeeping structure to hold the information GL needs to remember. A surface is where you actually draw into, and you can change what surface a context draws into (for example, swapping the back and front buffers points the context to a different surface), or even not attach it to any surface.</p>

<p>Normally, texture IDs are local to each context:</p>

<p><img src="http://petewarden.com/notes/images/ContextDiagram1.png" width=320 height=240 border=0 alt=''></p>

<p> if you&#039;ve created a texture in one context, you can&#039;t use it in another. To share textures, you need to use a <b>shared context</b>.</p>

<p>Shared contexts are places to store texture IDs, and other IDs for things like draw lists and programs. Creating a context with a shared context means using that shared context to store those IDs, and information about the objects they point to.</p>

<p><img src="http://petewarden.com/notes/images/ContextDiagram2.png" width=320 height=240 border=0 alt=''></p>

<p>A shared context doesn&#039;t even need to be pointing to a surface, since its only purpose is to store all those IDs. The easiest way to use a shared context is to create a single global one when your app starts, don&#039;t attach it to any window or pbuffer, and pass it into all create context calls.</p>

<p>Once you&#039;ve done that, all of the texture IDs you create in <i>any</i> of your contexts will be usable in <i>all</i> of them, and you never need to think about the shared context again.</p>

<p>You can use OpenGL on the Mac in a <b>multithreaded</b> app, but you must <i>never</i> call GL functions on the same context from different threads at the same time.</p>

<p>Think of a GL context like a C++ object, if you call a member function on an object whilst another thread is halfway through another member function on the same object, you may end up with a corrupted object. Corrupting a context by a threading mistake will often cause a kernel panic.</p>

<p>If you&#039;re writing a multi-threaded GL app, make sure you periodically run the OpenGL Profiler, and enable &#039;break on thread errors&#039;. This will stop the app and give a stack trace whenever two threads call into the same context. To find out what the other threads are doing, run the command line gdb with &#039;gdb -p &#039; and the process number of your app, and do the command &#039;thread apply all bt&#039; for a complete thread listing.</p>

<p><b>CGLMacro.h</b> exposes some of the implementation details of Mac OpenGL calls. Normally, every <b>gl*()</b> call actually ends up doing something like a <b>CGLGetCurrentContext()</b> call internally, and then passing that on to the function that does the actual work.</p>

<p><b>CGLGetCurrentContext()</b> has to do some magic to figure out which thread it&#039;s in, and return the correct context for that thread, so it&#039;s a good optimisation to cache the current context if you know it&#039;s not changing, and you&#039;re making a lot of GL calls. CGLMacros let you do this optimisation but change very little of your code, by letting you specify a variable to use for the CGL context. Then, including the header into a source file will automatically switch all GL calls to use that instead of the standard context system.<br clear="all"></p>]]>
</content>
</entry>
<entry>
<title>Straight Alpha and Bilinear Filtering</title>
<link rel="alternate" type="text/html" href="http://petewarden.com/notes/archives/2005/05/straight_alpha.html" />
<modified>2005-05-13T02:06:43Z</modified>
<issued>2005-05-11T01:00:56Z</issued>
<id>tag:petewarden.com,2005:/notes/1.2</id>
<created>2005-05-11T01:00:56Z</created>
<summary type="text/plain">There&amp;#039;s a subtle problem using graphics hardware to do bilinear filtering on straight alpha images....</summary>
<author>
<name>petewarden</name>
<url>http://petewarden.com/notes</url>
<email>notes@petewarden.com</email>
</author>

<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://petewarden.com/notes/">
<![CDATA[<p>There&#039;s a subtle problem using graphics hardware to do bilinear filtering on straight alpha images.<br />
</p>]]>
<![CDATA[<p>Take an image with a single pixel of full white, completely opaque, surrounded by fully transparent pixels, with black in their RGB channels</p>

<p><img src="http://petewarden.com/notes/images/BilinearDiagram1.png" width=320 height=240 border=0 alt=''></p>

<p>if you use bilinear filtering to get color and alpha values for a point halfway between the white pixel and a black one, this is what you end up with:</p>

<p><img src="http://petewarden.com/notes/images/BilinearDiagram2.png" width=320 height=240 border=0 alt=''></p>

<p>If you render this texture scaled up using the standard blending equation for straight alpha, glBlendFunc(GL_SRC_ALPHA,GL_ONE_<font color=660000>MIN</font>US_SRC_ALPHA) then you&#039;ll see black fringing around the white pixel.</p>

<p>In games, this is not a problem, because you just make sure any zero alpha pixels in your artwork have the correct straight color values.</p>

<p>Where this does become an issue is when the image is generated as an intermediate stage of an image processing pipeline. For example, if you read in images that are pre-multiplied with black alpha, and convert them to straight alpha for the pipeline, then you have no way of telling what the correct color values for fully transparent pixels are.</p>

<p>Another example is drawing a polygon into a pbuffer or render texture. Nothing&#039;s drawn into the fully transparent areas, and there&#039;s no definitive way of picking a color to put in those pixels.</p>

<p>We looked at several solutions to this:</p>

<p>- <b>Do a processing pass that averages the rgb values of the non-transparent neighbors of each zero alpha pixel, and replaces the existing color with that, or some other way of &#039;growing&#039; good rgb values into the zero alpha areas.</b></p>

<p>Though this would probably do a decent job of removing the fringing in real world images, I could still construct cases where this would fail, such as a zero alpha pixel between black and white opaque pixels:</p>

<p><img src="http://petewarden.com/notes/images/BilinearDiagram3.png" width=320 height=240 border=0 alt=''></p>

<p>This would mean an extra pass over all of our intermediate textures, which would hit performance. </p>

<p>- <b>Ignore any pixels with zero alpha when doing the bilinear filtering</b></p>

<p>Before each lerp in the bilinear process, do a check to see if one of the pixels to be lerped between has a zero alpha, and replace it with the other non-zero alpha pixel rather than lerping between them if so.</p>

<p>In pseudo-code, here&#039;s the original bilinear process<br />
<b><br />
topColor = lerp(topLeftNeighbor,topRightNeighbor, fractionalX);<br />
bottomColor = lerp(bottomLeftNeighbour,bottomRightNeighbor, fractionalX);<br />
result = lerp(topColor, bottomColor, fractionalY);<br />
</b><br />
and here&#039;s the version that checks for zero alpha<br />
<b><br />
if (topLeftNeighbor.alpha==0)<br />
    topColor = topRightNeighbor;<br />
else if (topRightNeighbor.alpha==0)<br />
    topColor = topLeftNeighbor;<br />
else<br />
    topColor = lerp(topLeftNeighbor,topRightNeighbor, fractionalX);<br />
    <br />
if (bottomLeftNeighbor.alpha==0)<br />
    bottomColor = bottomRightNeighbor;<br />
else if (bottomRightNeighbor.alpha==0)<br />
    bottomColor = bottomLeftNeighbor;<br />
else<br />
    bottomColor = lerp(bottomLeftNeighbour,bottomRightNeighbor, fractionalX);<br />
    <br />
if (topColor.alpha==0)<br />
    result = bottomColor;<br />
else if (bottomColor.alpha==0)<br />
    result = topColor;<br />
else<br />
    result = lerp(topColor, bottomColor, fractionalY);<br />
</b><br />
This would fix the problem, but requires 4 texture access&#039;s if implemented using OpenGL fragment programs, as well as being rather inelegant.</p>

<p>- <b>Change the pipeline to use premultiplied with black</b></p>

<p>Using glBlendFunc(GL_ONE, GL_ONE_<font color=660000>MIN</font>US_SRC_ALPHA) lets you render premultiplied with black alpha textures correctly, and bilinear filtering does the right thing. In the original case, the neighborhood of the white pixel evaluates to a full white (premultiplied by the alpha which drops away), removing the dark fringing.</p>

<p>This is what we did, even though it seems to go against the grain of normally using straight alpha with graphics cards, unlike the other solutions it didn&#039;t hurt performance.</p>]]>
</content>
</entry>
<entry>
<title>Animating Worley Noise</title>
<link rel="alternate" type="text/html" href="http://petewarden.com/notes/archives/2005/05/testing.html" />
<modified>2005-05-13T02:08:32Z</modified>
<issued>2005-05-11T00:38:33Z</issued>
<id>tag:petewarden.com,2005:/notes/1.1</id>
<created>2005-05-11T00:38:33Z</created>
<summary type="text/plain">Texturing and Modelling has a chapter by Steve Worley on using Voronoi diagrams to create a procedural noise pattern. I needed to find a way to animate and render this on a 2D plane using OpenGL....</summary>
<author>
<name>petewarden</name>
<url>http://petewarden.com/notes</url>
<email>notes@petewarden.com</email>
</author>

<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://petewarden.com/notes/">
<![CDATA[<p><a href="http://www.texturingandmodeling.com" target="_blank">Texturing and Modelling</a> has a chapter by <a href="http://www.worley.com" target="_blank">Steve Worley</a> on using Voronoi diagrams to create a procedural noise pattern. I needed to find a way to animate and render this on a 2D plane using OpenGL.<br />
</p>]]>
<![CDATA[<p>The basic approach is to take a set of points randomly scattered through space, and take the distance to the nearest point as the texture&#039;s value. This is the sort of pattern you end up with;</p>

<p><a href="javascript:openpopup('http://petewarden.com/notes/images/WorleyGrab.png',800,600,false);"><img src="http://petewarden.com/notes/images/WorleyGrab.png" border=0 alt=''></a></p>

<p>A naive implementation of this would search through all of the scattered points to find the current closest distance, which gets slow very quickly. Steve&#039;s implementation cuts down the search space by dividing space into grid cells and placing a single point randomly within each one, like this:</p>

<p><a href="javascript:openpopup('http://petewarden.com/notes/images/WorleyDiagram1.png',800,600,false);"><img src="http://petewarden.com/notes/images/WorleyDiagram1.png" border=0 alt=''></a></p>

<p>This allows each evaluation to only check against the point in the cell it&#039;s currently in, and its neighbours.</p>

<p>This works great for static images, but I really wanted the points to move to get an animated texture. One way to do this would be go back to randomly scattered points, and then sort them into a spatial sorting structure such as a quad-tree each frame to keep the number of neighbours you need to check to a minimum.</p>

<p>I was aiming at a hardware solution, and wanted to draw a simple 2d image of the pattern, rather than procedurally evaluating it at arbitrary points, so I decided to try an idea I first came across in <a href="http://www.cs.princeton.edu/gfx/pubs/Klein_2002_SVC/index.php" target="_blank">Stylized Video Cubes</a>; using a z-buffer to construct the voronoi diagram.</p>

<p>I draw a quad centered around each point, enable depth testing and use a small pixel shader that outputs the distance from the central point as the z depth, as well as setting the color from a palette indexed on the depth:</p>

<p><a href="javascript:openpopup('http://petewarden.com/notes/images/WorleyDiagram2.png',800,600,false);"><img src="http://petewarden.com/notes/images/WorleyDiagram2.png" border=0 alt=''></a></p>

<p>The actual size of the quad that&#039;s drawn for each point was chosen pretty arbitrarily, to give a nice look for a typical distribution of points.</p>

<p>I then animate each point by giving it a random velocity vector, and wrapping it&#039;s position when it heads off the edge of the image (you can either add a slop border the size of a single quad around the edge to wrap to, or if you want a tileable result, render each point multiple times if it&#039;s touching a border).</p>

<p><a href="images/WorleyMovie.mov" target="_blank">This</a> works pretty nicely, the main issue was what happened if a part of the image isn&#039;t covered by any quads, which leaves a gap in the result. I worked around this by clearing the screen to the color indexed by the maximum distance in the palette.</p>

<p>Having a tileable image means that it&#039;s possible to do a poor man&#039;s fractal noise, by reusing the constructed image for all the sublayers of the fractal pattern. The repetition can be noticeable, especially when its <a href="images/FractalMovie2.mov" target="_blank">animated</a>, but it&#039;s a very low cost operation once you have the basic image.</p>

<p><a href="javascript:openpopup('http://petewarden.com/notes/images/WorleyImage2.png',800,600,false);"><img src="http://petewarden.com/notes/images/WorleyImage2.png" border=0 alt=''></a></p>

<p>I experimented doing multiple passes to implement deeper levels of the algorithm, such as using the distance from the second or third nearest neighbor, but this involved rendering a pass that wrote ID values for each point to compare against in the next pass to figure out if the current nearest point was acceptable, and much larger quads, and so I was unable to get decent performance for L2 and higher textures.</p>

<p>I think the best approach for these on hardware would involve a combination of space-sorting strategies to break areas down into a small enough set of possible neighbor points, and then doing the fine-level sorting within a pixel shader. Space sorting would also be a good way to handle the moving points problem with a software implementation, since you don&#039;t have the performance advantage you get from an optimized z-buffer in hardware.</p>]]>
</content>
</entry>

</feed>