/LGC图形渲染/OpenGL Performance Optimization
2010-12-28 15:29
417 查看
SIGGRAPH '97
Course 24: OpenGL and Window System Integration
OpenGL Performance Optimization
Contents
1. Hardware vs. Software2. Application Organization
2.1 High Level Organization
2.2 Low Level Organization
3. OpenGL Optimization
3.1 Traversal
3.2 Transformation
3.3 Rasterization
3.4 Texturing
3.5 Clearing
3.6 Miscellaneous
3.7 Window System Integration
3.8 Mesa-specific
4. Evaluation and tuning
4.1 Pipeline tuning
4.2 Double buffering
4.3 Test on several implementations
1. Hardware vs. Software
OpenGL may be implemented by any combination of hardware and software.At the high-end, hardware may implement virtually all of OpenGL while at
the low-end, OpenGL may be implemented entirely in software. In between
are combination software/hardware implementations. More money buys more
hardware and better performance.
Intro-level workstation hardware and the recent PC 3-D hardware typically
implement point, line, and polygon rasterization in hardware but implement
floating point transformations, lighting, and clipping in software. This
is a good strategy since the bottleneck in 3-D rendering is usually
rasterization and modern CPU's have sufficient floating point performance
to handle the transformation stage.
OpenGL developers must remember that their application may be used on a
wide variety of OpenGL implementations. Therefore one should consider
using all possible optimizations, even those which have little return on
the development system, since other systems may benefit greatly.
From this point of view it may seem wise to develop your application on a
low-end system. There is a pitfall however; some operations which are
cheep in software may be expensive in hardware. The moral is: test your
application on a variety of systems to be sure the performance is dependable.
2. Application Organization
At first glance it may seem that the performance of interactive OpenGLapplications is dominated by the performance of OpenGL itself. This may
be true in some circumstances but be aware that the organization of the
application is also significant.
2.1 High Level Organization
Multiprocessing
Some graphical applications have a substantial computational componentother than 3-D rendering. Virtual reality applications must compute
object interactions and collisions. Scientific visualization programs
must compute analysis functions and graphical representations of data.
One should consider multiprocessing in these situations. By assigning
rendering and computation to different threads they may be executed in
parallel on multiprocessor computers.
For many applications, supporting multiprocessing is just a matter of
partitioning the render and compute operations into separate threads
which share common data structures and coordinate with synchronization
primitives.
SGI's Performer is an example of a high level toolkit designed for this
purpose.
Image quality vs. performance
In general, one wants high-speed animation and high-quality images inan OpenGL application.
If you can't have both at once a reasonable compromise may be to render at
low complexity during animation and high complexity for static images.
Complexity may refer to the geometric or rendering attributes of a database.
Here are a few examples.
During interactive rotation (i.e. mouse button held down) render a
reduced-polygon model. When drawing a static image draw the full
polygon model.
During animation, disable dithering, smooth shading, and/or texturing.
Enable them for the static image.
If texturing is required, use
GL_NEAREST
sampling and
glHint( GL_PERSPECTIVE_CORRECTION_HINT, GL_FASTEST )
.
During animation, disable antialiasing. Enable antialiasing for the
static image.
Use coarser NURBS/evaluator tesselation during animation. Use
glPolygonMode( GL_FRONT_AND_BACK, GL_LINE )
to
inspect tesselation granularity and reduce if possible.
Level of detail management and culling
Objects which are distant from the viewer may be rendered with a reducedcomplexity model. This strategy reduces the demands on all stages of the
graphics pipeline. Toolkits such as Inventor and Performer support this
feature automatically.
Objects which are entirely outside of the field of view may be culled.
This type of high level cull testing can be done efficiently with bounding
boxes or spheres and have a major impact on performance. Again, toolkits
such as Inventor and Performer have this feature.
2.2 Low Level Organization
The objects which are rendered with OpenGL have to be stored in some sortof data structure. Some data structures are more efficient than others
with respect to how quickly they can be rendered.
Basically, one wants data structures which can be traversed quickly
and passed to the graphics library in an efficient manner. For example,
suppose we need to render a triangle strip. The data structure which
stores the list of vertices may be implemented with a linked list or an
array. Clearly the array can be traversed more quickly than a linked list.
The way in which a vertex is stored in the data structure is also significant.
High performance hardware can process vertexes specified by a pointer more
quickly than those specified by three separate parameters.
An Example
Suppose we're writing an application which involves drawing a road map.One of the components of the database is a list of cities specified with
a latitude, longitude and name. The data structure describing a city
may be:
struct city { float latitute, longitude; /* city location */ char *name; /* city's name */ int large_flag; /* 0 = small, 1 = large */ };
A list of cities may be stored as an array of city structs.
Our first attempt at rendering this information may be:
void draw_cities( int n, struct city citylist[] ) { int i; for (i=0; i < n; i++) { if (citylist[i].large_flag) { glPointSize( 4.0 ); } else { glPointSize( 2.0 ); } glBegin( GL_POINTS ); glVertex2f( citylist[i].longitude, citylist[i].latitude ); glEnd(); glRasterPos2f( citylist[i].longitude, citylist[i].latitude ); glCallLists( strlen(citylist[i].name), GL_BYTE, citylist[i].name ); } }
This is a poor implementation for a number of reasons:
glPointSize
is called for every loop iteration.
only one point is drawn between
glBegin
and
glEnd
the vertices aren't being specified in the most efficient manner
Here's a better implementation:
void draw_cities( int n, struct city citylist[] ) { int i; /* draw small dots first */ glPointSize( 2.0 ); glBegin( GL_POINTS ); for (i=0; i < n ;i++) { if (citylist[i].large_flag==0) { glVertex2f( citylist[i].longitude, citylist[i].latitude ); } } glEnd(); /* draw large dots second */ glPointSize( 4.0 ); glBegin( GL_POINTS ); for (i=0; i < n ;i++) { if (citylist[i].large_flag==1) { glVertex2f( citylist[i].longitude, citylist[i].latitude ); } } glEnd(); /* draw city labels third */ for (i=0; i < n ;i++) { glRasterPos2f( citylist[i].longitude, citylist[i].latitude ); glCallLists( strlen(citylist[i].name), GL_BYTE, citylist[i].name ); } }
In this implementation we're only calling glPointSize twice
and we're maximizing the number of vertices specified between
glBegin
and
glEnd
.
We can still do better, however. If we redesign the data structures used
to represent the city information we can improve the efficiency of drawing
the city points. For example:
struct city_list { int num_cities; /* how many cities in the list */ float *position; /* pointer to lat/lon coordinates */ char **name; /* pointer to city names */ float size; /* size of city points */ };
Now cities of different sizes are stored in separate lists.
Position are stored sequentially in a dynamically allocated array.
By reorganizing the data structures we've eliminated the need for a
conditional inside the
glBegin/glEnd
loops.
Also, we can render a list of cities using the
GL_EXT_vertex_array
extension if available, or at least use a more efficient version of
glVertex
and
glRasterPos
.
/* indicates if server can do GL_EXT_vertex_array: */ GLboolean varray_available; void draw_cities( struct city_list *list ) { int i; GLboolean use_begin_end; /* draw the points */ glPointSize( list->size ); #ifdef GL_EXT_vertex_array if (varray_available) { glVertexPointerEXT( 2, GL_FLOAT, 0, list->num_cities, list->position ); glDrawArraysEXT( GL_POINTS, 0, list->num_cities ); use_begin_end = GL_FALSE; } else #else { use_begin_end = GL_TRUE; } #endif if (use_begin_end) { glBegin(GL_POINTS); for (i=0; i < list->num_cities; i++) { glVertex2fv( &position[i*2] ); } glEnd(); } /* draw city labels */ for (i=0; i < list->num_cities ;i++) { glRasterPos2fv( list->position[i*2] ); glCallLists( strlen(list->name[i]), GL_BYTE, list->name[i] ); } }
As this example shows, it's better to know something about efficient rendering
techniques before designing the data structures. In many cases one has to
find a compromize between data structures optimized for rendering and those
optimized for clarity and convenience.
In the following sections the techniques for maximizing performance,
as seen above, are explained.
3. OpenGL Optimization
There are many possibilities to improving OpenGL performance. The impactof any single optimization can vary a great deal depending on the OpenGL
implementation.
Interestingly, items which have a large impact on software
renderers may have no effect on hardware renderers, and vice versa
!
For example, smooth shading can be expensive in software but free in hardware
While
glGet*
can be cheap in software but expensive in hardware.
After each of the following techniques look for a bracketed list of symbols
which relates the significance of the optimization to your OpenGL
system:
H - beneficial for high-end hardware
L - beneficial for low-end hardware
S - beneficial for software implementations
all - probably beneficial for all implementations
3.1 Traversal
Traversal is the sending of data to the graphics system. Specifically, wewant to minimize the time taken to specify primitives to OpenGL.
Use connected primitives
Connected primitives such as
GL_LINES, GL_LINE_LOOP, GL_TRIANGLE_STRIP, GL_TRIANGLE_FAN
, and
GL_QUAD_STRIP
require fewer vertices to describe an
object than individual line, triangle, or polygon primitives.
This reduces data transfer and transformation workload. [all]
Use the vertex array extension
On some architectures function calls are somewhat expensive
so replacing many
glVertex/glColor/glNormal
calls with
the vertex array mechanism may be very beneficial. [all]
Store vertex data in consecutive memory locations
When maximum performance is needed on high-end systems it's
good to store vertex data in contiguous memory to maximize
through put of data from host memory to graphics subsystem. [H,L]
Use the vector versions of
glVertex
,
glColor
,
glNormal
and
glTexCoord
The
glVertex
,
glColor
, etc. functions
which take a pointer
to their arguments such as
glVertex3fv(v)
may be much
faster than those which take individual arguments such as
glVertex3f(x,y,z)
on systems with DMA-driven graphics
hardware. [H,L]
Reduce quantity of primitives
Be careful not to render primitives which are over-tesselated.
Experiment with the GLU primitives, for example,
to determine the best compromise of image quality vs.
tesselation level. Textured objects in particular may still
be rendered effectively with low geometric complexity. [all]
Display lists
Use display lists to encapsulate frequently drawn objects.
Display list data may be stored in the graphics subsystem
rather than host memory thereby eliminating host-to-graphics
data movement.
Display lists are also very beneficial when rendering
remotely. [all]
Don't specify unneeded per-vertex information
If lighting is disabled don't call
glNormal
.
If texturing is disabled don't call
glTexCoord
, etc.
Minimize code between
glBegin/glEnd
For maximum performance on high-end systems it's extremely
important to send vertex data to the graphics system as fast
as possible.
Avoid extraneous code between
glBegin/glEnd
.
Example:
glBegin( GL_TRIANGLE_STRIP ); for (i=0; i < n; i++) { if (lighting) { glNormal3fv( norm[i] ); } glVertex3fv( vert[i] ); } glEnd();
This is a very bad construct. The following is much better:
if (lighting) { glBegin( GL_TRIANGLE_STRIP ); for (i=0; i < n ;i++) { glNormal3fv( norm[i] ); glVertex3fv( vert[i] ); } glEnd(); } else { glBegin( GL_TRIANGLE_STRIP ); for (i=0; i < n ;i++) { glVertex3fv( vert[i] ); } glEnd(); }
Also consider manually unrolling important rendering loops to
maximize the function call rate.
3.2 Transformation
Transformation includes the transformation of vertices fromglVertex
to window coordinates, clipping and lighting.
Lighting
Avoid using positional lights, i.e. light positions should
be of the form (x,y,z,0) [L,S]
Avoid using spotlights. [all]
Avoid using two-sided lighting. [all]
Avoid using negative material and light color coefficients [S]
Avoid using the local viewer lighting model. [L,S]
Avoid frequent changes to the
GL_SHININESS
material parameter. [L,S]
Some OpenGL implementations are optimized for the case of
a single light source.
Consider pre-lighting complex objects before rendering, ala
radiosity. You can get the effect of lighting by
specifying vertex colors instead of vertex normals. [S]
Two sided lighting
If you want both the front and back of polygons shaded the
same try using two light sources instead of two-sided
lighting. Position the two light sources on opposite
sides of your object. That way, a polygon will always be
lit correctly whether it's back or front facing.
[L,S]
Disable normal vector normalization when not needed
glEnable/Disable(GL_NORMALIZE)
controls whether
normal vectors are scaled to unit length before lighting. If you
do not use
glScale
you may be able to disable
normalization without ill effects. Normalization is disabled
by default. [L,S]
Use connected primitives
Connected primitives such as
GL_LINES
,
GL_LINE_LOOP
,
GL_TRIANGLE_STRIP
,
GL_TRIANGLE_FAN
, and
GL_QUAD_STRIP
decrease traversal and transformation load.
glRect
usage
If you have to draw many rectangles consider using
glBegin(GL_QUADS)
...
glEnd()
instead. [all]
3.3 Rasterization
Rasterization is the process of generating the pixels which representpoints, lines, polygons, bitmaps and the writing of those pixels to the
frame buffer. Rasterization is often the bottleneck in software
implementations of OpenGL.
Disable smooth shading when not needed
Smooth shading is enabled by default. Flat shading doesn't
require interpolation of the four color components and is usually
faster than smooth shading in software implementations. Hardware
may perform flat and smooth-shaded rendering at the same rate
though there's at least one case in which smooth shading is faster
than flat shading (E&S Freedom). [S]
Disable depth testing when not needed
Background objects, for example, can be drawn without depth testing
if they're drawn first. Foreground objects can be drawn
without depth testing if they're drawn last. [L,S]
Disable dithering when not needed
This is easy to forget when developing on a high-end machine.
Disabling dithering can make a big difference in software
implementations of OpenGL on lower-end machines with 8 or 12-bit
color buffers. Dithering is enabled by default. [S]
Use back-face culling whenever possible.
If you're drawing closed polyhedra or other objects for which
back facing polygons aren't visible there's probably no point
in drawing those polygons. [all]
The GL_SGI_cull_vertex extension
SGI's Cosmo GL supports a new culling extension which looks at
vertex normals to try to improve the speed of culling.
Avoid extra fragment operations
Stenciling, blending, stippling, alpha testing and logic ops
can all take extra time during rasterization. Be sure to disable
the operations which aren't needed. [all]
Reduce the window size or screen resolution
A simple way to reduce rasterization time is to reduce the number
of pixels drawn. If a smaller window or reduced display resolution
are acceptable it's an easy way to improve rasterization speed. [L,S]
3.4 Texturing
Texture mapping is usually an expensive operation in both hardware andsoftware.
Only high-end graphics hardware can offer free to low-cost texturing.
In any case there are several ways to maximize texture mapping performance.
Use efficient image formats
The
GL_UNSIGNED_BYTE
component format is typically the
fastest for specifying texture images.
Experiment with the internal texture formats offered by the
GL_EXT_texture
extension. Some formats are faster
than others
on some systems (16-bit texels on the Reality Engine, for
example). [all]
Encapsulate texture maps in texture objects or display lists
This is especially important if you use several texture
maps. By putting textures into display lists or texture
objects the graphics system can manage their storage and
minimize data movement between the client and graphics
subsystem. [all]
Use smaller texture maps
Smaller images can be moved from host to texture memory faster
than large images. More small texture can be stored simultaneously
in texture memory, reducing texture memory swapping. [all]
Use simpler sampling functions
Experiment with the minification and magnification texture filters
to determine which performs best while giving acceptable results.
Generally, GL_NEAREST is fastest and GL_LINEAR is second fastest.
[all]
Use the same sampling function for minification and magnification
If both the minification and magnification filters are
GL_NEAREST
or
GL_LINEAR
then there's no reason OpenGL has to compute the
lambda
value which determines whether to use minification
or magnification sampling for each fragment.
Avoiding the lambda calculation can be a good performace improvement.
Use a simpler texture environment function
Some texture environment modes may be faster than others. For
example, the
GL_DECAL
or
GL_REPLACE_EXT
functions for 3 component textures is a simple assignment of texel
samples to fragments while
GL_MODULATE
is a linear
interpolation between texel samples and incoming fragments. [S,L]
Combine small textures
If you are using several small textures consider tiling them
together as a larger texture and modify your texture coordinates
to address the subtexture you want.
This technique can eliminate texture bindings.
Use glHint(GL_PERSPECTIVE_CORRECTION_HINT, GL_FASTEST)
This hint can improve the speed of texturing when perspective-
correct texture coordinate interpolation isn't needed, such as
when using a glOrtho() projection.
Animated textures
If you want to use an animated texture, perhaps live video textures,
don't use
glTexImage2D
to repeatedly change the texture.
Use
glTexSubImage2D
or
glTexCopyTexSubImage2D
.
These functions are standard in OpenGL 1.1 and available as extensions
to 1.0.
3.5 Clearing
Clearing the color, depth, stencil and accumulation buffers can betime consuming, especially when it has to be done in software.
There are a few tricks which can help.
Use
glClear
carefully [all]
Clear all relevant color buffers with one
glClear
.
Wrong:
glClear( GL_COLOR_BUFFER_BIT ); if (stenciling) { glClear( GL_STENCIL_BUFFER_BIT ); }
Right:
if (stenciling) { glClear( GL_COLOR_BUFFER_BIT | GL_STENCIL_BUFFER_BIT ); } else { glClear( GL_COLOR_BUFFER_BIT ); }
Disable dithering
Disable dithering before clearing the color buffer.
Visually, the difference between dithered and undithered clears
is usually negligable.
Use scissoring to clear a smaller area
If you don't need to clear the whole buffer use
glScissor()
to restrict clearing to a smaller area.
[L].
Don't clear the color buffer at all
If the scene you're drawing opaquely covers the entire window
there is no reason to clear the color buffer.
Eliminate depth buffer clearing
If the scene you're drawing covers the entire window there is a
trick which let's you omit the depth buffer clear. The idea is
to only use half the depth buffer range for each frame and
alternate between using GL_LESS and GL_GREATER as the depth test
function.
Example:
int EvenFlag; /* Call this once during initialization and whenever the window * is resized. */ void init_depth_buffer( void ) { glClearDepth( 1.0 ); glClear( GL_DEPTH_BUFFER_BIT ); glDepthRange( 0.0, 0.5 ); glDepthFunc( GL_LESS ); EvenFlag = 1; } /* Your drawing function */ void display_func( void ) { if (EvenFlag) { glDepthFunc( GL_LESS ); glDepthRange( 0.0, 0.5 ); } else { glDepthFunc( GL_GREATER ); glDepthRange( 1.0, 0.5 ); } EvenFlag = !EvenFlag; /* draw your scene */ }
Avoid glClearDepth( d ) where d!=1.0
Some software implementations may have optimized paths for
clearing the depth buffer to 1.0. [S]
3.6 Miscellaneous
Avoid "round-trip" callsCalls such as
glGetFloatv, glGetIntegerv, glIsEnabled, glGetError, glGetString
require a slow, round trip
transaction between the application and renderer.
Especially avoid them in your main rendering code.
Note that software implementations of OpenGL may actually perform
these operations faster than hardware systems. If you're developing
on a low-end system be aware of this fact. [H,L]
Avoid
glPushAttrib
If only a few pieces of state need to be saved and restored
it's often faster to maintain the information in the client
program.
glPushAttrib( GL_ALL_ATTRIB_BITS )
in
particular can be very expensive on hardware systems. This
call may be faster in software implementations than in hardware.
[H,L]
Check for GL errors during development
During development call
glGetError
inside your
rendering/event loop to catch errors. GL errors raised during
rendering can slow down rendering speed. Remove the
glGetError
call for production code since it's a
"round trip" command and can cause delays. [all]
Use
glColorMaterial
instead of
glMaterial
If you need to change a material property on a per vertex
basis,
glColorMaterial
may be faster than
glMaterial
. [all]
glDrawPixels
glDrawPixels
often performs best with
GL_UNSIGNED_BYTE
color
components [all]
Disable all unnecessary raster operations before calling
glDrawPixels
. [all]
Use the GL_EXT_abgr extension to specify color components in
alpha, blue, green, red order on systems which were designed
for IRIS GL. [H,L].
Avoid using viewports which are larger than the window
Software implementations may have to do additional clipping
in this situation. [S]
Alpha planes
Don't allocate alpha planes in the color buffer if you don't need them.
Specifically, they are not needed for transparency effects.
Systems without hardware alpha planes may have to resort to a
slow software implementation. [L,S]
Accumulation, stencil, overlay planes
Do not allocate accumulation, stencil or overlay planes if they
are not needed. [all]
Be aware of the depth buffer's depth
Your OpenGL may support several different sizes of depth
buffers- 16 and 24-bit for example. Shallower depth buffers
may be faster than deep buffers both for software and hardware
implementations. However, the precision of of a 16-bit depth
buffer may not be sufficient for some applications. [L,S]
Transparency may be implemented with stippling instead of blending
If you need simple transparent objects consider using
polygon stippling instead of alpha blending. The later is
typically faster and may actually look better in some
situations. [L,S]
Group state changes together
Try to mimimize the number of GL state changes in your code.
When GL state is changed, internal state may have to be
recomputed, introducing delays. [all]
Avoid using
glPolygonMode
If you need to draw many polygon outlines or vertex points
use
glBegin
with
GL_POINTS, GL_LINES, GL_LINE_LOOP
or
GL_LINE_STRIP
instead as it can be much faster. [all]
3.7 Window System Integration
Minimize calls to the make currentcall
The
glXMakeCurrent
call, for example, can be expensive
on hardware systems because the context switch may involve moving a
large amount of data in and out of the hardware.
Visual / pixel format performance
Some X visuals or pixel formats may be faster than others. On PCs
for example, 24-bit color buffers may be slower to read/write than
12 or 8-bit buffers. There is often a tradeoff between performance
and quality of frame buffer configurations. 12-bit color may not
look as nice as 24-bit color. A 16-bit depth buffer won't have the
precision of a 24-bit depth buffer.
The
GLX_EXT_visual_rating
extension can help you select
visuals based on performance or quality. GLX 1.2's visual
caveat
attribute can tell you if a visual has a performance
penalty associated with it.
It may be worthwhile to experiment with different visuals to determine
if there's any advantage of one over another.
Avoid mixing OpenGL rendering with native rendering
OpenGL allows both itself and the native window system to
render into the same window. For this to be done correctly
synchronization is needed. The GLX
glXWaitX
and
glXWaitGL
functions serve this purpose.
Synchronization hurts performance. Therefore, if you need to
render with both OpenGL and native window system calls try to
group the rendering calls to minimize synchronization.
For example, if you're drawing a 3-D scene with OpenGL and displaying
text with X, draw all the 3-D elements first, call
glXWaitGL
to synchronize, then call all the X drawing
functions.
Don't redraw more than necessary
Be sure that you're not redrawing your scene unnecissarily.
For example, expose/repaint events may come in batches describing
separate regions of the window which must be redrawn.
Since one usually redraws the whole window image with OpenGL
you only need to respond to one expose/repaint event.
In the case of X, look at the count field of the XExposeEvent
structure.
Only redraw when it is zero.
Also, when responding to mouse motion events you should skip
extra motion events in the input queue.
Otherwise, if you try to process every motion event and redraw
your scene there will be a noticable delay between mouse input
and screen updates.
It can be a good idea to put a print statement in your redraw
and event loop function so you know exactly what messages are
causing your scene to be redrawn, and when.
SwapBuffer calls and graphics pipe blocking
On systems with 3-D graphics hardware the SwapBuffers call is
synchronized to the monitor's vertical retrace.
Input to the OpenGL command queue may be blocked until the
buffer swap has completed.
Therefore, don't put more OpenGL calls immediately after SwapBuffers.
Instead, put application computation instructions which can
overlap with the buffer swap delay.
3.8 Mesa-specific
Mesa is a free library which implements most of the OpenGL API in acompatible manner. Since it is a software library, performance depends a
great deal on the host computer. There are several Mesa-specific features
to be aware of which can effect performance.
Double buffering
The X driver supports two back color buffer implementations: Pixmaps
and XImages. The MESA_BACK_BUFFER environment variable controls
which is used. Which of the two that's faster depends on the nature
of your rendering. Experiment.
X Visuals
As described above, some X visuals can be rendered into more quickly
than others. The
MESA_RGB_VISUAL
environment variable
can be used to determine the quickest visual by experimentation.
Depth buffers
Mesa may use a 16 or 32-bit depth buffer as specified in the
src/config.h configuration file. 16-bit depth buffers are faster
but may not offer the precision needed for all applications.
Flat-shaded primitives
If one is drawing a number of flat-shaded primitives all of the
same color the
glColor
command should be put before
the
glBegin
call.
Don't do this:
glBegin(...); glColor(...); glVertex(...); ... glEnd();
Do this:
glColor(...); glBegin(...); glVertex(...); ... glEnd();
glColor*() commands
The
glColor[34]ub[v]
are the fastest
versions of the
glColor
command.
Avoid double precision valued functions
Mesa does all internal floating point computations in single
precision floating point.
API functions which take double precision floating point values
must convert them to single precision.
This can be expensive in the case of glVertex, glNormal, etc.
4. Evaluation and Tuning
To maximize the performance of an OpenGL applications one must be ableto evaluate an application to learn what is limiting its speed.
Because of the hardware involved it's not sufficient to use ordinary
profiling tools.
Several different aspects of the graphics system must be evaluated.
Performance evaluation is a large subject and only the basics are covered here.
For more information see "OpenGL on Silicon Graphics Systems".
4.1 Pipeline tuning
The graphics system can be divided into three subsystems for the purposeof performance evaluation:
CPU subsystem
- application code which drives the graphics subsystem
Geometry subsystem
- transformation of vertices, lighting, and
clipping
Rasterization subsystem
- drawing filled polygons, line segments and
per-pixel processing
At any given time, one of these stages will be the bottleneck. The
bottleneck must be reduced to improve performance.
The strategy is to isolate each subsystem in turn and evaluate changes
in performance.
For example, by decreasing the workload of the CPU subsystem one can
determine if the CPU or graphics system is limiting performance.
4.1.1 CPU subsystem
To isosulate the CPU subsystem one must reduce the graphics workload whilepresevering the application's execution characteristics.
A simple way to do this is to replace
glVertex()
and
glNormal
calls with
glColor
calls.
If performance does not improve then the CPU stage is the bottleneck.
4.1.2 Geometry subsystem
To isoslate the geometry subsystem one wants to reduce the number ofprimitives processed, or reduce the transformation work per primitive
while producing the same number of pixels during rasterization.
This can be done by replacing many small polygons with fewer large
ones or by simply disabling lighting or clipping.
If performance increases then
your application is bound by geometry/transformation speed.
4.1.3 Rasterization subsystem
A simple way to reduce the rasterization workload is to make your windowsmaller. Other ways to reduce rasterization work is to disable per-pixel
processing such as texturing, blending, or depth testing.
If performance increases, your program is fill limited
.
After bottlenecks have been identified the techniques outlined in
section 3 can be applied.
The process of identifying and reducing bottlenecks should be repeated
until no further improvements can be made or your minimum performance
threshold has been met.
4.2 Double buffering
For smooth animation one must maintain a high, constant frame rate.Double buffering has an important effect on this.
Suppose your application needs to render at 60Hz but is
only getting 30Hz. It's a mistake to think that you must
reduce rendering time by 50% to achive 60Hz. The reason
is the swap-buffers operation is synchronized to occur
during the display's vertical retrace period (at 60Hz for
example). It may be that your application is taking only
a tiny bit too long to meet the 1/60 second rendering time
limit for 60Hz.
Measure the performance of rendering in single buffer mode
to determine how far you really are from your target frame
rate.
4.3 Test on several implementations
The performance of OpenGL implementations varies a lot.One should measure performance and test OpenGL applications
on several different systems to be sure there are no
unexpected problems.
Last edited on May 16, 1997 by Brian Paul.
相关文章推荐
- /LGC图形渲染/OpenGL 性能优化
- /LGC图形渲染/基于 OpenGL 进行 3D 图形开发
- /LGC图形渲染/OpenGL 概念建立
- /LGC图形渲染/OpenGL 资源汇编
- /LGC图形渲染/Android 图形系统剖析
- OPENGL固定图形渲染管线操作细节
- /LGC图形渲染/COS426
- /LGC图形渲染/淺談 Google Skia 圖形處理引擎
- /LGC图形渲染/图形学系列 -- 粒子系统概述及其实现
- /LGC图形渲染/如何判断是否启用了硬件加速
- OpenGL图形渲染管线、VBO、VAO、EBO概念及用例
- /LGC图形渲染
- /LGC图形渲染/图形学系列 -- 图形学基本概念汇编
- /LGC图形渲染/PXA300平台2D图形加速器性能测试与分析
- /LGC图形渲染/缩放图片长和宽控制图片文件尺寸
- /LGC图形渲染/图形学系列 -- 3D 图形学基础
- /LGC图形渲染/Pure GPU Computing Platform : NVIDIA CUDA Tutorial
- GPU图形处理管线、图形硬件接口(OpenGL)与可编程图形渲染语言(CG)的关系
- /LGC图形渲染/旗帜(waving texture)特效的实现
- /LGC图形渲染/图形学系列 -- 关于计算机图形学的学习