Everything you never wanted to know about OpenGL Vertex Buffer Objects but were forced to learn anyway

Post by **jlv** » Fri May 07, 2021 12:37 am

(Edit 2022-05-22: This post is wrong if you care about good performance on AMD's OpenGL. Read my followup to understand why.)

So I'm in the process of removing all the OpenGL stuff MXS uses that is deprecated/removed in OpenGL 3.1 / OpenGL ES and I figured you guys might want to suffer along with me. First up is the immediate mode drawing API.

If you don't know OpenGL, the immediate mode API is a really easy to use way to draw various types of polygon. Here's an example of how you'd draw a square:

Code: Select all

glBegin(GL_QUADS);
glVertex3f(-1.0, 1.0, 1.0);
glVertex3f(-1.0, -1.0, 1.0);
glVertex3f(1.0, -1.0, 1.0);
glVertex3f(1.0, 1.0, 1.0);
glEnd();

For a comparison, here's the more complicated, equivalent code using vertex buffer objects (abbreviated VBO henceforth):

Code: Select all

GLuint vbo = 0;
GLfloat data[16] = {
  -1.0, 1.0, 1.0, 1.0,
  -1.0, -1.0, 1.0, 1.0,
  1.0, -1.0, 1.0, 1.0,
  1.0, 1.0, 1.0, 1.0,
};
glGenBuffers(1, &vbo);
glBindBuffer(GL_ARRAY_BUFFER, vbo);
glVertexPointer(4, GL_FLOAT, sizeof(GLfloat) * 4, 0);
glEnableClientState(GL_VERTEX_ARRAY);
glBufferData(GL_ARRAY_BUFFER, sizeof(data), data, GL_STREAM_DRAW);
glDrawArrays(GL_QUADS, 0, 4);
glDisableClientState(GL_VERTEX_ARRAY);
glBindBuffer(GL_ARRAY_BUFFER, 0);

Obviously, it'd be a huge pain to do this every time you want to draw something simple like a square so I want to emulate the old API using the new API. It'd also be nice if it weren't actually slower than the original code as well.

To start I needed a test program to make sure everything works and also to benchmark with. I chose "glutplane" which is a glut example program that draws paper planes. I modified it to be more useful for benchmarking (output FPS, more planes and start with moving planes). Here's the resulting listing:

Code: Select all

/* Copyright (c) Mark J. Kilgard, 1994. */

/* This program is freely distributable without licensing fees 
   and is provided without guarantee or warrantee expressed or 
   implied. This program is -not- in the public domain. */

#include <stdlib.h>
#include <stdio.h>
#ifndef WIN32
#include <unistd.h>
#else
#define random rand
#define srandom srand
#endif
#include <math.h>
#include <time.h>
#define GL_GLEXT_PROTOTYPES
#include <GL/glcorearb.h>
#include <GL/glut.h>

/* Some <math.h> files do not define M_PI... */
#ifndef M_PI
#define M_PI 3.14159265
#endif
#ifndef M_PI_2
#define M_PI_2 1.57079632
#endif

GLboolean moving = GL_TRUE;

#define MAX_PLANES (1024*64)

struct {
  float speed;          /* zero speed means not flying */
  GLfloat red, green, blue;
  float theta;
  float x, y, z, angle;
} planes[MAX_PLANES];

#define v3f glVertex3f  /* v3f was the short IRIS GL name for
                           glVertex3f */

void
draw(void)
{
  GLfloat red, green, blue;
  int i;

  glClear(GL_DEPTH_BUFFER_BIT);
  /* paint black to blue smooth shaded polygon for background */
  glDisable(GL_DEPTH_TEST);
  glShadeModel(GL_SMOOTH);
  glBegin(GL_POLYGON);
  glColor3f(0.0, 0.0, 0.0);
  v3f(-20, 20, -19);
  v3f(20, 20, -19);
  glColor3f(0.0, 0.0, 1.0);
  v3f(20, -20, -19);
  v3f(-20, -20, -19);
  glEnd();
  /* paint planes */
  glEnable(GL_DEPTH_TEST);
  glShadeModel(GL_FLAT);
  for (i = 0; i < MAX_PLANES; i++)
    if (planes[i].speed != 0.0) {
      glPushMatrix();
      glTranslatef(planes[i].x, planes[i].y, planes[i].z);
      glRotatef(290.0, 1.0, 0.0, 0.0);
      glRotatef(planes[i].angle, 0.0, 0.0, 1.0);
      glScalef(1.0 / 3.0, 1.0 / 4.0, 1.0 / 4.0);
      glTranslatef(0.0, -4.0, -1.5);
      glBegin(GL_TRIANGLE_STRIP);
      /* left wing */
      v3f(-7.0, 0.0, 2.0);
      v3f(-1.0, 0.0, 3.0);
      glColor3f(red = planes[i].red, green = planes[i].green,
        blue = planes[i].blue);
      v3f(-1.0, 7.0, 3.0);
      /* left side */
      glColor3f(0.6 * red, 0.6 * green, 0.6 * blue);
      v3f(0.0, 0.0, 0.0);
      v3f(0.0, 8.0, 0.0);
      /* right side */
      v3f(1.0, 0.0, 3.0);
      v3f(1.0, 7.0, 3.0);
      /* final tip of right wing */
      glColor3f(red, green, blue);
      v3f(7.0, 0.0, 2.0);
      glEnd();
      glPopMatrix();
    }
  glutSwapBuffers();
}

void
tick_per_plane(int i)
{
  float theta = planes[i].theta += planes[i].speed;
  planes[i].z = -9 + 4 * cos(theta);
  planes[i].x = 4 * sin(2 * theta);
  planes[i].y = sin(theta / 3.4) * 3;
  planes[i].angle = ((atan(2.0) + M_PI_2) * sin(theta) - M_PI_2) * 180 / M_PI;
  if (planes[i].speed < 0.0)
    planes[i].angle += 180;
}

void
add_plane(void)
{
  int i;

  for (i = 0; i < MAX_PLANES; i++)
    if (planes[i].speed == 0) {

#define SET_COLOR(r,g,b) \
	planes[i].red=r; planes[i].green=g; planes[i].blue=b;

      switch (random() % 6) {
      case 0:
        SET_COLOR(1.0, 0.0, 0.0);  /* red */
        break;
      case 1:
        SET_COLOR(1.0, 1.0, 1.0);  /* white */
        break;
      case 2:
        SET_COLOR(0.0, 1.0, 0.0);  /* green */
        break;
      case 3:
        SET_COLOR(1.0, 0.0, 1.0);  /* magenta */
        break;
      case 4:
        SET_COLOR(1.0, 1.0, 0.0);  /* yellow */
        break;
      case 5:
        SET_COLOR(0.0, 1.0, 1.0);  /* cyan */
        break;
      }
      planes[i].speed = ((float) (random() % 20)) * 0.001 + 0.02;
      if (random() & 0x1)
        planes[i].speed *= -1;
      planes[i].theta = ((float) (random() % 257)) * 0.1111;
      tick_per_plane(i);
      if (!moving)
        glutPostRedisplay();
      return;
    }
}

void
remove_plane(void)
{
  int i;

  for (i = MAX_PLANES - 1; i >= 0; i--)
    if (planes[i].speed != 0) {
      planes[i].speed = 0;
      if (!moving)
        glutPostRedisplay();
      return;
    }
}

void
tick(void)
{
  int i;

  for (i = 0; i < MAX_PLANES; i++)
    if (planes[i].speed != 0.0)
      tick_per_plane(i);
}

static time_t g_seconds = 0;
static int g_frames = 0;

void
animate(void)
{
  time_t t = time(NULL);
  tick();
  glutPostRedisplay();
  if (g_seconds != 0 && g_seconds != t) {
    printf("%d frames in %ld second(s)\n", g_frames, (long)t - (long)g_seconds);
    g_frames = 0;
  }
  g_seconds = t;
  g_frames++;
}

void
visible(int state)
{
  if (state == GLUT_VISIBLE) {
    if (moving)
      glutIdleFunc(animate);
  } else {
    if (moving)
      glutIdleFunc(NULL);
  }
}

/* ARGSUSED1 */
void
keyboard(unsigned char ch, int x, int y)
{
  switch (ch) {
  case ' ':
    if (!moving) {
      tick();
      glutPostRedisplay();
    }
    break;
  case 27:             /* ESC */
    exit(0);
    break;
  }
}

#define ADD_PLANE	1
#define REMOVE_PLANE	2
#define MOTION_ON	3
#define MOTION_OFF	4
#define QUIT		5

void
menu(int item)
{
  switch (item) {
  case ADD_PLANE:
    add_plane();
    break;
  case REMOVE_PLANE:
    remove_plane();
    break;
  case MOTION_ON:
    moving = GL_TRUE;
    glutChangeToMenuEntry(3, "Motion off", MOTION_OFF);
    glutIdleFunc(animate);
    break;
  case MOTION_OFF:
    moving = GL_FALSE;
    glutChangeToMenuEntry(3, "Motion", MOTION_ON);
    glutIdleFunc(NULL);
    break;
  case QUIT:
    exit(0);
    break;
  }
}

int
main(int argc, char *argv[])
{
  int i;
  glutInit(&argc, argv);
  /* use multisampling if available */
  glutInitDisplayMode(GLUT_DOUBLE | GLUT_RGB | GLUT_DEPTH | GLUT_MULTISAMPLE);
  glutCreateWindow("glutplane");
  glutDisplayFunc(draw);
  glutKeyboardFunc(keyboard);
  glutVisibilityFunc(visible);
  glutCreateMenu(menu);
  glutAddMenuEntry("Add plane", ADD_PLANE);
  glutAddMenuEntry("Remove plane", REMOVE_PLANE);
  glutAddMenuEntry("Motion off", MOTION_OFF);
  glutAddMenuEntry("Quit", QUIT);
  glutAttachMenu(GLUT_RIGHT_BUTTON);
  /* setup OpenGL state */
  glClearDepth(1.0);
  glClearColor(0.0, 0.0, 0.0, 0.0);
  glMatrixMode(GL_PROJECTION);
  glFrustum(-1.0, 1.0, -1.0, 1.0, 1.0, 20);
  glMatrixMode(GL_MODELVIEW);
  /* add three initial random planes */
  srandom(getpid());
  for (i = 0; i < 1024; i++) add_plane();
  /* start event processing */
  glutMainLoop();
  return 0;             /* ANSI C requires main to return int. */
}

This gets 454 FPS on my system. My goal is to at least not be slower than that.

First up is the basic VBO implementation. I added this code at the beginning of the listing after the includes and added a call to init_fake_immediate() in main() after the glut setup code.

Code: Select all

#define IMM_CHUNK (1024*16)
#define VERT_ELT 12

static GLuint g_vbo = 0;
static GLenum g_mode = 0;
static int g_count = 0;
static GLfloat g_data[IMM_CHUNK * VERT_ELT];
static GLfloat g_texcoord[] = { 0.0, 0.0, 0.0, 1.0 };
static GLfloat g_color[] = { 1.0, 1.0, 1.0, 1.0 };

int
init_fake_immediate(void)
{
        glGenBuffers(1, &g_vbo);
}

void
mxglBegin(GLenum mode)
{
        g_mode = mode;
        g_count = 0;
}

void
mxglColor3f(GLfloat r, GLfloat g, GLfloat b)
{
        g_color[0] = r;
        g_color[1] = g;
        g_color[2] = b;
}

void
mxglVertex3f(GLfloat x, GLfloat y, GLfloat z)
{
        int i;

        if (g_count == IMM_CHUNK)
                return;

        i = g_count * VERT_ELT;

        g_data[i++] = x;
        g_data[i++] = y;
        g_data[i++] = z;
        g_data[i++] = 1.0;

        g_data[i++] = g_texcoord[0];
        g_data[i++] = g_texcoord[1];
        g_data[i++] = g_texcoord[2];
        g_data[i++] = g_texcoord[3];

        g_data[i++] = g_color[0];
        g_data[i++] = g_color[1];
        g_data[i++] = g_color[2];
        g_data[i++] = g_color[3];

        g_count++;
}

void
mxglEnd(void)
{
        glBindBuffer(GL_ARRAY_BUFFER, g_vbo);
        glVertexPointer(4, GL_FLOAT, sizeof(GLfloat) * VERT_ELT, (char *)(sizeof(GLfloat) * 0));
        glTexCoordPointer(4, GL_FLOAT, sizeof(GLfloat) * VERT_ELT, (char *)(sizeof(GLfloat) * 4));
        glColorPointer(4, GL_FLOAT, sizeof(GLfloat) * VERT_ELT, (char *)(sizeof(GLfloat) * 8));
        glEnableClientState(GL_VERTEX_ARRAY);
        glEnableClientState(GL_TEXTURE_COORD_ARRAY);
        glEnableClientState(GL_COLOR_ARRAY);

        glBufferData(GL_ARRAY_BUFFER, sizeof(GLfloat) * VERT_ELT * g_count, g_data, GL_STREAM_DRAW);

        glDrawArrays(g_mode, 0, g_count);

        glDisableClientState(GL_VERTEX_ARRAY);
        glDisableClientState(GL_TEXTURE_COORD_ARRAY);
        glDisableClientState(GL_COLOR_ARRAY);
        glBindBuffer(GL_ARRAY_BUFFER, 0);
}

#if 1
#define glBegin mxglBegin
#define glEnd mxglEnd
#define glColor3f mxglColor3f
#define glVertex3f mxglVertex3f
#endif

The result - 449 FPS. This isn't too bad considering the code is pretty bad. It uses glBufferData() for every glBegin/glEnd pair which forces the driver to waste a lot of time managing lots of tiny buffers since glBufferData() allocates a new buffer every time it's called.

For the next attempt we want to use glBufferSubData to update a single buffer so it doesn't have to keep allocating new buffers.

First the initialization code is changed to allocate the buffer:

Code: Select all

int
init_fake_immediate(void)
{
        glGenBuffers(1, &g_vbo);
        glBindBuffer(GL_ARRAY_BUFFER, g_vbo);
        glBufferData(GL_ARRAY_BUFFER, sizeof(GLfloat) * VERT_ELT * IMM_CHUNK, NULL, GL_DYNAMIC_DRAW);
        glBindBuffer(GL_ARRAY_BUFFER, 0);
}

Next the glBufferData() call in mxglEnd() is replaced with the glBufferSubData() call:

Code: Select all

glBufferSubData(GL_ARRAY_BUFFER, 0, sizeof(GLfloat) * VERT_ELT * g_count, g_data);

Test that out and we get 277 FPS. Oh no, we're getting even slower! The problem now is that while we aren't creating thousands of tiny buffers, our one buffer effectively forces the GPU to render only one polygon at a time.

At a minimum, we can't keep overwriting the first elements of the array. We want to rotate through the entire array so the GPU can work on one part of the array while we update it somewhere else.

So a "g_begin" variable is added that points to the next free spot in the array. When we hit the end of the buffer we move the current primitive to the front and reset "g_begin".

Code: Select all

#define IMM_CHUNK (1024*16)
#define VERT_ELT 12

static GLuint g_vbo = 0;
static GLenum g_mode = 0;
static int g_begin = 0;
static int g_count = 0;
static GLfloat g_data[IMM_CHUNK * VERT_ELT];
static GLfloat g_texcoord[] = { 0.0, 0.0, 0.0, 1.0 };
static GLfloat g_color[] = { 1.0, 1.0, 1.0, 1.0 };

int
init_fake_immediate(void)
{
        glGenBuffers(1, &g_vbo);
        glBindBuffer(GL_ARRAY_BUFFER, g_vbo);
        glBufferData(GL_ARRAY_BUFFER, sizeof(GLfloat) * VERT_ELT * IMM_CHUNK, NULL, GL_DYNAMIC_DRAW);
        glBindBuffer(GL_ARRAY_BUFFER, 0);
}

void
mxglBegin(GLenum mode)
{
        g_mode = mode;
        g_begin += g_count;
        g_count = 0;
}

void
mxglColor3f(GLfloat r, GLfloat g, GLfloat b)
{
        g_color[0] = r;
        g_color[1] = g;
        g_color[2] = b;
}

static void
move_to_beginning(void)
{
        int i;

        for (i = 0; i < g_count * VERT_ELT; i++)
                g_data[i] = g_data[i + g_begin * VERT_ELT];

        g_begin = 0;
}

void
mxglVertex3f(GLfloat x, GLfloat y, GLfloat z)
{
        int i;

	if (g_count == IMM_CHUNK)
		return;

        if (g_begin + g_count == IMM_CHUNK)
                move_to_beginning();

        i = (g_begin + g_count) * VERT_ELT;

        g_data[i++] = x;
        g_data[i++] = y;
        g_data[i++] = z;
        g_data[i++] = 1.0;

        g_data[i++] = g_texcoord[0];
        g_data[i++] = g_texcoord[1];
        g_data[i++] = g_texcoord[2];
        g_data[i++] = g_texcoord[3];

        g_data[i++] = g_color[0];
        g_data[i++] = g_color[1];
        g_data[i++] = g_color[2];
        g_data[i++] = g_color[3];

        g_count++;
}

void
mxglEnd(void)
{
        glBindBuffer(GL_ARRAY_BUFFER, g_vbo);
        glVertexPointer(4, GL_FLOAT, sizeof(GLfloat) * VERT_ELT, (char *)(sizeof(GLfloat) * 0));
        glTexCoordPointer(4, GL_FLOAT, sizeof(GLfloat) * VERT_ELT, (char *)(sizeof(GLfloat) * 4));
        glColorPointer(4, GL_FLOAT, sizeof(GLfloat) * VERT_ELT, (char *)(sizeof(GLfloat) * 8));
        glEnableClientState(GL_VERTEX_ARRAY);
        glEnableClientState(GL_TEXTURE_COORD_ARRAY);
        glEnableClientState(GL_COLOR_ARRAY);

        glBufferSubData(GL_ARRAY_BUFFER, sizeof(GLfloat) * VERT_ELT * g_begin, sizeof(GLfloat) * VERT_ELT * g_count, g_data + VERT_ELT * g_begin);

        glDrawArrays(g_mode, g_begin, g_count);

        glDisableClientState(GL_VERTEX_ARRAY);
        glDisableClientState(GL_TEXTURE_COORD_ARRAY);
        glDisableClientState(GL_COLOR_ARRAY);
        glBindBuffer(GL_ARRAY_BUFFER, 0);
}

#if 1
#define glBegin mxglBegin
#define glEnd mxglEnd
#define glColor3f mxglColor3f
#define glVertex3f mxglVertex3f
#endif

This gets, er, 276 FPS. OK, this is not going well. I don't know why this isn't faster. But we have one more trick up our sleeve, we can "orphan" the buffer when it gets full. You do this by calling glBufferData() with the same parameters. This tells OpenGL you can't modify the orphaned buffer anymore so it can do what it wants with it.

So we change the buffer overflow code to look like this and we're good to go:

Code: Select all

static void
orphan_vbo(void)
{
        glBindBuffer(GL_ARRAY_BUFFER, g_vbo);
        glBufferData(GL_ARRAY_BUFFER, sizeof(GLfloat) * VERT_ELT * IMM_CHUNK, NULL, GL_DYNAMIC_DRAW);
        glBindBuffer(GL_ARRAY_BUFFER, 0);
}

void
mxglVertex3f(GLfloat x, GLfloat y, GLfloat z)
{
        int i;

	if (g_count == IMM_CHUNK)
		return;

        if (g_begin + g_count == IMM_CHUNK) {
                orphan_vbo();
                move_to_beginning();
        }

It now gets 554 FPS and I've had enough!

TeamHavocRacing · Post by **TeamHavocRacing** » Tue May 11, 2021 8:34 pm

Is this going to help silence all the whining kids complaining about dropping below 100fps?

Post by **jlv** » Wed May 12, 2021 12:40 am

TeamHavocRacing wrote: ↑Tue May 11, 2021 8:34 pm Is this going to help silence all the whining kids complaining about dropping below 100fps?

No. I mostly use immediate mode for UI and HUD stuff that isn't performance critical. I do use it for the billboards so it might be a bit faster there. The purpose of this is to make it "clean" OpenGL 3.1 which is basically OpenGL ES. I figure if I'm going to require 3.1 I may as well be ES compatible so I can run the game on a Raspberry Pi.

The performance improvements will come when I redo the terrain engine to rely on vertex shaders with no fallback path for OpenGL with no shaders. You might say this is the first step towards that but that's not technically true since OpenGL 3.1 still has immediate mode if the driver supports the compatibility profile which pretty much all of them do.

TeamHavocRacing · Post by **TeamHavocRacing** » Wed May 12, 2021 3:46 pm

I think once you can do that, you've taken away most of the reasons for complaints. Very interested in that!

ddmx · Post by **ddmx** » Thu May 13, 2021 2:42 am

Very interesting! I code quite a bit and am always trying to optimize. All I could think of while reading was this video on creating an NES game with 40kb. Really interesting optimization techniques!

https://youtu.be/ZWQ0591PAxM

Post by **jlv** » Fri May 14, 2021 1:41 am

If I had limitless time I'd love to write a 2600 game. It doesn't even have a framebuffer. You have to move things into place before the raster beam gets there.

You'd probably get a kick out of this kkrieger video. A full FPS in 96k.

Post by **jlv** » Mon May 23, 2022 1:23 am

jlv wrote: ↑Fri May 07, 2021 12:37 am It now gets 554 FPS and I've had enough!

Turns out I hadn't had enough...

After integrating this into the game and posting it, it turned out this gets terrible performance on AMD's OpenGL driver. So what's the problem?

My first thought was AMD OpenGL might be ignoring the ranges in my glDrawRangeElements calls since it was now using indexed triangles instead of the more complex primitives in legacy OpenGL, but converting from indexed triangles to triangles with duplicated vertices didn't help.

So back to the benchmarks.

First the baseline OpenGL immediate mode test. It gets 304 FPS. So right off the bat it's significantly slower than the Mesa OpenGL that I initially tested on which did 454 FPS.

Next up is the one glBufferData call per primitive version. This gets 167 FPS. So AMD OpenGL obviously takes a big hit from using VBOs vs the legacy immediate mode API.

Naive glBufferSubData version: 127 FPS. This version is expected to be slow since it uses lots of big buffers.

glBufferSubData that rotates through the buffer: 127 FPS. This sucks. It has no excuse to be this slow since the buffer writes don't overlap.

Finally, with buffer orphaning: 127 FPS. This is awful. None of the optimizations that work on other OpenGL versions work.

So it seems like no matter what you do, it chokes. What's happening? My guess is whenever you write to a buffer with glBufferSubData, AMD OpenGL waits for all rendering out of that buffer to finish before it does the write. This effectively makes your GPU that can render thousands of triangles simultaneously draw them one at a time instead. This is so bad that things like buffer orphaning to allow parallelism across buffers aren't even on the map.

How do you deal with something this bad? The only solution is to not do partial buffer updates. You have to write the whole buffer at once. That's easy to say but hard to do, since to do the drawing immediately, you'd have to know the future vertex values that will be written in the buffer, and to draw afterwards you need to capture the entire OpenGL state at the time of the drawing so you can repeat the exact same drawing commands with the now full buffer. It's not possible to know the future so the only way is to capture the state and replay it later.

But before starting on something that complex, it's a good idea to confirm it with a simpler test.

Code: Select all

/* This is still loosely based on glutplane so I'm keeping Mark Kilgard's copyright notice */
/* Copyright (c) Mark J. Kilgard, 1994. */

/* This program is freely distributable without licensing fees 
   and is provided without guarantee or warrantee expressed or 
   implied. This program is -not- in the public domain. */

#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <time.h>
#define GL_GLEXT_PROTOTYPES
#include <GL/glcorearb.h>
#include <GL/glut.h>

static time_t g_seconds = 0;
static int g_frames = 0;
static int g_mode = 0;
static char *g_mode_string = "";

#define CHECKS_SIZE 64
#define CHECKS_VERT_ELT 10
static GLuint g_checks_vbo = 0;
static GLfloat g_checks_data[CHECKS_SIZE * CHECKS_SIZE * 4 * CHECKS_VERT_ELT];
#define NTEXTURES 8
static GLuint g_textures[NTEXTURES];

static void
set_vec2(GLfloat *p, float x, float y)
{
	p[0] = x;
	p[1] = y;
}

static void
set_vec4(GLfloat *p, float x, float y, float z, float w)
{
	p[0] = x;
	p[1] = y;
	p[2] = z;
	p[3] = w;
}

static void
init_checks(void)
{
	int i, x, y, p;
	float c;

	glGenBuffers(1, &g_checks_vbo);
	glBindBuffer(GL_ARRAY_BUFFER, g_checks_vbo);

	for (y = 0; y < CHECKS_SIZE; y++)
		for (x = 0; x < CHECKS_SIZE; x++) {
			p = (y * CHECKS_SIZE + x) * 4 * CHECKS_VERT_ELT;
			c = ((x + y) % 2 == 0) ? 1.0 : 0.0;

			set_vec4(g_checks_data + p + CHECKS_VERT_ELT * 0, 0.0, 0.0, 0.0, 1.0);
			set_vec4(g_checks_data + p + CHECKS_VERT_ELT * 1, 1.0, 0.0, 0.0, 1.0);
			set_vec4(g_checks_data + p + CHECKS_VERT_ELT * 2, 1.0, 1.0, 0.0, 1.0);
			set_vec4(g_checks_data + p + CHECKS_VERT_ELT * 3, 0.0, 1.0, 0.0, 1.0);

			set_vec4(g_checks_data + p + CHECKS_VERT_ELT * 0 + 4, c, c, c, 1.0);
			set_vec4(g_checks_data + p + CHECKS_VERT_ELT * 1 + 4, c, c, c, 1.0);
			set_vec4(g_checks_data + p + CHECKS_VERT_ELT * 2 + 4, c, c, c, 1.0);
			set_vec4(g_checks_data + p + CHECKS_VERT_ELT * 3 + 4, c, c, c, 1.0);

			set_vec2(g_checks_data + p + CHECKS_VERT_ELT * 0 + 8, 0.0, 0.0);
			set_vec2(g_checks_data + p + CHECKS_VERT_ELT * 1 + 8, 1.0, 0.0);
			set_vec2(g_checks_data + p + CHECKS_VERT_ELT * 2 + 8, 1.0, 1.0);
			set_vec2(g_checks_data + p + CHECKS_VERT_ELT * 3 + 8, 0.0, 1.0);
		}

	glBufferData(GL_ARRAY_BUFFER, sizeof(GLfloat) * CHECKS_SIZE * CHECKS_SIZE * 4 * CHECKS_VERT_ELT, g_checks_data, GL_DYNAMIC_DRAW);

	glGenTextures(NTEXTURES, g_textures);
	for (i = 0; i < NTEXTURES; i++) {
		GLubyte pixels[3];

		pixels[0] = 255 - ((i >> 0) & 1) * 255;
		pixels[1] = 255 - ((i >> 1) & 1) * 255;
		pixels[2] = 255 - ((i >> 2) & 1) * 255;

		glBindTexture(GL_TEXTURE_2D, g_textures[i]);
		glTexImage2D(GL_TEXTURE_2D, 0, GL_RGB, 1, 1, 0, GL_RGB, GL_UNSIGNED_BYTE, pixels);
	}
	glBindTexture(GL_TEXTURE_2D, 0);
}

enum { NO_UPLOAD, BUFFERDATA, BUFFERSUBDATA, BUFFERSUBDATA_CHUNKS, BUFFERSUBDATA_INTERLEAVED };
enum { NO_ORPHAN, ORPHAN };

static void
custom_draw(int upload, int orphan)
{
	int x, y;

	glVertexPointer(4, GL_FLOAT, sizeof(GLfloat) * CHECKS_VERT_ELT, (char *)(sizeof(GLfloat) * 0));
	glColorPointer(4, GL_FLOAT, sizeof(GLfloat) * CHECKS_VERT_ELT, (char *)(sizeof(GLfloat) * 4));
	glTexCoordPointer(2, GL_FLOAT, sizeof(GLfloat) * CHECKS_VERT_ELT, (char *)(sizeof(GLfloat) * 8));
	glEnableClientState(GL_VERTEX_ARRAY);
	glEnableClientState(GL_COLOR_ARRAY);
	glEnableClientState(GL_TEXTURE_COORD_ARRAY);

	glEnable(GL_TEXTURE_2D);
	glTexEnvi(GL_TEXTURE_ENV, GL_TEXTURE_ENV_MODE, GL_MODULATE);

	glClearColor(0.5, 0.5, 0.5, 1.0);
	glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);

	if (upload == BUFFERDATA)
		glBufferData(GL_ARRAY_BUFFER, sizeof(GLfloat) * CHECKS_SIZE * CHECKS_SIZE * 4 * CHECKS_VERT_ELT, g_checks_data, GL_DYNAMIC_DRAW);
	else if (upload == BUFFERSUBDATA)
		glBufferSubData(GL_ARRAY_BUFFER, 0, sizeof(GLfloat) * CHECKS_SIZE * CHECKS_SIZE * 4 * CHECKS_VERT_ELT, g_checks_data);
	else if (upload == BUFFERSUBDATA_CHUNKS)
		for (y = 0; y < CHECKS_SIZE; y++)
			for (x = 0; x < CHECKS_SIZE; x++) {
				int begin, count;

				begin = (y * CHECKS_SIZE + x) * 4 * CHECKS_VERT_ELT;
				count = 4 * CHECKS_VERT_ELT;

				glBufferSubData(GL_ARRAY_BUFFER, sizeof(GLfloat) * begin, sizeof(GLfloat) * count, g_checks_data + begin);
			}

	for (y = 0; y < CHECKS_SIZE; y++) {
		glBindTexture(GL_TEXTURE_2D, g_textures[y % NTEXTURES]);
		for (x = 0; x < CHECKS_SIZE; x++) {
			if (upload == BUFFERSUBDATA_INTERLEAVED) {
				int begin, count;

				begin = (y * CHECKS_SIZE + x) * 4 * CHECKS_VERT_ELT;
				count = 4 * CHECKS_VERT_ELT;

				glBufferSubData(GL_ARRAY_BUFFER, sizeof(GLfloat) * begin, sizeof(GLfloat) * count, g_checks_data + begin);
			}

			glPushMatrix();
			glTranslatef(0.0, 0.0, -6.0);
			glScalef(1.0 / CHECKS_SIZE * 10.0, 1.0 / CHECKS_SIZE * 10.0, 1.0);
			glTranslatef(x - CHECKS_SIZE / 2.0, y - CHECKS_SIZE / 2.0, 0.0);
			glDrawArrays(GL_QUADS, (x + y * CHECKS_SIZE) * 4, 4);
			glPopMatrix();
		}
	}

	glBindTexture(GL_TEXTURE_2D, 0);

	if (orphan == ORPHAN)
		glBufferData(GL_ARRAY_BUFFER, sizeof(GLfloat) * CHECKS_SIZE * CHECKS_SIZE * 4 * CHECKS_VERT_ELT, NULL, GL_DYNAMIC_DRAW);

	glDisable(GL_TEXTURE_2D);

	glDisableClientState(GL_VERTEX_ARRAY);
	glDisableClientState(GL_COLOR_ARRAY);
	glDisableClientState(GL_TEXTURE_COORD_ARRAY);

	glutSwapBuffers();
}

static void
draw(void)
{
	switch (g_mode) {
	case 0:
	case 1:
	case 2:
		custom_draw(NO_UPLOAD, NO_ORPHAN);
		g_mode_string = "no upload";
		break;
	case 3:
	case 4:
		custom_draw(BUFFERDATA, NO_ORPHAN);
		g_mode_string = "bufferdata";
		break;
	case 5:
	case 6:
		custom_draw(BUFFERSUBDATA, NO_ORPHAN);
		g_mode_string = "buffersubdata";
		break;
	case 7:
	case 8:
		custom_draw(BUFFERSUBDATA_CHUNKS, NO_ORPHAN);
		g_mode_string = "buffersubdata chunks";
		break;
	case 9:
	case 10:
		custom_draw(BUFFERSUBDATA_INTERLEAVED, NO_ORPHAN);
		g_mode_string = "buffersubdata interleaved";
		break;
	case 11:
	case 12:
		custom_draw(BUFFERDATA, ORPHAN);
		g_mode_string = "bufferdata orphan";
		break;
	case 13:
	case 14:
		custom_draw(BUFFERSUBDATA, ORPHAN);
		g_mode_string = "buffersubdata orphan";
		break;
	case 15:
	case 16:
		custom_draw(BUFFERSUBDATA_CHUNKS, ORPHAN);
		g_mode_string = "buffersubdata chunks orphan";
		break;
	default:
		custom_draw(BUFFERSUBDATA_INTERLEAVED, ORPHAN);
		g_mode_string = "buffersubdata interleaved orphan";
		break;
	}
}

static void
tick(void)
{
}

static void
animate(void)
{
	time_t t = time(NULL);
	tick();
	glutPostRedisplay();

	if (g_seconds != 0 && g_seconds != t) {
		if (g_mode > 0)
			printf("%d frames in %ld second(s) %s\n", g_frames, (long)t - (long)g_seconds, g_mode_string);
		g_mode++;
		g_frames = 0;
	}

	g_seconds = t;
	g_frames++;
}

static void
visible(int state)
{
	if (state == GLUT_VISIBLE) {
		glutIdleFunc(animate);
	} else {
		glutIdleFunc(NULL);
	}
}

/* ARGSUSED1 */
static void
keyboard(unsigned char ch, int x, int y)
{
	switch (ch) {
	case ' ':
		tick();
		glutPostRedisplay();
		break;
	case 27:             /* ESC */
		exit(0);
		break;
	}
}

int
main(int argc, char *argv[])
{
	glutInit(&argc, argv);
	/* use multisampling if available */
	glutInitDisplayMode(GLUT_DOUBLE | GLUT_RGB | GLUT_DEPTH | GLUT_MULTISAMPLE);
	glutCreateWindow("vbo test");
	glutDisplayFunc(draw);
	glutKeyboardFunc(keyboard);
	glutVisibilityFunc(visible);
	init_checks();
	/* setup OpenGL state */
	glClearDepth(1.0);
	glClearColor(0.0, 0.0, 0.0, 0.0);
	glMatrixMode(GL_PROJECTION);
	glFrustum(-1.0, 1.0, -1.0, 1.0, 1.0, 20);
	glMatrixMode(GL_MODELVIEW);
	/* start event processing */
	glutMainLoop();
	return 0;
}

This draws a colored check pattern using the following modes of operation:

"no upload" - doesn't update the buffer. This should be the "speed of light" in that it shouldn't be possible for dynamic data to go faster than this static test. 282 FPS.

"bufferdata" - updates the entire buffer with glBufferData before drawing. 275 FPS.

"buffersubdata" - updates the entire buffer with glBufferSubData before drawing. 278 FPS.

"buffersubdata chunks" - updates with glBufferSubData uploading one primitive at a time but drawing them all after the upload. 190 FPS.

"buffersubdata interleaved" - updates with glBufferSubData uploading one primitive at a time interleaved with the draw calls. 34 FPS.

(It also tests all of these modes with buffer orphaning, but AMD OpenGL is apparently doesn't notice it since it didn't affect the frame rates.)

This definitely proves AMD OpenGL falls on its face if you don't update the entire buffer in one call. It's slow even if the updates aren't interleaved with the drawing.

For the curious, here are the numbers from Linux/Mesa OpenGL on the same hardware:

no upload 412 FPS
bufferdata 405 FPS
buffersubdata 427 FPS
buffersubdata chunks 152 FPS
buffersubdata interleaved 69 FPS
bufferdata orphan 421 FPS
buffersubdata orphan 402 FPS
buffersubdata chunks orphan 298 FPS
buffersubdata interleaved orphan 277 FPS

(Mesa is *way* faster. If you're a Windows AMD user you should definitely keep an eye on the Mesa/Zink OpenGL on Vulkan library for Windows. I bet it'll be a huge improvement over AMD's OpenGL.)

So what's the lesson here? Update your entire buffer with glBufferSubData and don't interleave drawing with your buffer updates if you care about running on AMD's OpenGL. It's extemely inconvenient but the only way to go fast on everything. Despite what numerous articles on OpenGL optimization say, buffer orphaning is pointless since it doesn't work on one of the most common OpenGL implementations.

TeamHavocRacing · Post by **TeamHavocRacing** » Mon May 23, 2022 6:21 am

jlv wrote: ↑Mon May 23, 2022 1:23 am... turned out this gets terrible performance on AMD's OpenGL driver.

Over a decade ago I learned how bad AMD was with GL. It gave me an argument for NVIDIA. I also remember weird texture issues that looked like z-fighting. MX Bikes has it now. Never had to think about it again once I stuck with the switch to NVIDIA.

Post by **jlv** » Tue May 24, 2022 1:31 am

TeamHavocRacing wrote: ↑Mon May 23, 2022 6:21 am Over a decade ago I learned how bad AMD was with GL. It gave me an argument for NVIDIA. I also remember weird texture issues that looked like z-fighting. MX Bikes has it now. Never had to think about it again once I stuck with the switch to NVIDIA.

Nvidia hired all the OpenGL guys from SGI back in the glquake days so it's no wonder that their OpenGL driver is good. I remember seeing Mark Kilgard, Michael Gold and Jon Leech all suddenly have Nvidia email addresses instead of their usual SGI addresses and thinking I should probably buy some Nvidia stock. (I didn't and NVDA is up around 10,000% since then.)

For Linux AMD is actually better since the GPU has documentation which gives you open source drivers that have great performance and endless support. Much better than depending on Nvidia's closed driver that they can end support for at any moment.

TeamHavocRacing · Post by **TeamHavocRacing** » Tue May 24, 2022 12:43 pm

Well, it's good to know that there's an argument for AMD. Otherwise, the only difference between them I can notice is the kinda greenish tint Nvidia has vs. AMD's reddish tint.

Everything you never wanted to know about OpenGL Vertex Buffer Objects but were forced to learn anyway

Everything you never wanted to know about OpenGL Vertex Buffer Objects but were forced to learn anyway

Re: Everything you never wanted to know about OpenGL Vertex Buffer Objects but were forced to learn anyway

Re: Everything you never wanted to know about OpenGL Vertex Buffer Objects but were forced to learn anyway

Re: Everything you never wanted to know about OpenGL Vertex Buffer Objects but were forced to learn anyway

Re: Everything you never wanted to know about OpenGL Vertex Buffer Objects but were forced to learn anyway

Re: Everything you never wanted to know about OpenGL Vertex Buffer Objects but were forced to learn anyway

Re: Everything you never wanted to know about OpenGL Vertex Buffer Objects but were forced to learn anyway

Re: Everything you never wanted to know about OpenGL Vertex Buffer Objects but were forced to learn anyway

Re: Everything you never wanted to know about OpenGL Vertex Buffer Objects but were forced to learn anyway

Re: Everything you never wanted to know about OpenGL Vertex Buffer Objects but were forced to learn anyway