7 April 2021

Tiled point lights.

by Tommi

The issues causing less than great performance:
Level Design placed hundreds of lights in their levels.
A problem area were the long “well”-lit corridors that occasionally popped up.

I quickly realised that half of every point light (or more) were OUTSIDE the corridor.
They were placed on walls and ceilings, so if you could use the depth buffer it would filter out almost half of all the pixels that needed lighting.
With the inverted light meshes used that was not possible

A solution, split point lights up.

Conclusion: Decent performance boost, it’s now down to 9-10ms per frame. Still bad.

What next then, googling deferred point light rendering turned up tiled rendering, so this is how I implemented it.
Setup code to toggle between existing rendering and the new tiled at runtime. Using Dear ImGui

static bool RENDER_POINT_LIGHTS_TILED = false;  
if (ImGui::TreeNode("Deferred Renderer Lights"))  
    ImGui::Checkbox("Render Tiled?", &RENDER_POINT_LIGHTS_TILED);  

Time to learn compute shaders then. (DirectX11)

No issues so far, seems to be pretty straight-forward. Then I got hit by this.

Tiled shading works by

  1. Sending all the lights to the shader.
  2. Splitting the screen into smaller parts. like 16x16px.
  3. Culling the lights by that parts frustum.
  4. Rendering only the remaining lights on tile.

First we wanna make it work to know that we’re doing it right.
So lets skip part 2 and 3 for now, they’re there for optimization later.

First create a new buffer to store all the lights and their colors. We can define this in 2 float4 variables for the shader.

struct PointLightData {
    float4 myPositionAndRange;
    float4 myColorAndIntensity;

And a matching struct for the C++ code.

struct PointLightData
    Vector4 myPositionAndRange;
    Color myColorAndIntensity;

A lightbuffer with space for 256 lights were created and the framebuffer from the earlier method were appended with a myLightCount variable. Small change to how UV is converted to viewspace since compute shaders go per-pixel and doesn’t care about the rasterization step.

Step 1 is done, now to go for step 4, we’re just gonna let it run through all lights instead, with the usual per-pixel lighting.

We just made it work! Time to make it right then!

Let’s go with step 2 before 3, splitting into smaller parts of the screen is not that hard.
Dispatch less groups increase numthreads, started with 16x16 groups on the screen.
No problem since we still got the SV_DispatchThreadID to get the exact pixel we’re working on.

Step 2 turned out to be no problem at all. Let’s go to the culling then with step 3.

We need a groupshared array to contain the lights and a counter for how many filled it.
To save space we only save the indices in the array.

groupshared uint ourLightIndexCounter;
groupshared uint ourLightIndices[MAX_LIGHTS_PER_TILE];

Let each thread construct a frustum. Ignore Z for now.
Calculating the point on the view (-1x to 1x, -1y to 1y) from the px coordinates.
Use the projection Matrix to get the normal direction to the corners.
Then to the view Matrix to get the plane in world coordinates.
The plane is never getting rotated or moved, so we can define it efficiently with the Hesse normal form.

Next we have to cull the point lights, we can split these up across all the threads like this.
Not actual code, just to show you how it’s splitting the work.

uint startI = SV_GroupThreadID.x + SV_GroupThreadID.y * NUMTHREADS_X;
for (uint i = startI; i < myPointLightCount; i += NUMTHREADS_X * NUMTHREADS_Y)

In the loop we test if the distance from the point light position to the frustums planes are all less than the radius of the point light.
If they are we increment the counter and push in the index to the shared array.
Then we sync all the threads in this group, before applying lighting.

Finally instead of looping over all lights per pixel while lighting, we loop over all indices in our shared indices and pick out the correct lights.

Without culling lights on the CPU, it takes about 5ms now. Reasonable now.

It’s now done “right”. A few things I would like to do for improvements are.

The specs I’ve been running this on are 2560x1440 resolution Intel Core i5-8600k 3.6GHz (6 CPUs) Nvidia GeForce GTX 1060 6GB