Rendering Polygon Scenes With Ray Tracing

JDB · Post by **JDB** » April 29th, 2016, 4:36 am

Hey guys, a few months ago I posted a thread about rendering point cloud objects. Since that time my knowledge of 3D rendering has vastly improved and I want to take the time to share some of my insights and discoveries before I forget all of it. I was going to post a long thread but I decided to put it into a PDF document because it turned out much longer than expected. I updated my point cloud engine to implement the ideas discussed in the paper so it's still based on the Chili Engine and it's still a VS 2012 project.

Instead of rendering point cloud objects, Pray Engine now renders traditional polygon objects as well as analytical spheres. It also includes many bug fixes and new features such as support for normal maps, surface transparency, colored shadows, anti-aliasing, and more. Keep in mind it wont be fast because it's a prototype engine which doesn't use the GPU at all and it's mostly written for readability and not speed. My next goal is to write OpenCL kernels based on the prototype code to handle all of the parallel processing and make it run at real-time speeds.

It is set up to load a generic scene containing a simple room and a matchbox floating in the center of the room. You can move around using the WASD keys like a normal game, but you can also tilt the camera with Q and E, move up and down using the page up and page down keys, and control some lights with the arrow keys. When I get some free time I'll try to upload specifications for all the engine file formats. For now just check the source code if you wish to know how it works. I also made an app for converting OBJ files to the Pray Engine format if anybody is interested.

The source code can be found on GitHub: Ray-Tracer CPU Prototype

Screenshot of scene rendered by Pray Engine with 4x AA (4 rays per pixel):

(click to see full size)

Post by **albinopapa** » April 29th, 2016, 11:31 pm

Oh, man, I want to SSE and multithread the crap out of that engine so bad. I know you want to port to OpenCL and that would be awesome as well. I like pushing the CPU to get a good idea on just how much of a difference there is between CPU and GPU. I've heard 10x GPU vs CPU, but not sure if that takes into account all the vector acceleration and multithreading a CPU can do or if it's x86 single core performance they are comparing it to.

Obviously, GPU is going to be amazingly fast. They have somewhere between ~500 and ~4000 threads or 8 - 40 compute units.

The PDF is a pretty interesting read, I really like the explanation of Euclidean's "Infinite Detail" engine and the problems they will have to overcome to make it game-worthy.

(side note) I wonder if you used their idea with a triangle based control mesh, if it would help with animation.

JDB · Post by **JDB** » April 30th, 2016, 2:15 am

albinopapa wrote:Oh, man, I want to SSE and multithread the crap out of that engine so bad. I know you want to port to OpenCL and that would be awesome as well. I like pushing the CPU to get a good idea on just how much of a difference there is between CPU and GPU. I've heard 10x GPU vs CPU, but not sure if that takes into account all the vector acceleration and multithreading a CPU can do or if it's x86 single core performance they are comparing it to.

Obviously, GPU is going to be amazingly fast. They have somewhere between ~500 and ~4000 threads or 8 - 40 compute units.

The advantage a GPU will have over a CPU really depends how linear the problem is. If it's a problem which cannot have multiple steps computed at the same time then the GPU will not make much difference because the GPU is best at parallel processing. I wrote Pray Engine in such a way that every ray fired could be computed at the same time, and that means it will run exponentially better on the GPU. Right now the CPU has to loop through every ray one after the other, and that's obviously going to be much slower.

The PDF is a pretty interesting read, I really like the explanation of Euclidean's "Infinite Detail" engine and the problems they will have to overcome to make it game-worthy.

Thanks for taking the time to read it, definitely well worth a read if you want to mess around with Pray Engine. Even when I do convert it to use OpenCL, there will still be quite a bit of code which will need to run on the CPU.

Post by **albinopapa** » April 30th, 2016, 6:08 pm

Btw, ran the code, had to change a few things to get it running like removing the OpenCL.lib from additional dependencies and the min/max functions weren't being defined so had to add #include <algorithm>.

Not sure you need the precision of doubles, especially for real time rendering, could probably get a little boost by using floats instead.

Added a timer to get fps, am getting around 0.25 frames per second in 64 bit Release mode on this AMD A10-7870K with 8 GB 1333 DDR 3 memory.

Going off the info from the PDF, it might be a good idea, for organization and easier portability, to separate the code into the render passes as you describe. I haven't fully looked through the code yet. For best processing efficiency, you'd want your data contiguous. I wonder if there would be too much of a performance hit to create the lists as described in the pdf, then copy that data to an array/buffer that can be more efficiently processed using SSE. I don't know if this would help with any GPU implementation as well, but if you were going to use HSA I believe it would be a must.

Intel, nVidia and AMD all of their own way of implementing HSA. As far as software support, C++/AMP is one library, though doesn't take advantage of the benefits of HSA like being able to pass a pointer to the GPU, instead it treats the GPU and CPU as different entities and forces you to copy buffers between the two. OpenCL 2.0 I believe does allow you to pass pointers to the GPU, but haven't gotten through the API yet to determine if this is true.

I'd love to "stay in the loop" as you progress through this project while porting to OpenCL or whatever you decide to do with it.

JDB · Post by **JDB** » May 1st, 2016, 3:37 am

albinopapa wrote:Btw, ran the code, had to change a few things to get it running like removing the OpenCL.lib from additional dependencies and the min/max functions weren't being defined so had to add #include <algorithm>.

Sorry about that, thought I had removed all the OpenCL dependencies. I did manage to get OpenCL working correctly in VS 2012 and I have a version of Pray Engine which is already set up to use OpenCL. I just need to get around to writing the kernels. I have quite a bit of other work that I need to get done before I can focus on that though, so it'll probably be quite a while before I get around to it. I'll keep you updated though.

Not sure you need the precision of doubles, especially for real time rendering, could probably get a little boost by using floats instead.

I plan to use floats as much as possible when I convert it to OpenCL, but I want to try to keep using doubles for world positions otherwise it will be very difficult to create extremely large game worlds, such as with space sims.

Added a timer to get fps, am getting around 0.25 frames per second in 64 bit Release mode on this AMD A10-7870K with 8 GB 1333 DDR 3 memory.

Keep in mind that's with 4x AA enabled and it's only using one core of your CPU. You can change the level of AA and some other settings in the Resource.h file.

Going off the info from the PDF, it might be a good idea, for organization and easier portability, to separate the code into the render passes as you describe. I haven't fully looked through the code yet.

I have already tried to do that, the 4 rendering stages are broken up into separate functions, which will become separate kernels for each stage, designed to manipulate the buffers common to all stages.

Intel, nVidia and AMD all of their own way of implementing HSA. As far as software support, C++/AMP is one library, though doesn't take advantage of the benefits of HSA like being able to pass a pointer to the GPU, instead it treats the GPU and CPU as different entities and forces you to copy buffers between the two.

Well it seems to me that treating the CPU and GPU as separate entities is really the right way to go, because you want to avoid communicating with the CPU as much as possible and avoid using the slow system RAM. I'll probably need to hold most of the textures and meshes on the GPU for the fastest processing times. I believe most video games hold all the level textures and meshes in GPU memory, the GPU memory therefore becomes the biggest limiting factor. In many games they will get around this problem by loading objects on the fly, for example Skyrim will discard some objects and load new objects when you walk into a new area cell, only keeping data for the closest 36 cells or something like that.

I'm not entirely sure how it works though, I need to do a lot more research on that type of stuff and how OpenCL works before I can begin actually writing the kernels.

Post by **albinopapa** » May 2nd, 2016, 3:00 pm

JDB wrote: Sorry about that, thought I had removed all the OpenCL dependencies. I did manage to get OpenCL working correctly in VS 2012 and I have a version of Pray Engine which is already set up to use OpenCL. I just need to get around to writing the kernels. I have quite a bit of other work that I need to get done before I can focus on that though, so it'll probably be quite a while before I get around to it. I'll keep you updated though.

No problem. I would like to know if you have actually gotten the OpenCL version to run. I'm having a bit of trouble getting multi-threading working because all the variables are members of Game instead of locally scoped. I know the lists will have to be members, and to modify them you have to setup some sort of block so that threads can add or remove from them, but it doesn't seem necessary for all the variables to be members. I'm slowly making progress though in widdling them down. Making the obvious ones locally scoped and some of the less obvious ones locally scoped and passing as reference or pointer parameters to other functions hasn't affected performance negatively.

JDB wrote: I plan to use floats as much as possible when I convert it to OpenCL, but I want to try to keep using doubles for world positions otherwise it will be very difficult to create extremely large game worlds, such as with space sims.

Changing all the doubles to floats including the literals didn't seem to boost the performance really. After changing all to floats and making some of the variables locally scoped I now get 0.26 frames per second, a 10% increase. I understand the desire for doubles, just wanted to test performance difference. With such slow performance in current state, it really doesn't make a difference. My guess right now their are two reasons for the slow performance. I changed the AA down to 1, 0 won't build since you can't declare a 0 element array. Got around 1 fps, so it scaled linearly. The first thing I'm going to attribute the performance to is allocating and deallocating so many things on the fly. The second thing is copying everything which I believe you call caching.

Allocations and especially deallocations are really slow. Perhaps it would be better to preallocate for things and instead of creating a list of "new" objects, just pass the address of the objects you want in the list. Instead of "deleting" objects from the list, you would just reset the count and overwrite the elements in the list. This might not work since it's a linked list though. The reason I am claiming this as a bottleneck is watching the performance analysis in VS 2015. Each frame shows a rise and fall of memory usage from around 200 MB to over 500 MB.

Copying data can be slow and is definitely limited by memory bandwidth and latency. Haven't figured this part of the code out yet to see if there is an alternative, so I can't give any advice here.

JDB wrote: Keep in mind that's with 4x AA enabled and it's only using one core of your CPU. You can change the level of AA and some other settings in the Resource.h file.

As mentioned above, changing the AA to 0 won't build, and changing to 1 gave 4x more performance to 1 fps, didn't try 2 or 3. Can't do multi-threading quite yet, still make changes to make it more efficient so I'm not have to block threads to make updates/changes.

JDB wrote: I have already tried to do that, the 4 rendering stages are broken up into separate functions, which will become separate kernels for each stage, designed to manipulate the buffers common to all stages.

Sorry about that, I didn't fully look over the code before posting. I see the separate stages now.

JDB wrote: Well it seems to me that treating the CPU and GPU as separate entities is really the right way to go, because you want to avoid communicating with the CPU as much as possible and avoid using the slow system RAM. I'll probably need to hold most of the textures and meshes on the GPU for the fastest processing times. I believe most video games hold all the level textures and meshes in GPU memory, the GPU memory therefore becomes the biggest limiting factor. In many games they will get around this problem by loading objects on the fly, for example Skyrim will discard some objects and load new objects when you walk into a new area cell, only keeping data for the closest 36 cells or something like that.

Treating the GPU and CPU as separate entities I believe is unrelated to what point I was trying to make. From the perspective of making HSA common place it doesn't make sense to treat them as separate entities. HSA is suppose to make it easier and more efficient to write code for general processing using a GPU.

Memory bandwidth is an issue using APU type processors. HBM/HBM2 will really help if AMD or Intel decide to use it. I've read rumors that AMD's Zen APU will have either/or and I can't wait to see how it performs. I believe for ray tracing though, memory bandwidth isn't really the issue. It's going to be the shear number of calculations. The CPU only handles around 20 Gigaflops while the GPU can handle around 4-7 teraflops, 100 times more processing power than the CPU. The PS4 or the PS4K(PS4.5) whichever it was, is an APU based console and handles in the tflop range of processing power, so I don't think they should be counted out.

JDB wrote: I'm not entirely sure how it works though, I need to do a lot more research on that type of stuff and how OpenCL works before I can begin actually writing the kernels.

So, after spending a few hours over the last couple of days with your code, I'm also having trouble finding a good way of implementing SSE. I've tried converting your Vect class, which is one way of handling it, but keep running into problems since a lot of your code relies heavily on accessing the individual components for condition tests for branching. I'm sure if I knew a lot more there are ways to reduce the number of branches, but my knowledge is quite limited. Also, I keep getting weird results. In the first run, I wound up with a screen with smeared colors and the second run I ended up with just a black screen. So I've given up on that route for now. I am eventually going to focus on vectorizing the main code instead. Just in case, what I mean by vectorizing is handling four calculations at the same time through the use of the SSE registers which hold four floats or two elements at a time in the case of doubles. Also, there is the case where you don't have operators overloaded, instead you use functions like vectAdd and addVect.

Not sure if it's something I've done or changed, but it crashes in debug mode when trying to present the frame. Runs fine in release mode though.

JDB · Post by **JDB** » May 3rd, 2016, 7:18 am

albinopapa wrote: No problem. I would like to know if you have actually gotten the OpenCL version to run. I'm having a bit of trouble getting multi-threading working because all the variables are members of Game instead of locally scoped.

Yeah I wanted to avoid juggling a bunch of variables between all the functions so I did it like that, although it's clearly not a good coding practice because it's harder to understand what's happening. I also reused many of the variables for different things to avoid declaring a boat load of variables. I'm sure you will find it very hard to convert them to locally scoped variables without introducing bugs. When writing the OpenCL kernels I'll try to keep everything properly scoped, it just didn't seem necessary for the prototype code. Like I said I do have OpenCL properly configured and it's successfully able to compile some test kernel code.

Changing all the doubles to floats including the literals didn't seem to boost the performance really. After changing all to floats and making some of the variables locally scoped I now get 0.26 frames per second, a 10% increase. I understand the desire for doubles, just wanted to test performance difference.

10% is an appreciable increase in speed actually, and I'm sure that using floats as much as possible will have a much larger impact on the GPU because they are more optimized for doing operations on floats.

My guess right now their are two reasons for the slow performance. I changed the AA down to 1, 0 won't build since you can't declare a 0 element array. Got around 1 fps, so it scaled linearly.

The AA level is really the pixel subray count, meaning how many rays are sent through each pixel, so 0 obviously wont work and it is expected to scale linearly. You can set it to 1,4,9,16, etc. In other words it can be any "square number" because the pixel will be broken up into a square grid. For example an AA level of 9 would mean a 3x3 grid, with one ray for each square on the grid.

The first thing I'm going to attribute the performance to is allocating and deallocating so many things on the fly. The second thing is copying everything which I believe you call caching.

Allocations and especially deallocations are really slow. Perhaps it would be better to preallocate for things and instead of creating a list of "new" objects, just pass the address of the objects you want in the list. Instead of "deleting" objects from the list, you would just reset the count and overwrite the elements in the list. This might not work since it's a linked list though. The reason I am claiming this as a bottleneck is watching the performance analysis in VS 2015. Each frame shows a rise and fall of memory usage from around 200 MB to over 500 MB.

The vertex caching part of the code is not copying variables, it's storing world space positions which would need to be computed at many different stages if they weren't cached the first time they were computed. The places where I do copy variables are mainly to avoid heavy use of pointers, which can also be very slow. The resolution also has a large impact on speed, if you move a little bit away from the scene so that some rays don't hit anything, you'll see the frame rate increase quite a bit. As you said, memory allocations and deallocations can be very slow, that's one of the main problems with linked lists and it's not really possible to avoid.

Your suggestion would defeat the main purpose of using linked lists in the first place, which is to only use as much memory as we need, instead of having a huge preallocated array. Plus we pretty much have to use a linked lists because we never really know how much depth each list will require. Some rays may hit nothing, in which case we don't need to add anything to the list for that ray, but some rays may hit 100 different objects, in which case we need lots of depth, but we clearly cannot preallocate such a large array for each pixel, especially when memory is so precious on the GPU.

Treating the GPU and CPU as separate entities I believe is unrelated to what point I was trying to make. From the perspective of making HSA common place it doesn't make sense to treat them as separate entities. HSA is suppose to make it easier and more efficient to write code for general processing using a GPU.

Yes I think HSA is where were are headed in the future, but most people don't have a HSA setup and neither do I. Also I think it should be designed to work on as many systems as possible and not have to rely on a specialized system architecture.

Post by **albinopapa** » May 3rd, 2016, 3:37 pm

I kind of get where you are coming from on wanting to save memory, however, linked lists don't save memory. You still need to have the memory available when the list(s) are at it's fullest. I understand now why you are using them, because you won't know how big to make the buffer. However, the same holds true for linked lists, you won't know how much memory you are going to use and could run out of GPU memory anyway. Then there is the issue of memory fragmentation which occurs from allocating and deallocating from random places.

I was able to get a significant boost creating a memory pool of around 1GB, then allocating memory from that pool instead of using operator new, and instead of deleting everything each frame, I did keep the SLLinkedList::clear function setting the old pointers to null, but not delete, and as for the memory pool, I just set it's usage back to 0. That way the data can be overwritten and no allocation/deallocations have to be done. I went from 0.25 fps to 0.35 fps, a 40% increase. The amount of the poo actually used by the end of the frame was around 136 MB at default position, orientation and render settings. Got another 30% successfully converting the Vect class to use the DirectXMath library that uses SSE 2 instructions. So am now getting ~0.45 fps.

This could be a bit higher I believe if the data was laid out differently. The key is going to be finding a way to gather all necessary data in the beginning of the stage, so that you can just run the calculations in the last stage, instead of sprinkling the calculations in with the collection stage.

Right now you don't have a background and a static background wouldn't look good in a game, so you'd probably have a box or sphere to hold your sky texture or you'd be inside where a ray is going to hit something no matter what. So no matter what, your initial ray is going to hit something, whether it be a sky box or ceiling or glass roof then sky box. If I'm missing something let me know, but I believe the rays would collect UV coordinates and materials for each surface hit. I believe it's around 300 bytes of data collected, so you would need (screen_width * screen_height * AA_Level * 300) to be able to render just the initial ray. That's around 768 MB of data collected.

That might be over estimating though, as stated in previous post, the memory usage would fluctuate between 200 and 500 MB using the new/delete route each frame, didn't check Task Manager. With preallocating a 1GB memory pool, I used only 136-137 MB for the lists and total memory usage stayed around 400 MB, apparently Task Manager doesn't update until you actually use the memory and not just allocate. The memory pool wasn't used with the 3 buffers in BufferSet nor the mesh and textures, just for the linked lists.

As far as I know, you can't use pointers on graphics cards so you won't be able to use linked lists anyway. You will need to create buffers of finite size to store your data. Granted, you can use the CPU to create the lists, then create the buffer then create and bind the buffer to the graphics card then copy the buffer from the system RAM to the graphics RAM, but I don't think that's going to work too well either.

Sorry if I seem completely ignorant, it's because there is still a lot I don't understand. My knowledge is pretty narrow and is solely based on personal experience with SSE and limited D3D/DirectCompute. Heck, even the little bit of math I do understand came from mostly from Chili's advanced tutorials, so some of your code I don't understand just because I don't understand the math. Right now I'm just fumbling through replacing code with what I do understand. Right now I've broken the Lambertian lighting, even though the code for it is still there and is being processed, so that isn't why there is an increase in frame times.

I think because of the millions of loops in the code, the best performance the CPU is going to get is around 4 fps and that's being generous, at least my poor APU. I'm sure if you had a newer 8 threaded I7, you might get 6 fps, unless you count using the integrated graphics as still being CPU rendered, of course the by product of that would mean it would also be able to be rendered on a dedicated GPU as well. In order to get even those frame times, you'd have to reorganize the code so that the variables are locally scoped to be able to use multi-threading and the data would have to be laid out to be contiguous to take advantage of the CPU cache and SSE vectorization.

Thanks for letting me know that the variables are being used for different purposes, that may be helpful in unraveling the code a bit. This really is a neat project and it has lifted my spirits a bit about coding, was getting a little burnt out until you posted this thing. I think it's helped me realize I don't have a passion for making games, just tweaking the engine seems to be about the only thing that gets me going.

JDB · Post by **JDB** » May 4th, 2016, 7:49 am

There are 2 main places linked lists are used. The first is to generate the 2D optimization structure. Seeing that we never know how many surfaces a given ray might intersect, it's a very bad idea to preallocate an array with a limited depth because the ray might miss the first few objects in the list and if we limit the size of the list then we might not include the object which the ray actually hit. If we preallocate an array with a large depth then it will consume a lot of memory even though most of the array is never utilized, which is why linked lists do save memory.

The other place linked lists are used is for storing the surfaces intersected by a ray. If we just want to record the closest intersection then we don't need to use linked lists, but if we want to allow transparency then we need to record all the semi-transparent surfaces the ray travelled through before hitting an opaque surface. That is what the 3rd rendering stage does. However in this case it's ok to enforce a limit on the size of the list because it's ok if we can only see through a limited number of semi-transparent surfaces, so we could actually use an array in this case if we didn't mind using the extra memory. Keep in mind, most rays will not hit a semi-transparent surface, let alone multiple semi-transparent surfaces.

I chose not to use an array simply because linked lists are fast at inserting new items at any point in the list, and I already had a linked list class written. I believe there are ways to use linked lists on the GPU because it's done with order-independent transparency, but I'm guessing it's quite difficult and for that reason I'll probably preallocate an array when I convert it to OpenCL. As for the 2D optimization structure, that kind of needs to use linked lists but I think it could be handled by the CPU since it's a fairly easy job to compute it. The following information is from the order-independent transparency wiki page and should make it a bit clearer to you why linked lists are useful for this:

* The first was storing the fragment data in a 3D array,[4] where fragments are stored along the z dimension for each pixel x/y. In practice, most of the 3D array is unused or overflows, as a scene's depth complexity is typically uneven. To avoid overflow the 3D array requires large amounts of memory, which in many cases is impractical.

Two approaches to reducing this memory overhead exist.

* Packing the 3D array with a prefix sum scan, or linearizing,[5] removed the unused memory issue but requires an additional depth complexity computation rendering pass of the geometry. The "Sparsity-aware" S-Buffer, Dynamic Fragment Buffer,[6] "deque" D-Buffer[citation needed], Linearized Layered Fragment Buffer[7] all pack fragment data with a prefix sum scan and are demonstrated with OIT.

* Storing fragments in per-pixel linked lists[8] provides tight packing of this data and in late 2011, driver improvements reduced the atomic operation contention overhead making the technique very competitive.[7]

Post by **albinopapa** » May 4th, 2016, 9:07 am

Linked lists on GPU.
It's not really a linked list like the CPU, but I see what's going on. It's a group of buffers, one stores the location of the surface, one stores the color and one stores the location (index) of the previous encountered surface and the depth value of each surface.

So if I understand correctly, you would take your first stage, that finds the objects that are going to affect the scene on CPU in a list. Then, you'd render those objects storing the location, calculated color and depth and any previous indices stored at that pixel. After you finish gathering all the data, you can just go through the list of indices to get back the color and depth information so you can correctly sort the transparent surfaces and calculate the color for that pixel.

You are right about Stage1 being pretty quick with just the few items that are in the scene, once you get a few hundred or a few thousand in there, it may be a different story though.

Supposedly, the biggest benefit of Ray Tracing over Ray Casting/Rasterization is the performance doesn't degrade as fast with more polygons with ray tracing, only with resolution...which means more rays having to be calculated.

Anyway, I got about half the variables switched to local scope wihin functions and passing references and instead of copying just getting a reference to some of the variables, I've gotten the fps up to 0.5 with 4x AA and just from the starting position and orientaion. That's up from 0.45 with the other things I've mentioned. Lighting works again, but now the shadow is off, it's on the left wall instead of on the floor. Also, the light-bulb is not being rendered correctly while the match box is.

Just thought you'd like an update. If I can figure out how to rearrange the code to store the gathered data in a contiguous buffer, or at least pile all the major calculations together, I might be able to get closer to 1 fps using SSE. As it stands, I think I'm just going to end up breaking it beyond my ability to be able to put it back together, so I may just finish localizing the variables and implement MT'ing. If I could get it to scale linearly with core count, I'd get just under 2 fps on my 4 core. Not quite playable frame rates, but I'm sure if I knew what I was doing, it could be a bit higher.

Planet Chili

Rendering Polygon Scenes With Ray Tracing

Rendering Polygon Scenes With Ray Tracing

Re: Rendering Polygon Scenes With Ray Tracing

Re: Rendering Polygon Scenes With Ray Tracing

Re: Rendering Polygon Scenes With Ray Tracing

Re: Rendering Polygon Scenes With Ray Tracing

Re: Rendering Polygon Scenes With Ray Tracing

Re: Rendering Polygon Scenes With Ray Tracing

Re: Rendering Polygon Scenes With Ray Tracing

Re: Rendering Polygon Scenes With Ray Tracing

Re: Rendering Polygon Scenes With Ray Tracing