Multithreading

The Partridge Family were neither partridges nor a family. Discuss.
albinopapa
Posts: 4373
Joined: February 28th, 2013, 3:23 am
Location: Oklahoma, United States

Multithreading

Post by albinopapa » June 3rd, 2016, 7:24 pm

I've said this before and I believe it's happening again. I have been loosely following a ray tracing tutorial on youtube, probably the same one that JDB used, the coding style looks the same, but not the point. I wanted to test multithreading and what I'm running into is the performance is about the same as single threaded performance around 7-8 fps and the CPU usage is only at 33 - 40%.

The ray tracer at the moment hasn't come very far, it's not even rendering the image correctly, but it's suppose to be rendering a sphere and a plane with shadows. Without shadows, just getting the color of the objects the ray intersects, I was getting around ~25 fps both before setting up threads and after, even the usage didn't change much. I figured there wasn't enough work being performed between locks and stuff, so I waited until I had gotten the shadow calculations in before considering there being an issue.

So, anyone who knows what might be going on, please have a look.

https://github.com/albinopapa/Multithreaded_RayTracer
If you think paging some data from disk into RAM is slow, try paging it into a simian cerebrum over a pair of optical nerves. - gameprogrammingpatterns.com

cameron
Posts: 794
Joined: June 26th, 2012, 5:38 pm
Location: USA

Re: Multithreading

Post by cameron » June 3rd, 2016, 10:43 pm

Your biggest problem is that you are locking for too long. All it is doing is getting ready to process the data when you lock. Due to this pretty much only one thread is doing actual work while the others wait for it to finish. I was able to improve fps by 100%.
Original: 17 FPS, 15% CPU usage
New: 38 FPS, 50% CPU usage

You need to come up with a way to get all synchronization done with little overhead. It works much better if you can have all the threads working simultaneously without much synchronization. I used a modified version of the multithreaded code I posted a few weeks ago to get my results.
Computer too slow? Consider running a VM on your toaster.

albinopapa
Posts: 4373
Joined: February 28th, 2013, 3:23 am
Location: Oklahoma, United States

Re: Multithreading

Post by albinopapa » June 4th, 2016, 4:53 am

Going to have to explain it to me, what I'm doing wrong.

I need the threads to render during the ComposeFrame call, and the main thread can't continue until all threads are done. So it seems logical I need to notify the worker threads when I call ComposeFrame and have the main thread fall asleep, then when they are done, they notify the boss thread and go to sleep while EndFrame is called.

I think I see the problem, the unique_lock is locking during the whole RunThreads function, am I getting close?

Actually, I think I need to make sure all worker threads of checked in and went to sleep before the boss thread can "wake" them. Notifying threads that haven't gone to sleep yet would seem kind of pointless. So, count how many threads have checked in, the last one checking in would then notify the boss thread that they are ready. The boss thread would do what it needs to do, then notify all threads before going to sleep. Does that sound right?
If you think paging some data from disk into RAM is slow, try paging it into a simian cerebrum over a pair of optical nerves. - gameprogrammingpatterns.com

User avatar
chili
Site Admin
Posts: 3948
Joined: December 31st, 2011, 4:53 pm
Location: Japan
Contact:

Re: Multithreading

Post by chili » June 4th, 2016, 8:41 am

I took a quick look, and I might be wrong but it looks to me like you have one mutex shared by all the worker threads. You need a separate mutex for each worker, first of all. Separate mutex and cv for each worker thread, or you can make it so that the worker threads don't hold the mutex throughout operation. I don't like that approach though because it means there will be guaranteed contention when you start your workers.
Chili

cameron
Posts: 794
Joined: June 26th, 2012, 5:38 pm
Location: USA

Re: Multithreading

Post by cameron » June 4th, 2016, 3:01 pm

Yea your method of notification seemed good, it was similar to what I did.

Yea the unique lock is the problem, it holds the lock until it loses scope. Because it is locked for almost the entire duration of the thread only one thread can function at a time. The best solution is to reduce the amount of time spent whie locking(take code out the lock). Like I moved all the looping put of the mutex because it does not need synchronized other things like putpixel may or may not need synchronization but I am guessing not based on the code. It took some work but I managed to take pretty much all the code out of the loop. But I mean if the camera was moving around and stuff it would be best to load it before any processing. I didnt know much about the code but from what I can see most of it can be moved out of the lock.

The second biggest problem is how you are dividing the workload, but dont worry about this until the other problem is fixed. Parallel processing works much better if they take workload from a global queue take it out in small chunks depending on the total amount of data to process(CAS loop would ne better but more difficult to impliment). The problem with a fixed workload from the start is other threads will finish while the others still have lots to proccess. In this situation I got the best performance taking items out one by one out using a CAS loop, using a mutex I bet it would work better taking several out a once.

Multithreading and concurrency can be difficult to get right because things like data races and other things may come up where you never would have thought of. Sometimes you have to be creative in order to reduce the amount of locking needed. I updated my parallel processing code that might help.

->Chili:
I never thought of waiting for notifications(starting, exiting etc.) as being a cause of slowdown because when you use parallel processing you usually process a ton of data so that constant overhead of waiting on the same mutex nThread times would be neglectible. But yea the major slowdown is because the entire function is pretty much wrapped by the mutex.
Computer too slow? Consider running a VM on your toaster.

albinopapa
Posts: 4373
Joined: February 28th, 2013, 3:23 am
Location: Oklahoma, United States

Re: Multithreading

Post by albinopapa » June 4th, 2016, 6:25 pm

Cam, most of what you said makes sense. The only time that would need to be synchronized, is before and after the rendering. RayTracing works on one pixel at a time, so there is virtually no overlap when it comes to writing to memory. I know that even with ray tracing, there might be spots that have less to render than others, so dividing the workload up equally like I did is a naive approach, but for what I have so far, all steps are the same and the only way one thread might finish before another, is all the early outs, return statements. That's not to say, one thread might finish before another because of scheduling from the OS though.

I have looked over your parallel processing code, and I wasn't sure how tailor it to my needs. It seemed too tied to the demo and didn't know which parts were able to be extracted.

@chili, I am not quite sure if having conditional variables and mutexes for each thread would fit this situation, maybe just for the two separate parts of code, one for rendering and one for the master thread?
If you think paging some data from disk into RAM is slow, try paging it into a simian cerebrum over a pair of optical nerves. - gameprogrammingpatterns.com

cameron
Posts: 794
Joined: June 26th, 2012, 5:38 pm
Location: USA

Re: Multithreading

Post by cameron » June 4th, 2016, 6:51 pm

I am not really familiar with ray tracing, so I may be wrong about doing this. However, I have looked over all the code in the RunThreads function. I think as long as there are no duplicate writes to the same pixel there shouldn't need to be any synchronization within the function itself other than start and end cv's. Because most of these functions work on local lists and variables there shouldn't be any race conditions.

There are several reasons why one thread might finish faster than the other many of which you have mentioned:
-OS scheduling
-Threads not starting at exact same time due to mutex
-Contention between threads

Here is a sample of what I have done:

Game.h

Code: Select all

   #define NTHREADS 4
	friend void RunThreads(uint32_t threadIndex, int& index, Game& game);
	MParallelProcessor<void, int, Game> mp;
	std::vector<float> obj_intersects[NTHREADS], shadow_intersects[NTHREADS];
	int* data;
ctor

Code: Select all

:
mp(NTHREADS, /*1*/1000, &RunThreads, *this)
{
   const int a = screen_w * screen_h;
	data = new int[a];
	for (int i = 0; i < a; i++)
	{
		data[i] = i;
	}
}

RunThreads

Code: Select all

void RunThreads(uint32_t threadIndex, int& index, Game& game)
{
	int y = index / screen_w;
	int x = index % screen_w;
	float y_amount = ((float)y + 0.5f) * rec_sh;

	float x_left = ((float)x + 0.5f) * rec_sw;
	float x_amount = (x_left * aspect) - game.pix_offset_precalc;

	game.obj_intersects[threadIndex].clear();

	int gIndex = game.RenderImage(x_amount, y_amount, game.obj_intersects[threadIndex]);

	if (gIndex < 0)
		return;

	const float intersection_dist = game.obj_intersects[threadIndex][gIndex];
	Vec3 intersection_pos = game.cam.position + (game.cam.direction * intersection_dist);

	if (x == 0 && y == screen_h / 2)
	{
		int a = 0;
	}
	Color intersection_color = game.GetColorAt(intersection_pos, game.cam.direction,
		game.obj_list, gIndex, game.pt_light, game.shadow_intersects[threadIndex], game.ambient);

	game.gfx.PutPixel(x, y, intersection_color);
}
Compose Frame

Code: Select all

	mp.StartProcessAndWait(data, screen_w * screen_h);
Last edited by cameron on June 4th, 2016, 7:51 pm, edited 3 times in total.
Computer too slow? Consider running a VM on your toaster.

cameron
Posts: 794
Joined: June 26th, 2012, 5:38 pm
Location: USA

Re: Multithreading

Post by cameron » June 4th, 2016, 7:36 pm

Upon further examination I can tell there are many skipped pixels when rendering, though it doesn't have anything to due with the thread count, when you set it to 1 it doesn't change anything.

Edit:
I finally found the problem, it was something I tried when fixing the access violation bug with my parallel processing code. It was skipping every other index so there were tons of wholes. After fixing that fps dropped down to 25!
Computer too slow? Consider running a VM on your toaster.

albinopapa
Posts: 4373
Joined: February 28th, 2013, 3:23 am
Location: Oklahoma, United States

Re: Multithreading

Post by albinopapa » June 4th, 2016, 8:02 pm

The first 120 lines are blank. It's because the plane goes off into infinity there along the horizon, the dark arching bands I haven't figured out yet. I didn't follow the tutorial to the "T" and since I don't understand all the code involved, I'm kind of flying blind. Instead of overloading the math operators, +,-,*,/ the user created a vectAdd, negate and vectMul. This normally wouldn't make it too hard to convert to using the operator overloads, but because of the chaining of function calls, he has to break the steps down into separate lines, or have a really long chain.

Just as an aside, he also uses .at() instead of the [] for vectors, never passing by reference and almost never reuses his vectors when they are going to be used for a different purpose. For instance, he calculates light direction in two or three different places instead of reusing the vector, because it is used for something other than the formula he originally intended it to be used for. In one of the FindIntersection functions, he gets all the individual components of all the vectors used and does all the subtraction and dot products, even though his vect class already had methods to handle all that. All of these little things basically make it a nightmare to confidently convert.

To really drive it home, something like
Vect a,b,c;
c = b - a would look like
c = b.vectAdd(a.negate());

or
float a,b,c
c = -( b / a ) would be
c = -1 * ( b / a );

I know some prefer
c = b + (-a);
but I'm not one of them.


I guess his way is more mathematically accepted or literal, with the -1 * and + (-a) stuff, but the rest seems he probably is more use to the Java or C# languages. Another thing that makes me believe he was new to C++ was his lack of understanding how to inherit functions. Something that even chili does for some reason is adding 'virtual' to the child classes, which is only necessary if those children are going to have derivations of their own. Also, the lack of using = 0 for the base class virtual functions when they are never going to be used.

Anyway, thanks for looking at the project Cameron and Chili. Every hurdle seems to be burning me out faster than before. I've been doing this for about 3.5 years and I feel as though I should have gotten a lot further in my understanding of computer programming. So any straight forward help is greatly appreciated.
If you think paging some data from disk into RAM is slow, try paging it into a simian cerebrum over a pair of optical nerves. - gameprogrammingpatterns.com

cameron
Posts: 794
Joined: June 26th, 2012, 5:38 pm
Location: USA

Re: Multithreading

Post by cameron » June 4th, 2016, 10:13 pm

Yea I know what you mean, I have the difficult time reading some peoples code and takes quite some time to convert it especially if it isn't written in a c++ sort of way. I often find myself getting burnt out when I try to push something too quick. When that happens I usually either just take a break or start from square one lol.

I actually tried out vulkan api the other week and man it was complex. About 1000 lines to initialize it and draw a triangle lol. I decided to take a break from that due to lack of tutorials/info about the api in general.

On a side note though, I got fps up to 60 by maxing out cpu usage by increase number of threads to 20 and changing it to 64 bit CPU usage is now very close to 100%. It is interesting how much better CAS loops scale compared to mutexes/critical sections.
Computer too slow? Consider running a VM on your toaster.

Post Reply