Multithreading: Atomics and you!

The Partridge Family were neither partridges nor a family. Discuss.
Post Reply
User avatar
cyboryxmen
Posts: 190
Joined: November 14th, 2014, 2:03 am

Multithreading: Atomics and you!

Post by cyboryxmen » April 15th, 2018, 8:21 am

"Atomicity" refers to a series of changes that, when made in a particular order and then "observed", you will observe the changes happening in that exact same order. These changes as a whole is referred to as an "atomic operation". It seems like a really redundant term to make at first since that's already what happens naturally. However, in the crazy new world of parallelism, what actually happens will not always be what you actually see. Take this example:

main.cpp

Code: Select all

static int x = 0;
Thread 1

Code: Select all

// If thread 2 changed x first.
if ( x == 15 )
{
	std::cout << "Thread 2 executed first!\n";
}
x = 24;
Thread 2

Code: Select all

// If thread 1 changed x first.
if ( x == 24 )
{
	std::cout << "Thread 1 executed first!\n";
}
x = 15;
It's completely possible for thread 1 to see that x is 15 while thread 2 sees that x is 24 at the same time resulting in the console outputting both of these messages. To the observer seeing these messages, it'll be almost as if x has both the value of 24 and the value of 15! There's a bunch of these examples available online; each more mind bending than the previous one. Rather than try to understand how this could happen, it's best to just accept this fact and try to find ways to work around it.

The changes you made in one thread are always going to be observed to happen in the same order you made them in the same thread. The problem comes when another thread then tries to observe those changes. Not only is it not guaranteed to see the changes happen in the same order as the thread that made them, the changes it then makes may be observed to happen in a different order in the original thread too. Fortunately for us, C++ already has a construct that helps us maintain the atomicity of our changes across threads. Meet std::atomic! Atomic is a variable that gives you the guarantee that the changes made by any thread to that atomic variable is part of a single atomic operation that is shared among all threads. In other words, the changes any thread makes to it will be observed to happen in the same order for all threads. Not only that, it can also be customised to give you several additional levels of guarantees. std::memory_order_relaxed is the most basic guarantee the atomic provides. This guarantee is more than enough to fix our previous example:

main.cpp

Code: Select all

static std::atomic<int> x { 0 };
Thread 1

Code: Select all

// If thread 2 changed x first.
if ( x.load ( std::memory_order_relaxed ) == 15 )
{
	std::cout << "Thread 2 executed first!\n";
}
x.store ( 24, std::memory_order_relaxed );
Thread 2

Code: Select all

// If thread 1 changed x first.
if ( x.load ( std::memory_order_relaxed ) == 24 )
{
	std::cout << "Thread 1 executed first!\n";
}
x.store ( 15, std::memory_order_relaxed );
Either one of these two messages will output or none of them will depending on how fast the threads are at setting x. No unreasonable outcomes where both of them are outputted! In other situations though, std::memory_order_relaxed does not give you a strong enough guarantee to work with most of the time. This is because std::memory_order_relaxed only guarantees that the changes made to the atomic variable are part of the same atomic operation. It does not extend that guarantee to other changes made around them.

main.cpp

Code: Select all

static int value = 0;
static std::atomic<bool> written { false };
Writer Thread

Code: Select all

value = 15;
// Tell the reader thread that we have written the value.
written.store ( true, std::memory_order_relaxed );
Reader Thread

Code: Select all

// Wait until the value is written by the writer thread.
while ( written.load ( std::memory_order_relaxed ) == false )
{
	continue;
}
std::cout << value << '\n';
By right, <value> should only ever be observed by the reader thread as 15 but it's completely possible for the reader thread to observe <value> as 0 instead. Since std::memory_order_relaxed does not maintain the order of changes made to variables other than the atomic, the reader thread could observe the change to the <written> first before observing the change to <value>. You might be wondering if making <value> an atomic variable would fix this problem. That wouldn't actually work since changes made to one atomic are part of a different atomic operation than changes made to another atomic. In this case, we need a stronger guarantee than what std::memory_order_relaxed provides. Enter std::memory_order_acquire and std::memory_order_release. When a thread makes a change to an atomic with std::memory_order_release, the atomic "releases" all changes that came before that's made by that thread . A thread that reads an atomic with std::memory_order_acquire then "acquires" all changes released by any thread through that same atomic. Knowing this, we can rewrite the code as such.

Writer Thread

Code: Select all

value = 15;
// Release our changes to the reader thread.
written.store ( true, std::memory_order_release );
Reader Thread

Code: Select all

// Acquire the changes made by the writer thread.
while ( written.load ( std::memory_order_acquire ) == false )
{
	continue;
}
std::cout << value << '\n';
This is a common synchronisation technique that you'll see in multithreading. Release acquire pairs are commonly used to "propagate" changes made in one thread to other threads through a single atomic. A "release operation" is useless without an accompanying "acquire operation" and vice versa. You usually start by acquiring changes made by other threads and then releasing changes you made yourself. If however you ever want to acquire and release in a single change, you can use std::memory_order_acq_rel as demonstrated:

main.cpp

Code: Select all

std::vector<Object> objects;
static std::atomic<std::size_t> num_threads_left { 0 };
All threads

Code: Select all

void update_objects ( const std::size_t begin, const std::size_t end )
{
	// Update the part of the object vector assigned to this thread.
	for ( auto it = begin; it != end; ++it )
	{
		it->update ( );
	}
	
	// Acquire changes other threads have made. Then, release our changes.
	const auto old_value = num_threads_left.fetch_sub ( 1, std::memory_order_acq_rel );
	const auto current_num_threads_left = old_value - 1;
	
	// If we are the last thread to finish
	if ( current_num_threads_left == 0 )
	{
		std::cout << "All threads have finished! Returning to main thread!\n";
	}
	// Otherwise, wait until all threads have finished.
	else
	{
		while ( num_threads_left.load ( std::memory_order_acquire ) != 0 )
		{
			continue;
		}
	}
}
There is an std::memory_order that gives the strongest guarantee than any other std::memory_order in C++. A change made with std::memory_order_seq_cst not only does an acquire and a release, it also combines all the atomic operations of every atomic variable that use std::memory_order_seq_cst into a single global atomic operation. This is only effective however if these atomic variables use std::memory_order_seq_cst for all reads and writes. You cannot make half the reads and writes with only std::memory_order_relaxed or std::memory_order_release and the other half with std::memory_order_seq_cst. Personally, I haven't actually encountered a situation where I ever need such a strong guarantee. It is however used by other people that synchronise threads with multiple atomics instead of one. Here is an example of its use from the CPPReference.

main.cpp

Code: Select all

#include <thread>
#include <atomic>
#include <cassert>
 
std::atomic<bool> x = {false};
std::atomic<bool> y = {false};
std::atomic<int> z = {0};
 
void write_x()
{
    x.store(true, std::memory_order_seq_cst);
}
 
void write_y()
{
    y.store(true, std::memory_order_seq_cst);
}
 
void read_x_then_y()
{
    while (!x.load(std::memory_order_seq_cst))
        ;
    if (y.load(std::memory_order_seq_cst)) {
        ++z;
    }
}
 
void read_y_then_x()
{
    while (!y.load(std::memory_order_seq_cst))
        ;
    if (x.load(std::memory_order_seq_cst)) {
        ++z;
    }
}
 
int main()
{
    std::thread a(write_x);
    std::thread b(write_y);
    std::thread c(read_x_then_y);
    std::thread d(read_y_then_x);
    a.join(); b.join(); c.join(); d.join();
    assert(z.load() != 0);  // will never happen
}
Then there's std::memory_order_consume...we don't talk about std::memory_order_consume...

If you made it this far, congrats! You are one of the top 5% of programmers that actually understand the niche that is Lock Free Programming. Most people would just use std::mutex to "block" threads forcing only one thread be able to read and write shared values at a time. There's no denying the power of std::atomic provides by giving you the ability to make changes without having to block any thread. If used correctly, it can vastly improve the performance of your multithreaded code since every thread can continue to do work without blocking each other from progressing. Used incorrectly however, it can be much slower than simply blocking and more importantly, may cause huge untraceable bugs. Like all tools in C++, it must be used wisely.
Last edited by cyboryxmen on April 16th, 2018, 1:45 am, edited 5 times in total.
Zekilk

albinopapa
Posts: 4373
Joined: February 28th, 2013, 3:23 am
Location: Oklahoma, United States

Re: Multithreading: Atomics and you!

Post by albinopapa » April 15th, 2018, 9:22 am

Still don't understand how std::atomic is any different than just locking, updating a variable, then unlocking a mutex, or surrounding some event with a condition_variable instead of hiding behind an std::atomic<bool>.
If you think paging some data from disk into RAM is slow, try paging it into a simian cerebrum over a pair of optical nerves. - gameprogrammingpatterns.com

User avatar
cyboryxmen
Posts: 190
Joined: November 14th, 2014, 2:03 am

Re: Multithreading: Atomics and you!

Post by cyboryxmen » April 15th, 2018, 11:51 am

Most mutex implementations are built on top of atomic operations. A mutex lock is an atomic acquire operation while a mutex unlock is an atomic release operation. If a thread causes "contention" when it tries to acquire a lock that is being held, the thread is put to sleep. This mechanism would evict the thread from the CPU and once it is done, the thread would cost zero processing power to maintain(and zero memory cost if it is moved from ram to disc.) This is partly the season why simply blocking a thread can be faster than using a lock free algorithm. When it finally does manages to acquire the lock, the thread then gets loaded back into CPU and is run again. This mechanism as you can imagine, takes time to do so the thread won't be able to run immediately after receiving the lock.

Spin locks(a simple atomic exchange run in a while loop until it acquires a lock) conceptually don't have this problem. They can just keep spinning in the loop and could respond almost immediately after receiving the lock.

Code: Select all

while ( !cout_lock.exchange ( false, std::memory_order_acq_rel ) ) continue;
std::cout << "I got it!\n";
cout_lock.store ( true, std::memory_order_release );
Ultimately though, the truth is much more complicated than this. Some mutex implementations, when they cause contention, actually spin for a while first to help handle cases where contention ends very quickly. For spinlocks, although they are unlikely to cause the thread to be put to sleep, they can cause a context switch and make the CPU run another thread instead. A pure context switch is usually less costly than making the thread sleep. Ultimately, spinlocks won't make sense for situations where you have more threads than CPU hardware threads(like on a single threaded machine). In those cases, the spinlock would just waste CPU time while mutexes would put them to sleep while another thread gets executed instead making it more likely for the lock to be released soon.

You can easily go lock free by changing your mutexes to spin locks. In certain situations, this'll give you better performance but you can do better as far as lock free algorithms go. A true lock free algorithm is wait free. Threads won't get blocked nor have to wait for other threads in a while loop to continue doing work. These algorithms are very fast as they allow instantaneous synchronisation between multiple threads allowing threads to run continuously without stopping. Think of it as a highway where different flows of traffic merge and split smoothly despite all the cars driving at high speeds.

You can argue that an acquire operation actually does block the thread as it needs changes made by other threads to propagate to it before continuing. On x86 machines though, an acquire operation is not that much different from a normal operation in a single threaded program that "blocks" the thread because it needs to read from ram to cache. The benchmarks don't lie; lock free is the way to go for high performance multithreading.

Of course, no synchronisation would be better than both mutexes and lock free. That's why you should try to avoid making changes to your data at all. Just make copies instead of mutating the original data. No changes to data means no need for synchronisation.
Last edited by cyboryxmen on April 16th, 2018, 1:46 am, edited 1 time in total.
Zekilk

User avatar
Zedtho
Posts: 189
Joined: February 14th, 2017, 7:32 pm

Re: Multithreading: Atomics and you!

Post by Zedtho » April 15th, 2018, 1:37 pm

Thanks for sharing! This stuff is super interesting!

Firepath
Posts: 77
Joined: March 10th, 2018, 11:53 pm

Re: Multithreading: Atomics and you!

Post by Firepath » April 15th, 2018, 9:20 pm

Agreed, good info, I'll have to read again and look into it more. Thanks!

albinopapa
Posts: 4373
Joined: February 28th, 2013, 3:23 am
Location: Oklahoma, United States

Re: Multithreading: Atomics and you!

Post by albinopapa » April 15th, 2018, 10:02 pm

Yeah, I've been watching YT videos on this stuff every so often ( not very regularly ) and value semantics can be costly for single threaded applications, but the benefit you get in performance of not having to lock threads makes up for the cost of copying data around. The data still needs to be synchronized though, so I'm guessing you would make copies, do your calculations, then send the results back to the main thread where the updated results are copied back?

This idea makes me think of double buffering, where you have a read buffer and a write buffer. You copy data from the read buffer, do the calculations then write to the write buffer. Each frame, you'd swap the buffers through a simple pointer swap once all operations have been completed.
If you think paging some data from disk into RAM is slow, try paging it into a simian cerebrum over a pair of optical nerves. - gameprogrammingpatterns.com

Post Reply