Coding Challenge 5 - Meme Graveyard Keeper

Post by **chili** » November 5th, 2018, 1:50 am

Updated the requirements/conditions. You are now allowed a 3000ms runtime instead of 1000ms.

NightFighter · Post by **NightFighter** » November 5th, 2018, 5:03 pm

Can someone tell me if I'm missing something. with the field size of UINT32_Max in a vector of <unsigned long long>s it uses more than my computer has of ram and when trying was of getting round it, it takes too long. so how are you getting 200 ms with 4,252,267,227 field size. in a vector it should take 34 GB of ram.

Edit:
here are my results with a method which doesn't use a graveyard vector

Code: Select all

PASS (0.00ms) File Test: Buck Futter's Sexy Test 1
PASS (0.00ms) File Test: Buck Futter's Sexy Test 2
PASS (0.00ms) File Test: Buck Futter's Sexy Test 3
PASS (0.00ms) File Test: Buck Futter's Sexy Test 4
PASS (263106.09ms) File Test: Slidy's Test (1,798,592 Commands, 2,645 Field Size)

It isn't going well

second Edit:
ok I optimised my code (by removing vector.insert) and got these

Code: Select all

PASS (0.01ms) File Test: Buck Futter's Sexy Test 1
PASS (0.00ms) File Test: Buck Futter's Sexy Test 2
PASS (0.00ms) File Test: Buck Futter's Sexy Test 3
PASS (0.00ms) File Test: Buck Futter's Sexy Test 4
PASS (310.23ms) File Test: Slidy's Test (1,798,592 Commands, 2,645 Field Size)

so things are looking up for me (and I guess it works with uint32_max number of graves but it's harder)

Post by **albinopapa** » November 5th, 2018, 7:49 pm

max value of an unsigned int should be 4,294,967,295 not 4,252,267,227.

The InputCommand struct has 3 unsigned int values, equaling 12 bytes total struct size, which would be 48 GiB ( 51,539,607,552 bytes ). So unless you have 64 GiB of RAM, that's not going to happen.

I don't understand how you are able to get the max result if you don't store an array/vector of the graves, that blows my mind lol.

So far, the result I got for Slidy's test using nested for loops was ~2,800 ms. Looks like I won't even bother with this challenge.

NightFighter · Post by **NightFighter** » November 5th, 2018, 8:02 pm

albinopapa wrote:max value of an unsigned int should be 4,294,967,295 not 4,252,267,227.

The InputCommand struct has 3 unsigned int values, equaling 12 bytes total struct size, which would be 48 GiB ( 51,539,607,552 bytes ). So unless you have 64 GiB of RAM, that's not going to happen.

I don't understand how you are able to get the max result if you don't store an array/vector of the graves, that blows my mind lol.

So far, the result I got for Slidy's test using nested for loops was ~2,800 ms. Looks like I won't even bother with this challenge.

the Input Command has a max size of 8000000 not 4,294,967,295
and if you want a tip where to start, you should think of the InputCommands on a number line, with a start and an end and think of how you can split and sort them (std::sort was really useful for me)

Post by **chili** » November 6th, 2018, 3:22 am

It can definitely be done my dudes, it's just requires a bit of thought and ingenuity (like most challenges of this type).

Seldom is the naive (straight-forward) approach the answer.

And this type of thinking is essential if you want to write optimal code in the wild. You can spend 100s of hours writing cache, MT, SIMD, and low level bit twerking optimizations, but if you're working on a naive O(n^2) algorithm when a O(nlogn) algorithm exists, you're just spinning your wheels.

thesmallcreeper · Post by **thesmallcreeper** » November 6th, 2018, 10:08 am

Hey, figured out the trick for big fields in order not to fuck the memory but anyway wanna spin my wheels a little bit so tried playing with AVXs 256-bit and SSEs on a naive solution and have some things to share with you ^_^

So. Each of my graves is a long long and gotta do some tricks because SIMD dont have many instructions for 64-bit integers

. The tricks I did on AVX version and SSEs version are more less the same. The only difference is that my AVX version calculate 4 graves per loop while SSE version 2 graves per loop.

Results on the file Slidy uploaded ( I dont have exact numbers, what I remember more less)
on Piledriver machine: Naive solution: 1200ms , using SSE: 600ms , using AVX: I need AVX 2 that isnt supported
on Haswell machine: Naive solution: 1200ms, using SSE: 1200ms!! , using AVX: 600ms

Well I think I know why Haswell just fuck SSE version

In SSE and AVX version I use VPCMPGTQ and PCMPGTQ accordingly which on Haswell both have a latency of 5 cycles. However on Piledriver PCMPGTQ has latency of just 2 cycles. (kinda interesting...) Futhermore, on Slidy's test AVX version use the VPCMPGTQ about 300 million times while SSE version use PCMPGTQ about 600 million times, so that might explain the results

Also one thing about AVX code. If you try AVX you may have to use load and store instructions. Personally at first I tried to access grave like if they was a array of __m256i. However MSVC on -O2 generated too many richarded move instructions and killed the performance ,about 1200ms ,while GCC -O3 just the needed.

(I feel like my English is pure shit...)

Post by **chili** » November 6th, 2018, 10:24 am

Lel, richarded

What field sizes are you getting these results with btw?

thesmallcreeper · Post by **thesmallcreeper** » November 6th, 2018, 10:29 am

With the file that slidy shared with us.
1,798,592 Commands, 2,645 Field Size

edit: maybe gona check if I can benefit from the thoughput of instructions. .. they are 1 cpi

edit 2: tried doing that by calculating 8 values per loop using AVX 2. Here is my nice results

Code: Select all

PASS (319.89ms) File Test: Slidy's Test (1,798,592 Commands, 2,645 Field Size) // with x2 AVX instructions per loop
PASS (602.10ms) File Test: Slidy's Test (1,798,592 Commands, 2,645 Field Size) // without AVX loop unroll

Gotta try unrolling x4

Post by **albinopapa** » November 6th, 2018, 6:17 pm

However MSVC on -O2 generated too many richarded move instructions

I've noticed looking through disassembly on MSVC the same sometimes. Move from memory to register XMM0, back to memory, then back to XMM0. A lot of the moves seemed duplicated, but I think it's unaligned load/stores to try to get it aligned, which seems ridiculous. From what I've read, AMD's AVX implementation interleaves their 128 bit SSE lanes to simulate AVX instructions and still do on their new Zen architectures instead of doing the two SSE lanes simultaneously.

However on Piledriver PCMPGTQ has latency of just 2 cycles.

I've been looking ( not extensively ) for this info on AMD chips, where did you find this? Intel has the intrinsics guide which has a lot of this info there, but haven't found anything similar for AMD.

Post by **albinopapa** » November 6th, 2018, 7:31 pm

thesmallcreeper wrote:With the file that slidy shared with us.
1,798,592 Commands, 2,645 Field Size

edit: maybe gona check if I can benefit from the thoughput of instructions. .. they are 1 cpi

edit 2: tried doing that by calculating 8 values per loop using AVX 2. Here is my nice results
Code: Select all
PASS (319.89ms) File Test: Slidy's Test (1,798,592 Commands, 2,645 Field Size) // with x2 AVX instructions per loop
PASS (602.10ms) File Test: Slidy's Test (1,798,592 Commands, 2,645 Field Size) // without AVX loop unroll
Gotta try unrolling x4

I personally never got any speed increase from loop unrolling. In most of my trials, the compiler usually unrolled the loops already or there wasn't enough work to be done between the loads/stores, or was probably already memory bandwidth limited. The prefetcher gets 64 bytes ( four SSE lanes worth ) in an array as it is, so four single iteration loops are already cached if you are processing arrays. This is probably the reason why unrolling usually doesn't do much of anything.

Planet Chili

Coding Challenge 5 - Meme Graveyard Keeper

Re: Coding Challenge 5 - Meme Graveyard Keeper

Re: Coding Challenge 5 - Meme Graveyard Keeper

Re: Coding Challenge 5 - Meme Graveyard Keeper

Re: Coding Challenge 5 - Meme Graveyard Keeper

Re: Coding Challenge 5 - Meme Graveyard Keeper

Re: Coding Challenge 5 - Meme Graveyard Keeper

Re: Coding Challenge 5 - Meme Graveyard Keeper

Re: Coding Challenge 5 - Meme Graveyard Keeper

Re: Coding Challenge 5 - Meme Graveyard Keeper

Re: Coding Challenge 5 - Meme Graveyard Keeper