Coding Challenge 5 - Meme Graveyard Keeper
Re: Coding Challenge 5 - Meme Graveyard Keeper
Updated the requirements/conditions. You are now allowed a 3000ms runtime instead of 1000ms.
Chili
-
- Posts: 8
- Joined: September 22nd, 2018, 6:32 pm
- Location: United Kingdom
Re: Coding Challenge 5 - Meme Graveyard Keeper
Can someone tell me if I'm missing something. with the field size of UINT32_Max in a vector of <unsigned long long>s it uses more than my computer has of ram and when trying was of getting round it, it takes too long. so how are you getting 200 ms with 4,252,267,227 field size. in a vector it should take 34 GB of ram.
Edit:
here are my results with a method which doesn't use a graveyard vector
It isn't going well
second Edit:
ok I optimised my code (by removing vector.insert) and got these
so things are looking up for me (and I guess it works with uint32_max number of graves but it's harder)
Edit:
here are my results with a method which doesn't use a graveyard vector
Code: Select all
PASS (0.00ms) File Test: Buck Futter's Sexy Test 1
PASS (0.00ms) File Test: Buck Futter's Sexy Test 2
PASS (0.00ms) File Test: Buck Futter's Sexy Test 3
PASS (0.00ms) File Test: Buck Futter's Sexy Test 4
PASS (263106.09ms) File Test: Slidy's Test (1,798,592 Commands, 2,645 Field Size)
second Edit:
ok I optimised my code (by removing vector.insert) and got these
Code: Select all
PASS (0.01ms) File Test: Buck Futter's Sexy Test 1
PASS (0.00ms) File Test: Buck Futter's Sexy Test 2
PASS (0.00ms) File Test: Buck Futter's Sexy Test 3
PASS (0.00ms) File Test: Buck Futter's Sexy Test 4
PASS (310.23ms) File Test: Slidy's Test (1,798,592 Commands, 2,645 Field Size)
-
- Posts: 4373
- Joined: February 28th, 2013, 3:23 am
- Location: Oklahoma, United States
Re: Coding Challenge 5 - Meme Graveyard Keeper
max value of an unsigned int should be 4,294,967,295 not 4,252,267,227.
The InputCommand struct has 3 unsigned int values, equaling 12 bytes total struct size, which would be 48 GiB ( 51,539,607,552 bytes ). So unless you have 64 GiB of RAM, that's not going to happen.
I don't understand how you are able to get the max result if you don't store an array/vector of the graves, that blows my mind lol.
So far, the result I got for Slidy's test using nested for loops was ~2,800 ms. Looks like I won't even bother with this challenge.
The InputCommand struct has 3 unsigned int values, equaling 12 bytes total struct size, which would be 48 GiB ( 51,539,607,552 bytes ). So unless you have 64 GiB of RAM, that's not going to happen.
I don't understand how you are able to get the max result if you don't store an array/vector of the graves, that blows my mind lol.
So far, the result I got for Slidy's test using nested for loops was ~2,800 ms. Looks like I won't even bother with this challenge.
If you think paging some data from disk into RAM is slow, try paging it into a simian cerebrum over a pair of optical nerves. - gameprogrammingpatterns.com
-
- Posts: 8
- Joined: September 22nd, 2018, 6:32 pm
- Location: United Kingdom
Re: Coding Challenge 5 - Meme Graveyard Keeper
the Input Command has a max size of 8000000 not 4,294,967,295albinopapa wrote:max value of an unsigned int should be 4,294,967,295 not 4,252,267,227.
The InputCommand struct has 3 unsigned int values, equaling 12 bytes total struct size, which would be 48 GiB ( 51,539,607,552 bytes ). So unless you have 64 GiB of RAM, that's not going to happen.
I don't understand how you are able to get the max result if you don't store an array/vector of the graves, that blows my mind lol.
So far, the result I got for Slidy's test using nested for loops was ~2,800 ms. Looks like I won't even bother with this challenge.
and if you want a tip where to start, you should think of the InputCommands on a number line, with a start and an end and think of how you can split and sort them (std::sort was really useful for me)
Re: Coding Challenge 5 - Meme Graveyard Keeper
It can definitely be done my dudes, it's just requires a bit of thought and ingenuity (like most challenges of this type). Seldom is the naive (straight-forward) approach the answer.
And this type of thinking is essential if you want to write optimal code in the wild. You can spend 100s of hours writing cache, MT, SIMD, and low level bit twerking optimizations, but if you're working on a naive O(n^2) algorithm when a O(nlogn) algorithm exists, you're just spinning your wheels.
And this type of thinking is essential if you want to write optimal code in the wild. You can spend 100s of hours writing cache, MT, SIMD, and low level bit twerking optimizations, but if you're working on a naive O(n^2) algorithm when a O(nlogn) algorithm exists, you're just spinning your wheels.
Chili
-
- Posts: 14
- Joined: September 24th, 2018, 1:20 pm
Re: Coding Challenge 5 - Meme Graveyard Keeper
Hey, figured out the trick for big fields in order not to fuck the memory but anyway wanna spin my wheels a little bit so tried playing with AVXs 256-bit and SSEs on a naive solution and have some things to share with you ^_^
So. Each of my graves is a long long and gotta do some tricks because SIMD dont have many instructions for 64-bit integers . The tricks I did on AVX version and SSEs version are more less the same. The only difference is that my AVX version calculate 4 graves per loop while SSE version 2 graves per loop.
Results on the file Slidy uploaded ( I dont have exact numbers, what I remember more less)
on Piledriver machine: Naive solution: 1200ms , using SSE: 600ms , using AVX: I need AVX 2 that isnt supported
on Haswell machine: Naive solution: 1200ms, using SSE: 1200ms!! , using AVX: 600ms
Well I think I know why Haswell just fuck SSE version
In SSE and AVX version I use VPCMPGTQ and PCMPGTQ accordingly which on Haswell both have a latency of 5 cycles. However on Piledriver PCMPGTQ has latency of just 2 cycles. (kinda interesting...) Futhermore, on Slidy's test AVX version use the VPCMPGTQ about 300 million times while SSE version use PCMPGTQ about 600 million times, so that might explain the results
Also one thing about AVX code. If you try AVX you may have to use load and store instructions. Personally at first I tried to access grave like if they was a array of __m256i. However MSVC on -O2 generated too many richarded move instructions and killed the performance ,about 1200ms ,while GCC -O3 just the needed.
(I feel like my English is pure shit...)
So. Each of my graves is a long long and gotta do some tricks because SIMD dont have many instructions for 64-bit integers . The tricks I did on AVX version and SSEs version are more less the same. The only difference is that my AVX version calculate 4 graves per loop while SSE version 2 graves per loop.
Results on the file Slidy uploaded ( I dont have exact numbers, what I remember more less)
on Piledriver machine: Naive solution: 1200ms , using SSE: 600ms , using AVX: I need AVX 2 that isnt supported
on Haswell machine: Naive solution: 1200ms, using SSE: 1200ms!! , using AVX: 600ms
Well I think I know why Haswell just fuck SSE version
In SSE and AVX version I use VPCMPGTQ and PCMPGTQ accordingly which on Haswell both have a latency of 5 cycles. However on Piledriver PCMPGTQ has latency of just 2 cycles. (kinda interesting...) Futhermore, on Slidy's test AVX version use the VPCMPGTQ about 300 million times while SSE version use PCMPGTQ about 600 million times, so that might explain the results
Also one thing about AVX code. If you try AVX you may have to use load and store instructions. Personally at first I tried to access grave like if they was a array of __m256i. However MSVC on -O2 generated too many richarded move instructions and killed the performance ,about 1200ms ,while GCC -O3 just the needed.
(I feel like my English is pure shit...)
Re: Coding Challenge 5 - Meme Graveyard Keeper
Lel, richarded
What field sizes are you getting these results with btw?
What field sizes are you getting these results with btw?
Chili
-
- Posts: 14
- Joined: September 24th, 2018, 1:20 pm
Re: Coding Challenge 5 - Meme Graveyard Keeper
With the file that slidy shared with us.
1,798,592 Commands, 2,645 Field Size
edit: maybe gona check if I can benefit from the thoughput of instructions. .. they are 1 cpi
edit 2: tried doing that by calculating 8 values per loop using AVX 2. Here is my nice results
Gotta try unrolling x4
1,798,592 Commands, 2,645 Field Size
edit: maybe gona check if I can benefit from the thoughput of instructions. .. they are 1 cpi
edit 2: tried doing that by calculating 8 values per loop using AVX 2. Here is my nice results
Code: Select all
PASS (319.89ms) File Test: Slidy's Test (1,798,592 Commands, 2,645 Field Size) // with x2 AVX instructions per loop
PASS (602.10ms) File Test: Slidy's Test (1,798,592 Commands, 2,645 Field Size) // without AVX loop unroll
-
- Posts: 4373
- Joined: February 28th, 2013, 3:23 am
- Location: Oklahoma, United States
Re: Coding Challenge 5 - Meme Graveyard Keeper
I've noticed looking through disassembly on MSVC the same sometimes. Move from memory to register XMM0, back to memory, then back to XMM0. A lot of the moves seemed duplicated, but I think it's unaligned load/stores to try to get it aligned, which seems ridiculous. From what I've read, AMD's AVX implementation interleaves their 128 bit SSE lanes to simulate AVX instructions and still do on their new Zen architectures instead of doing the two SSE lanes simultaneously.However MSVC on -O2 generated too many richarded move instructions
I've been looking ( not extensively ) for this info on AMD chips, where did you find this? Intel has the intrinsics guide which has a lot of this info there, but haven't found anything similar for AMD.However on Piledriver PCMPGTQ has latency of just 2 cycles.
If you think paging some data from disk into RAM is slow, try paging it into a simian cerebrum over a pair of optical nerves. - gameprogrammingpatterns.com
-
- Posts: 4373
- Joined: February 28th, 2013, 3:23 am
- Location: Oklahoma, United States
Re: Coding Challenge 5 - Meme Graveyard Keeper
I personally never got any speed increase from loop unrolling. In most of my trials, the compiler usually unrolled the loops already or there wasn't enough work to be done between the loads/stores, or was probably already memory bandwidth limited. The prefetcher gets 64 bytes ( four SSE lanes worth ) in an array as it is, so four single iteration loops are already cached if you are processing arrays. This is probably the reason why unrolling usually doesn't do much of anything.thesmallcreeper wrote:With the file that slidy shared with us.
1,798,592 Commands, 2,645 Field Size
edit: maybe gona check if I can benefit from the thoughput of instructions. .. they are 1 cpi
edit 2: tried doing that by calculating 8 values per loop using AVX 2. Here is my nice results
Gotta try unrolling x4Code: Select all
PASS (319.89ms) File Test: Slidy's Test (1,798,592 Commands, 2,645 Field Size) // with x2 AVX instructions per loop PASS (602.10ms) File Test: Slidy's Test (1,798,592 Commands, 2,645 Field Size) // without AVX loop unroll
If you think paging some data from disk into RAM is slow, try paging it into a simian cerebrum over a pair of optical nerves. - gameprogrammingpatterns.com