Newest at the top
| 2026-02-14 22:14:40 +0100 | <tomsmeding> | the _mm256 and _mm variants indeed have 3 |
| 2026-02-14 22:14:38 +0100 | <[exa]> | probie: anyway it might be the case that the compiler just ignores it but I'd bet this is the problem number 1 |
| 2026-02-14 22:14:33 +0100 | <tomsmeding> | [exa]: oh I mistyped, I meant _mm512_add_epi8 |
| 2026-02-14 22:14:29 +0100 | merijn | (~merijn@host-cl.cgnat-g.v4.dfn.nl) (Ping timeout: 245 seconds) |
| 2026-02-14 22:14:20 +0100 | <[exa]> | probie: memory order too strong QQ |
| 2026-02-14 22:14:09 +0100 | <tomsmeding> | yes |
| 2026-02-14 22:14:06 +0100 | <probie> | there can be aliasing |
| 2026-02-14 22:14:04 +0100 | <tomsmeding> | probie: the compiler doesn't know that |
| 2026-02-14 22:13:56 +0100 | <probie> | oh wait, <expletive> |
| 2026-02-14 22:13:52 +0100 | <[exa]> | tomsmeding: where do you read that? intel intrinsics guide says 3 per cycle |
| 2026-02-14 22:13:46 +0100 | <probie> | There isn't really a data dependency though, since memory is never read again after being written |
| 2026-02-14 22:13:28 +0100 | <tomsmeding> | (yes, the throughput label is misleading; I checked that a div_pd has 4 there and add_pd 0.5, so indeed it's CPI = 1/throughput) |
| 2026-02-14 22:12:51 +0100 | <tomsmeding> | and apparently it can even do two of those _mm256_add_epi8 instructions in one cycle, by the CPI of 0.5 |
| 2026-02-14 22:12:20 +0100 | peterbecich | (~Thunderbi@71.84.33.135) (Ping timeout: 256 seconds) |
| 2026-02-14 22:12:10 +0100 | <tomsmeding> | epi is integer stuff |
| 2026-02-14 22:11:56 +0100 | <[exa]> | oh these are the epi8 instructions from the intrinsic guide that I ignored everytime |
| 2026-02-14 22:11:39 +0100 | <tomsmeding> | think about that, 64 adds with 1-cycle latency |
| 2026-02-14 22:11:05 +0100 | <tomsmeding> | or the _mm256 version, or _mm512 if you want to use your juicy AVX512 |
| 2026-02-14 22:10:29 +0100 | <tomsmeding> | _mm_add_epi8 is the one you want here (paddb) |
| 2026-02-14 22:10:17 +0100 | <tomsmeding> | https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=epi8 |
| 2026-02-14 22:10:02 +0100 | [exa] | learned today |
| 2026-02-14 22:09:48 +0100 | <tomsmeding> | yes |
| 2026-02-14 22:09:40 +0100 | <[exa]> | are there SIMD instructions for chars? |
| 2026-02-14 22:09:35 +0100 | merijn | (~merijn@host-cl.cgnat-g.v4.dfn.nl) merijn |
| 2026-02-14 22:09:30 +0100 | <tomsmeding> | but 4 should at least get you different assembly |
| 2026-02-14 22:09:27 +0100 | <[exa]> | ok I somehow hoped this is at least floats |
| 2026-02-14 22:09:20 +0100 | <[exa]> | oh |
| 2026-02-14 22:09:11 +0100 | <tomsmeding> | or at least 16x to use 128bit SSE4 registers |
| 2026-02-14 22:08:49 +0100 | <tomsmeding> | with this being Word8 you may even want to unroll 32x |
| 2026-02-14 22:08:28 +0100 | <[exa]> | (edited right into pastebin so didn't try it but you see the point I guess) |
| 2026-02-14 22:07:50 +0100 | <[exa]> | probie: try this https://paste.tomsmeding.com/GjpwizwI |
| 2026-02-14 22:07:34 +0100 | caubert | (~caubert@user/caubert) caubert |
| 2026-02-14 22:07:15 +0100 | <tomsmeding> | lol |
| 2026-02-14 22:07:02 +0100 | <tomsmeding> | yeah probie ^ |
| 2026-02-14 22:06:57 +0100 | <[exa]> | it's writing back to the original vector |
| 2026-02-14 22:06:50 +0100 | <tomsmeding> | oh no |
| 2026-02-14 22:06:45 +0100 | <tomsmeding> | [exa]: isn't this code just zipWith (+) |
| 2026-02-14 22:06:12 +0100 | <[exa]> | probie: man, you're introducing a data dependency there, it can't simd |
| 2026-02-14 22:04:25 +0100 | infinity0 | (~infinity0@pwned.gg) (Ping timeout: 255 seconds) |
| 2026-02-14 22:04:08 +0100 | <fgarcia> | llvm goes to at least 23 now. it could be the SIMD changes haven't made it down |
| 2026-02-14 22:03:56 +0100 | <tomsmeding> | but in theory, either should work here |
| 2026-02-14 22:03:48 +0100 | <tomsmeding> | in general, Storable is more straightforward |
| 2026-02-14 22:03:42 +0100 | <int-e> | Right. I should've known that. |
| 2026-02-14 22:03:28 +0100 | <probie> | int-e: it's not I omitted the `import qualified Data.Vector.Unboxed.Mutable as V`. Weirdly, I get slightly better llvm if I use `Storable` instead of `Unboxed` |
| 2026-02-14 22:03:10 +0100 | <int-e> | tomsmeding: gah |
| 2026-02-14 22:02:59 +0100 | <[exa]> | int-e: afaik you can import the one from the .unboxed.mutable or .primitive.mutable module |
| 2026-02-14 22:02:52 +0100 | <tomsmeding> | int-e: every mutable vector variant has its own definition of the "IOVector" type synonym |
| 2026-02-14 22:02:35 +0100 | <EvanR> | oh, LLVM |
| 2026-02-14 22:02:22 +0100 | <EvanR> | last I heard ghc didn't have SIMD support |
| 2026-02-14 22:02:09 +0100 | <tomsmeding> | there are a bunch of loop-invariant loads here that I expect llvm to lift out |