2026/02/14

Newest at the top

2026-02-14 22:13:46 +0100	<probie>	There isn't really a data dependency though, since memory is never read again after being written
2026-02-14 22:13:28 +0100	<tomsmeding>	(yes, the throughput label is misleading; I checked that a div_pd has 4 there and add_pd 0.5, so indeed it's CPI = 1/throughput)
2026-02-14 22:12:51 +0100	<tomsmeding>	and apparently it can even do two of those _mm256_add_epi8 instructions in one cycle, by the CPI of 0.5
2026-02-14 22:12:20 +0100	peterbecich	(~Thunderbi@71.84.33.135) (Ping timeout: 256 seconds)
2026-02-14 22:12:10 +0100	<tomsmeding>	epi is integer stuff
2026-02-14 22:11:56 +0100	<[exa]>	oh these are the epi8 instructions from the intrinsic guide that I ignored everytime
2026-02-14 22:11:39 +0100	<tomsmeding>	think about that, 64 adds with 1-cycle latency
2026-02-14 22:11:05 +0100	<tomsmeding>	or the _mm256 version, or _mm512 if you want to use your juicy AVX512
2026-02-14 22:10:29 +0100	<tomsmeding>	_mm_add_epi8 is the one you want here (paddb)
2026-02-14 22:10:17 +0100	<tomsmeding>	https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=epi8
2026-02-14 22:10:02 +0100	[exa]	learned today
2026-02-14 22:09:48 +0100	<tomsmeding>	yes
2026-02-14 22:09:40 +0100	<[exa]>	are there SIMD instructions for chars?
2026-02-14 22:09:35 +0100	merijn	(~merijn@host-cl.cgnat-g.v4.dfn.nl) merijn
2026-02-14 22:09:30 +0100	<tomsmeding>	but 4 should at least get you different assembly
2026-02-14 22:09:27 +0100	<[exa]>	ok I somehow hoped this is at least floats
2026-02-14 22:09:20 +0100	<[exa]>	oh
2026-02-14 22:09:11 +0100	<tomsmeding>	or at least 16x to use 128bit SSE4 registers
2026-02-14 22:08:49 +0100	<tomsmeding>	with this being Word8 you may even want to unroll 32x
2026-02-14 22:08:28 +0100	<[exa]>	(edited right into pastebin so didn't try it but you see the point I guess)
2026-02-14 22:07:50 +0100	<[exa]>	probie: try this https://paste.tomsmeding.com/GjpwizwI
2026-02-14 22:07:34 +0100	caubert	(~caubert@user/caubert) caubert
2026-02-14 22:07:15 +0100	<tomsmeding>	lol
2026-02-14 22:07:02 +0100	<tomsmeding>	yeah probie ^
2026-02-14 22:06:57 +0100	<[exa]>	it's writing back to the original vector
2026-02-14 22:06:50 +0100	<tomsmeding>	oh no
2026-02-14 22:06:45 +0100	<tomsmeding>	[exa]: isn't this code just zipWith (+)
2026-02-14 22:06:12 +0100	<[exa]>	probie: man, you're introducing a data dependency there, it can't simd
2026-02-14 22:04:25 +0100	infinity0	(~infinity0@pwned.gg) (Ping timeout: 255 seconds)
2026-02-14 22:04:08 +0100	<fgarcia>	llvm goes to at least 23 now. it could be the SIMD changes haven't made it down
2026-02-14 22:03:56 +0100	<tomsmeding>	but in theory, either should work here
2026-02-14 22:03:48 +0100	<tomsmeding>	in general, Storable is more straightforward
2026-02-14 22:03:42 +0100	<int-e>	Right. I should've known that.
2026-02-14 22:03:28 +0100	<probie>	int-e: it's not I omitted the `import qualified Data.Vector.Unboxed.Mutable as V`. Weirdly, I get slightly better llvm if I use `Storable` instead of `Unboxed`
2026-02-14 22:03:10 +0100	<int-e>	tomsmeding: gah
2026-02-14 22:02:59 +0100	<[exa]>	int-e: afaik you can import the one from the .unboxed.mutable or .primitive.mutable module
2026-02-14 22:02:52 +0100	<tomsmeding>	int-e: every mutable vector variant has its own definition of the "IOVector" type synonym
2026-02-14 22:02:35 +0100	<EvanR>	oh, LLVM
2026-02-14 22:02:22 +0100	<EvanR>	last I heard ghc didn't have SIMD support
2026-02-14 22:02:09 +0100	<tomsmeding>	there are a bunch of loop-invariant loads here that I expect llvm to lift out
2026-02-14 22:02:03 +0100	<int-e>	IOVector is boxed.
2026-02-14 22:01:42 +0100	<tomsmeding>	probie: if it's easy to paste the optimised LLVM IR, that would make it easier to see what's going on, probably
2026-02-14 22:00:22 +0100	<[exa]>	probie: you have unboxed or primitive vectors?
2026-02-14 22:00:09 +0100	L29Ah	(~L29Ah@wikipedia/L29Ah) (Ping timeout: 260 seconds)
2026-02-14 21:59:26 +0100	<probie>	https://paste.tomsmeding.com/NRYKh5Fj
2026-02-14 21:59:21 +0100	<[exa]>	that looks like a lot of indirection
2026-02-14 21:58:23 +0100	caubert	(~caubert@user/caubert) (Ping timeout: 252 seconds)
2026-02-14 21:58:11 +0100	<tomsmeding>	why are there so many loads for only two stores? I assume this is different code than you posted originally?
2026-02-14 21:57:03 +0100	<probie>	I don't think it's LLVM's problem here; GHC is just not generating good code https://paste.tomsmeding.com/8ZYY5Pka
2026-02-14 21:57:02 +0100	merijn	(~merijn@host-cl.cgnat-g.v4.dfn.nl) (Ping timeout: 256 seconds)