2026/02/14

Newest at the top

2026-02-14 22:31:59 +0100	merijn	(~merijn@host-cl.cgnat-g.v4.dfn.nl) (Ping timeout: 245 seconds)
2026-02-14 22:31:44 +0100	machinedgod	(~machinedg@d75-159-126-101.abhsia.telus.net) machinedgod
2026-02-14 22:31:32 +0100	<probie>	Vector still has alignment issues though, so I'm not inclined to bend over backwards to get it to work
2026-02-14 22:26:51 +0100	merijn	(~merijn@host-cl.cgnat-g.v4.dfn.nl) merijn
2026-02-14 22:26:29 +0100	<tomsmeding>	storable vectors are just C-style arrays as you expect
2026-02-14 22:26:10 +0100	<tomsmeding>	so there is indirection between the vector you have in hand and the underlying storage
2026-02-14 22:25:59 +0100	<[exa]>	probie: also highly suggest having a look at if repa/massiv can do this and if it can, copy what they did. IIRC these would still count these as "pure" haskell.
2026-02-14 22:25:55 +0100	<tomsmeding>	the idea of unboxed vectors is that they are struct-of-arrays transformed, i.e. a vector of (Int, Int) is actually two vectors under the hood
2026-02-14 22:25:32 +0100	<tomsmeding>	recommend storable vectors for that though, as they have a sensible withForeignPtr function
2026-02-14 22:25:04 +0100	<[exa]>	probie: there should be some (ugly but working) way to get a pointer to the vector which can be used as the required target type for the primitive op
2026-02-14 22:24:30 +0100	infinity0	(~infinity0@pwned.gg) infinity0
2026-02-14 22:23:55 +0100	<probie>	I think if I want to do this in "pure" Haskell, I'm better off ditching vector, and using `MutableByteArray#`s with the explicit SIMD operations from ghc-prim
2026-02-14 22:21:54 +0100	<[exa]>	:(
2026-02-14 22:21:42 +0100	<probie>	[exa]: It did fail, and the reads are absolutely not aligned
2026-02-14 22:21:18 +0100	tcard	(~tcard@2400:4051:5801:7500:cf17:befc:ff82:5303) (Quit: Leaving)
2026-02-14 22:20:25 +0100	<[exa]>	probie: btw if this fails, try making sure the reads are aligned (no real clue how to help there tho.)
2026-02-14 22:19:44 +0100	<tomsmeding>	<3
2026-02-14 22:19:33 +0100	<[exa]>	tomsmeding: cool I'm going to simd my stupid database string-indexing code :D
2026-02-14 22:19:28 +0100	L29Ah	(~L29Ah@wikipedia/L29Ah) L29Ah
2026-02-14 22:17:27 +0100	<geekosaur>	(I typo each of those into the other constantly…)
2026-02-14 22:16:34 +0100	<tomsmeding>	I can't type any more
2026-02-14 22:16:32 +0100	<tomsmeding>	s/ghc/gcc/
2026-02-14 22:16:00 +0100	<tomsmeding>	-ffast-math dances circles around IEEE semantics, but memory semantics aren't broken down as far as I know
2026-02-14 22:15:42 +0100	<tomsmeding>	I would be surprised if ghc does this at any optimisation level
2026-02-14 22:15:23 +0100	<tomsmeding>	ignoring this is a blatant violation of the semantics
2026-02-14 22:15:13 +0100	<tomsmeding>	and I can also assure you that GHC will not tell LLVM that these things do not alias
2026-02-14 22:15:11 +0100	<probie>	Even gcc only ignores it if you pass -O3 IIRC
2026-02-14 22:14:52 +0100	<tomsmeding>	this is DEFINITELY not ignored by llvm
2026-02-14 22:14:40 +0100	<tomsmeding>	the _mm256 and _mm variants indeed have 3
2026-02-14 22:14:38 +0100	<[exa]>	probie: anyway it might be the case that the compiler just ignores it but I'd bet this is the problem number 1
2026-02-14 22:14:33 +0100	<tomsmeding>	[exa]: oh I mistyped, I meant _mm512_add_epi8
2026-02-14 22:14:29 +0100	merijn	(~merijn@host-cl.cgnat-g.v4.dfn.nl) (Ping timeout: 245 seconds)
2026-02-14 22:14:20 +0100	<[exa]>	probie: memory order too strong QQ
2026-02-14 22:14:09 +0100	<tomsmeding>	yes
2026-02-14 22:14:06 +0100	<probie>	there can be aliasing
2026-02-14 22:14:04 +0100	<tomsmeding>	probie: the compiler doesn't know that
2026-02-14 22:13:56 +0100	<probie>	oh wait, <expletive>
2026-02-14 22:13:52 +0100	<[exa]>	tomsmeding: where do you read that? intel intrinsics guide says 3 per cycle
2026-02-14 22:13:46 +0100	<probie>	There isn't really a data dependency though, since memory is never read again after being written
2026-02-14 22:13:28 +0100	<tomsmeding>	(yes, the throughput label is misleading; I checked that a div_pd has 4 there and add_pd 0.5, so indeed it's CPI = 1/throughput)
2026-02-14 22:12:51 +0100	<tomsmeding>	and apparently it can even do two of those _mm256_add_epi8 instructions in one cycle, by the CPI of 0.5
2026-02-14 22:12:20 +0100	peterbecich	(~Thunderbi@71.84.33.135) (Ping timeout: 256 seconds)
2026-02-14 22:12:10 +0100	<tomsmeding>	epi is integer stuff
2026-02-14 22:11:56 +0100	<[exa]>	oh these are the epi8 instructions from the intrinsic guide that I ignored everytime
2026-02-14 22:11:39 +0100	<tomsmeding>	think about that, 64 adds with 1-cycle latency
2026-02-14 22:11:05 +0100	<tomsmeding>	or the _mm256 version, or _mm512 if you want to use your juicy AVX512
2026-02-14 22:10:29 +0100	<tomsmeding>	_mm_add_epi8 is the one you want here (paddb)
2026-02-14 22:10:17 +0100	<tomsmeding>	https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=epi8
2026-02-14 22:10:02 +0100	[exa]	learned today
2026-02-14 22:09:48 +0100	<tomsmeding>	yes