2025/02/25

Newest at the top

2025-02-25 10:36:48 +0100	<Athas>	Indeed.
2025-02-25 10:36:45 +0100	<tomsmeding>	but with arrays, even just first-order ones, both of those problems should be reduced to negligibility
2025-02-25 10:36:20 +0100	<tomsmeding>	we have the disadvantage that being scalar-oriented means spraying your the heap with huge amounts of garbage, and also creating long, deep chains that the GC has to traverse
2025-02-25 10:35:31 +0100	<Athas>	So 4x to 5x slower than Enzyme (which is a compiler transformation) is dosable in C++. And these tools don't actually know what a "matrix" is; they're completely scalar-oriented.
2025-02-25 10:35:00 +0100	<Athas>	Here is how well tape based AD can do in C++: https://sigkill.dk/junk/gmm-diff.pdf - 'adept', 'adol-c', and 'cppad' use taping, while everything else is some kind of source transformation.
2025-02-25 10:34:35 +0100	<tomsmeding>	It's fun to see the original performance around 300ms, and then seeing the array version be 700us. :p
2025-02-25 10:33:55 +0100	<tomsmeding>	:)
2025-02-25 10:33:45 +0100	<tomsmeding>	the inspiration for this was your question of "is there anything better than 'ad'" and me thinking "we can't possibly have just this as the answer"
2025-02-25 10:33:22 +0100	alfiee	(~alfiee@user/alfiee) alfiee
2025-02-25 10:33:09 +0100	<tomsmeding>	perhaps I should've been fair and used fneural_A for both, but honestly, fneural should be faster because it uses V.map sometimes
2025-02-25 10:32:59 +0100	<Athas>	tomsmeding: very promising!
2025-02-25 10:32:38 +0100	<tomsmeding>	Athas: I used fneural_A for my library and fneural for 'ad' https://git.tomsmeding.com/ad-dual/tree/examples/Numeric/ADDual/Examples.hs
2025-02-25 10:32:22 +0100	ski	<https://downloads.haskell.org/ghc/8.4-latest/docs/html/users_guide/parallel.html#data-parallel-has…>,<https://wiki.haskell.org/GHC/Data_Parallel_Haskell,<https://www.cs.cmu.edu/~scandal/cacm/cacm2.html>,<https://web.archive.org/web/20040806031752/http://cgi.cse.unsw.edu.au/~chak/papers/papers.html#ndp…>,<https://en.wikipedia.org/wiki/NESL>,<https://en.wikipedia.org/wiki/Data_parallelism> )
2025-02-25 10:32:22 +0100	ski	. o O ( "Data Parallel Haskell" (GHC 6.10 - 8.4)
2025-02-25 10:32:04 +0100	<tomsmeding>	I benchmarked this on a simple 2-hidden-layer neural network with relu activation and softmax at the end; when input and hidden layers are all width N, then when N goes from 100 to 2000, the speedup over 'ad' goes from 26x to ~550x
2025-02-25 10:31:54 +0100	<Athas>	How close is this to working?
2025-02-25 10:29:37 +0100	<tomsmeding>	'ad' also has unsafePerformIO in the same positions
2025-02-25 10:29:07 +0100	ThePenguin	(~ThePengui@cust-95-80-24-166.csbnet.se) ThePenguin
2025-02-25 10:28:51 +0100	<tomsmeding>	I guess that's called "runST . unsafeCoerce"
2025-02-25 10:28:41 +0100	<tomsmeding>	and there isn't even unsafePerformST!
2025-02-25 10:28:23 +0100	<tomsmeding>	Athas: I might be able to switch to ST, but because the individual scalar operations are effectful (they mutate the tape), there's going to be some unsafePerformIO-like thing there anyway
2025-02-25 10:27:08 +0100	ThePenguin	(~ThePengui@cust-95-80-24-166.csbnet.se) (Remote host closed the connection)
2025-02-25 10:26:40 +0100	why	(~why@n218250229238.netvigator.com) (Quit: Client closed)
2025-02-25 10:26:40 +0100	<tomsmeding>	if so, then exactly the same thing should work here
2025-02-25 10:26:24 +0100	<Athas>	Yes, so that ought to work. That is good.
2025-02-25 10:25:48 +0100	<tomsmeding>	Athas: I think if you 'diffF' on a function that uses 'grad', you instantiate the 'a' in the type of 'grad' with 'AD s (Forward a)'; does that sound right?
2025-02-25 10:24:47 +0100	<tomsmeding>	nothing, it seems, semantically?
2025-02-25 10:24:29 +0100	<tomsmeding>	*supposed
2025-02-25 10:24:27 +0100	<tomsmeding>	I'm not familiar enough with the 'ad' API to figure out what that 'AD' type is suppose to do
2025-02-25 10:22:59 +0100	dsrt^	(~dsrt@108.192.66.114) (Ping timeout: 260 seconds)
2025-02-25 10:22:50 +0100	<tomsmeding>	the fact that that package's name is so short is really inconvenient. :P
2025-02-25 10:22:36 +0100	tomsmeding	opens the 'ad' docs
2025-02-25 10:22:30 +0100	off^	(~off@108.192.66.114) (Ping timeout: 268 seconds)
2025-02-25 10:22:19 +0100	<Athas>	It's 'diffF' on a function that uses 'grad'.
2025-02-25 10:21:55 +0100	<Athas>	Operationally yes. I don't know what the type of that looks like. I was referring to the surface syntax.
2025-02-25 10:21:13 +0100	<tomsmeding>	i.e. doing reverse mode, but secretly the scalars are dual numbers?
2025-02-25 10:21:08 +0100	califax	(~califax@user/califx) califx
2025-02-25 10:21:00 +0100	<tomsmeding>	wouldn't that be Reverse s (Forward Double) in 'ad'?
2025-02-25 10:20:49 +0100	califax	(~califax@user/califx) (Remote host closed the connection)
2025-02-25 10:20:40 +0100	<Athas>	It Just Works in 'ad'.
2025-02-25 10:20:27 +0100	<Athas>	jvp (\ ... vjp ...), but with the weird 'ad' names instead.
2025-02-25 10:20:17 +0100	<tomsmeding>	and I think you may be right that IO is overkill here!
2025-02-25 10:19:57 +0100	Smiles	(uid551636@id-551636.lymington.irccloud.com) Smiles
2025-02-25 10:19:49 +0100	<tomsmeding>	how would that look in 'ad'?
2025-02-25 10:19:31 +0100	<Athas>	But this cannot possibly do reverse-then-forward, right?
2025-02-25 10:19:14 +0100	<tomsmeding>	surely they could be
2025-02-25 10:19:09 +0100	<Athas>	I guess they could be.
2025-02-25 10:19:01 +0100	<Athas>	Are dual numbers Storable?
2025-02-25 10:18:52 +0100	<Athas>	This looks compelling, but why unsafePerformIO directly? Can this not be expressed with ST?
2025-02-25 10:18:43 +0100	AlexZenon	(~alzenon@178.34.162.44)