2025/02/25

Newest at the top

2025-02-25 11:01:10 +0100	<ski>	[exa] : differentiated as in AD, or say as in differential lambda calculus ?
2025-02-25 11:00:20 +0100	<ski>	mainly the parallelism, i suppose, tomsmeding
2025-02-25 10:59:53 +0100	divya-	divya
2025-02-25 10:58:38 +0100	divya-	(divya@140.238.251.170) divya
2025-02-25 10:55:43 +0100	<[exa]>	ski: I was trying to find some such connection before and it doesn't really seem easily. IMO we'd need some very interesting encoding of the array indexes and updates that can be differentiated directly and then reconstructed. Most differentiation formulations I've seen basically deny that.
2025-02-25 10:53:54 +0100	kuribas	(~user@ip-188-118-57-242.reverse.destiny.be) kuribas
2025-02-25 10:46:42 +0100	<tomsmeding>	in any case, the fact that it doesn't really exist any more makes this a hard proposition :)
2025-02-25 10:46:27 +0100	<tomsmeding>	ski: are you thinking of just the parallelism part, or also the distributed computing part of DPH? (IIRC they had distributed execution over multiple machines as a core design goal)
2025-02-25 10:45:59 +0100	tzh	(~tzh@c-76-115-131-146.hsd1.or.comcast.net) (Quit: zzz)
2025-02-25 10:45:24 +0100	<tomsmeding>	Athas: do you get those nice plots out of gradbench?
2025-02-25 10:45:18 +0100	<ski>	Athas : wondering whether there'd be any hope of integrating it with something like the above
2025-02-25 10:43:03 +0100	<tomsmeding>	Real-world impact is not particularly correlated with publication-worthiness
2025-02-25 10:42:36 +0100	<tomsmeding>	This is at most a workshop talk!
2025-02-25 10:41:51 +0100	<tomsmeding>	but then it ends up, in this particular niche of haskell AD libraries, improving the state of the art by >500x.
2025-02-25 10:41:37 +0100	<tomsmeding>	"this is trivial application of the stuff we already know"
2025-02-25 10:41:25 +0100	<tomsmeding>	it's also in a way unsatisfying. Writing this ad-dual thing is mostly like "this is easy stuff, the interesting research is in tackling harder problems like actually differentiating higher-order code first-class, or as a source-code transformation, or stuff like that"
2025-02-25 10:40:41 +0100	<tomsmeding>	well, first I have to revise a journal paper which is due in 2 weeks or so. :P
2025-02-25 10:40:19 +0100	<tomsmeding>	it's funny how people put those graphs in papers, but then the underlying implementations are actually kind of crap :P
2025-02-25 10:40:11 +0100	<Athas>	No no, I think this is definitely what you should be workign on. Keep up! Don't get distracted!
2025-02-25 10:39:58 +0100	<tomsmeding>	hah
2025-02-25 10:39:54 +0100	<tomsmeding>	anyway this is very much a side-track to what I should actually be working on these weeks. But I wanted to get some ideas from you anyway :)
2025-02-25 10:39:36 +0100	<Athas>	The PyTorch code in ADBench was also crap, but a colleague of mine was so disgusted that he rewrote it, when we needed it for a paper a few years back.
2025-02-25 10:38:56 +0100	<Athas>	ski: what about DPH?
2025-02-25 10:38:47 +0100	<tomsmeding>	also pytorch is like "I'm faster than all you guys but you just need to give me data"
2025-02-25 10:38:13 +0100	<tomsmeding>	heh okay
2025-02-25 10:38:00 +0100	<tomsmeding>	perhaps, but scalar-level AD is in some sense the limit of cleverness in that respect -- it automatically detects and exploits _all_ sparsity!
2025-02-25 10:37:50 +0100	<Athas>	The TF code is crap, I have an issue for it. It is something I took from ADBench but there is no way this is as fast as it should be.
2025-02-25 10:37:43 +0100	alfiee	(~alfiee@user/alfiee) (Ping timeout: 252 seconds)
2025-02-25 10:37:19 +0100	<tomsmeding>	I find it interesting in your graph that 'tensorflow' is just like "I have about 0.6 seconds overhead, otherwise I have the same graph as you guys"
2025-02-25 10:37:17 +0100	<Athas>	And in the limit, with high level array operations, some things can perhaps be differentiated more effectively, using knowledge about what is actually going on.
2025-02-25 10:36:48 +0100	<Athas>	Indeed.
2025-02-25 10:36:45 +0100	<tomsmeding>	but with arrays, even just first-order ones, both of those problems should be reduced to negligibility
2025-02-25 10:36:20 +0100	<tomsmeding>	we have the disadvantage that being scalar-oriented means spraying your the heap with huge amounts of garbage, and also creating long, deep chains that the GC has to traverse
2025-02-25 10:35:31 +0100	<Athas>	So 4x to 5x slower than Enzyme (which is a compiler transformation) is dosable in C++. And these tools don't actually know what a "matrix" is; they're completely scalar-oriented.
2025-02-25 10:35:00 +0100	<Athas>	Here is how well tape based AD can do in C++: https://sigkill.dk/junk/gmm-diff.pdf - 'adept', 'adol-c', and 'cppad' use taping, while everything else is some kind of source transformation.
2025-02-25 10:34:35 +0100	<tomsmeding>	It's fun to see the original performance around 300ms, and then seeing the array version be 700us. :p
2025-02-25 10:33:55 +0100	<tomsmeding>	:)
2025-02-25 10:33:45 +0100	<tomsmeding>	the inspiration for this was your question of "is there anything better than 'ad'" and me thinking "we can't possibly have just this as the answer"
2025-02-25 10:33:22 +0100	alfiee	(~alfiee@user/alfiee) alfiee
2025-02-25 10:33:09 +0100	<tomsmeding>	perhaps I should've been fair and used fneural_A for both, but honestly, fneural should be faster because it uses V.map sometimes
2025-02-25 10:32:59 +0100	<Athas>	tomsmeding: very promising!
2025-02-25 10:32:38 +0100	<tomsmeding>	Athas: I used fneural_A for my library and fneural for 'ad' https://git.tomsmeding.com/ad-dual/tree/examples/Numeric/ADDual/Examples.hs
2025-02-25 10:32:22 +0100	ski	<https://downloads.haskell.org/ghc/8.4-latest/docs/html/users_guide/parallel.html#data-parallel-has…>,<https://wiki.haskell.org/GHC/Data_Parallel_Haskell,<https://www.cs.cmu.edu/~scandal/cacm/cacm2.html>,<https://web.archive.org/web/20040806031752/http://cgi.cse.unsw.edu.au/~chak/papers/papers.html#ndp…>,<https://en.wikipedia.org/wiki/NESL>,<https://en.wikipedia.org/wiki/Data_parallelism> )
2025-02-25 10:32:22 +0100	ski	. o O ( "Data Parallel Haskell" (GHC 6.10 - 8.4)
2025-02-25 10:32:04 +0100	<tomsmeding>	I benchmarked this on a simple 2-hidden-layer neural network with relu activation and softmax at the end; when input and hidden layers are all width N, then when N goes from 100 to 2000, the speedup over 'ad' goes from 26x to ~550x
2025-02-25 10:31:54 +0100	<Athas>	How close is this to working?
2025-02-25 10:29:37 +0100	<tomsmeding>	'ad' also has unsafePerformIO in the same positions
2025-02-25 10:29:07 +0100	ThePenguin	(~ThePengui@cust-95-80-24-166.csbnet.se) ThePenguin
2025-02-25 10:28:51 +0100	<tomsmeding>	I guess that's called "runST . unsafeCoerce"
2025-02-25 10:28:41 +0100	<tomsmeding>	and there isn't even unsafePerformST!