Newest at the top
2025-02-25 11:01:10 +0100 | <ski> | [exa] : differentiated as in AD, or say as in differential lambda calculus ? |
2025-02-25 11:00:20 +0100 | <ski> | mainly the parallelism, i suppose, tomsmeding |
2025-02-25 10:59:53 +0100 | divya- | divya |
2025-02-25 10:58:38 +0100 | divya- | (divya@140.238.251.170) divya |
2025-02-25 10:55:43 +0100 | <[exa]> | ski: I was trying to find some such connection before and it doesn't really seem easily. IMO we'd need some very interesting encoding of the array indexes and updates that can be differentiated directly and then reconstructed. Most differentiation formulations I've seen basically deny that. |
2025-02-25 10:53:54 +0100 | kuribas | (~user@ip-188-118-57-242.reverse.destiny.be) kuribas |
2025-02-25 10:46:42 +0100 | <tomsmeding> | in any case, the fact that it doesn't really exist any more makes this a hard proposition :) |
2025-02-25 10:46:27 +0100 | <tomsmeding> | ski: are you thinking of just the parallelism part, or also the distributed computing part of DPH? (IIRC they had distributed execution over multiple machines as a core design goal) |
2025-02-25 10:45:59 +0100 | tzh | (~tzh@c-76-115-131-146.hsd1.or.comcast.net) (Quit: zzz) |
2025-02-25 10:45:24 +0100 | <tomsmeding> | Athas: do you get those nice plots out of gradbench? |
2025-02-25 10:45:18 +0100 | <ski> | Athas : wondering whether there'd be any hope of integrating it with something like the above |
2025-02-25 10:43:03 +0100 | <tomsmeding> | Real-world impact is not particularly correlated with publication-worthiness |
2025-02-25 10:42:36 +0100 | <tomsmeding> | This is at most a workshop talk! |
2025-02-25 10:41:51 +0100 | <tomsmeding> | but then it ends up, in this particular niche of haskell AD libraries, improving the state of the art by >500x. |
2025-02-25 10:41:37 +0100 | <tomsmeding> | "this is trivial application of the stuff we already know" |
2025-02-25 10:41:25 +0100 | <tomsmeding> | it's also in a way unsatisfying. Writing this ad-dual thing is mostly like "this is easy stuff, the interesting research is in tackling harder problems like actually differentiating higher-order code first-class, or as a source-code transformation, or stuff like that" |
2025-02-25 10:40:41 +0100 | <tomsmeding> | well, first I have to revise a journal paper which is due in 2 weeks or so. :P |
2025-02-25 10:40:19 +0100 | <tomsmeding> | it's funny how people put those graphs in papers, but then the underlying implementations are actually kind of crap :P |
2025-02-25 10:40:11 +0100 | <Athas> | No no, I think this is definitely what you should be workign on. Keep up! Don't get distracted! |
2025-02-25 10:39:58 +0100 | <tomsmeding> | hah |
2025-02-25 10:39:54 +0100 | <tomsmeding> | anyway this is very much a side-track to what I should actually be working on these weeks. But I wanted to get some ideas from you anyway :) |
2025-02-25 10:39:36 +0100 | <Athas> | The PyTorch code in ADBench was also crap, but a colleague of mine was so disgusted that he rewrote it, when we needed it for a paper a few years back. |
2025-02-25 10:38:56 +0100 | <Athas> | ski: what about DPH? |
2025-02-25 10:38:47 +0100 | <tomsmeding> | also pytorch is like "I'm faster than all you guys but you just need to give me data" |
2025-02-25 10:38:13 +0100 | <tomsmeding> | heh okay |
2025-02-25 10:38:00 +0100 | <tomsmeding> | perhaps, but scalar-level AD is in some sense the limit of cleverness in that respect -- it automatically detects and exploits _all_ sparsity! |
2025-02-25 10:37:50 +0100 | <Athas> | The TF code is crap, I have an issue for it. It is something I took from ADBench but there is no way this is as fast as it should be. |
2025-02-25 10:37:43 +0100 | alfiee | (~alfiee@user/alfiee) (Ping timeout: 252 seconds) |
2025-02-25 10:37:19 +0100 | <tomsmeding> | I find it interesting in your graph that 'tensorflow' is just like "I have about 0.6 seconds overhead, otherwise I have the same graph as you guys" |
2025-02-25 10:37:17 +0100 | <Athas> | And in the limit, with high level array operations, some things can perhaps be differentiated more effectively, using knowledge about what is actually going on. |
2025-02-25 10:36:48 +0100 | <Athas> | Indeed. |
2025-02-25 10:36:45 +0100 | <tomsmeding> | but with arrays, even just first-order ones, both of those problems should be reduced to negligibility |
2025-02-25 10:36:20 +0100 | <tomsmeding> | we have the disadvantage that being scalar-oriented means spraying your the heap with huge amounts of garbage, and also creating long, deep chains that the GC has to traverse |
2025-02-25 10:35:31 +0100 | <Athas> | So 4x to 5x slower than Enzyme (which is a compiler transformation) is dosable in C++. And these tools don't actually know what a "matrix" is; they're completely scalar-oriented. |
2025-02-25 10:35:00 +0100 | <Athas> | Here is how well tape based AD can do in C++: https://sigkill.dk/junk/gmm-diff.pdf - 'adept', 'adol-c', and 'cppad' use taping, while everything else is some kind of source transformation. |
2025-02-25 10:34:35 +0100 | <tomsmeding> | It's fun to see the original performance around 300ms, and then seeing the array version be 700us. :p |
2025-02-25 10:33:55 +0100 | <tomsmeding> | :) |
2025-02-25 10:33:45 +0100 | <tomsmeding> | the inspiration for this was your question of "is there anything better than 'ad'" and me thinking "we can't possibly have just this as the answer" |
2025-02-25 10:33:22 +0100 | alfiee | (~alfiee@user/alfiee) alfiee |
2025-02-25 10:33:09 +0100 | <tomsmeding> | perhaps I should've been fair and used fneural_A for both, but honestly, fneural should be faster because it uses V.map sometimes |
2025-02-25 10:32:59 +0100 | <Athas> | tomsmeding: very promising! |
2025-02-25 10:32:38 +0100 | <tomsmeding> | Athas: I used fneural_A for my library and fneural for 'ad' https://git.tomsmeding.com/ad-dual/tree/examples/Numeric/ADDual/Examples.hs |
2025-02-25 10:32:22 +0100 | ski | <https://downloads.haskell.org/ghc/8.4-latest/docs/html/users_guide/parallel.html#data-parallel-has…>,<https://wiki.haskell.org/GHC/Data_Parallel_Haskell,<https://www.cs.cmu.edu/~scandal/cacm/cacm2.html>,<https://web.archive.org/web/20040806031752/http://cgi.cse.unsw.edu.au/~chak/papers/papers.html#ndp…>,<https://en.wikipedia.org/wiki/NESL>,<https://en.wikipedia.org/wiki/Data_parallelism> ) |
2025-02-25 10:32:22 +0100 | ski | . o O ( "Data Parallel Haskell" (GHC 6.10 - 8.4) |
2025-02-25 10:32:04 +0100 | <tomsmeding> | I benchmarked this on a simple 2-hidden-layer neural network with relu activation and softmax at the end; when input and hidden layers are all width N, then when N goes from 100 to 2000, the speedup over 'ad' goes from 26x to ~550x |
2025-02-25 10:31:54 +0100 | <Athas> | How close is this to working? |
2025-02-25 10:29:37 +0100 | <tomsmeding> | 'ad' also has unsafePerformIO in the same positions |
2025-02-25 10:29:07 +0100 | ThePenguin | (~ThePengui@cust-95-80-24-166.csbnet.se) ThePenguin |
2025-02-25 10:28:51 +0100 | <tomsmeding> | I guess that's called "runST . unsafeCoerce" |
2025-02-25 10:28:41 +0100 | <tomsmeding> | and there isn't even unsafePerformST! |