2025/02/25

Newest at the top

2025-02-25 10:40:41 +0100 <tomsmeding> well, first I have to revise a journal paper which is due in 2 weeks or so. :P
2025-02-25 10:40:19 +0100 <tomsmeding> it's funny how people put those graphs in papers, but then the underlying implementations are actually kind of crap :P
2025-02-25 10:40:11 +0100 <Athas> No no, I think this is definitely what you should be workign on. Keep up! Don't get distracted!
2025-02-25 10:39:58 +0100 <tomsmeding> hah
2025-02-25 10:39:54 +0100 <tomsmeding> anyway this is very much a side-track to what I should actually be working on these weeks. But I wanted to get some ideas from you anyway :)
2025-02-25 10:39:36 +0100 <Athas> The PyTorch code in ADBench was also crap, but a colleague of mine was so disgusted that he rewrote it, when we needed it for a paper a few years back.
2025-02-25 10:38:56 +0100 <Athas> ski: what about DPH?
2025-02-25 10:38:47 +0100 <tomsmeding> also pytorch is like "I'm faster than all you guys but you just need to give me data"
2025-02-25 10:38:13 +0100 <tomsmeding> heh okay
2025-02-25 10:38:00 +0100 <tomsmeding> perhaps, but scalar-level AD is in some sense the limit of cleverness in that respect -- it automatically detects and exploits _all_ sparsity!
2025-02-25 10:37:50 +0100 <Athas> The TF code is crap, I have an issue for it. It is something I took from ADBench but there is no way this is as fast as it should be.
2025-02-25 10:37:43 +0100alfiee(~alfiee@user/alfiee) (Ping timeout: 252 seconds)
2025-02-25 10:37:19 +0100 <tomsmeding> I find it interesting in your graph that 'tensorflow' is just like "I have about 0.6 seconds overhead, otherwise I have the same graph as you guys"
2025-02-25 10:37:17 +0100 <Athas> And in the limit, with high level array operations, some things can perhaps be differentiated more effectively, using knowledge about what is actually going on.
2025-02-25 10:36:48 +0100 <Athas> Indeed.
2025-02-25 10:36:45 +0100 <tomsmeding> but with arrays, even just first-order ones, both of those problems should be reduced to negligibility
2025-02-25 10:36:20 +0100 <tomsmeding> we have the disadvantage that being scalar-oriented means spraying your the heap with huge amounts of garbage, and also creating long, deep chains that the GC has to traverse
2025-02-25 10:35:31 +0100 <Athas> So 4x to 5x slower than Enzyme (which is a compiler transformation) is dosable in C++. And these tools don't actually know what a "matrix" is; they're completely scalar-oriented.
2025-02-25 10:35:00 +0100 <Athas> Here is how well tape based AD can do in C++: https://sigkill.dk/junk/gmm-diff.pdf - 'adept', 'adol-c', and 'cppad' use taping, while everything else is some kind of source transformation.
2025-02-25 10:34:35 +0100 <tomsmeding> It's fun to see the original performance around 300ms, and then seeing the array version be 700us. :p
2025-02-25 10:33:55 +0100 <tomsmeding> :)
2025-02-25 10:33:45 +0100 <tomsmeding> the inspiration for this was your question of "is there anything better than 'ad'" and me thinking "we can't possibly have just this as the answer"
2025-02-25 10:33:22 +0100alfiee(~alfiee@user/alfiee) alfiee
2025-02-25 10:33:09 +0100 <tomsmeding> perhaps I should've been fair and used fneural_A for both, but honestly, fneural should be faster because it uses V.map sometimes
2025-02-25 10:32:59 +0100 <Athas> tomsmeding: very promising!
2025-02-25 10:32:38 +0100 <tomsmeding> Athas: I used fneural_A for my library and fneural for 'ad' https://git.tomsmeding.com/ad-dual/tree/examples/Numeric/ADDual/Examples.hs
2025-02-25 10:32:22 +0100ski<https://downloads.haskell.org/ghc/8.4-latest/docs/html/users_guide/parallel.html#data-parallel-has…>,<https://wiki.haskell.org/GHC/Data_Parallel_Haskell,<https://www.cs.cmu.edu/~scandal/cacm/cacm2.html>,<https://web.archive.org/web/20040806031752/http://cgi.cse.unsw.edu.au/~chak/papers/papers.html#ndp…>,<https://en.wikipedia.org/wiki/NESL>,<https://en.wikipedia.org/wiki/Data_parallelism> )
2025-02-25 10:32:22 +0100ski. o O ( "Data Parallel Haskell" (GHC 6.10 - 8.4)
2025-02-25 10:32:04 +0100 <tomsmeding> I benchmarked this on a simple 2-hidden-layer neural network with relu activation and softmax at the end; when input and hidden layers are all width N, then when N goes from 100 to 2000, the speedup over 'ad' goes from 26x to ~550x
2025-02-25 10:31:54 +0100 <Athas> How close is this to working?
2025-02-25 10:29:37 +0100 <tomsmeding> 'ad' also has unsafePerformIO in the same positions
2025-02-25 10:29:07 +0100ThePenguin(~ThePengui@cust-95-80-24-166.csbnet.se) ThePenguin
2025-02-25 10:28:51 +0100 <tomsmeding> I guess that's called "runST . unsafeCoerce"
2025-02-25 10:28:41 +0100 <tomsmeding> and there isn't even unsafePerformST!
2025-02-25 10:28:23 +0100 <tomsmeding> Athas: I might be able to switch to ST, but because the individual scalar operations are effectful (they mutate the tape), there's going to be some unsafePerformIO-like thing there anyway
2025-02-25 10:27:08 +0100ThePenguin(~ThePengui@cust-95-80-24-166.csbnet.se) (Remote host closed the connection)
2025-02-25 10:26:40 +0100why(~why@n218250229238.netvigator.com) (Quit: Client closed)
2025-02-25 10:26:40 +0100 <tomsmeding> if so, then exactly the same thing should work here
2025-02-25 10:26:24 +0100 <Athas> Yes, so that ought to work. That is good.
2025-02-25 10:25:48 +0100 <tomsmeding> Athas: I think if you 'diffF' on a function that uses 'grad', you instantiate the 'a' in the type of 'grad' with 'AD s (Forward a)'; does that sound right?
2025-02-25 10:24:47 +0100 <tomsmeding> nothing, it seems, semantically?
2025-02-25 10:24:29 +0100 <tomsmeding> *supposed
2025-02-25 10:24:27 +0100 <tomsmeding> I'm not familiar enough with the 'ad' API to figure out what that 'AD' type is suppose to do
2025-02-25 10:22:59 +0100dsrt^(~dsrt@108.192.66.114) (Ping timeout: 260 seconds)
2025-02-25 10:22:50 +0100 <tomsmeding> the fact that that package's name is so short is really inconvenient. :P
2025-02-25 10:22:36 +0100tomsmedingopens the 'ad' docs
2025-02-25 10:22:30 +0100off^(~off@108.192.66.114) (Ping timeout: 268 seconds)
2025-02-25 10:22:19 +0100 <Athas> It's 'diffF' on a function that uses 'grad'.
2025-02-25 10:21:55 +0100 <Athas> Operationally yes. I don't know what the type of that looks like. I was referring to the surface syntax.
2025-02-25 10:21:13 +0100 <tomsmeding> i.e. doing reverse mode, but secretly the scalars are dual numbers?