Newest at the top
2025-02-25 10:36:48 +0100 | <Athas> | Indeed. |
2025-02-25 10:36:45 +0100 | <tomsmeding> | but with arrays, even just first-order ones, both of those problems should be reduced to negligibility |
2025-02-25 10:36:20 +0100 | <tomsmeding> | we have the disadvantage that being scalar-oriented means spraying your the heap with huge amounts of garbage, and also creating long, deep chains that the GC has to traverse |
2025-02-25 10:35:31 +0100 | <Athas> | So 4x to 5x slower than Enzyme (which is a compiler transformation) is dosable in C++. And these tools don't actually know what a "matrix" is; they're completely scalar-oriented. |
2025-02-25 10:35:00 +0100 | <Athas> | Here is how well tape based AD can do in C++: https://sigkill.dk/junk/gmm-diff.pdf - 'adept', 'adol-c', and 'cppad' use taping, while everything else is some kind of source transformation. |
2025-02-25 10:34:35 +0100 | <tomsmeding> | It's fun to see the original performance around 300ms, and then seeing the array version be 700us. :p |
2025-02-25 10:33:55 +0100 | <tomsmeding> | :) |
2025-02-25 10:33:45 +0100 | <tomsmeding> | the inspiration for this was your question of "is there anything better than 'ad'" and me thinking "we can't possibly have just this as the answer" |
2025-02-25 10:33:22 +0100 | alfiee | (~alfiee@user/alfiee) alfiee |
2025-02-25 10:33:09 +0100 | <tomsmeding> | perhaps I should've been fair and used fneural_A for both, but honestly, fneural should be faster because it uses V.map sometimes |
2025-02-25 10:32:59 +0100 | <Athas> | tomsmeding: very promising! |
2025-02-25 10:32:38 +0100 | <tomsmeding> | Athas: I used fneural_A for my library and fneural for 'ad' https://git.tomsmeding.com/ad-dual/tree/examples/Numeric/ADDual/Examples.hs |
2025-02-25 10:32:22 +0100 | ski | <https://downloads.haskell.org/ghc/8.4-latest/docs/html/users_guide/parallel.html#data-parallel-has…>,<https://wiki.haskell.org/GHC/Data_Parallel_Haskell,<https://www.cs.cmu.edu/~scandal/cacm/cacm2.html>,<https://web.archive.org/web/20040806031752/http://cgi.cse.unsw.edu.au/~chak/papers/papers.html#ndp…>,<https://en.wikipedia.org/wiki/NESL>,<https://en.wikipedia.org/wiki/Data_parallelism> ) |
2025-02-25 10:32:22 +0100 | ski | . o O ( "Data Parallel Haskell" (GHC 6.10 - 8.4) |
2025-02-25 10:32:04 +0100 | <tomsmeding> | I benchmarked this on a simple 2-hidden-layer neural network with relu activation and softmax at the end; when input and hidden layers are all width N, then when N goes from 100 to 2000, the speedup over 'ad' goes from 26x to ~550x |
2025-02-25 10:31:54 +0100 | <Athas> | How close is this to working? |
2025-02-25 10:29:37 +0100 | <tomsmeding> | 'ad' also has unsafePerformIO in the same positions |
2025-02-25 10:29:07 +0100 | ThePenguin | (~ThePengui@cust-95-80-24-166.csbnet.se) ThePenguin |
2025-02-25 10:28:51 +0100 | <tomsmeding> | I guess that's called "runST . unsafeCoerce" |
2025-02-25 10:28:41 +0100 | <tomsmeding> | and there isn't even unsafePerformST! |
2025-02-25 10:28:23 +0100 | <tomsmeding> | Athas: I might be able to switch to ST, but because the individual scalar operations are effectful (they mutate the tape), there's going to be some unsafePerformIO-like thing there anyway |
2025-02-25 10:27:08 +0100 | ThePenguin | (~ThePengui@cust-95-80-24-166.csbnet.se) (Remote host closed the connection) |
2025-02-25 10:26:40 +0100 | why | (~why@n218250229238.netvigator.com) (Quit: Client closed) |
2025-02-25 10:26:40 +0100 | <tomsmeding> | if so, then exactly the same thing should work here |
2025-02-25 10:26:24 +0100 | <Athas> | Yes, so that ought to work. That is good. |
2025-02-25 10:25:48 +0100 | <tomsmeding> | Athas: I think if you 'diffF' on a function that uses 'grad', you instantiate the 'a' in the type of 'grad' with 'AD s (Forward a)'; does that sound right? |
2025-02-25 10:24:47 +0100 | <tomsmeding> | nothing, it seems, semantically? |
2025-02-25 10:24:29 +0100 | <tomsmeding> | *supposed |
2025-02-25 10:24:27 +0100 | <tomsmeding> | I'm not familiar enough with the 'ad' API to figure out what that 'AD' type is suppose to do |
2025-02-25 10:22:59 +0100 | dsrt^ | (~dsrt@108.192.66.114) (Ping timeout: 260 seconds) |
2025-02-25 10:22:50 +0100 | <tomsmeding> | the fact that that package's name is so short is really inconvenient. :P |
2025-02-25 10:22:36 +0100 | tomsmeding | opens the 'ad' docs |
2025-02-25 10:22:30 +0100 | off^ | (~off@108.192.66.114) (Ping timeout: 268 seconds) |
2025-02-25 10:22:19 +0100 | <Athas> | It's 'diffF' on a function that uses 'grad'. |
2025-02-25 10:21:55 +0100 | <Athas> | Operationally yes. I don't know what the type of that looks like. I was referring to the surface syntax. |
2025-02-25 10:21:13 +0100 | <tomsmeding> | i.e. doing reverse mode, but secretly the scalars are dual numbers? |
2025-02-25 10:21:08 +0100 | califax | (~califax@user/califx) califx |
2025-02-25 10:21:00 +0100 | <tomsmeding> | wouldn't that be Reverse s (Forward Double) in 'ad'? |
2025-02-25 10:20:49 +0100 | califax | (~califax@user/califx) (Remote host closed the connection) |
2025-02-25 10:20:40 +0100 | <Athas> | It Just Works in 'ad'. |
2025-02-25 10:20:27 +0100 | <Athas> | jvp (\ ... vjp ...), but with the weird 'ad' names instead. |
2025-02-25 10:20:17 +0100 | <tomsmeding> | and I think you may be right that IO is overkill here! |
2025-02-25 10:19:57 +0100 | Smiles | (uid551636@id-551636.lymington.irccloud.com) Smiles |
2025-02-25 10:19:49 +0100 | <tomsmeding> | how would that look in 'ad'? |
2025-02-25 10:19:31 +0100 | <Athas> | But this cannot possibly do reverse-then-forward, right? |
2025-02-25 10:19:14 +0100 | <tomsmeding> | surely they could be |
2025-02-25 10:19:09 +0100 | <Athas> | I guess they could be. |
2025-02-25 10:19:01 +0100 | <Athas> | Are dual numbers Storable? |
2025-02-25 10:18:52 +0100 | <Athas> | This looks compelling, but why unsafePerformIO directly? Can this not be expressed with ST? |
2025-02-25 10:18:43 +0100 | AlexZenon | (~alzenon@178.34.162.44) |