https://dev.to/sahilrajput/how-latency-numbers-changes-from-1990-to-2020-173n
What I want from a distributed application programming interface: 1. swap!/deref 2. some useful subset of the clojure language 3. nothing else
Do you want this capability in a distributed system, where the failure of one or a few machines does not inhibit the progress of the others? Or do you want it in a system of N machines where if 1 fails, the computation fails and/or locks up, and maybe needs restarting?
A little bit of both. I don't think all distributed system designs should have to worry about "Byzantine fault tolerance" and net splits. Unless you want to distribute work over a network of potentially untrusted peers or something.
Zookeeper is probably overkill for a lot of "distributed" use cases I think
And these days, if you fully control two networked computers and one isn't playing nice with the other, it's probably the application programmers fault and they should fix it.
But anyway, machine redundancy stuff can be somewhat orthogonal to the distributed application programming interface on top that I'm thinking of. It'd probably make sense to have some rudimentary supervisor service running, like with Erlang. But if application logic breaks the system, application logic shouldn't do that.
The newer version of Tau has a supervisor that coordinates things and a watchdog service for health monitoring. Not much implemented there yet though.
failing machines are pretty far from worrying about untrusted adversaries.
Well, if a thread fails, that rarely locks up a whole program. And if it does, the programmer needs to build redundancy into that thread logic.
Using network compute resources should be as easy as using threads
And using clojure, I don't even really want to have to deal with those networked threads. I'll just lean on higher level abstractions to manage those networked threads
Networked futures, agents, atoms, etc
Would each atom be 'owned' by one machine? If so, I don't see how you can write code using atoms where the whole computation doesn't just lock up when the owning machine dies. If the atom's "owning machine" can move, then you've got a fun implementation.
Depends. In some situations I've cached the atom on all subscribed nodes, and have a particular "owning" node control all updates and broadcast results. With that setup, it's pretty easy to change ownership to another node if the owning node stops responding to a watchdog.
In terms of "moving machines", I haven't thought about that much, but we're using "js isolates" here, so there may be a story with regard to moving js isolates between machines already.
It isn't difficult to have systems that usually work, but break under unusual but possible circumstances, e.g. the watchdog timer fires, and then the owner starts making progress at the same time. That's where Paxos/zookeeper/etc. shine, is avoiding incorrect behavior in those unusual corner cases.
But, if you want to spread ownership of an atom across multiple nodes, I'd just make sure ownership transfers unidirectionally across whatever shape of nodes you've built to do the work. And then have the result finalized at a last node (the UI maybe) or the one originating the request
Paxos may or may not be necessary, depending on the reliability of the system underneath. But I'd like to have a distributed thread abstraction on top where I don't have to care about the underlying coordination, just like a cpu.
Infiniband network drives usually "just work"