hydra-docs/hydra-robustness.org

#+title: Hydra Robustness/Stability/Simplicity
* The situation

Essentially we need to know what is going on with some key components:

- Cardano Node
- Hydra Heads
- Hydra Nodes

The Hydra Head state comes from the Nodes, and with the nodes we care about their status, and whether they are running or not.

Ideally we need an interface that can accurately internalize and understand all the of the information from
the various parts and prevent a large array of footguns.

Lets start with the lifecycle and gotchas.

High level API (request/response/monitoring):

Lets look at it from a HydraNow specific lens:

At the Payment Channel Level
I want to:
- Start a payment channel
- Send transactions in this payment channel
- Close the payment channel (which implies funds are given to the respective owners on L1)

The interface should be as simple as this, the Payment Channel API should just be concered
with these key actions, and then failure should be represented at the Payment Channel Level?

At the Hydra Head level, well the hydra head level doesn't really exist, in Hydra
the Head is essentially the Chain Data + State of the Hydra Head as seen by the nodes.

But from an interface perspective it is nice to be able to have our code speak in Heads and not
necesarily nodes.

So at the Head Level essentially we want the following RPC/Commands:

Another important aspect of the

** Create
This would be a combination of generating the configurations, running the nodes, and then Initing, I recommend this is just one concrete action in the API as it is something we always want to do, likely the first Commit is probably able to be incorporated into this command as well, but because you may not be able to/ready/want to commit right away, we can leave it out.

Creation would then only persist a Head that was able to get through Init, this has a tradeoff that gas will be spent the second this action is performed, but only persisting Valid, working, heads allows us to make a lot of good assumptions about the state of the system.

*** What are the failure states?
Configuration Error: The nodes will fail to start if configuration is wrong, so we should represent our types and parse don't validate to get into a state where running the process to start a Head begins with a valid configuration. In our API/implementation as the Types should not allow for invalid configurations, so configuration errors can likely be handled in the form of functions that take the various parameters and either produce a configuration or not.

Node Error: The nodes fail to start, this should be an exceptional circumstance, but while we are figuring this all out we need to have a good way to track down why a node failed to start and fix it. So while this shouldn't be exposed in the User level API, we should have something to point us more directly to Node errors and crashes, likely this takes the form of the system self reporting, monitoring, and restarting threads. It should maybe even track the RAM and resource usage as a form of logging. Potentially logs from another part of the system also tell us why the node failed, for example: There is a crash when a NewTx is sent to a Node that is part of a Head that isn't considered Open, if it doesn't produce logs because of the crash, then we don't know what happened, but if we know the time of the crash, we can likey see around that time in the logs that NewTx was sent.

Transaction fails to get poseted on chain: This is almost eloquently expressed in the Hydra Websocket API error PostTxOnChainFailed, the reason is burried in there, and we could likley parse it out to make it more obvious to us during development, usually this indicates Fuel is missing, we can respond to this error by topping up fuel using HydraNow's faucet, but HydraPay can't automatically handle this, as the response should be user defined... Maybe we can provide Hooks or callbacks for when this happens...

Technically we should be able to see transactions the Node does, as the node is using the signing key of the proxy address... So worst case we can look directly at the chain to try and glean information, the issue with that is then we are re-implemnting the smart contract interaction layer of Hydra, which sounds like awful work to do.

*** What State needs to be tracked?
We need to track the process handles for the nodes, which node is associated with which person, and we likely need some unique way to refer to a Head and each node process to be able to easily parse and find information through the logs, as lots of unforseen random things may happen.

We need a database of Heads, likely using HeadId as the primary key. We also need to track whether that "Head" is running, meaning all the nodes are running and see eachother.

Each Node's state is important, as the node state directly influences the Head state. For example, each node replays its state and then reacts to connected peers, if not all nodes have done this, the Head is in an unusable state. Apparently RollBacks can happen, though I don't know
how to parse RollBacks to resolve our representation of State... perhaps we just consider the Head unusable until a new message is produced that changes state. For example, you may Commit, and then a rollback happens, but likely another message like HeadIsInitialized will happen after a rollback.
(Maybe should ask the Hydra team about this?). The API currently doesn't expose anything about rollbacks, and it looks like we would just want to scan the logs. Potentially we only need to scan the logs, and then we don't care about the websocket information?

*** What Events Change the State?
so essentially to get a Head, you give a configuration, and get back a Unique Head, the HeadID could be used to track this head, now that Heads would only be "created" when they indeed have state on L1 in a smart contract.
So within the API a Create happens, and the result of that create is either HeadId and some processes, or an error from one of the above failure states.

The HeadID is given by HeadIsInitializing!


*** Mechanically what needs to happen
We start all the Nodes, we ensure they are alive
We then send an Init to one of the nodes.
The node will create state on L1 representing the Head, this costs gas
We will send a Init via the Node's websocket.
We will receive a HeadIsInitializing and this will contain the HeadID inidcating success...
What can happen on failure??
For now I think failure is going to be CommmandFailed or PostTxOnChainFailed.

So we need to likely check the state of the Node/Head as we see it, before allowing such a thing, though if create simply takes the configuration and proces Either Error HeadId, then we can avoid the CommandFailed part, meaning anything but PostTxOnChainFailed
is exceptional, the issue is we still need to know all the messages we got to be able to expand our failure logic in case we have missed something...

** Commit (This needs to talk to the right node, it must be the node that acts on behalf of the participant committing)

Commiting is quite the process because of the current limitation of Hydra's Commit Scheme. Essentially you must commit O or 1 UTxO and that UTxO must have the exact amount of ADA you would like to commit.
Currently we have proxy addresses, and we Pay (transfer ADA to) these addresses. This has a side optimization/simplification benefit, which is that the UTxO produced can be used to Commit.

Meaning for HydraPay's API, Commit can take a HeadID, an address (of a participant, that isn't the participants proxy), and a transaction signed by the participant that pays to the proxy. We can submit the transaction, and then
we can just use the resulting UTxO directly as the commit input, and then also provide a form of commit that just commits with 0 UTxOs for a single direction payment channel if necessary.

commitToHead :: MonadHydraPay m => HeadId -> CommitInfo -> m (Either Error ())

When a Head isn't running or a node is dead, we should restart it or something, but do we want commitToHead to wait, or do we want to design our system a different way, where you must explicity wait?
Usually the failure would likely trigger some other logic or something, so it makes sense to have commitToHead, wait for a Head to either succeed or fail, the issue is that we may wait indefinitely
if we have failed to consider a message that is actually a response for the Commit we have sent.

So in general then, we have some messages we should always consider, and likely log...

*** What State needs to be tracked
*** What are the failure states

Failure would be an invalid commit, but we can likely avoid sending to the node to detect most of these issues:
- Nodes crashed or aren't running? We can tell from the process handles before we send anything, the real question is do we restart them right away automatically? I would say probably, as them not running is likely an exceptional circumstance.
- The Head isn't in the right state? Well in that case we can also detect that right away and respond with an error
- The Head is not able to be Committed to? This is also something we can directly control ourselves, it just sucks that we have to.

What could happen at the Websocket API level?

We should always get either Committed /or/ CommandFailed Commit

** SendTx or something
It is kind of wild that Websocket API NewTx requires an actual CBOR Transaction as the Node has your public/private keys and should likely just take a description of the money you want to send and where.
Our API SendTx should do exactly that and simply take a Map like Map Address Amount or something and go from there. There is validation we can do before even sending the websocket message to make things work.

NewTx can fail if the transaction is invalid, but there is also some implicit time coupling here where the transaction would have to finish before you can send another one.

By decoupling the SendTx call from the actually building of a valid Transaction CBOR we can simply keep a queue of transactions and submit them as the other ones succeed. Now usually the transactions should happen
fast enough that we never construct two transactions that use the same UTxO and try and spend the same UTxO but we can avoid the potential of that happening all togheter.

On the Hydra API Side I would like to see NewTx completely eliminated and probably GetUTxO and the UTXO event, as that is too low of a level that nobody using Hydra actually cares about.

** Balance and more importantly, we care about the balances of the individual participants.
I don't know how this should be shaped, I do know dealing with UTxOs when looking at a Head or Payment channel seems like the wrong level of granularity (see SendTx or something).

** Contest (We won't use this, but as part of providing an API and extending HydraPay we 100% care about having this in the API available to people to use, and thus we care about it)
So we want to have this in the API eventually, but we would want to be able to test this, for now just providing it, and saying "use if you need to" would suffice. The big invariant here is:
Each participant can only Contest once, so we should validate that the Node doesn't crash or something when you Contest twice...

** Destroy
Destroying a Head would be contextual, as we usually don't care about the specifics, At different points in the Lifecycle of a Head, we have different ways to shut it down...
So we end up with a Destroy that simply looks at the Head and does the right thing:
Before all parties have committed? ABORT
After the parties have committed? CLOSE + Wait for contestation period + FANOUT
Contestation periods are also pretty annoying and something we probably don't wanna push into HydraPay until we have a usecase and something to validate we can even do it/make it work...


** Implementation/Mechanics
So now we have state to track, and that state needs to be accurate, we need to know:
- The HeadID
- The status of the Head On-Chain
- The status of the Head when it comes to readiness to receive commands/is running
- The status of the Nodes
- The particapnts and which Nodes they control
- Potentially how many times each participant has contested.
- The fuel of each Participant's proxy address

How to we get the HeadID: When you receive a HeadIsInitializing it will come with the HeadID
How do we get updates to the Status of the Head? Through the websocket of the Nodes we will receive:
- HeadIsInitializing
- HeadIsOpen
- HeadIsAborted
- ReadyToFanout
- HeadIsFinalized
- HeadIsContested
- HeadIsClosed

*** Failure states implicit in the implementation:
- The websocket can close, and should be restarted, and we need to ignore the replay
- Nodes can crash, and we should likely restart them,

*** Footguns in the implementation:
- When connecting to a node via websocket, the node replays history, then sends a Greetings (which implies no nodes are connected), and then sends PeerConnected for each node

*** What am I worried about?
Dropping things on the floor, hanging forever, etc.

When can that happen? Well lets say we want to create a Head? we would need some setup work in running the nodes, connecting to them via websocket, and then
processing all the messages from those websockets, we also need to be able to send messages through to the websockets.
We need to be able to "know" when the action has succeeded or failed.
How do we signal to an action that success has happened? We probably create a TMVar, send the request, and wait on that TMVar to be filled, we likley also
just want a timeout, the issue with timeouts is that somethings, during congestion can actually just take a lot of time.... So maybe timeout is configurable or something.
The issue is, if we issue an action that takes 66seconds, and we had the timeout at 60seconds, that action will happen eventually, and so we _need_ to wait to not drop those results on the floor.

For example lets say a Head is created but we stopped waiting, we would end up making another head if run the Create action again. So if we did have timeouts, we would likely want logic to say "Find (or Create for me) a Head that has these participants", and maybe that is easy enough to do, as that Head would be in the DB with the participant list and On-Chain status known: HeadIsInitializing, meaning we could likely pick those up.

So it seems we really just want a Head specific message queue, and some threads setup to handle messages.
We don't necessarily need request-response but instead just need the ability to send and read "messages".

Then things are created in terms of messages and waiting for messages, which maps nicer to the Hydra API.
We avoid the other issues by simply holding onto state propagated by these messages.

We can add a layer where we are able to inject our own messages, this could allow us to handle a lot of things pretty nicely.

Monadically then we are just waiting for events to happen on the system.
This makes business logic pretty nice to write probably.

We just make the API for reading and writing messages work asynchronously which isn't hard, then the difficulty is just making sure we are waiting for the /right/ messages
and fail out when we get something we don't expect with bright exceptional failure.

So the websocket api becomes read and write messages, we just make sure the messages are /broadcast/ to all the potentially people looking at this system.
We use these messages to update the state, and then we also use the state to prevent firing dumb messages.

What else needs to be there for simple and stable interfaces?

** General stuff
We could even check fuel before seeing PostTxOnChainFailed, though in general, maybe we have a way to take PostTxOnChainFailed and turn it into a ThisPersonNeedsFuel Address type message and then just have a handler that handles that.
It might be worth logging/knowing when a node makes transaction, a node transaction would be a transaction from a Proxy Address, that uses the Fuel UTxO as input?
The HydraAPI does actually have a list of ERROR states for the various errors, we could likely find the parallels and make our errors be able to carry more human readable, less bloated error messages for things


* Proposed changes

At the Head level we should track heads via HeadID, this implies that "creating" a Head means On-Chain activity in the form of an Init, which is then persisted and managed by HydraPay (Head manager). This also means creating a Head is running a head, though that doesn't have to be true once we get the HeadId.

Interactions under the Head level at the Node level should follow a message queue based API where messages can be sent and read. These represent the activity on the websocket, but will come from a single connection and placed in concurrency friendly datastructures to be able to be shared and passed around.

We will then simplify Head interactions like Commit, Init, etc. By simply waiting for a message, we will also in the state indicate if messages are flowing (aka the websocket is open and active).

Head level actions will first check the status of the Node processes, and log when they crash so we can try and acertain why.

Head level actions should try and check any invariants they can before they actually commit to sending a message in the message queue.


The API at the Head level simplifies to:

- Create

- Commit
- SendAda
- GetBalance
- Destroy

All will utilize the above guidelines and the send/read message interface.
Timestamps will be added to logs, and logs will be unified, we will not do anything to persist logs ourselves, unless we can set a hard limit on the amount of logs.

The PaymentChannel API will then sit on top of all of this, now when creating a payment channel, we DO need to persist some information so that users have an indication that there
is a payment channel, and so we can give immediate feedback.

So the payment channel table should be updated to have a Maybe HeadID which would point to the Head that actually holds all the channel information, so creation can return immediately, and we can have the task workers manage using the above API to actually do their work, and then part of representing the payment channels in the UI can now be based on whether the Head is Just, and if it is just based around the status of the Head for success and failure.

Create and Destroy are a little more heavy weight and will actually change the Head Manager database.