Does Datomic have any sort of known issues around creating a number (~300-400) of connections in a very short space of time? We've got an app that on startup spawns 30-40 threads, each of which is creating connections to pull data from 10 separate databases (the databases are unique per-thread, so only 1 connection per db, but on the same Datomic cluster). We frequently get a load of category interrupted 'Datomic Client Timeout' exceptions on startup, and have to delete/recreate the container, even though the AWS metrics (Datomic Cloud production setup) don't show any particular issues with mem, CPU, etc. Once the app has started it seems to be fine and stable, no timeouts (with transact
and q
calls calling d/connect
each time they run), unless we perform an action that's going to require it to run through and recreate a lot of those connections rapidly again.
I would put exponential retries with some jitter
the exception you receive should be marked with a :cognitect.anomalies/category that should indicate if it's retriable
Yeah, it's interrupted
(namespaced of course)
you don't want to destroy a container just because 1/400 connections fail
Itโs likely youโre getting throttled ops.
(Can check in CW dashboard)
I was going to do some work to add that, but there has been a bit of pushback from some in the team because recommendations/docs from Cognitect elsewhere recommend a retry on unavailable
, but don't mention other categories
So having feedback that it would be a good idea is ๐:skin-tone-2:
interrupted, busy and unavailable are the 3 retriable anomalies
@kenny You mean some Datomic internal throttling, or on the DynamoDB? We did used to see a bit of DDB throttling, so we changed from provisioned resource to on demand scaling (basically, no scaling needed but pay per-request), and don't see them any more. Once we have a better idea of longer-term access patterns we'll probably change that back
@carr0t cloud or onprem?
Cloud
We are going between VPCs though, as the CloudFormation for Datomic cloud sets up its own VPC rather than being able to 'join' an existing one, and we already had an existing one with EKS in etc. We're not currently finding any limits being hit r.e. inter-VPC comms though
The cloudwatch dashboard should show Throttled Ops
(Dashboard for Datomic)
This is separate than ddb throttling, but could be caused by ddb throttling
Also curious if you're pointing all 300-400 to the primary compute group.
Oh yes, with you. Nothing in the dashboard. Occasional OpsTimeout
there too, but no OpsThrottled
@kenny At the moment, yes. We've not deployed any query groups (right terminology? I'm pretty new to Datomic), so the only instances running in the cluster are the 2x i3.large ones that are part of the standard template.
Our access pattern involves a fair bit of writing. In some cases we're 1:1 read:write. There is a small lean towards q
requests on startup as it loads initial state, but that is only maybe 10% above the transact
requests, so I wasn't sure that query groups would help.
plan for exponential retry/backoff on transact, connect, q
๐:skin-tone-2: Our next challenge we already know is "how do we make this faster?", but that's a good start. Thank you. And we'll have metrics to know when we do retry
look into Query Groups to isolate read load
can scale those independently of the primary compute group
You can consider pre-scaling a query group prior to app deploy.