datomic

Ask questions on the official Q&A site at https://ask.datomic.com!
danm 2021-03-08T15:45:36.118600Z

Does Datomic have any sort of known issues around creating a number (~300-400) of connections in a very short space of time? We've got an app that on startup spawns 30-40 threads, each of which is creating connections to pull data from 10 separate databases (the databases are unique per-thread, so only 1 connection per db, but on the same Datomic cluster). We frequently get a load of category interrupted 'Datomic Client Timeout' exceptions on startup, and have to delete/recreate the container, even though the AWS metrics (Datomic Cloud production setup) don't show any particular issues with mem, CPU, etc. Once the app has started it seems to be fine and stable, no timeouts (with transact and q calls calling d/connect each time they run), unless we perform an action that's going to require it to run through and recreate a lot of those connections rapidly again.

ghadi 2021-03-08T15:54:27.119100Z

I would put exponential retries with some jitter

โž• 1
ghadi 2021-03-08T15:54:52.119500Z

the exception you receive should be marked with a :cognitect.anomalies/category that should indicate if it's retriable

danm 2021-03-08T15:55:10.120200Z

Yeah, it's interrupted (namespaced of course)

ghadi 2021-03-08T15:55:14.120400Z

you don't want to destroy a container just because 1/400 connections fail

kenny 2021-03-08T15:55:15.120600Z

Itโ€™s likely youโ€™re getting throttled ops.

kenny 2021-03-08T15:55:57.121200Z

(Can check in CW dashboard)

danm 2021-03-08T15:56:21.121400Z

I was going to do some work to add that, but there has been a bit of pushback from some in the team because recommendations/docs from Cognitect elsewhere recommend a retry on unavailable, but don't mention other categories

danm 2021-03-08T15:56:44.121800Z

So having feedback that it would be a good idea is ๐Ÿ‘:skin-tone-2:

ghadi 2021-03-08T15:56:48.122Z

interrupted, busy and unavailable are the 3 retriable anomalies

danm 2021-03-08T15:58:47.122600Z

@kenny You mean some Datomic internal throttling, or on the DynamoDB? We did used to see a bit of DDB throttling, so we changed from provisioned resource to on demand scaling (basically, no scaling needed but pay per-request), and don't see them any more. Once we have a better idea of longer-term access patterns we'll probably change that back

ghadi 2021-03-08T16:00:12.123800Z

@carr0t cloud or onprem?

danm 2021-03-08T16:04:25.124100Z

Cloud

danm 2021-03-08T16:06:16.124400Z

We are going between VPCs though, as the CloudFormation for Datomic cloud sets up its own VPC rather than being able to 'join' an existing one, and we already had an existing one with EKS in etc. We're not currently finding any limits being hit r.e. inter-VPC comms though

ghadi 2021-03-08T16:06:35.125200Z

The cloudwatch dashboard should show Throttled Ops

ghadi 2021-03-08T16:06:45.125700Z

(Dashboard for Datomic)

ghadi 2021-03-08T16:07:04.126400Z

This is separate than ddb throttling, but could be caused by ddb throttling

kenny 2021-03-08T16:07:43.126600Z

Also curious if you're pointing all 300-400 to the primary compute group.

danm 2021-03-08T16:08:22.126800Z

Oh yes, with you. Nothing in the dashboard. Occasional OpsTimeout there too, but no OpsThrottled

danm 2021-03-08T16:13:22.127Z

@kenny At the moment, yes. We've not deployed any query groups (right terminology? I'm pretty new to Datomic), so the only instances running in the cluster are the 2x i3.large ones that are part of the standard template. Our access pattern involves a fair bit of writing. In some cases we're 1:1 read:write. There is a small lean towards q requests on startup as it loads initial state, but that is only maybe 10% above the transact requests, so I wasn't sure that query groups would help.

ghadi 2021-03-08T16:18:39.127200Z

plan for exponential retry/backoff on transact, connect, q

danm 2021-03-08T16:20:22.127500Z

๐Ÿ‘:skin-tone-2: Our next challenge we already know is "how do we make this faster?", but that's a good start. Thank you. And we'll have metrics to know when we do retry

ghadi 2021-03-08T16:44:06.127700Z

look into Query Groups to isolate read load

1
ghadi 2021-03-08T16:44:22.127900Z

can scale those independently of the primary compute group

kenny 2021-03-08T16:46:14.129200Z

You can consider pre-scaling a query group prior to app deploy.