aws

http://status.aws.amazon.com/ https://www.expeditedssl.com/aws-in-plain-english
kenny 2020-03-25T18:55:17.003500Z

@dchelimsky The deadlock issue reported https://github.com/cognitect-labs/aws-api/issues/130 was fixed, but I think I'm still seeing something related. We have a SQS message processing service that runs 10 processing threads. All 10 of those threads are currently deadlocked on calls to aws-api. Is there a potential issue with the fix applied for #130 when running lots of requests in parallel? I have a thread dump as well. 4 cognitect-aws-send-* threads are paused on refresh! call https://github.com/cognitect-labs/aws-api/blob/0dc6c388d45a7ebe09a3f147481c61aeaa6e1846/src/cognitect/aws/credentials.clj#L96.

kenny 2020-03-26T15:58:04.012800Z

I deployed workaround with async-fetch-pool set to 10 yesterday evening. The service has been healthy since. For those looking for a workaround, I can say with high certainty that this will fix it. Wrapping in async/thread as @dchelimsky says sounds like another potential solution. I do believe that results in an unbounded number of potential threads to be created. In this case, that should always be small I think.

2020-03-26T16:22:48.013Z

It's bound by core.async's threadpool size, which defaults to 8, but you can configure it via the clojure.core.async.pool-size system property: https://github.com/clojure/core.async/blob/1c5efa72ebf65cd23315311d2bc3d890b780696e/src/main/clojure/clojure/core/async/impl/exec/threadpool.clj

kenny 2020-03-26T16:51:21.013400Z

I don't think async/thread is bound by that. That's for the go blocks.

kenny 2020-03-26T16:52:12.013600Z

thread calls thread-call which uses thread-macro-executor, defined as a CachedThreadPool https://github.com/clojure/core.async/blob/1c5efa72ebf65cd23315311d2bc3d890b780696e/src/main/clojure/clojure/core/async.clj#L471.

kenny 2020-03-26T16:53:00.013900Z

The maximumPoolSize for core.async's call to newCachedThreadPool is Integer.MAX_VALUE.

👍 1
2020-03-26T17:14:50.014400Z

So it is bounded 😉. But yeah, we'll need to constrain that a bit more.

kenny 2020-03-26T18:10:16.016100Z

I think CredentialsProvider fetch needs to return a channel. That would move the problem to the user.

kenny 2020-03-26T18:11:05.017200Z

Also, I think that is effectively the same as unbounded. Fairly certain bad things will happen as you approach the max int value.

2020-03-26T21:02:45.017400Z

Oh yeah, I was making a joke about it being bounded.

kenny 2020-03-26T21:03:22.017600Z

Ohh, got it 🙂

2020-03-26T21:03:28.017800Z

re: returning a channel from fetch, that would be a breaking change so we won't do that, but we could add fetch-async to the CredentialsProvider and return a channel from that (and use that in our implementations).

kenny 2020-03-26T21:04:43.018Z

It would indeed. Deprecating the current method seems like a good idea since it can result in a deadlock. Deadlocks are one of the worst problems to hit.

2020-03-26T21:04:54.018200Z

I'll have some time to look at this and talk some things over with other contributors tomorrow so we should have at least a direction. I'll follow up on the issue at that point.

1
2020-03-27T17:16:24.018800Z

@kenny: @ghadi and I reviewed this together and think the simplest solution is https://github.com/cognitect-labs/aws-api/commit/d1bfbef63e3d17d6f66fb88d38daf500ced27e9a

2020-03-27T17:17:39.019Z

Yes, that does use core.async's thread pool, which is effectively onbounded, but threads are reclaimed after 60 seconds, and we think the risk is very low of this causing any real problems. If you disagree, please explain.

2020-03-27T17:18:48.019200Z

re: making fetch async, it already is, effectively. Even though a user implements fetch, it is only ever called via fetch-async. Make sense?

kenny 2020-03-27T17:50:11.019400Z

I think this will probably work for pretty much all use cases and don't have any arguments against using an unbounded pool at this time. Although, I think I would much prefer these threads to be named something more relevent to make for clearer thread dumps. Right, it is async but it is managed implicitly -- the user cannot manage the threads used to fetch credentials if desired. Returning a channel gives the api consumer control over how the credential fetching manages the threads. Perhaps no one would ever want that level of control. Not sure.

kenny 2020-03-27T17:56:40.019600Z

Tangentially, could we remove or change this sentence "See log for more details." in the anomaly message https://github.com/cognitect-labs/aws-api/blob/d1bfbef63e3d17d6f66fb88d38daf500ced27e9a/src/cognitect/aws/util.clj#L310?

{:cognitect.anomalies/category :cognitect.anomalies/fault
 :cognitect.anomalies/message (format "Unable to fetch %s. See log for more details." item)}
Although the fetch did fail, it may or may not have logged any information about the error. I suggest something like this: "Failed fetching %s. Fetch result was nil. "

👍 1
kenny 2020-03-27T18:04:32.019900Z

For that matter, why does fetch not support returning an anomaly and nil as failure conditions?

kenny 2020-03-27T18:05:18.020100Z

It would seem the anomaly response from the AssumeRole call would provide far better information than the current anomaly returned.

2020-03-27T18:15:10.020400Z

My guess is that was an effort to keep the implementation of https://github.com/cognitect-labs/aws-api/blob/d1bfbef63e3d17d6f66fb88d38daf500ced27e9a/src/cognitect/aws/credentials.clj#L132 simple, but I agree that anomalies would be better. The problem right now is fetch has "Return the credentials found by this provider, or nil." in the docstring, and changing implementations to return an anomaly instead risks breaking custom implementations that call fetch directly.

2020-03-27T18:15:49.020800Z

That said, an anomaly would definitely give us more control and more information!

kenny 2020-03-27T18:19:02.021Z

Right. I'm thinking we may even want this in the near future to inform a user of our product that we are unable to authenticate using an IAM role they provided and giving them the AWS returned error message. Actually, I'm pretty sure returning an anomaly as is would already result in an anomaly getting returned haha

kenny 2020-03-27T18:19:43.021200Z

... result in the anomaly you returned getting returned to the caller. It just happens as an implementation detail though.

2020-03-27T18:22:08.021400Z

Within the aws-api implementation, yes. And we could always update fetch-async to decorate an anomaly with more info.

2020-03-27T18:23:04.021600Z

My concern is for the end user who wrote anything custom that dispatches on nil coming back directly from fetch.

2020-03-27T18:23:21.021800Z

The likelihood of that is probably very low, but I can't be sure.

kenny 2020-03-27T18:24:01.022Z

Oh, I see. Oof.

kenny 2020-03-27T18:25:06.022200Z

CredentialsProvider2? 🙂

2020-03-27T18:25:16.022400Z

No joke!

2020-03-27T18:26:43.023100Z

Though that needs to be reviewed through a more holistic lens.

kenny 2020-03-27T18:27:54.024Z

As in other things that could be included in this addition?

2020-03-25T18:59:08.004700Z

@kenny there's always potential! I can't look at this until Friday, but please feel free to post more information here as you learn more and/or submit an issue.

1
kenny 2020-03-25T19:01:55.004900Z

Pasted the cognitect-aws-send-* threads here

kenny 2020-03-25T19:08:36.005300Z

All the aws-send threads have been waiting almost exactly the same amount of time. I'll bet a repro would be running 3 assume role calls in parallel.

kenny 2020-03-25T19:10:20.005500Z

I could forcibly serialize calls to fetch-creds for now.

kenny 2020-03-25T19:19:01.005900Z

The repro would probably need to be 3 assume role calls with different CredentialsProvider objects since https://github.com/cognitect-labs/aws-api/blob/aff9028757189f9875e8b3ebed18a8f07b624327/src/cognitect/aws/util.clj#L320 code is serializing calls for the same creds provider.

kenny 2020-03-25T19:20:41.006200Z

And yes, that could happen in our code. We're making calls to our customers' AWS accounts. Parallel calls to multiple AWS accounts using different creds providers could easily happen.

kenny 2020-03-25T19:35:45.006600Z

Got a repro 🙂

kenny 2020-03-25T19:51:00.006800Z

Repro here: https://github.com/kennyjwilli/aws-api-deadlock

💯 1
kenny 2020-03-25T20:35:00.007100Z

I think the problem is the same and the fix only made it less likely. This is what is happening in each thread. | invoke DescribeInstances | fetch-creds (AssumeRole) | invoke STS AssumeRole | fetch-creds (implicit) The fetch-creds calls go through the async-fetch-pool which is size 4. If you launch N threads to run the above flow, where N >= async-fetch-pool size, you will get a deadlock. This occurs because the inner fetch-creds call is sitting in the async-fetch-pool's queue.

kenny 2020-03-25T23:35:27.007500Z

I have created an issue here: https://github.com/cognitect-labs/aws-api/issues/137

kenny 2020-03-25T23:36:20.007900Z

I think the best solution for us near-term is to fork aws-api and set the async-fetch-pool size to 10. I'm sure there is a better, more general solution but we need something to work for the time being.