@ambrosebs @bjagg @atdixon I did some more digging but I still haven't figured out the root cause. A summary of what I've found so far, based on log files from 2020-02-20 and 2020-02-21:
• The 400s are being returned by the Fastly CDN
• It is only on a few GET requests for maven-metadata.xml
files (both existing (which should return a 200) and non-existing (which should return a 404)). No other paths are affected.
• Only six client IPs were affected out of 8,122 that made GET /.../maven-metadata.xml
requests
• Only 266 requests were affected out of 107,255 for GET /.../maven-metadata.xml
• Those 266 requests were for 54 different maven-metadata.xml
paths
• There is no common client, OS version, or Java version based on the user-agent strings
• There is no common cache host
• I have not been able to recreate it
Actions taken to remediate:
• There was custom configuration in Fastly to override the cache TTL for maven-metadata.xml
files. This is the only configuration that would treat those files differently. It was obviated by https://github.com/clojars/clojars-web/commit/6c081e809a4c2e0c4ddf5359c2fc858d1cd5dc2b and removed ~ 2020-02-21 01:00 UTC
• This didn't resolve the issue immediately, but it's possible the 400s were cached
• ~2020-02-22 16:30 UTC all 54 paths that had 400s in the logs were purged from Fastly's cache
• It is unknown if this will help with the issue
I've captured this in https://github.com/clojars/clojars-web/issues/744 as well.
Since I just purged all of those paths from Fastly, can y'all test again when you get a chance? I really appreciate your help and your patience.
I'll also analyze the logs again this evening to see how things look (the logs are aggregated every evening from 100s of individual files, so doing analysis before that aggregation occurs is painful)
I also added some more details to that GH issue about the workarounds that y'all have been successful with so far.
@tcrawley wow thanks for such a thorough investigation! tried it again, still same error unfortunately. Reported it to https://github.com/clojars/clojars-web/issues/744
Thanks for testing @ambrosebs! I just added more logging to Fastly to a service where I can see the logs more easily - would you mind trying again when you have a chance? I'll then see if there are any clues in the extended logs
@tcrawley Deploying snapshot worked! Had a checksum error earlier, but that did not occur this last time. Thanks!!
@tcrawley gets a little further! https://github.com/clojars/clojars-web/issues/744#issuecomment-589990610
fixed link ^^
Thanks @ambrosebs @bjagg! With the additional logging, I also see 400s for maven-metadata.xml.sha1
and .md5
files from your requests. Unfortunately, the logs don't provide any additional insights. I have however enabled request logging on the S3 bucket itself. That log is supposed to include an error reason if the 400 is originating from S3. If y'all could try yet again I would be most appreciative, and I'll check on those logs a bit later.
@tcrawley failed with a 400
@tcrawley again: https://github.com/clojars/clojars-web/issues/744#issuecomment-589992556
@ambrosebs @bjagg ok, with S3 logging on, I can see the 400 requests being rejected due to InvalidArgument
. The S3 docs for InvalidArgument
aren't very helpful, but I did find reference to that being returned if the Authorization
header is set, but not to valid AWS credentials. So I added logging for that header, and see it set on the 400 error requests since I enabled it. I then added a rule in fastly to strip that header from the request before passing it to S3. Would y'all be willing to try again?
@tcrawley I think it worked!
I'll paste the output in the issue
thanks @tcrawley!!
@tcrawley that works for me!!
Awesome! I'm glad we finally figured it out! I'll update the issue with the details and clean up my mess of logging. Thanks again for your help and patience @ambrosebs @bjagg @atdixon
you rock @tcrawley thanks so much!
Yeah thank u!