Am I doing something wrong here or is there a bug?
bb repl
Babashka v0.4.0 REPL.
Use :repl/quit or :repl/exit to quit the REPL.
Clojure rocks, Bash reaches.
user=> (require '[babashka.process :refer [process sh]])
nil
user=> (sh ["youtube-dl" "--dump-json" "--ignore-errors" "--skip-download" "<https://www.youtube.com/user/Misophistful>"])
Segmentation fault (core dumped)
Even more minimal exaxmple:
bb repl
Babashka v0.4.0 REPL.
Use :repl/quit or :repl/exit to quit the REPL.
Clojure rocks, Bash reaches.
user=> (require '[babashka.process :refer [process sh]])
nil
user=> (sh ["youtube-dl"])
Segmentation fault (core dumped)
@ben.sless are you using the static Linux binary perhaps?
If thatโs the case, try the normal one
yes
will do
works
Ive linked you to the thread where we are discussing the issues
@ben.sless what OS and kernel versions are you using? I will try to dig a bit deep
OS: Pop 20.10 groovy
Kernel: x86_64 Linux 5.11.0-7614-generic
Thanks!
Maybe we should enable the musl then? https://github.com/oracle/graal/issues/571#issuecomment-618508346
ah
By default, --static will statically link your native-image against your host libc (glibc)
https://github.com/oracle/graal/issues/2691#issuecomment-670010352this could explain the segfaults
i would advocate for using musl too
ok PR welcome for whoever wants to figure out how to do this :)
will try this
Maybe try with pod-babashka-aws. It just got a static build an hour ago
but the tests don't pass there
so it would be a good reason to start doing that
Yeah, this makes sense if we are linking with glibc
I think even when linking statically, glibc still tries to access some system files on runtime
Something related to nss
And some private symbols
And yeah, a glibc version mismatch can cause segmentation faults
musl should work much better for static binaries
@ben.sless Any chance you could test this (musl-linked) static binary for linux with the same program that didn't work for you this morning? https://github.com/babashka/babashka/files/6448129/babashka-0.4.1-SNAPSHOT-linux-amd64-static-musl.tar.gz
Hmm. The static image was compiled to work with Alpine, that was the only reason it was there. If the musl build has issues that the other static image doesnโt have, Iโm inclined to roll back changes until we sort it out. Cc @rahul080327
I wonโt have a build that is flaky, thatโs a constant source of headaches
What kind of issue did you see with data generators, also a stackoverflow?
I think I found the issue. The static binary stackoverflows much sooner than the non-static one. E.g. try:
./bb -e "(defn foo [x] (prn x) (foo (inc x))) (foo 0)"
The non-static goes to about 20k, while the static one only goes through 186 or so> musl allocates memory of very small size for thread stacks. musl documentation says it's only 80KB (now increased to 128KB) for default, which is rather small compared to 2-10MB of glibc. And that's been said to be the major reason for stack-consuming executable to crash (segmentation fault) in musl libc environment.
Very interesting find! My opinion here would be that we could make it explicit that we are using musl and point out the caveats. if this is the reason for flakyness i think we can go ahead with musl with the warning stickers. the ability to run in most places should outweigh the stack size issues. musl has quite a widespread production usage and i think this would be known to people
We could try patching the binary but again i get shivers when we change the defaults of a well used lib, specially when it comes to memory regions
things like rust, zig and others solely rely on musl for static links, i would say thats a good reason to stick to it
can't we just up the stack size limit?
@rahul080327 Patching the binary with muslstack
didn't work btw. Can't we set a higher stack limit during compilation?
This seems to be a similar bug: https://bugs.webkit.org/show_bug.cgi?id=187485
Is there a way to pass stack-size=2097152
to the compilation?
This option needs to be passed to the native-image compiler?
I am trying now with: "-H:CCompilerOption=-Wl,-z,stack-size=2097152"
So far it seems to work... more testing now
@thiagokokada Perhaps you can also try this option locally. I'm running the build a few times over to see if works repeatedly. https://app.circleci.com/pipelines/github/babashka/babashka?branch=stack-size
Doesn't seem to work. The recursion depth is also more limited still with this binary, so maybe it was just luck that it worked a couple of times.
Discussing with a GraalVM dev now
> What kind of issue did you see with data generators, also a stackoverflow?
Yeah, but I only saw this issue once. The issue with clj-http-little
seems to occur more frequently (and also it does not seem related to stack, since it occurs randomly; or maybe it is, I will try it later with the recent stack changes)
> Very interesting find! My opinion here would be that we could make it explicit that we are using musl and point out the caveats. if this is the reason for flakyness i think we can go ahead with musl with the warning stickers. the ability to run in most places should outweigh the stack size issues. musl has quite a widespread production usage and i think this would be known to people
I concur with @rahul080327 here.
I've done deep into the weeds with a GraalVM dev already. It's related to what musl does with the stack size. We should resolve this before publishing the new static bb, I find this unacceptable.
You can find some details here: https://github.com/oracle/graal/issues/3398
I am building Babashka with "-H:CCompilerOption=-Wl,-z,stack-size=2097152"
just to run the tests and see what we get
@thiagokokada I already tried this and it doesn't have any effect
You can see this when using the java repro from the issue
https://github.com/oracle/graal/issues/3398#issuecomment-836754269
So there is a workaround, it seems.
@thiagokokada @rahul080327 This seems to be a workaround that works: https://github.com/babashka/babashka/commit/3bc0f3ffad557bd5e89a1fb1337ce5fc1d8716b5
The workaround is to start bb in a thread with a better default for the stack size. The rest of the threads should get the thread-size from the stack-size
argument. It does incur some extra startup time so I will make that workaround conditional based on some env var
We don't even have to set the stack size in Java, as long as we start everything in a thread and pass a better default for the stack size, things should work again
> It does incur some extra startup time so I will make that workaround conditional based on some env var
So we will have something like BABASHKA_THREAD_STACK_SIZE
that can be set by the user if it wants it? This is the ideia?
LGTM by the way
Let's continue in #babashka-sci-dev
Thread is getting long
Looks like it works now
This is with a more recent musl: https://github.com/babashka/babashka/files/6448198/babashka-0.4.1-SNAPSHOT-linux-amd64-static-musl.tar.gz
@rahul080327 Seems like musl is really the way to go ๐
:bananadance:
awesome work @thiagokokada great pairing session!
Thanks for you too @rahul080327
Would it make sense that iff you run the babashka static binary, then you also download static pods?
or would it be best to control this via an explicit env var:
BABASHKA_PODS_STATIC=true
if the pods are guaranteed to be on the same machine, which i think they are, i would say it makes sense to make it the default
most users may not think of that env var i think?
in this case we must capture the fact that babashka has been compiled statically (not sure if graalvm already provides a system property for this that you can read at compile time)
im not sure of it
this is easily done using (def static? (System/getenv "BABASHKA_STATIC"))
in the relevant pods library namespace and use this for downloading the right pod
Meanwhile, @thiagokokada are you going to do a similar PR to babashka for the static one?
Another question: we can "prefer" a static pod if bb is itself static, but what if there is no static linux pod available, but there is a dynamic one, should we select that one, or just crash with: no such pod here?
Sorry, didn't understand the question
@thiagokokada My impression was that you were going to update bb static itself with musl too?
or did I misunderstand?
I thought we would use the @rahul080327 approach instead of mine, so I was waiting for his PR
ok, that PR is merged now. but that was for the aws pod.
But I can do it too if @rahul080327 is not interested of porting the approach from pods-aws-babashka
you can go ahead with it @thiagokokada if you want ๐
About the question about if we should try to download dynamic pods in case of the static one is unavailable: no, I don't think so
Too much magic may make things more confusing
maybe automatically selecting a static or dynamic one is also too much magic - that is why I proposed an env var to control this
Yeah, maybe
But I don't like to control those kinds of thing as env vars
Could we have an opts passed to load-pod
instead :thinking_face: ?
That can work, but then you'd have to change the code if you deploy your code from a mac to the cloud for example
Like:
(pods/load-pod 'some-pod "0.0.0" {:static true})
Or maybe
(pods/load-pod 'some-pod "0.0.0" {:linux-opts {:static true}})
This would make it clear that it only applies to linux (so macOS are unaffected)
Also has the advantage if someone wants to do something like "static with dynamic fallback" possible
They just need to capture the exception when calling {:static true}
fails
It would be much simpler if we could just always use the static pod on linux
Maybe let's test this out a bit for a few months and if the static one works out ok, we can just switch over all linux pods to it
Made a new release for the pod-babashka-aws pod
https://github.com/babashka/babashka/pull/828 :reviewplease:
why did you make this change? ARG BABASHKA_XMX="-J-Xmx4500m"
> It would be much simpler if we could just always use the static pod on linux Maybe we could go even further and eventually simple remove the dynamically compiled version of Babashka If someone wants to use Babashka with dynamic linking, they can always depend on their distros packages
How would dynamically liked bb magically appear in "their distros packages" if we don't provide it
For example, NixOS's Babashka package is dynamically linked I was imagining like the distros themselves packaging Babashka
I am all for it, the only reason I would not do it if it wouldn't work for someone
> I am all for it, the only reason I would not do it if it wouldn't work for someone Yeah, I think we can work on this static version now and ask for more people to test, and if everything looks good we could eventually remove the non-static version
I can secretly use the static aws pod in the manifest and see if someone starts yelling?
Seems like a good starting point since it's still a "young" pod
> I can secretly use the static aws pod in the manifest and see if someone starts yelling? Well, LGTM, I don't know if someone has some objection against this
let's do it, when it breaks certain environments, we will switch back and find another solution
updated the 0.0.6 manifest
https://github.com/babashka/babashka/pull/828 The CI passed
What is the recommended way to have a bb task that consists of a Unix pipeline? I tried {:tasks {clean (-> (shell "find . -type f -name '*~' -print") (shell "xargs rm -f"))}}
but bb clean
gives me a java.util.concurrent.ExecutionException
.
Merged ๐
@dorab shell
doesn't support this with ->
, but babashka.process/process
does.
I think shell
could support this though (made an issue: https://github.com/babashka/babashka/issues/829)
What is also possible:
(let [out (:out (shell {:out :string} "find . -type f -name '*~' -print"))]
(shell {:in out} "xargs rm -f"))
although I think this can all be done using babashka.fs
glob
and delete
as well
does *~
mean you want to find all files ending with a tilde, or does it expand into something else?
{:requires ([babashka.fs :as fs])
:task (run! fs/delete
(fs/glob "." "*~"))}
Thanks @borkdude Just did another PR to make sure that we are using the correct ZLIB sources (avoiding possible security issues): https://github.com/babashka/babashka/pull/830
Very good. We should start adopting this for pods as well.
(I mean, after downloading a pod we should verify it against a sha in the manifest)
I tested the musl bb on my VPS and had no problems
Hmm, I just enabled the musl static build on master
and it fails
I'm logging off for tonight, but feel free to take a look and I'll be back tomorrow
which graalvm version did you use to compile locally?
> Hmm, I just enabled the musl static build on master Strange, it seems to works fine in my machine
> which graalvm version did you use to compile locally? I used the version included in the Dockerfile
Seems to be version 21.0.0
hmm ok. your version works fine on my VPS with the clj-http-lite library
but it fails on CI with some mysterious stackoverflow
In that previous version I build musl from scratch, while this one uses the musl-tools
prebuilt from Debian
But I ran the tests above in my machine with the changes that where merged on master
Did you run lein test or script/test?
on CI there are more tests, notably script/run_lib_tests
and to test the native, you have to set BABASHKA_TEST_ENV=native
For the result above I ran BABASHKA_TEST_ENV="native" ./script/run_lib_tests > result
that should work yeah
I am building it locally again and will run the lib tests again
Just to make sure that I am using the correct versions of everything
Huh... Yeah, now that I build it again it seems to segfault
I will try my approach again
(Building musl from scratch)
Any chance you were using 21.1 the first time around? I just merged into my 21.1 branch and there it seems to work fine...
(GraalVM version 21.1.0 JDK11 to be specific)
Nope because I don't even have GraalVM installed in my system
So if it is building, it should be from Dockerfile
I see. Note that you can bump up the BABASHK_XMX value in you want to speed up your local compile a bit btw
Can you perhaps also try the graal21.1
branch (pull recent changes from Github)?
You mean trying musl from source with graal21.1?
Right now I am building Babashka with musl from source, and will run the tests
No, there is a graal21.1
branch which is the same as master but with 21.1 instead of 21.0
ah ok
if that works, maybe that's the better approach then
Nope, I also got an exception So maybe the difference is the musl version :thinking_face: ?
Yep, I think the issue is the musl version
Let me try one more thing
Well, downgrading to whatever musl version that comes from Debian fixed the clj-http.lite.client-test
tests but broke the clojure.data.generators-test
:man-shrugging:
I think the 1.2.0
that I used first (compiling from source) passed all tests
But maybe it is just luck
Let me try the GraalVM 21.1
Yeah, with GraalVM 21.1 it seems to work
So, a summary
- musl 1.22+GraalVM 21.0: broke clj-http.lite.client-test
- musl 1.1.21 (I think, this is the musl from Debian Buster but I am not sure that this is the version we use)+GraalVM 21.0: broke clojure.data.generators-test
- musl 1.22+GraalVM 21.1: works fine with all libs that we have tests
Actually, forget about it
This test (`clj-http.lite.client-test`) seems flakly
So yeah, very bizarre :man-shrugging:
Maybe reporting this issue to GraalVM itself :thinking_face: ?
Yeah, got the same flakly behavior even with Babashka compiled from master
So if I run the tests once it works, again it segfaults, again it works, again it segfaults...
@borkdude All I can say for now is that the tests are really really flaky with those static binaries. We could add a retry and run it a few times, but it doesn't seem to be a good sign considering that this behavior doesn't happen with the old glibc binaries CC @rahul080327
I do think we should report this to GraalVM itself since it seems to be an issue with them, but I don't know how we could report this (we probably need some way to reproduce this with a smaller example that can trigger this behavior)
My suggestion is maybe releasing this static version with musl separated from the other static binary, at least until we can figure out why this tests are flakly. But I will leave the decision to @borkdude
Just to leave here, I also tried to build with Debian Buster (we seem to use Debian Stretch actually, since the GCC used by the image is 6.3 instead of the 8.3), since Native Image in the end builds everything with GCC and 6.3 is a quite old GCC version.
I ended up using openjdk-17-lein-2.9.6-buster
that also uses updated versions of lein and OpenJDK since I didn't find a way to only bump the Debian version.
But no dice, same behavior.