uncomplicate

whilo 2017-08-24T14:36:42.000325Z

@blueberry I guess you don't want to announce your used MC method for bayadera publicly? It is neither mentioned in the slides or in the source code. I am implementing SGHMC atm. which does not need branching, but rather works like momentum SGD. I implement it with pytorch's autograd. I think something like it would fly on neanderthal + autograd.

whilo 2017-08-24T14:37:23.000240Z

It works for very high-dimensional problems like NNs over natural images: https://arxiv.org/abs/1705.09558

2017-08-24T14:41:06.000004Z

I would like to see some numbers before I can form an opinion.

whilo 2017-08-24T14:43:13.000322Z

The paper has some. I will let you know once I can reproduce the results. What particular numbers are you interested in?

whilo 2017-08-24T14:43:55.000530Z

I am mentioning it, because I think an autograd functionality would be very helpful as an intermediary abstraction from neanderthal to build bayesian statistics toolboxes.

whilo 2017-08-24T14:44:30.000334Z

I don't have as much time as I would like to have, but I am exploring some clj-autodiff stuff atm. in https://github.com/log0ymxm/clj-auto-diff

2017-08-24T14:51:26.000220Z

I've skimmed through the paper, but this is the problem: most of those papers require familiarity with concrete problems (vision/classification in this case) to judge the results. There is no anchor I can use to judge this paper. I can not see the simplest data that I'm interested in when I hear about ANY MCMC implementation: how much steps it needs to converge for some problems that are easy to understand and compare, and how much time one step takes. For easy problems. If it works poorly for easy problems, I can not see how it can work great for harder problems. If it works ok for easy problems, then I can look at harder ones and see how it does there. There is so much hand-waving in general that usually does not surprise me that 99% of those is vaporware.

2017-08-24T14:53:54.000666Z

I was quite surprised when I saw the Anglican MCMC hello world at EuroClojure. The most basic thing you can use MCMC for took 239 SECONDS for the world's simplest beta-binomial.

whilo 2017-08-24T14:53:56.000787Z

Generating such artificial images is very very difficult, even if they are far from perfect. So you definitely need a good method to get there. The original SGHMC paper has more explanations, but there is whole string of literature related to this.

whilo 2017-08-24T14:54:11.000058Z

I mention it to you, because I can imagine that you don't have the time to dive into it.

whilo 2017-08-24T14:55:04.000474Z

I understand your concerns. I on the other hand am interested on using Clojure again at some point for my optimization stuff. Esp. since Anglican represents fairly sophisticated tooling compared to Python's libraries on top of tensorflow or theano.

2017-08-24T14:55:20.000310Z

That might be very useful for that particular problem (or not - i don't know since I don't do computer vision) but it does not tell me whether SGHMC is worth exploring for general MCMC that I'm interested in.

whilo 2017-08-24T14:55:52.000069Z

SGHMC samples the whole weight matrices of a 5-layer CNN in their case.

whilo 2017-08-24T14:56:27.000050Z

That is a very high-dimensional problem. In MCMC there are in general only convergence proofs for toy problems, so I cannot tell you how well it explores the distribution.

2017-08-24T14:56:27.000771Z

In what sense? Are those matrices unknown?

whilo 2017-08-24T14:57:00.000379Z

Instead of stochastic gradient descent the routine does not just find an optimum but samples from the posterior of the weight matrices given the data.

2017-08-24T14:57:37.000524Z

I'm not aware on ANY method that guarantees MCMC convergence, toy or non-toy! That is the most tricky part with MCMC.

whilo 2017-08-24T14:57:47.000380Z

Yes

2017-08-24T14:58:12.000121Z

So, each sample is the whole CNN?

whilo 2017-08-24T14:58:17.000185Z

Yes.

2017-08-24T14:59:18.000171Z

How do their method compares to the state of the art? You know, those models Google/Facebook/Deep Mind or whoever else is the leader publishes?

2017-08-24T15:00:02.000571Z

How much memory one sample typically takes?

2017-08-24T15:00:37.000075Z

And how many samples they consider enough to fairly represent the posterior?

whilo 2017-08-24T15:02:06.000038Z

They subsample from the chain, but I don't know much about these specifics yet. The sample probably takes a few megabytes.

whilo 2017-08-24T15:02:54.000802Z

In this paper they managed to compete with deep learning GANs and exploit Bayesian features like multiple samples from the posterior.

2017-08-24T15:03:02.000689Z

Hmmm. I'm afraid I do not have enough knowledge to fairly comment what they do.

whilo 2017-08-24T15:03:44.000355Z

Sure. I thought you might be interested in scalable high-dimensional sampling methods. So far I just wanted to talk a bit about with you.

whilo 2017-08-24T15:03:56.000325Z

I can tell you more once I can get it run on more than toy problems.

2017-08-24T15:04:14.000577Z

Do they compete regarding results only or also on speed. Because getting same results for 1000x times is a bit underwhelming (if that is the case, of course).

whilo 2017-08-24T15:04:47.000227Z

No, that is the cool thing. It is a bit slower than SGD, but not orders of magnitudes.

whilo 2017-08-24T15:05:00.000682Z

You always pay for a Bayesian approach.

whilo 2017-08-24T15:05:23.000282Z

Estimating a full distribution vs. a MAP or MLE estimate is a lot more expensive in general.

2017-08-24T15:05:27.000294Z

Not necessarily 🙂

2017-08-24T15:05:40.000250Z

Of course.

2017-08-24T15:06:04.000658Z

That's why you do it only when you know you'll get qualitatively better answers

2017-08-24T15:06:31.000742Z

If it's only "we got slightly better X" then why bother?

2017-08-24T15:07:14.000196Z

I understand that it can be quite challenging in machine vision because the DL people really pushed the state of the art in the last decade.

whilo 2017-08-24T15:07:22.000136Z

I agree. I think a statistics approach can help you to do informed optimization decisions though, even if you go towards an MLE estimate in the end.

whilo 2017-08-24T15:07:25.000370Z

Yes

whilo 2017-08-24T15:08:23.000738Z

But if you can incorporate the strengths of deep nets, then you can also improve statistical methods. This is what a lot of people try do atm.

whilo 2017-08-24T15:08:44.000152Z

Use neural networks to approximate internal functions to make their samplers faster or get a better variational approximation.

2017-08-24T15:09:06.000425Z

That's what they hope for, at least 😉

2017-08-24T15:09:35.000060Z

But also may be the case of "what a nice hammer I've got"

whilo 2017-08-24T15:09:41.000633Z

I agree with your emphasis on performance. I explore it atm. because I am doing research. In a practical project I probably would stick to a CNN for these problems.

whilo 2017-08-24T15:10:15.000003Z

Yes, I have this internal struggle with the Bayesian approach. But so far it helps to stretch in this direction for me.

2017-08-24T15:10:23.000526Z

🙂

whilo 2017-08-24T15:10:38.000218Z

The math is sound and you can borrow a lot of intuitions and from past experiences.

whilo 2017-08-24T15:11:09.000036Z

Have you thought about autograd at some point?

whilo 2017-08-24T15:11:43.000239Z

Because that is really strong in Python, esp. with Pytorch. I am really happy, despite Python being slow and a big mess under the hood.

whilo 2017-08-24T15:14:47.000255Z

cortex directly builds on layers and I don't understand why, except for business reasons. autograd has a very strong history in Lisp and it is an irony that Python is so much better at it than Clojure.

whilo 2017-08-24T15:15:22.000340Z

theano, tensorflow and pytorch are all autograd libs to different degrees.

2017-08-24T15:19:20.000066Z

I'll add something more pragmatic (and effective IMO): vectorized gradients for all standard math functions on vector/matrix/tensor structures in neanderthal, but no general clojure code gradients.

whilo 2017-08-24T15:21:11.000336Z

That is probably sufficient. I agree that general autograd might be too much, but there is very rich literature and impls. in scheme and they have tried hard to make it efficient.

whilo 2017-08-24T15:21:31.000241Z

E.g. reverse autograd is like backprop.

whilo 2017-08-24T15:26:02.000657Z

Do you have pointers of how you would do it in neanderthal?

2017-08-24T15:29:26.000515Z

I prefer to talk about things only once I am sure I can do it properly. I have a pretty good idea how to do it, but am not sure whether the results would be amazing, so I prefer to shut up for now 🙂

2017-08-24T15:31:02.000564Z

and, of course, there are quite a few things with higher priority now.