data-science

Data science, data analysis, and machine learning in Clojure https://scicloj.github.io/pages/chat_streams/ for additional discussions
orestis 2020-01-09T11:32:20.023600Z

Total newb in anything ML here. I have a corpus of natural language documents, that I’d like to a) automatically extract themes from b) apply these themes to new documents as they enter the system c) sort them by similarity (so we can show “related documents” for each document, sort lists by similarity so similar documents appear close together I had a look at AWS comprehend but at least the key phrases functionality didn’t seem to be what I want. I’ll try topic modeling next, but I’m also interested to have full control of the model so it can be helped by humans to get more accurate results.

orestis 2020-01-09T11:33:36.025700Z

The amount of data i have is relatively small (hundreds of documents), so hopefully I don’t need to go into big data or anything crazy. Any suggestions on where to start? Ideally Clojure with a robust/mainstream underlying library/framework.

orestis 2020-01-09T11:35:23.026600Z

Oh there’s also a ton of conj talks about ML, any pointers on where to start?

orestis 2020-01-09T11:35:31.026900Z

Many thanks 🙏

👀 1
2020-01-09T14:09:04.028100Z

I don't know of Clojure library to handle that. AWS Sagemaker looks more what you want to do large scale https://docs.aws.amazon.com/sagemaker/latest/dg/ntm.html

2020-01-09T14:10:23.029600Z

MXNet is underneath the hood - you may be able to interface with it with Clojure-MXNet. But it might be easier to deal with it in interop with https://github.com/cnuernber/libpython-clj. Also these tasks are most not likely going to be deep learning anyway

2020-01-09T14:13:21.030400Z

There also is a LDA example which might be relevant

2020-01-09T14:14:02.031100Z

Once you have your documents into a the end vector form - you can compare them by cosine similarity

2020-01-09T14:15:10.031400Z

(or some other distance measure)

2020-01-09T14:22:36.032100Z

or honestly - you might just be able to use python sklearn with the python interop https://sanjayasubedi.com.np/nlp/nlp-with-python-document-clustering/

2020-01-09T14:29:08.033200Z

But take all with a grain of salt - I've never done that before. Maybe someone else has a better idea. Just telling you what direction I personally would look in

orestis 2020-01-09T16:34:49.034500Z

Thanks @gigasquid — already my head is spinning. Libpython-clj sure does seem to open the door to many powerful libraries, there also seem to be some Java NLP libraries out there. I’ll continue my search and report back if I find something interesting.

👍 1
Avichal 2020-01-09T17:28:11.040Z

@orestis: I am also a ML noob here. But I implemented something to find similarity between “Government Forms” a few years back. I used “Jaccard Similarity Index”. It is metric to measure similarity between sets. And, I used something called “Shingling” (similar to n-grams) to creates sets from these documents. It worked reasonably well for our use case. Jaccard: https://en.wikipedia.org/wiki/Jaccard_index Shingling: https://en.wikipedia.org/wiki/W-shingling

orestis 2020-01-09T17:32:49.040800Z

Thanks - I’ll have a look at those. Did you use any particular library for this?

Avichal 2020-01-09T17:34:48.041Z

Back then I implemented in python not in Clojure. But In python too I did the custom implementation with out any library. It was not hard to implement.

orestis 2020-01-09T17:36:26.041900Z

I have seen the Jaccard distance somewhere in some Apache project.

👍 1
mkvlr 2020-01-09T21:09:42.043500Z

@orestis may I shamelessly plug http://nextjournal.com if you’d like to do training on GPUs? You can currently use a GPU on the free plan and @kommen and I are here to help if you need anything

2020-01-09T23:37:37.044100Z

@chris441 - I took libpython-clj out for a spin in python mxnet - worked great! https://gist.github.com/gigasquid/b4bde4588f9da80a55c1516c62bf9660

5
2020-01-10T20:30:45.071Z

This great!