Total newb in anything ML here. I have a corpus of natural language documents, that I’d like to a) automatically extract themes from b) apply these themes to new documents as they enter the system c) sort them by similarity (so we can show “related documents” for each document, sort lists by similarity so similar documents appear close together I had a look at AWS comprehend but at least the key phrases functionality didn’t seem to be what I want. I’ll try topic modeling next, but I’m also interested to have full control of the model so it can be helped by humans to get more accurate results.
The amount of data i have is relatively small (hundreds of documents), so hopefully I don’t need to go into big data or anything crazy. Any suggestions on where to start? Ideally Clojure with a robust/mainstream underlying library/framework.
Oh there’s also a ton of conj talks about ML, any pointers on where to start?
Many thanks 🙏
I don't know of Clojure library to handle that. AWS Sagemaker looks more what you want to do large scale https://docs.aws.amazon.com/sagemaker/latest/dg/ntm.html
MXNet is underneath the hood - you may be able to interface with it with Clojure-MXNet. But it might be easier to deal with it in interop with https://github.com/cnuernber/libpython-clj. Also these tasks are most not likely going to be deep learning anyway
There also is a LDA example which might be relevant
Once you have your documents into a the end vector form - you can compare them by cosine similarity
(or some other distance measure)
or honestly - you might just be able to use python sklearn with the python interop https://sanjayasubedi.com.np/nlp/nlp-with-python-document-clustering/
But take all with a grain of salt - I've never done that before. Maybe someone else has a better idea. Just telling you what direction I personally would look in
Thanks @gigasquid — already my head is spinning. Libpython-clj sure does seem to open the door to many powerful libraries, there also seem to be some Java NLP libraries out there. I’ll continue my search and report back if I find something interesting.
@orestis: I am also a ML noob here. But I implemented something to find similarity between “Government Forms” a few years back. I used “Jaccard Similarity Index”. It is metric to measure similarity between sets. And, I used something called “Shingling” (similar to n-grams) to creates sets from these documents. It worked reasonably well for our use case. Jaccard: https://en.wikipedia.org/wiki/Jaccard_index Shingling: https://en.wikipedia.org/wiki/W-shingling
Thanks - I’ll have a look at those. Did you use any particular library for this?
Back then I implemented in python not in Clojure. But In python too I did the custom implementation with out any library. It was not hard to implement.
I have seen the Jaccard distance somewhere in some Apache project.
@orestis may I shamelessly plug http://nextjournal.com if you’d like to do training on GPUs? You can currently use a GPU on the free plan and @kommen and I are here to help if you need anything
@chris441 - I took libpython-clj out for a spin in python mxnet - worked great! https://gist.github.com/gigasquid/b4bde4588f9da80a55c1516c62bf9660
This great!