data-science

Data science, data analysis, and machine learning in Clojure https://scicloj.github.io/pages/chat_streams/ for additional discussions
chrisn 2020-04-08T15:08:18.087800Z

http://tech.ml.dataset is now at beta version 2.0. It is a dataframe-like library and now includes supports for joins and the majority of the java.time datetime api. Column types are autodetected from data in most cases and can overridden in multiple ways. Once loaded you have the full power of the tech.datatype library to transform and clean data into whatever format you wish. It is also very easy to take this data and apply some simple linear regression and categorical inference on it (predict one column from a set of other columns). Some features: * Joins work in index space making them subsecond for even very large datasets. * Full datetime support including a set of 'packed' date time datatypes that are stored as 32 or 64 bit integers but can be operated on and print like their date time analogues. * Support for reading tsv, csv, xlsx, xls, and gzipped varieties of those. * Support for writing tsv, csv, and gzipped varieties of those. * Strings are loaded into variable width string tables so if for instance you have a categorical variable with 5 categories each entry will be stored in a byte. * Column datatype autodetection with a clear system for overriding and specifying the desired column types. * Memory efficient - data is stored in primitive arrays, dates are packed, and strings are stored in string tables. This means you can work with data many times larger in than before and stay in Clojure. It also means that sorting, group-by, filtering, and joining a table by a column is extremely fast. All of these operations are done in index space so they are both fast and if they result in an expansion of the datatset then you only pay for the indexes required but not the duplication of the data. https://github.com/techascent/tech.ml.dataset

👍 3
💯 4
🎉 5