onyx

FYI: alternative Onyx :onyx: chat is at <https://gitter.im/onyx-platform/onyx> ; log can be found at <https://clojurians-log.clojureverse.org/onyx/index.html>
lucasbradstreet 2018-08-14T02:08:53.000113Z

@lmergen great. Having other step in to fill the gap would be fantastic. Plus my hours answering questions are limited to work hours now.

rustam.gilaztdinov 2018-08-14T14:21:30.000100Z

hello! trying to start work with onyx-sql, and for exercise β€” copy data from one table to another

(def id-column :id)
(def table :generated_data)
(def copy-table :generated_data_onyx)
;; schema of both tables is equal

(def catalog
  [{:onyx/name :partition-keys
    :onyx/plugin :onyx.plugin.sql/partition-keys
    :onyx/type :input
    :onyx/medium :sql
    :sql/classname (:classname config)
    :sql/subprotocol (:subprotocol config)
    :sql/subname (:host config)
    :sql/db-name (:database config)
    :sql/user (:user config)
    :sql/password (:password config)
    :sql/table table
    :sql/id id-column
    :sql/rows-per-segment 1000
    :sql/columns [:*]
    :onyx/batch-size batch-size
    :onyx/max-peers 1
    :onyx/doc "Partitions a range of primary keys into subranges"
    :sql/lower-bound 0
    :sql/upper-bound 100000}

   {:onyx/name :read-rows
    ;; :onyx/tenancy-ident :onyx.plugin.sql/read-rows
    :onyx/fn :onyx.plugin.sql/read-rows
    :onyx/type :function
    :sql/classname (:classname config)
    :sql/subprotocol (:subprotocol config)
    :sql/subname (:host config)
    :sql/db-name (:database config)
    :sql/user (:user config)
    :sql/password (:password config)
    :sql/table table
    :sql/id id-column
    :onyx/batch-size batch-size
    :onyx/doc "Reads rows of a SQL table bounded by a key range"}

   {:onyx/name :identity
    :onyx/fn :sql-data.core/rows
    :onyx/type :function
    :onyx/batch-size batch-size
    :onyx/batch-timeout batch-timeout
    :onyx/doc "identity"}

   {:onyx/name :write-rows
    :onyx/plugin :onyx.plugin.sql/write-rows
    :onyx/type :output
    :onyx/medium :sql
    :sql/classname (:classname config)
    :sql/subprotocol (:subprotocol config)
    :sql/subname (:host config)
    :sql/db-name (:database config)
    :sql/user (:user config)
    :sql/password (:password config)
    :sql/table copy-table
    :sql/copy? false
    ;; :sql/copy-fields [:first :second :third]
    :onyx/batch-size batch-size
    :onyx/doc "Writes segments from the :rows keys to the SQL database"}
   ])
I have this exception in logs:
clojure.lang.ExceptionInfo: Wrong number of args (1) passed to: sql/read-rows
     offending-segment:  {:id 1, :name "name", :price 100, :created_date #inst "2016-04-22T21:00:00.000000000-00:00", :description "description", :in_stock true}
        offending-task: :read-rows
    original-exception: :clojure.lang.ArityException
clojure.lang.ExceptionInfo: Handling uncaught exception thrown inside task lifecycle :lifecycle/apply-fn. Killing the job. -&gt; Exception type: clojure.lang.ExceptionInfo. Exception message: Wrong number of args (1) passed to: sql/read-rows
                job-id: #uuid "5fa02f30-675f-8c9c-8e5d-fd27609f2207"
              metadata: {:job-id #uuid "5fa02f30-675f-8c9c-8e5d-fd27609f2207", :job-hash "2e8adc49564869d2ca4536a0b155de9411e5c55d78b576a4cd13411e444aaa"}
     offending-segment: {:id 1, :name "name", :price 100, :created_date #inst "2016-04-22T21:00:00.000000000-00:00", :description "description", :in_stock true}
        offending-task: :read-rows
    original-exception: :clojure.lang.ArityException
               peer-id: #uuid "cb52b552-5c86-04ed-a94f-f874dfd46aca"
             task-name: :read-rows
Which args I should provide?

2018-08-14T15:43:36.000200Z

@rustam.gilaztdinov that's a very weird error, which version of onyx are you using ?

2018-08-14T15:43:44.000100Z

are you sure your version of onyx is compatible with the version of the plugin ?

rustam.gilaztdinov 2018-08-14T15:44:11.000100Z

[org.onyxplatform/onyx "0.13.3-alpha4"]
[org.onyxplatform/onyx-sql "0.13.3.0-alpha4"]

2018-08-14T15:44:24.000100Z

hmm

2018-08-14T15:50:52.000100Z

wait a minute, this doesn't make sense, the docs are not good

lucasbradstreet 2018-08-14T15:51:24.000100Z

It’s probably a missing lifecycle causing an argument not to be injected into an onyx/fn?

rustam.gilaztdinov 2018-08-14T15:52:12.000100Z

but I don’t have any args, just identity

2018-08-14T15:52:38.000100Z

no, the docs are not corect

2018-08-14T15:52:51.000100Z

you're not supposed to call read-rows anymore since i did that refactoring to 0.10

2018-08-14T15:53:11.000100Z

SqlPartitioner now calls read-rows itself

2018-08-14T15:54:33.000200Z

@rustam.gilaztdinov as a matter of debugging, could you do something for me ? instead of this catalog entry:

{:onyx/name :read-rows
    ;; :onyx/tenancy-ident :onyx.plugin.sql/read-rows
    :onyx/fn :onyx.plugin.sql/read-rows
    :onyx/type :function
replace the :onyx/fn with another fuction, and (println ..) the output ?

2018-08-14T15:54:54.000100Z

it should be {:id 1, :name "name", :price 100, :created_date #inst "2016-04-22T21:00:00.000000000-00:00", :description "description", :in_stock true}

2018-08-14T15:55:04.000100Z

docs need to be updated i think πŸ™‚

2018-08-14T15:55:51.000100Z

here you can see the tests also do not use read-rows anymore: https://github.com/onyx-platform/onyx-sql/blob/master/test/onyx/plugin/input_test.clj#L43

rustam.gilaztdinov 2018-08-14T15:58:37.000100Z

yes, I removed :read-rows from catalog and workflow and this works!

2018-08-14T16:07:13.000100Z

πŸ‘

rustam.gilaztdinov 2018-08-14T17:09:31.000100Z

actually, another question πŸ™‚ Batch size is equal to 10 After I submit job, only one batch processed, then in log I have this exception

java.lang.NullPointerException:
    clojure.lang.ExceptionInfo: Handling uncaught exception thrown inside task lifecycle :lifecycle/read-batch. Killing the job. -&gt; Exception type: java.lang.NullPointerException. Exception message: null
       job-id: #uuid "7bacb4ad-e7c8-1292-8814-b9e6a9183a7f"
     metadata: {:job-id #uuid "7bacb4ad-e7c8-1292-8814-b9e6a9183a7f", :job-hash "b167932da4fa7ff4b4cff26b921fcf0576cc010fb9fbf5c607022507b3b9f6d"}
      peer-id: #uuid "76fd87b1-64af-254e-1568-96a986a649b7"
    task-name: :partition-keys
If I changed batch size β€” still have this exception and one batch processed Is that means β€” I should add :lifecycle/after-batch function or handle exception? Sorry, I’m pretty new to onyx, any tips will be super helpful

2018-08-14T17:29:12.000100Z

perhaps there is an SQL error somewhere ? can you look at your database logs to see which queries are being sent ?

2018-08-14T17:29:29.000100Z

NPE doesn't look too good

2018-08-14T17:30:20.000100Z

to be perfectly honest, I don't think the sql plugin is used that much -- I mostly use it as an output plugin, and I think I'm one of the few people actually using it πŸ™‚

2018-08-14T17:30:30.000100Z

so the chances are that you might run into some corner cases

rustam.gilaztdinov 2018-08-14T17:35:39.000100Z

If i change batch size to 20, all works well. So, no NPE on data. NPE comes with next batch

2018-08-14T18:06:38.000100Z

ok, i wouldnt be able to tell right now. it's most likely a bug in the sql plugim

rustam.gilaztdinov 2018-08-14T18:18:10.000100Z

Oh(

rustam.gilaztdinov 2018-08-14T18:23:27.000100Z

It's bad, I'm data analyst, and thinking about writing complex transformation on sql data( we have huge postgres database, and this is why i peek onyx. Clojure is so great to work with data and in combination with onyx it's promise so much. Which kind of workflow you suggest -- produce data to kafka, and work with kafka plugin and sql-plugin to output?

2018-08-14T18:50:56.000100Z

it completely depends upon what you want to do with the data. I would say that Onyx shines more in production workloads, for exploratory data analysis you want to keep things in a relational database or data warehouse.

2018-08-14T18:51:31.000100Z

in production, you could use Onyx to implement your actual algorithms, for example indeed sourcing from Kafka and writing into PostgreSQL

2018-08-14T18:52:34.000100Z

but it all completely depends on what you plan on doing

2018-08-14T18:52:46.000100Z

and it needs to be properly tailored towards your use case

2018-08-14T18:52:58.000100Z

There is no one size fits all solution here :)

rustam.gilaztdinov 2018-08-14T19:10:30.000100Z

Agree) but when data is huge, not so big for hadoop and spark -- ETL on this size of data on single machine is slow. Think, this totally onyx case.