datomic

Ask questions on the official Q&A site at https://ask.datomic.com!
genekim 2021-02-13T21:07:30.201700Z

Hello, Datomic folks — I’m dusting off the code that I wrote to research the Twitter relationship graph (which I referred to in my Conj2019 talk, which I’d like to clean up and publish). But this time, my goal is to archive and organize some notes I’ve been tweeting out. I’m pondering on how to store this in Datomic — I’d like to just store the entire tweet data structure that Twitter returns as an EDN string, and parse out new fields as I need. In fact, if I could just store the entire data structure, I’m not even sure if I want to pull out and store the tweet text as a separate entity. (So why store it? It’s a pain in the butt to query the Twitter API because of rate limits, and retrieval is not quite as easy as one would hope: they don’t even support pagination.) The problem is the size of these data blobs: average is about 2.5K, and max is around 7K. Does anyone have any advice on how you’d tackle storing this in Datomic? I’m not loving the idea of storing it in S3, in aggregate form or individually, mostly because it’s one more thing to access/manage. But I’m open to any/all advice on how to approach this! I’ll post a sample tweet in reply, as well as distribution of

(->> tweets
       (map str)
       (map count)
       sort)
Thanks in advance!

genekim 2021-02-13T21:08:09.201800Z

An approx 2K EDN data structure of a tweet:

{:in_reply_to_screen_name "RealGeneKim",
 :is_quote_status false,
 :coordinates nil,
 :in_reply_to_status_id_str "1360654233322676226",
 :place nil,
 :geo nil,
 :in_reply_to_status_id 1360654233322676226,
 :entities {:hashtags [],
            :symbols [],
            :user_mentions [{:screen_name "QuinnyPig",
                             :name "Corey Quinn",
                             :id 97114171,
                             :id_str "97114171",
                             :indices [0 10]}],
            :urls []},
 :source "&lt;a href=\"<http://tapbots.com/tweetbot>\" rel=\"nofollow\"&gt;Tweetbot for iΟS&lt;/a&gt;",
 :lang "en",
 :in_reply_to_user_id_str "19734656",
 :id 1360654514076852224,
 :contributors nil,
 :truncated false,
 :retweeted false,
 :in_reply_to_user_id 19734656,
 :id_str "1360654514076852224",
 :favorited false,
 :user {:description "WSJ bestselling author: Unicorn Project! DevOps researcher/enthusiast. Coauthor: Phoenix Project, Accelerate. Host of The Idealcast. Tripwire founder. Clojure.",
        :profile_link_color "1DA1F2",
        :profile_sidebar_border_color "C0DEED",
        :is_translation_enabled false,
        :profile_image_url "<http://pbs.twimg.com/profile_images/1208492572420665344/ou-2Ojes_normal.jpg>",
        :profile_use_background_image true,
        :default_profile true,
        :profile_background_image_url "<http://abs.twimg.com/images/themes/theme1/bg.png>",
        :is_translator false,
        :profile_text_color "333333",
        :profile_banner_url "<https://pbs.twimg.com/profile_banners/19734656/1571153817>",
        :name "Gene Kim",
        :profile_background_image_url_https "<https://abs.twimg.com/images/themes/theme1/bg.png>",
        :favourites_count 7546,
        :screen_name "RealGeneKim",
        :entities {:url {:urls [{:url "<https://t.co/v26RwFbMdV>",
                                 :expanded_url "<https://itrevolution.com/idealcast/>",
                                 :display_url "<http://itrevolution.com/idealcast/|itrevolution.com/idealcast/>",
                                 :indices [0 23]}]},
                   :description {:urls []}},
        :listed_count 1826,
        :profile_image_url_https "<https://pbs.twimg.com/profile_images/1208492572420665344/ou-2Ojes_normal.jpg>",
        :statuses_count 42237,
        :has_extended_profile false,
        :contributors_enabled false,
        :following nil,
        :lang nil,
        :utc_offset nil,
        :notifications nil,
        :default_profile_image false,
        :profile_background_color "C0DEED",
        :id 19734656,
        :follow_request_sent nil,
        :url "<https://t.co/v26RwFbMdV>",
        :translator_type "none",
        :time_zone nil,
        :profile_sidebar_fill_color "DDEEF6",
        :protected false,
        :profile_background_tile false,
        :id_str "19734656",
        :geo_enabled false,
        :location "ÜT: 45.527981,-122.670577",
        :followers_count 49973,
        :friends_count 1475,
        :verified false,
        :created_at "Thu Jan 29 21:10:55 +0000 2009"},
 :retweet_count 0,
 :favorite_count 6,
 :created_at "Sat Feb 13 18:18:10 +0000 2021",
 :text "@QuinnyPig I wish it were always easy to find Twitter handle of blog authors. 
        
        Wanted to acknowledge Techy's great work."}

genekim 2021-02-13T21:08:47.202Z

Sorted distribution of string length of these EDN representations.

(2423
 2505
 2506
 2530
 2537
 2558
 2610
 2617
 2618
 2625
 2633
 2640
 2648
 2651
 2652
 2653
 2654
 2655
 2675
 2677
 2678
 2679
 2689
 2702
 2702
 2707
 2707
 2707
 2707
 2708
 2708
 2708
 2708
 2708
 2708
 2708
 2708
 2708
 2708
 2708
 2708
 2708
 2708
 2708
 2708
 2708
 2708
 2708
 2708
 2708
 2708
 2708
 2708
 2708
 2708
 2708
 2708
 2708
 2708
 2708
 2708
 2708
 2708
 2708
 2708
 2710
 2712
 2712
 2725
 2726
 2727
 2727
 2729
 2730
 2732
 2735
 2735
 2755
 2769
 2776
 2786
 2788
 2798
 2804
 2804
 2804
 2804
 2804
 2805
 2805
 2805
 2805
 2805
 2805
 2805
 2805
 2805
 2805
 2805
 2805
 2805
 2805
 2805
 2805
 2805
 2805
 2805
 2805
 2805
 2805
 2805
 2805
 2817
 2820
 2831
 2831
 2846
 2848
 2894
 2909
 2914
 2927
 2930
 2930
 3598
 3847
 4240
 4924
 5092
 5262
 5435
 5846
 6656
 6950)

esp1 2021-02-13T22:23:13.226Z

Q: is it possible to have a notion of identity / upsert that is localized to components within a particular entity instance? e.g. say I have entities that may contain some number of named config components:

[{:node/id "N1"
  :node/configs [{:node.config/type "C1"
                  :node.config/value "abc"}
                 {:node.config/type "C2"
                  :node.config/value "def"}]}
 {:node/id "N2"
  :node/configs [{:node.config/type "C1"
                  :node.config/value "aardvark"}]}]
There is a limited set of config types (e.g. C1, C2), so as shown above multiple nodes may have config components with the same type (node N1 and N2 both have a C1 config component). However I'd also like to be able to upsert new values for config components. But if try to enable upsert by adding :db.unique/identity to the :node.config/type attribute, then multiple nodes can no longer have the same config type (I'll get a unique constraint violation). Is it possible to have a notion of unique identity that is local to the components within a given entity? How would I do this?

Joe Lane 2021-02-13T22:24:17.226100Z

Hey @genekim, long time no see! A few questions and recommendations: 1. Are the tweets considered immutable by now or do you need additional updates if they happen in the future (more comments, additional likes, etc.) 2. How many tweets do you have? 3. Unfortunately, I would not recommend storing big string blobs in Datomic. It can turn into a big head for a variety of reasons which I'm happy to go into more detail about if you're curious. 4. Will the tweet data be the only data in the database or is it a larger system? 5. Will you be running analysis over the tweets, or is this simply an archive of the data indexed by id? 6. Would you be interested in generating a schema for the tweet data? There are a few fun ways to analyze an edn datastructure and generate a schema. I could work out a close approximation to get you started if you sent me an edn file of one of the tweets (smartquotes are evil).

esp1 2021-02-13T22:36:46.226300Z

I would also be interested in understanding why storing big string blobs in Datomic is problematic - and also what would be considered 'big'. I'm assuming it may adversely impact indexing performance? I have a similar problem to @genekim and have considered storing the immutable large files in S3 under a uuid path and storing the path references in Datomic, but it would be nice if Datomic could support this use case out of the box. Are there any plans for Datomic to support this kind of thing in the future?

genekim 2021-02-13T22:47:38.226500Z

@lanejo01 Great to hear from you — I hope all is well! And thanks for the thoughtful questions — can’t wait to see where you go with this! 🙂 1. immutable, for sure. 2. about 50K tweets 3. For sure, after some research, I’m thinking this is wise. I suppose any database would do, right? (just store, say, MySQL row id into Datomic, as well as any parsed entities?) 4. I’m adding the tweets to one that was primarily about users, because it seems like it could be fun 5. At this point, no analysis… but I’m discovering processing is more significant than I had thought. I think it’d be super useful to extract all the graphics links, which would be great to store in Datomic. 6. Thanks for the offer — how about I post the schema when I’m a little further along. (After exploring the data, I think I’m going to extract all the images, which I’ve wanted to do for years — there are some pictures from conferences I know I either posted on Twitter, or just retweeted them.) So, I guess I’m going to pour all these 35K JSON entities into a CloudSQL MySQL table that I have laying around, and then transform/load into Datomic. Super helpful, Joe! Thank you!

ghadi 2021-02-13T23:27:38.226700Z

Hi @genekim. Adding to what @lanejo01 said, you've already done the important work of understanding the blob's size distribution. Blobs give you no leverage in Datomic. Extract and transact the information you want to query through, then add a Base64(SHA256(blob.json)) datom alongside, to point back to the blob. (Then you can store the blob on disk, EFS, S3. whatever)

ghadi 2021-02-13T23:29:29.227Z

content-addressing is a nice way to punt on the decision on how to transform the important parts of the source data

genekim 2021-02-13T23:35:54.227300Z

Hey, @ghadi! I did a double-take on your suggestion, because it was so startling. I was thinking that for each tweet, I’d store with it a, say, :mysql-tweet-id , so I could retrieve the original JSON. You’re suggesting that I not do that, and reference it by the hash of the JSON (e.g., tweet-sha) with each tweet, to completely remove any assumption of where that tweet is stored, right? (mysql, S3, etc.) Do I have that right? (my reaction to “content-addressing”: 🤯🤯)

ghadi 2021-02-13T23:36:28.227500Z

yeah just tweet-sha, not mysql-tweet-sha

genekim 2021-02-13T23:36:59.227700Z

Awesome! Thank you, all!

ghadi 2021-02-13T23:37:08.227900Z

otherwise in 5 years you'll wonder why the program is querying Postgres for data with mysql-tweet-id in it 😉

2