jackdaw

https://github.com/FundingCircle/jackdaw
Quest 2020-03-04T10:15:21.029600Z

Fair warning: this is an event sourcing question, not inherently related to Jackdaw. I'm asking here because this channel probably has the best expertise, but will redirect chatter elsewhere on request. With that said... I'm building up a CQRS+ES system and considering the datastore for commands & events as they flow through Kafka. Traditional advice seems to be (A) "write to an external datastore", for instance what https://github.com/capitalone/cqrs-manager-for-distributed-reactive-services does (postgres). The reasoning is that you need a way to execute a "save this event only if the version of the entity is x." However the newer release of "exactly once delivery semantics[1]" & "stateful stream processors[2]" creates the possibility of containing your command+event logic entirely within Kafka Streams (B). [1] https://www.confluent.io/blog/exactly-once-semantics-are-possible-heres-how-apache-kafka-does-it/ [2] https://kafka.apache.org/11/documentation/streams/core-concepts#streams_state

Quest 2020-03-04T10:23:48.034900Z

Does anyone have thoughts on the viability of option B? Seems nice in theory -- your event datastore is simply Kafka. I question what unknown gotchas might be hiding out there though. Is there a sample repo out there that demonstrates the "pure Kafka" approach?

gklijs 2020-03-09T13:24:57.039900Z

With my current company they had it all running on Kafka, but are about to use Mongo as datastore. The three most important reasons are that RocksDB being just a key value store was a bit limited, which meant intermediate topics and additional complexity. And several global tables taking several gigabyte of data on each instance + it takes about half a minute to get those from the broker. The last one is that the company want to move to a multi cloud environment, and with atlas it's already available from multiple clouds, while with Kafka some replication needs to be put in place.

gklijs 2020-03-09T13:31:55.040100Z

In my hobby bank project I used PostgreSQL to store state, mainly to have easy idempotence.

2020-03-09T16:43:07.040300Z

a workaround we've implemented is putting documents in an external store, and flowing document ids through kafka (with the convention that the documents are versioned and immutable)

1👍
gklijs 2020-03-09T19:46:32.040900Z

That's sounds pretty much like crux, only in that case the data also goes through Kafka. Having the indexes apart from the application does solve some of the issues with using just Kafka.

thom 2020-03-10T09:22:53.042300Z

Any gotchas implementing custom data stores? Is it a widely accepted architecture to have some tables that just grow monotonically forever?

thom 2020-03-10T09:25:55.045200Z

Basically we do updates or retractions potentially months later and so far Kafka Streams seems completely magic compared to other platforms for keeping downstream stuff up to date, as long as we feed it storage and don’t mind new instances taking a while to drink the data in. But we’d perhaps feel more comfortable if it wasn’t just backed by RocksDB.

gklijs 2020-03-10T12:43:49.045400Z

You could also use ksqlDB, that basically let you have the RockDB exist in another service, and has an api to get data from there. It does solve the restart cost, but you need another piece. And with the other piece you might use a more traditional database as well. You also need to take into account how you want to restore the data in case something happens. Relying just on replication in Kafka might be to risky.

2020-03-10T17:21:15.045600Z

also consider what happens if your data is subject to "right to be forgotten" type rules

tzzh 2020-03-04T11:16:24.035100Z

Hey @cddr just a suggestion, what would you think about also having producer-serde and consumer-serde functions in the jackdaw.serdes.avro.confluent namespace so you only provide the arguments what are needed if you don’t use both aspects of the serde ? eg for the producer you’d only need to provide the avro schema and then for the consumer only needs a schema-registry-client ?

adamfeldman 2020-03-04T14:49:02.035300Z

I went hunting for recent Confluent-produced resources on event sourcing and found this: https://www.slideshare.net/ConfluentInc/event-sourcing-stream-processing-and-serverless-ben-stopford-confluent-kafka-summit-sf-2019 If you haven’t already, be sure to check-out ksqldb (formerly ksql). I’ve been watching it, as a foundation for a CQRS+ES system: https://www.confluent.io/blog/intro-to-ksqldb-sql-database-streaming/

1👍
2020-03-04T16:13:43.035600Z

we do option B in some cases here (at fc where jackdaw is written) one gotcha is that a db or document store has prior art for common maintenance tasks and problems, and kafka as a store is not as mature or as simple. I don't believe there's any place where we trust kafka as the store without some backup, and backing up a proper db like postgres is turnkey, backing up kafka streams is not

thom 2020-03-04T20:02:06.037300Z

Do you guys just use RocksDB for Kafka Streams state? How long do you keep things around in KTables etc?

2020-03-04T21:49:41.037500Z

we use RocksDB in at least one service, with indefinite storage

2020-03-04T21:51:04.037700Z

(caveat, It's been a while since I worked on that service and I'm not sure how much of that implementation has remained the same and I probably shouldn't get too specific :D)