architecture

adamfeldman 2020-04-22T18:13:32.169Z

This release from Flink is fascinating to me — in short, it enables more-easily building an event-driven system, where Flink manages the event stream and also manages invocations of external stateless services (FaaS or otherwise). https://flink.apache.org/news/2020/04/07/release-statefun-2.0.0.html. “Flink invokes the functions through a service endpoint via HTTP or gRPC based on incoming events, and supplies state access. The system makes sure that only one invocation per entity (`type`+`ID`) is ongoing at any point in time, thus guaranteeing consistency through isolation. By supplying state access as part of the function invocation, the functions themselves behave like stateless applications and can be managed with the same simplicity and benefits: rapid scalability, scale-to-zero, rolling/zero-downtime upgrades and so on.” It’s so exciting to me to see higher-level distributed systems primitives being born and maturing. I see Flink’s work as part of a pattern where lots of people across the distributed systems community are working on treating (micro|stateless) services/functions as virtual actors: • Virtual actors in https://dapr.io. • Durable functions on Azure (Microsoft Research invented virtual actors in https://dotnet.github.io/orleans) There are parallels here to what’s happening in the Kafka ecosystem as well, with Kafka Streams, ksqldb, etc. And with Apache Pulsar and its Functions… Where do AWS and GCP initiatives fit into all this?

lukasz 2020-04-22T18:23:18.170100Z

That's really interesting - so I can create a pipeline which consumes events from somewhere an on each event it can call an arbitrary HTTP endpoint to do something with the event record?

lukasz 2020-04-22T18:23:35.170300Z

and push results somewhere else ?

lukasz 2020-04-22T18:23:49.170700Z

GCP has Apache Airflow support I believe

lukasz 2020-04-22T18:23:56.170900Z

which seems to be an equivalent

adamfeldman 2020-04-22T18:34:06.172800Z

@lukaszkorecki I think that’s a good description. Flink StateFun will also ensure there is only 1 concurrent invocation for each unique event record (which I see as actor-like). GCP has Cloud Dataflow, which is based on the Apache Beam runtime model (for which Flink is a first-class runtime, along with the hosted Dataflow service)

lukasz 2020-04-22T18:34:59.173800Z

Right - that's apache Beam. Too many competing Apache projects ;-)

❕ 1
adamfeldman 2020-04-22T18:35:11.174Z

(GCP has Cloud Composer, which is Airflow, which IIUC isn’t designed to handle a high volume of workflows — that’s why these workflow tools exist https://netflix.github.io/conductor/ and https://github.com/uber/cadence)

lukasz 2020-04-22T18:36:01.175600Z

Need to think about this - I'm trying to figure out how we can create a reliable pipeline of workers to do API fetches, but with rate-limiting per tenant

adamfeldman 2020-04-22T18:36:02.175800Z

I’d say AWS Step Functions is similar in spirit to conductor/cadence

adamfeldman 2020-04-22T18:36:22.176400Z

I’m also interested in the outbound rate limiting problem

adamfeldman 2020-04-22T18:36:33.176800Z

I don’t understand why there isn’t off-the-shelf tooling for that

lukasz 2020-04-22T18:36:41.177100Z

at this point we have a janky rate limiter based on redis :-/

lukasz 2020-04-22T18:36:56.177600Z

but still no queues per tenant (because we rely on RabbitMQ)

adamfeldman 2020-04-22T18:37:19.178200Z

https://zeebe.io is an OSS-ish workflow tool (from Camunda who do BPM)

adamfeldman 2020-04-22T18:37:27.178600Z

Maybe helpful for building that stateful pipeline

adamfeldman 2020-04-22T18:38:51.180100Z

Perhaps you could limit calls per-tenant by sending all tenant calls through the same worker? Then you don’t need a distributed rate limiting tool, just a way to partition tenants across workers

adamfeldman 2020-04-22T18:39:40.180800Z

Or, similarly, a logical queue per tenant

adamfeldman 2020-04-22T18:40:00.181200Z

ah you mentioned RabbitMQ

lukasz 2020-04-22T18:40:24.181700Z

Logical queues are ok, but only to a point - Rabbit cannot scale well with a large number of queues

lukasz 2020-04-22T18:40:56.182500Z

Same worker could potentially do it, but then we tie our deployment to a number of tenants, which come and go

2020-04-22T18:41:06.182700Z

there is a great talk about rate limitting

2020-04-22T18:41:20.182900Z

https://www.youtube.com/watch?v=m64SWl9bfvk

👍 3
👀 1
lukasz 2020-04-22T18:41:41.183500Z

btw, I don't enforce rate-limits - it's the 3rd party APIs that we need to call

lukasz 2020-04-22T18:41:59.183900Z

and limits vary per tenant (enterprise accounts vs regular plans) etc etc

lukasz 2020-04-22T18:42:05.184300Z

it's a multi-variable nightmare

adamfeldman 2020-04-22T18:42:15.184700Z

Salesforce is often the culprit there

lukasz 2020-04-22T18:42:31.185100Z

we solved that by pushing processing to Salesforce itself - they call us ;-)

❤️ 1
lukasz 2020-04-22T18:42:58.185700Z

but it's terribly hard to debug if something fails so that's a big trade off

2020-04-22T18:43:49.186200Z

you may also want to look at aws step functions for inspiration

richiardiandrea 2020-04-22T18:51:04.186500Z

I was happy to to see this. Apache Pulsar is also super interesting cause it basically is able to store event on different storage (like S3). The only thing I would about all these JVM/Apache tools is the overhead in ops. It's a pity cause you really need to ponder before adopting them.

adamfeldman 2020-04-22T18:59:20.186700Z

That’s also what I most appreciate about Pulsar — having Bookkeeper offload to blobstores. Kafka has recently made public moves to support that as well. I expect that eventually everything will have a Kubernetes operator — it doesn’t solve all the problems, but a lot of them

richiardiandrea 2020-04-22T19:08:32.186900Z

Yeah but that's then another piece to hold knowledge about. I don't know, I feel we are getting farther and farther from simplicity here 😄