This release from Flink is fascinating to me — in short, it enables more-easily building an event-driven system, where Flink manages the event stream and also manages invocations of external stateless services (FaaS or otherwise). https://flink.apache.org/news/2020/04/07/release-statefun-2.0.0.html. “Flink invokes the functions through a service endpoint via HTTP or gRPC based on incoming events, and supplies state access. The system makes sure that only one invocation per entity (`type`+`ID`) is ongoing at any point in time, thus guaranteeing consistency through isolation. By supplying state access as part of the function invocation, the functions themselves behave like stateless applications and can be managed with the same simplicity and benefits: rapid scalability, scale-to-zero, rolling/zero-downtime upgrades and so on.” It’s so exciting to me to see higher-level distributed systems primitives being born and maturing. I see Flink’s work as part of a pattern where lots of people across the distributed systems community are working on treating (micro|stateless) services/functions as virtual actors: • Virtual actors in https://dapr.io. • Durable functions on Azure (Microsoft Research invented virtual actors in https://dotnet.github.io/orleans) There are parallels here to what’s happening in the Kafka ecosystem as well, with Kafka Streams, ksqldb, etc. And with Apache Pulsar and its Functions… Where do AWS and GCP initiatives fit into all this?
That's really interesting - so I can create a pipeline which consumes events from somewhere an on each event it can call an arbitrary HTTP endpoint to do something with the event record?
and push results somewhere else ?
GCP has Apache Airflow support I believe
which seems to be an equivalent
@lukaszkorecki I think that’s a good description. Flink StateFun will also ensure there is only 1 concurrent invocation for each unique event record (which I see as actor-like). GCP has Cloud Dataflow, which is based on the Apache Beam runtime model (for which Flink is a first-class runtime, along with the hosted Dataflow service)
Right - that's apache Beam. Too many competing Apache projects ;-)
(GCP has Cloud Composer, which is Airflow, which IIUC isn’t designed to handle a high volume of workflows — that’s why these workflow tools exist https://netflix.github.io/conductor/ and https://github.com/uber/cadence)
Need to think about this - I'm trying to figure out how we can create a reliable pipeline of workers to do API fetches, but with rate-limiting per tenant
I’d say AWS Step Functions is similar in spirit to conductor/cadence
I’m also interested in the outbound rate limiting problem
I don’t understand why there isn’t off-the-shelf tooling for that
at this point we have a janky rate limiter based on redis :-/
but still no queues per tenant (because we rely on RabbitMQ)
https://zeebe.io is an OSS-ish workflow tool (from Camunda who do BPM)
Maybe helpful for building that stateful pipeline
Perhaps you could limit calls per-tenant by sending all tenant calls through the same worker? Then you don’t need a distributed rate limiting tool, just a way to partition tenants across workers
Or, similarly, a logical queue per tenant
ah you mentioned RabbitMQ
Logical queues are ok, but only to a point - Rabbit cannot scale well with a large number of queues
Same worker could potentially do it, but then we tie our deployment to a number of tenants, which come and go
there is a great talk about rate limitting
btw, I don't enforce rate-limits - it's the 3rd party APIs that we need to call
and limits vary per tenant (enterprise accounts vs regular plans) etc etc
it's a multi-variable nightmare
Salesforce is often the culprit there
we solved that by pushing processing to Salesforce itself - they call us ;-)
but it's terribly hard to debug if something fails so that's a big trade off
you may also want to look at aws step functions for inspiration
I was happy to to see this. Apache Pulsar is also super interesting cause it basically is able to store event on different storage (like S3). The only thing I would about all these JVM/Apache tools is the overhead in ops. It's a pity cause you really need to ponder before adopting them.
That’s also what I most appreciate about Pulsar — having Bookkeeper offload to blobstores. Kafka has recently made public moves to support that as well. I expect that eventually everything will have a Kubernetes operator — it doesn’t solve all the problems, but a lot of them
Yeah but that's then another piece to hold knowledge about. I don't know, I feel we are getting farther and farther from simplicity here 😄