The Art and Science of Workflow Systems

Rohit Kumar
11 min readMar 10, 2024

--

I have been dealing with workflow systems quite a lot in the past three years, so I would like to summarize a little bit about the design of workflow systems.

First, let’s clearly define the workflow system. There is a very simple introduction on Wikipedia . I have open sourced a stateless workflow system to replicate RDS instances using AWS step functions which constitutes a workflow. Within companies I have been exposed to diffrent workflow management systems like AWS SWF , Camunda, Temporal etc. After several twists and turns, I slowly began to think about how to design a workflow system and what important aspects need to be taken into consideration. We will talk through the following architecture below.

t

Scalability

Basically, no matter what infrastructure is designed, scalability is an important consideration. As for workflow, basically the horizontal expansion of working nodes is the most important indicator of scalability. Since the working node can be expanded horizontally, this means that the task must be actively acquired by the working node in the pull method, rather than being allocated from the scheduling node in the push method. When assigning tasks, you need to consider the following: If multiple worker nodes try to pull tasks, who should be assigned to them? Specifically, for example, if each task node is allowed to execute 5 tasks at the same time, and the total number of tasks that can be executed simultaneously is only 5, and the total number of task nodes is also 5, the ideal state should be these 5. are evenly distributed to these 5 nodes, but using a simple pull mechanism cannot guarantee this. It is possible that all 5 tasks run to one machine, because this does not exceed the number of tasks that can be executed simultaneously by one node’s upper limit.

On the other hand, generally speaking, all tasks should be idempotent, that is, they can be repeatedly submitted for execution, and the results of executing them several times are the same as executing them once. The task execution of worker nodes can cause errors at any step. As the number of nodes increases, such errors become more of a norm rather than an “abnormality.” The health status of working nodes needs to be maintained and notified in some way. The most typical, cheap and effective way is “heartbeat”.

Functional Decoupling

  • Decoupling resource management and task management. I have only seen this in a few workflows. Task management is available in almost all workflows, but individual resource management is not. For example, I can write a task to execute tasks on EMR , and you can also write a task to execute on EMR. The execution management logic of EMR can be shared by us in the form of code — but under this architecture, it is difficult for my task and your task to share the same EMR resource safely and efficiently. Whether it is resource creation and destruction, status query, or throttling, it becomes very troublesome. Similar examples include database sharing, printer sharing, and even sharing of another workflow system. When there are expensive resources, we often need to manage the workflow level in a unified manner, manage one or several resources, but share them with a large number of tasks.
  • Business logic and scheduling logic are decoupled. This is basically available in all workflows. Scheduling logic has nothing to do with the business and is a relatively “dead” thing. It manages the status of the workflow and the success or failure of each task. But business logic is the “living” flesh and blood that makes up the workflow and the tasks in it. I haven’t seen any workflow that writes business code and scheduling logic together.
  • Decoupling of status query and scheduling systems. Scheduling can only be one of the core aspects of a complete workflow system. Without a good status query system, the maintenance workload will be huge. The two must be decoupled. For example, the status of workflow and task execution must be persisted in some kind of storage medium, such as relational databases, NoSQL databases, disk log files, etc. At this time, the scheduling system can be said to be the main source of writing this information to the storage system, and reading this information may be from the scheduling system or from the status query system. The storage format, or schema, must be relatively stable. The consistency and availability of this storage will be the core component of the consistency and availability of the entire system.
  • Decoupling the decision-making system and execution system. The decision-making system is used to decide whether a task meets the conditions and start its execution. It is the brain of the entire workflow system the execution system is the specific tasks, and it is the flesh and blood of the entire workflow system.
  • The event system and the listening system are decoupled. There are only a few workflows involving this. Many workflow systems have internal event systems, such as a task being assigned to a node, a task failing to execute, etc. However, the monitoring system for such events is not independent, resulting in subsequent execution for special events. Specific logic becomes difficult.

Synchronous and asynchronous tasks

  • In fact, when independent resource management functions are considered, the division of asynchronous and synchronous tasks becomes natural.
  • There are many tasks that need to be performed on the current worker node. For example, you need to make a downstream call on the working node, and then write it to the database after processing. These tasks consume a lot of memory and CPU, and need to be allocated independent and dedicated threads to complete, which are synchronization tasks.
  • For some tasks, the worker node is not the actual executor of the work, but a client for a certain resource system. It is only responsible for submitting tasks to the system and is responsible for management and monitoring. For example, for a print task/sending out emails etc, you submit a print request to the printer/email server, and then you only need to continuously query for the status of the task, and perform operations such as deleting the task and resubmitting it as needed. These tasks usually do not require long-term ownership of the thread, and a thread can handle multiple tasks in one cycle. They are asynchronous tasks. In addition, to take a special case, the nesting of workflows, that is, the workflow calls a sub-workflow, then the behavior of querying the status of the sub-workflow must be an asynchronous task. Asynchronous tasks involve event notification and monitoring mechanisms, which will be mentioned later.

Distributed lock

In some cases, distributed locks become a necessity. For example, the resource management mentioned earlier. There are many resources that require exclusive operations. In other words, concurrent calls of two operations are not supported, and unpredictable problems may occur during the process. On the other hand, when a node operates on resources, it needs to cooperate with other nodes. , so that the operations of the two working nodes are orderly and correct, and there will be no conflict.

For example, working node A wants to query the status of the current EMR. If it has been idle for 10 minutes, it will perform an operation to end the EMR resource; and working node B will query the status of the EMR. If it has not been terminated, it will start new computing tasks that need to be submitted to it. At this time, if there is no cooperation of distributed locks, a problem arises. Maybe node B first queries and finds that the EMR status is still alive. At this moment, node A ends it, but B does not know it, and then submits a calculation task to the ended EMR resource, so the submitted computing task must have failed to execute.

There are many ways to implement distributed locks, including simple storage systems with strong consistency, and of course more efficient implementations, such as some specialised distributed lock systems.

Functional scalability

I talked about scalability in performance architecture before, and the same is true at the functional level.

  • Custom tasks. This is something that almost all workflow systems will consider, and it is also inevitable for the decoupling of business logic and scheduling logic. Because it is impossible to predict all task types when designing a workflow system, users can define their own execution logic.
  • Custom resources. With resource management, there is a need to customize resources.
  • Custom event listening. Event management is usually an easily overlooked aspect in workflow systems. For example, if I want to send a special message to notify me when a certain task times out, I need to provide the possibility of expansion for this event listening.
  • Workflow task execution conditions at runtime. Usually the workflow will have a file (meta file) that defines how to execute, but there are some execution parameters and conditions that can be determined at runtime, and even depend on the results of the previous execution, or require the execution of some logic to obtain them.

Availability and reliability

Most workflows adopt a decentralized node design to ensure that there is no single point of failure. It also ensures that when business pressure increases, the latency that marks availability is within the expected range. I won’t expand on other content, but articles introducing this aspect are everywhere.

Life cycle management

This refers to both the life cycle management of a workflow execution and the life cycle management of a single task.

Talking about these will inevitably involve the following questions:

  • The separation of workflow definition and workflow, the separation of task definition and task execution. Among them, definition defines the logic of execution, and execution is really related to the execution environment, time, parameters, etc. There can usually be only one copy of logic (but this is not necessarily the case, it depends on whether the workflow supports multiple versions, as mentioned later), but execution will save multiple copies as retries occur.
  • Processing of parameter changes when workflow retries. Changes in some parameters will not affect completed tasks, but some parameters will not.
  • Processing of completed tasks when workflow retries. In some cases we want completed tasks to be re-executed, and in other cases we want these completed tasks to be skipped.
  • The number of retries for the task, and the back off strategy when retrying. For example, you need to wait 5 minutes for the first retry, and 10 minutes for the second retry, with a maximum of 2 retries.
  • How to politely end task execution on a worker node. In many cases, we have to interrupt and end the task execution on a certain node. For example, the working node needs to be restarted. This is not considered a task execution failure caused by business code, but more like a “resource termination”. In this case, the task usually needs to be assigned to another alive node, and there is a strategy involving this reallocation, which has been mentioned earlier.
  • The weight of the task or priority, I have only seen this feature in a few workflows. When considering resource allocation, some more important tasks can have higher priority, and the failure of insignificant tasks may not even affect the status of the workflow.

Design and expression of task DAG

This is a flow chart of workflow execution and an expression of the dependencies between all tasks. I have seen many expression methods, including XML, JSON, and some unknown self-defined formats. Some workflow definitions can use a graphical tool to assist in completing the flow chart. The design of this DSL determines to a certain extent whether the use of the workflow can be easy to understand. In addition, the optional graphical tool mentioned here is only an auxiliary after all, it is not the core of the workflow.

In addition, the status and execution of the workflow, as well as its archiving and management, also require an integrated tool to assist. Almost all workflows have this aspect, usually web tools and command line tools.

Input and output management

This is also a nice-to-have thing. For each task, there are input and output, which can be completely implemented by the user. For example, the user stores them in a file or writes them to a database, while the workflow is not implemented at all. No matter, each task can read the corresponding user file internally. But a better way is that some commonly used and simple inputs and outputs can be persisted to the status of workflow and tasks along with execution. This also makes it easier to place some expressions in the definition of the workflow that determine subsequent execution based on the execution results of the previous task.

In addition, there is a slightly less popular use case, which is the management of input and output. Usually workflows are executed repeatedly, and the data scale of input and output for each execution is often something that many people care about. Regarding this part, I have not seen any workflow provide such a function. Many users write their own tools and scripts to obtain such information.

Independent metrics and logging system

For metrics, the core content is nothing more than node health, CPU, memory, task execution time distribution, failure rate, etc. In some cases users also want to expand by themselves.

Regarding logs, it mainly refers to archiving and merging. Archiving means that historical logs will not be lost, or will not be lost within a certain period of time, and expired logs can be overwritten, thus not causing disk capacity problems; and merging means that logs can be queried and browsed from a more unified perspective , if there is a problem, you don’t have to go to each machine to find it manually. The lack of this feature can sometimes be very troublesome. At work, I encountered a problem where a resource was terminated abnormally. In order to find the node that terminated the resource, I checked the logs of dozens of nodes, which was extremely painful. All workflow systems like AWS SWF, Camunda, Temporal etc proviode a way to rertain history via tables, cloudwatch etc.

Version control and smooth deployment

The reason for putting these two together is that code upgrades are inevitable and often happen. In order to ensure smooth deployment, it is obvious that usually, the code on the node cannot be updated at the same time and needs to be done part by part. For example, first terminate 50% of the nodes, deploy the code, activate and ensure success, and then proceed to the remaining 50% of the nodes. However, during this period, there was the problem of the coexistence of old and new code, which usually led to many strange problems. I have seen two solutions to this problem:

  • One is that all nodes are deployed at the same time. In this case, all nodes are inactivated. Task timeout may occur due to this inactivation, and even workflow execution failure may occur. However, the life cycle of the workflow is managed by a separate scheduling system, so most situations are not affected except for timeouts.
  • There is also a partial deployment, which is the smoothest, but in this case it is necessary to manage the coexistence of multiple versions, and it also puts forward new requirements for code quality — backward compatibility.

No matter which one is chosen, this method is relatively simple to implement, but there are also many problems, such as how to deal with external resources in this case? For example, a Spark task is executed on an external EMR resource, but old code has been placed on EMR for execution. At this time, the working nodes are updated. How to deal with the tasks being executed on EMR? Should it be invalidated or retained? If retained, these executions will still rely on the old code. Will the subsequent processing of the results conflict with the new code just deployed? Another example is that if there is a change in the workflow definition (such as a change in DAG), how should the existing execution be handled? Should it be updated or kept as it is (usually it is kept as it is, because updates bring many complex issues).

--

--