Preemption-Aware Allocation, Deadline Assignment for Conditional DAGs on Partitioned EDF

Heterogeneous hardware platforms are often used for implementing complex critical real-time applications, like Advanced driver-assistance systems (ADAS) and autonomous driving. Typically, they are composed of CPU hosts and a set of accelerators. To better support real-time workloads, several hardware accelerators have evolved to allow preemption for computationally intensive tasks, such as GPUs. However, their preemption costs can be very high compared to classical CPU preemption, and therefore must be taken into account at design time and in the scheduling analysis. In this paper, we address mainly two tightly correlated problems: (i) task allocation for a set of real-time tasks, modeled by conditional directed acyclic graphs (C-DAG), onto multiprocessor platforms under partitioned preemptive Earliest Deadline First scheduling, assuming a non-negligible cost of preemption, and (ii) intermediate deadlines and offsets assignments to real-time C-DAGs, so to remove unnecessary preemption and reduce the total preemption overhead. The effectiveness of the proposed technique is evaluated using a large set of synthetic tasks sets.


I. INTRODUCTION
Recent embedded platforms combine several computing cores with different instruction-set architectures and computation capacities. An example of such architecture is the NVIDIA Xavier-based board, which comprises different accelerators, like GPUs, DLAs, etc., together with a set of classical ARM cores onto the same System On Chip (SoC). These platforms are the preferred choice for modern time-critical applications, like Advanced Driving Assistance Systems (ADAS), which need an ever-increasing amount of computational power for executing complex time-critical tasks.
These applications are typically structured as a set of concurrent tasks, each one modeled by a Directed Acyclic Graph (DAG) of sub-tasks. Moreover, they may exhibit dynamic behavior. For example, when an ADAS detects an obstacle, it may run effective algorithms to precisely detect and avoid the obstacle, otherwise it may continue running lessprecise but less time-consuming sub-tasks. The Conditional DAG (C-DAG) [4] has been proposed to efficiently model and analyze such dynamic behavior.
Several difficult challenges are encountered when scheduling real-time applications modeled by C-DAGs on such architectures: the choice of the scheduling algorithm, and how to allo-This work has received funding from the European Unions ECSEL JU Programme under the SECREDAS Project, grant agreement N. 783119. 978-1-7281-4403-0/20/$31.00 ©2020 European Union cate (sub-)tasks to computational resources. Global scheduling may not always be viable on heterogeneous architectures, and task migration may produce a high overhead. In this paper we restrict to partitioned scheduling. Our strategy consists in allocating sub-tasks to computational resources, and then use fully preemptive EDF on each resource.
In most of the literature on real-time scheduling, the cost of preempting a task is considered to be negligible at the schedulability analysis. According to the tightness of the realtime constraints, the previous assumption can hold for classical CPUs. However, preemption can be a very costly operation for massively parallel accelerators. In some extreme cases, the cost of saving and restoring the context may exceed the task worst-case execution time, in such scenario, it is more effective to use a non-preemptive scheduling algorithm. However, in other cases, preemption is necessary to ensure schedulability.
Many techniques for limiting preemption have been proposed in the literature (see Section VII for an overview of related work). In this paper, we investigate reducing preemption, in an orthogonal solution to those used in classical real-time systems. We will now briefly describe our main idea.
First of all, to correctly schedule the sub-tasks of a C-DAG, we need to assign them artificial scheduling deadlines and offsets, such that, if every instance of a sub-task executes within its window defined by its offset and its deadline, then all precedence constraints are respected and the task completes before its end-to-end deadline (see Example 1 in Section II). The problem of optimally assigning offset and deadlines to subtasks is very difficult, and many heuristics have been proposed in the literature. In this paper, we use some EDF properties to assign deadlines and avoid preemption when not necessary.
Further, we propose clustering techniques to define a set of task groups, such that tasks of the same group can avoid preempting each other under preemptive EDF. These techniques uses properties of our deadline and offset assignment heuristics.
Summarizing, in this paper we propose a novel methodology for 1) assigning artificial deadlines and offsets to the sub-tasks of a C-DAG, and 2) allocating sub-tasks to computational resources, so to guarantee schedulability and to reduce the overall preemption cost.

A. Task model
Let T = {τ 1 , τ 2 , · · · , τ n } denote a set of n tasks. Each task τ i ∈ T is represented by a tuple: (i) a Directed Acyclic Graph (DAG) denoted by G(τ i ), (ii) its period T(τ i ) and (iii) its end-to-end deadline D(τ i ). When it is not important, we drop task index for the sake of simplicity.
Each task graph G = {N , E} is compound of a finite set N of nodes, and a finite set E of directed edges representing precedence order between graph nodes. No cycles are allowed in the graph. A node can be either a sub-task v ∈ V or a condition-control node c ∈ C, (N = C ∪ V). A sub-task v is an elementary sequential execution block and can be implemented as a single thread. A condition-control node c is a non-deterministic condition evaluated on line. According to the value of the condition one of two successors of c is selected(We can easily express the case of a condition with multiple successors by defining successively branches of twosuccessors condition-control nodes).
To simplify the model and the presentation of our work, in this paper we restrict to identical core platforms. In fact, addressing heterogeneous code requires a more complex task model where sub-tasks are tagged with the core on which they are allowed to execute (a GPU sub-task can only execute on GPUs, etc.). A more complex model accounting for heterogeneous multicore has been presented in [29]. Allocating a set of C-DAG tasks to Multiple ISA heterogeneous architecture can be converted onto multiple single ISA identical multiprocessor allocation problems, without loss of generality. Thus, this work can easily be integrated in [29] to consider heterogeneous architectures.
A sub-task is characterized by its execution time C(v) and by the overhead to account when preempting sub-task v, called preemption cost and denoted by pc(v). An edge e(n i , n j ) ∈ E models a precedence constraint (and related communication) between node n i and node n j . n i is an immediate predecessor of n j if ∃e(n i , n j ) ∈ E. pred(n i ) denotes the set of all immediate predecessors of node n i . n i is a predecessor of a n j if there exist a path from n i to n j . If a sub-task has no predecessor, it is a source node of the graph. In our model we allow a graph to have several source nodes. In the same way, n i is an immediate successor of n j if n j is an immediate predecessor of n i . n i is a successor of n j if there is a path from n j to n i . If a node has no successors, it is called sink node. src(τ ) denotes the set of all source nodes in task τ , and src(T ) = τ ∈T src(τ ) denotes all source sub-tasks in task set T , respectively. |S| denotes the cardinality of the set S.
We consider a sporadic task model, therefore parameter T represents the minimum inter-arrival time between two instances of the same task. When an instance of a task is activated at time t, all source sub-tasks are simultaneously activated. All subsequent sub-tasks are activated upon completion of their predecessors, and sink sub-tasks must all complete no later Fig. 1: C-DAG Task with its subtasks execution times than time t + D. We assume constrained deadline tasks, that is D ≤ T.
We define an execution pattern p j (τ ) of task τ as one possible combination of all condition-control nodes in G. P(τ ) denotes the set of all execution patterns. We define the pattern execution time as the sum of all sub-tasks it contains. We denote by vol(τ ) the volume of task τ computed as the maximum execution time among all its execution patterns: vol(τ ) = max{C(p j (τ )), j ∈ {0, · · · , |C| + 1}}. and we define the utilization of task τ as: u(τ ) = vol(τ ) T(τ ) . We define also the task set utilization as the sum of utilizations of all its tasks: In this paper, we allow sub-tasks of the same task to be allocated onto different cores. We define by V k (τ ) the set of all sub-tasks of task τ that are allocated on core k. τ k denotes an isomorphic graph of τ where sub-tasks not belonging to V k have null execution time and null preemption cost, and the sub-tasks belonging to V k have the same execution time and preemption cost as those in V.

Definition 1.
Let v i be is an immediate predecessor of v j . If v i and v j are allocated onto different cores, then we say that v i is a "null predecessor" of v j regarding its allocation core.
Let π(τ ) denote a complete path(Complete path is a path from one source node to one sink node)of task τ , Π(τ ) denote the set of all paths of task τ . We define the slack Sl(π, D(τ )) along path π as: We denote by π * (τ ) the path having the maximum sum of execution time of its sub-tasks, called critical path.
Example 1. Consider the task described in Figure 1. The task in Figure 1 starts its execution with a single source sub-task v 1 . Further, according to condition C 2 , the task may follow the execution pattern v 4 , v 8 to join the sink node v 9 , or follow the other pattern and execute v 3 and further fork a parallel execution v 5 and v 6 , both tasks join back v 7 to finally execute the sink node v 9 . Notice here that in this example source and sink nodes are unique, however this is not mandatory.
If sub-task v 3 gets preempted, the preempting sub-task has to account for 3 times units of preemption overheads. If the preempting task preempts sub-task v 5 , it has to account 4 times units. It is important to account for the correct preemption cost to guarantee the respect of deadline constraints. The volume vol(τ ) is equal to 34.

III. DEADLINES AND OFFSETS ASSIGNMENT
In this paper, we assume that sub-tasks are scheduled by EDF. Therefore, we need to assign artificial deadlines to sub-tasks. Several techniques have been proposed to enforce the respect of precedence constraints across different cores [12], [21], [22], among them assigning artificial offsets to each sub-task. In fact, offsets and deadline are assigned so that, if each sub-task executes within its artificial offset and deadline, execution order is respected and the overall task respects its deadline.
We denote by D(v) the artificial deadline of v. The activation time of a task instance is the absolute time of the arrival of source sub-tasks instances. The artificial offset O(v) is the interval between the activation of the task graph and the activation of the sub-task. The absolute deadline of a sub-task instance is the activation time plus the artificial offset plus the artificial deadline D(v). We also define the local deadline as the interval between the task graph activation and the sub-task absolute deadline: it is computed as the sum of its artificial offset and its artificial deadline (O(v) + D(v)).
Most of the artificial deadline assignment algorithms distribute the slack computed in Equation (1) to different subtasks on every path in a task graph. However, they differ on the way this operation is done (referred as calculate share in Equation (2)). In Section VII, we will describe the most popular techniques proposed in the literature, and in Section VI-B, we propose our own heuristics.
Once artificial deadlines are computed for all sub-task, we can automatically assign offsets as follows. As source nodes are activated as soon as the task is activated, their offset is set to 0. For the other sub-tasks: if the sub-task has more than one immediate predecessor, the offset is computed recursively as the maximum between the local deadlines of its immediate predecessors (Equation (3)). Sub-task v is feasible if for each task instance arrived at a j , sub-task v executes within the interval bounded by its arrival time a(v) = a j + O(v) and its absolute deadline a(v) + D(v).

IV. PREEMPTION-AWARE ANALYSIS
In this paper, we consider a sporadic task system. We show in [29] how to compute and reduce preemption costs.
Lemma 1 (Worst case preemption). Let V = {v 1 , v 2 , · · · , v K } be a set of sub-tasks to be scheduled by EDF on a single core. Consider where v i has the same parameters as v i , except for its WCET that is computed as If V pc is schedulable by EDF when considering a null preemption cost, then V is schedulable when considering the cost of preemption.
Lemma 1 accounts the maximum preemption cost in each sub-task execution time. If the system is schedulable in this configuration, then it is schedulable when considering preemption. The lemma is safe but very pessimistic. Pessimism can be reduced by using Theorem 1.
We highlight here the difference between pc(v i ), which represents the cost to preempt v i , and pc i which is the cost that v i needs to account to preempt other sub-tasks.
Definition 2 (Maximal sequential subset). A maximal sequential subset V M of task τ is a maximal subset of V τ such that: is either null and does not belong to V M , or non null and belongs to V M . Further, we denote by v M the sub-task with the shortest local deadline among all sub-tasks in V M that are either sources, or have a null predecessor.
and pc i is computed as follows: If V pc is schedulable by EDF when considering a null preemption cost, then V is schedulable when considering the cost of preemption.
Proofs of Lemma 1 and Theorem 1 can be found in [29]. For space reasons, they are not reported in this paper.
To adopt pc i notation for partitioned scheduling, we revise the symbol to pc i (k) to denote the preemption cost of sub-task v i when allocated onto core k.
Theorem 2 (Preemption aware volume). Let T be a set of tasks, whose sub-tasks have already been allocated on a set of cores. Consider task τ ∈ T , and suppose that τ is allocated on a subset of cores denoted by K. Letτ k denote the subset of sub-tasks of τ allocated on core k ∈ K.
Consider now a second configuration for task τ in which all sub-tasks of τ are allocated on the same core j ∈ K, let us call such a configuration asτ j ; the other tasks maintain the same configuration of allocation. Then: In plain words, splitting the allocation of a task on several cores costs more in term of utilization than allocating all subtasks in one single core.
Proof. We start by proving that the following inequality is correct: We must consider two cases. First, consider the case of a task not containing any condition-control nodes. According to the definition ofτ k , sub-tasks of τ not allocated to k have null execution times, thus k vol(τ k ) = vol(τ ). Now consider the case when the task contains conditional branches. Two branches of a conditional node will both contribute each on its core on the task volume, however if they are allocated on the same core, only one of will contribute to the volume. Therefore, the left part of Inequality (6) cannot be inferior to the right part. Now we prove that the following inequality is also correct: By assuming that all sub-tasks are allocated onto one core j, the maximal sequential subset V M contains all sub-tasks, thus Let us analyze the left hand side of Inequality (7).
As the last sum (in Equation (9)) is greater than or equal to 0, the inequality in (7) is proved.
Theorem 3 (0-cost preemption). Let T denotes a set of tasks allocated on same core and scheduled using preemptive EDF. T is scheduled without any preemption if and only if: • all source sub-tasks have the same artificial deadline D src ; • all other sub-tasks have deadline shorter than D src .
Proof. Only-if. For a given sub-task v, from Theorem 1, and from the fact that all sub-task are allocated on the same processor, we derive that v can only be preempted by a source sub-task. Since all source sub-tasks have deadline larger than the deadline of v, thus no preemption can occur on v.
If. By contradiction, assume that at least one preemption occurs. The preempting sub-task must be a source, thus: By assumption, the condition above is not possible, thus proving the second leg of the equivalence.
According to Theorem 3, if it is possible to feasibly assign the same deadline to all source nodes, which is greater to all other resource of a task system allocated onto the same core without any null-predecessor, the preemption cost is 0. Theorems 2 and 3 will be used to build our allocation and deadline assignment heuristics.
We present now preemption-aware deadline assignment using ILP and further allocation and deadline assignment heuristics.

V. ILP PREEMPTION-AWARE DEADLINE ASSIGNMENT
Finding the optimal solution for overall problem (deadline assignment and allocation that minimizes preemption cost) is very complex due to the extremely large space of parameters to explore.
In this section, we propose a model based on Mixed Integer Linear Programming (MILP) to assign artificial deadlines to sub-tasks, assuming a single core. The MILP model proposed here can be used by the heuristics for allocation on multicore systems of Section VI-A. In Section VI we will also propose heuristics for deadline assignment as an alternative to the MILP formulation.

A. Decision variables and objective function
Let seq l (τ ) denote the l th maximal sequential subset of task τ on a given core and let v M l denote its v M , selected according to Theorem 1. Let D(v) be an integer decision variable expressing the deadline of sub-task v.
Let p (vi,vj ) be a binary decision variable to express the ability of sub-task v i to preempt sub-task v j . According to Theorem 1, only sub-tasks v M l in all maximal sequential subsets (∀l) have to account for maximal preemption costs, all the other sub-tasks account for 0 preemption cost. Thus, the variable p is defined for the combination of sub-task v m l for every maximal sequential subset and all sub-tasks as follows: Where V = ( τ ∈T V(τ )) is the set of all sub-tasks in task set and τ is the task to which v M l belongs. The objective function tries to reduce as much as possible the preemption cost, thus it is modeled as follows:

B. Preemption and deadline assignment constraints
To express the ability of v M l to preempt v j , we impose the D(v M l ) < D(v j ) that can be linearized as follows: is set to 1 so both constraints in Equation (12) can be respected, otherwise it is set to 0.
Other constraints can be linearized in a similar way. Due to space constraints, we do not report here the rest of the linearized constraints.
For each task, we need to impose that the sum of deadlines in each complete path does not exceed the task deadline as follows: We highlight that if sub-task v is present in several paths, it has only one decision variable D(v). Moreover, the deadline of each sub-task need to be greater than its execution time (slack need to be greater or equal to 0), thus D(v) ≥ C(v).

C. Feasibility constraints
In addition to the above constraints, we need to impose the schedulability of the system. Theorem 4 (Single core feasibility). Let T be a set of task graphs allocated onto a single-core core. Task set T is schedulable by EDF if and only if: where dbf is the demand bound function [6] for a task graph τ in interval t. The demand bound function is computed as the maximum cumulative execution time of all jobs (instances of sub-tasks) having their arrival time and deadline within any interval of time of length t. For a task graph, the dbf can be computed as follows: where(We remind that the remainder of a/b is by definition a positive number r such that a = kb + r.) Schedulability can be tested by applying the constraint presented in Equation (14) for all values of t between 0 and the tasks hyper-period by replacing C(v ) in Equation (15) by the execution time C(v ) pc computed in Equation (16).
It is time consuming to compute the exact dbf as in Equation (15). Several dbf approximations has been proposed in the literature of real-time systems. One of simplest for conditional DAGs has been presented by Baruah et al. in [3]. We enhanced this approximation to take into account the preemption costs. Our modification is described in Equation (17). First let us define vol(τ ) pc as the task volume when accounting for preemption overheads : The dbf can be approximated as: Where u pc i = vol(τ ) pc T is the task utilization when taking into account preemption costs.
The approximation described here is only valid when all sub-tasks of the same task are allocated on the same processor. If one or more sub-tasks of the same task are allocated on a different processor, the approximation does not guarantee the respect of the artificial deadlines of the sub-tasks, and hence of the precedence constraints. Therefore, if all sub-tasks of a task are allocated onto the same processor, we use the approximation; otherwise we resort to the exact dbf.

VI. ALLOCATION AND DEADLINE ASSIGNMENT HEURISTICS
In this section, we present the different heuristics used in this paper to assign deadlines and allocate sub-tasks onto processors so to minimize the impact of the preemption cost on the schedulability of the system. As mentioned in Section IV, to achieve optimal and sub-optimal solutions, deadline assignment of different sub-tasks of different tasks have to be considered at the same time, because the cost of preemption is a function of all sub-tasks allocated on the same core.
Thus, our algorithm consists of 3 steps: (i) group tasks (subtasks) using a preemption-aware clustering algorithm, (ii) assign deadlines using either single core ILP deadline assignment as shown in Section V, or following the deadline assignment heuristic describe further and (iii) re-adjust task groups and allocate each group onto a core. This 3-step approach is described in Algorithm 1.

A. Allocation heuristics
Algorithm 1 starts by clustering tasks (Line 3) into separate groups, such that each group has total utilization strictly greater than 1 (except the last one).
If the number of groups is greater than the number of available cores, the system is non feasible as it will be proved in Lemma 2.
Further, task groups are sorted in a non-increasing order of total utilization (Line 7). For each task group, we first assign deadlines using either ILP, or one of the preemption-aware heuristics described later on. while (!feasible) do 10: assign deadlines(T c ,ASSIGN PARAM) 11: feasible = test feasibility(T c ) 12: if (!feasible) then removed += omit subtasks(T c ) 13: end while 14: insert(removed,T |clusters| ) 15: If |clusters| > m then return FAIL 16: end for 17: return SUCESS Further, for each group the algorithm tests the feasibility of the sub-tasks in the group (except if the ILP approach is used, because it always produces a feasible task set). If the system is not feasible, a sub-task is selected to be removed from the current group (line 15). When a sub-task is removed from the group, it is replaced by a null sub-task with execution time and preemption costs equal to 0. The removed sub-task is inserted into the removed list and will be inserted in the last task group.
omit subtasks starts by selecting randomly a task in the task group. Further, within the selected task, it selects a sub-task according to one of the following heuristics: • random heuristics: the sub-task to omit is selected randomly from all sub-tasks in the task, • preemption-aware heuristic: this heuristic behaves differently when it is applied to a task with no null-sub-tasks, and when it is applied to a task containing at least one null sub-task. In the first case, the algorithm selects the sub-task with the largest execution time among all subtasks that do not belong to the critical path of the task. If all sub-tasks in the task belong to the critical path, then the last one in the critical path is chosen. In the second case (presence of at least one null sub-task), the heuristic tries to avoid creating too many sequential maximum subsets. Hence, it looks for null sub-tasks and removes one of their immediate predecessor or successors. Therefore, reducing the preemption costs. Among all candidates, it gives priority to the one with the largest execution time and that does not belong to the critical path. Notice that, when two consecutive sub-tasks are removed, their deadline and offsets might be later reassigned when moving them to a different group.
The system tests iteratively the schedulability until finding a feasible schedule by invoking test feasibility. test feasibility uses dbf based tests according to two situations: If all sub-tasks of the same sub-task are allocated on the same core, it uses the dbf approximation described in Equation (17), otherwise it uses the exact dbf described in Equation (15). Since every time we remove a sub-task, the while loop (Line 11) will converge to the case of no sub-tasks, which is obviously feasible. The non-allocated sub-tasks which are contained in removed task list are added to the last task group to be allocated in the future iterations. Further, the algorithm invokes the clustering algorithm to add the removed sub-tasks.
As a consequence, new clusters may be produced. The algorithm fails at any time the number of clusters is greater than the number of available cores.
We now describe the clustering algorithm. First of all, tasks are sorted according to the following order relationship.
Definition 3 (Task order function). For a task τ i , we denote by γ(τ i ) the average artificial deadline of task τ i , computed as: where |π| denotes the number of sub-tasks in path π, and Π(τ i ) is the set of all paths in τ i .
Let τ i , τ j be two tasks. The order relationship τ i > τ j is defined as Notice that > sorts tasks according to their average deadline. If two tasks have similar average deadline, then it is likely possible to group them on the same processor: then, as stated by Theorem 3, we can reduce the cost of preemption by assigning the same deadline to their source sub-tasks, and shorter deadlines to following sub-tasks. On the contrary, if we group tasks with very different average deadline on the same core, it is unlikely to assign the same large deadline to all source sub-tasks, leading to a large preemption cost.
Once the tasks have been sorted according to their average deadline, the clustering algorithm adds tasks one by one to the current group until its total utilization is greater than 1. As a consequence, tasks having an utilization greater than 1, are put in their own group(We observe that this approach is similar to the federated-scheduling framework). When a group has an utilization greater than 1, a new cluster is created.
Lemma 2 (Necessary test). Let T be a task set and M the number of clusters of T obtained by our clustering algorithm.
If M > m, then the task system is not feasible.
Proof. Trivially, if M > m, the total utilization exceeds m, so the system is not schedulable.

B. Deadline assignment heuristics
The deadline assignment step has a large impact on schedulability and preemption overheads. In fact, a good deadline assignment technique can allow us to avoid costly preemption or even reduce the preemption cost to zero as proven in Theorem 3. In this section, we will show how to assign deadlines while taking into account preemption costs.
First we start by defining the preemption heaviness of subtask v as w(v) = pc(v) According to their heaviness, we define three classes of sub-tasks: • Non preemptive: sub-tasks with preemption heaviness greater than or equal to 1. Preempting these sub-tasks costs more than waiting for their completion, so they must be assigned the same shortest possible deadline. • Heavy: sub-tasks with preemption heaviness greater then a given threshold α and less than 1. • Preemptive: sub-tasks with preemption cost less than or equal to α. Algorithm 2 assigns deadlines by taking into account preemption costs. The algorithm has three main steps: the first assigns deadlines to all sources sub-tasks, the second one re-adjusts, if necessary, the source sub-tasks deadlines, and the final step assigns deadlines to the other sub-tasks.
The first step starts by computing the base deadline. It represents the maximum deadline that source sub-tasks may be assigned, so to eliminate all possible preemption according Algorithm 2 Preemption-aware deadline-assignment. to Theorem 3. It is computed as the maximum between the largest execution time among all sub-tasks of all tasks in the group, and the maximum average deadline γ(τ ) among all tasks in the group.
Assigning base deadline to source sub-tasks does not ensure schedulability, thus a necessary test is applied to quickly eliminate unfeasible solutions.
The necessary test computes the slack in the critical path of every tasks, and checks that it is still positive: where v is the source of the critical path π * (τ ).
If the test in the previous equation fails, the task group is not feasible when assigning the base deadline to source subtasks. Therefore base deadline is decremented iteratively until Condition 19 becomes true.
After this iteration, the deadline of each source sub-task is set to the maximum between the base deadline and the subtask execution time (the per-sub-task slack can not be negative). If the execution time of the critical path is less than the task deadline (necessary condition even on a unlimited number of cores), this second step will converge.
The third step assigns deadlines to the rest of the sub-tasks. It may use the already existing heuristics such as fair, proportional deadline assignment described in Section VII or the preemptionaware deadline assignment described in Algorithm 3. In the case a group has no null sub-task, all heuristics behave in the same way regarding preemption cost (see Theorem 1). However, in case of a null sub-task, a well-designed heuristic can reduce preemption cost.
Algorithm 3 is a novel heuristic for assigning artificial deadlines. It starts by selecting all heavy sub-tasks. Further, it selects d min, the minimum deadline that has been already assigned in a previous step. d min is an upper bound to heavy sub-task deadlines, otherwise these tasks can be preempted by at least one source sub-task. Further, another upper bound d b min is computed as the minimum γ(τ ). This step is similar to source sub-tasks deadline assignment, however the minimum is selected instead of the maximum. In fact, if this heavy tasks deadline is greater than this value, at least one of the sub-tasks for which the deadline is not-yet assigned will have a smaller deadline, hence be able to preempt at least one heavy task.
The minimum between the two upper bounds d min and d b min is selected as a new upper bound. Further we ensure that this upper bound is greater than the maximum sub-task execution time (Line 6) of heavy sub-tasks. If it is the case, the maximum execution time among heavy sub-tasks is selected. Further, heavy sub-tasks deadlines are reduced in a way similar to source sub-tasks deadline assignment. Further, the deadlines of light non-source sub-tasks are assigned using either fair or proportional deadline (line 11). We highlight that already assigned deadlines are not reassigned again (except when the deadlines are canceled inside omit subtasks in Algorithm 1).
One of the most effective techniques to schedule DAGs on multicore platforms is to assign intermediate deadlines and offsets (in this paper, we refer to them as artificial deadlines and offsets)to sub-tasks in order to enforce precedence constraints. The advantage of such techniques is that a set of dependent sub-tasks is converted into a set of independent sub-tasks with offsets, for which well-know and efficient schedulability analysis exists. However, optimal assignment of intermediate deadlines and offset is a difficult problem. The most popular heuristic algorithms are based on the idea of dividing the slack time along each path among all its sub-tasks according to some simple rule. Two among the many alternative heuristics are: (i) Fair distribution: assigns slack as the ratio of the original slack by the number of sub-tasks along the path, and (ii) Proportional distribution: assigns slack proportionally to the contribution of the sub-task execution time in the path. Chetto et al. in [12] have proposed to schedule sub-tasks according to their original task deadlines, such approach can not be used in our context, as tasks are allowed to be partitioned across cores, where every core has its own scheduler. At run-time, we ensure the respect of the analyzed scheduling, by guaranteeing the respect of the activation times and deadlines. The share of every sub-task is computed according to a non-increasing order of paths by cumulative execution time. Authors of [18] studied the deadline assignment problem in distributed real-time systems. They formalized the problem and identified the cases where deadline assignment methods have a strong impact on system performances. They proposed Fair Laxity Distribution (FLD) and Unfair Laxity Distribution (ULD) and studied their impact on the schedulability. Their work takes into account only schedulability, our techniques in addition consider the cost of preemption, which for some tasks, can be as larger than their execution time. In [15], authors analyze the schedulability of a set of DAGs using global EDF, global rate-monotonic (RM), and federated scheduling. Yifun wu et al. in [26] propose techniques to set offsets and deadlines using ILPs. Qamhieh et al. in [24] proposed a sufficient schedulability test of a set of DAG tasks onto a multicore platform. They assigned intermediate deadlines and offsets according to path length using techniques similar to Equation (19).
All the above-cited works consider preemption costs to be negligible. In the presence of high and variable preemption cost, the previous techniques may be ineffective. A radical approach is to consider non-preemptive systems as those found in [11], [20]. Bertogna et al. [7]- [9], [27] propose limited preemption models as a viable alternative between the two extreme cases of fully preemptive and non-preemptive scheduling.
To reduce the preemption costs, two main techniques have been proposed. In [25], a task can not be preempted up to a given priority, called Preemption Threshold. Thus, each task is assigned a priority and a preemption threshold, and the preemption takes place only when the priority of the arriving task is higher than the threshold of the running task. On the other hand, Baruah in [2] proposed deferred preemption. According to this method, when a high priority task is activated on a core where a low priority task is running, a function is evaluated on-line to define the longest interval the current task can continue to execute non-preemptively without compromising the respect of real-time constraints. Finally, fixed preemption points have been introduced to forbid a task to get preempted out of well-defined preemption points specified in the code. A complete survey about such techniques can be found in [10]. For all these techniques the preemption cost itself is still considered negligible. Some of these techniques can be used to reduce/avoid on-line preemption cost, however they cannot be used in the case of very high preemption costs.
Phavorin et al. in [23] have shown that single processor EDF is not optimal when considering preemption costs. This work is the closest one to our work. However, they use techniques to build off-line static schedules, whereas our techniques is based on EDF. Moreover, we use assume C-DAGs, thus our sub-tasks are dependent, whereas Phavorin et al. consider only independent L&L tasks. Finally, they consider a single core platform, we are interested in partitioned scheduling.

VIII. RESULTS AND DISCUSSIONS
In this section, we evaluate the performance of our deadline assignment heuristics and allocation strategies. We compare the combination of several heuristics proposed in this paper against fair and proportional deadline assignment combined with the bin-packing heuristics: Best Fit (BF) and Worst Fit (WF). We adapted BF and WF to take into account the preemption costs evaluated using Theorem 1. We compare schedulability rate, preemption cost reduction efficiency and practical complexity of the techniques cited above on a platform of 4 identical cores.

A. Task Generation
We conducted our experiments on a large number of randomly generated task sets. First, the generation algorithm starts by producing n utilizations whose sum is equal to x (varies in every experiment) by using the UUniFast [13] algorithm, n is randomly selected between [8,12].
For each utilization, the algorithm uses again UUniFastdiscard to distribute the task utilization to n v sub-task, so to obtain per-sub-task utilizations. n v is randomly selected between 7 and 15. The sub-task utilization is multiplied by the task period to compute the sub-task execution time. Further, two approaches to assign preemption costs and define task topology are used: (i) In the first approach, the per-sub-task preemption cost is generated randomly according to a probability P pc = 0.7: thus 70% of the sub-tasks are assigned preemption cost in the interval [0%, 20%] of the sub-task execution time; and 30% of the sub-tasks have preemption cost in interval [70%, 120%] of their execution time. Sub-tasks are connected randomly. (ii) In the second approach, the per-sub-task preemption cost is computed as a fixed percentage of the sub-task execution time.
In the different experiments, this fixed cost is chosen in the list {0%, 30%, 60%}. Further, sub-tasks are randomly divided assigned into L = 5 layers. Sub-tasks of layer l are randomly connected only to sub-tasks of layer l + 1.
We remark that the second approach has been designed to stress our heuristic. In fact, as the number of sub-task is set between 7 and 15 and it is randomly distributed across 5 layers, the number of sub-tasks in a layer can very often be equal to 1. This is actually unfavorable for our heuristic because, in the case a task is not feasible on a single processor, the algorithm is forced to split the critical path and generate several maximal sequential subsets. Moreover, as the load is fairly distributed between layers, fair deadline assignment heuristic have better chances to achieve good deadline assignment compared to our heuristic that assigns large deadlines to source tasks and over-constrains the sub-tasks in the following layers. The pertask utilization is limited to 60% in the second generation method. Once a DAG has been generated, we transform it into a C-DAG by randomly inserting conditional nodes between sub-tasks to simulate tasks' dynamic behavior, thus creating new paths without increasing the task utilization. The task period is selected randomly from a predefined list of periods {50, 80, 100, 150, 200, 300, 400, 500, 600, 800, 1200}, so to establish an upper bound to the hyper period. The task deadline is selected randomly in the interval [0.75 · T, 0.85 · T].

B. Simulation results and discussions
We vary the baseline utilization from 0 to 4 by a step of 0.25. Every point in the following figures represents the average value of 100 experiment. In all experiments, the standard deviation is between 2% and 3% of the average value (except for Figure  3c, which will be discussed later on). The results presented in this section are the combination of several heuristics proposed in this paper and in the literature. Each combination is denoted by 3 letters: (i) the task allocation heuristic can be either C for clustering, B for best fit or W for worst fit; (ii) the deadline assignment heuristic can be either capital P for preemption aware heuristic, or F for fair deadline assignment, or I for ILP; (iii) sub-task selection (Line 15 in Algorithm 1), it can be either R for random heuristic or P for preemption aware selection heuristic.  In Figure 2a, tasks are generated using the first method. The figure shows the schedulability rate of the following combinations: CPP, CPR, CFR , WFP, BFP and CIP heuristics as a function of total utilization. CIP uses clustering for allocation, preemption-aware omit subtasks and ILP to assign per-group deadlines, thus it presents the highest schedulability rates. CPP and CPR combine the preemption-aware heuristics, and they presents very high schedulability rates. CFP takes advantage of the clustering properties that allow grouping subtasks with a "fair" possible deadline distribution. Even when combined with optimal deadline assignment techniques, BF and FF show a low performances, because (i) preemptionUnimore University cost depends on the allocation, (ii) these heuristic add sub-tasks individually, and no global state is considered.
In Figure 2b, tasks are generated using the second method and preemption cost equal to zero. The goal of the experiments reported in this figure is to study the effectiveness of our approaches in the absence of preemption costs. The figure reports the schedulability as a function of total utilization for CPP, CPR, WFP (BFP is omitted to avoid surcharging the figure as it is outperformed by WFP). With the increase of the workload, heuristics based on fair deadline assignment perform slightly better than preemption-aware based heuristics. In fact, the preemption-aware heuristics try to reduce the number of preemption even when the preemption cost is null, and they tend to over-constrain unnecessarily the non-source sub-tasks. CIP outperforms all other heuristics, as it combines clustering with optimal deadline assignment.
In Figure 3a and Figure 3b, task are generated according to the second method, with a fixed preemption cost of 30% and 60% of the sub-task execution time, respectively. The schedulability falls sharply even for low utilization rates: as the preemption cost increases, reducing the number of preemption becomes essential for schedulability. Again, CIP outperforms all other heuristics. We observe that at low workloads, CPP and CPR provide performance close to CIP, however as the workload increases, CIP dominates all the others. In contrast, BF and WF schedulability rates fall more sharply and they are not able to achieve more than 10% schedulability rate even at a very low utilization. Figure 3c represent the preemption cost reducing efficiency as a function of total utilization for schedulable task sets. The preemption reduction efficiency of vertex v i is defined as: (v i ) = pc i (h) pc(vi) . In the figure we show the total preemption efficiency (the sum for all sub-tasks). It quantifies the effectiveness of a heuristic in reducing the preemption costs: the lower, the better. Only schedulable task sets are considered in this figure. Please notice that, since the number of schedulable task sets decreases substantially for high utilization, the standard deviation increases from 10% on the left part of the figure, to more than 50% in the right part of the figure, so those points are less meaningful. As expected, in general clustering based approaches are more effective in reducing the preemption cost compared to WFbased approaches. The inversion on the very low utilizations can be explained by observing that WF distributes tasks onto different cores first, thus sub-tasks of the same task may be allocated onto the same core with less concurrence. On the other hand, clustering based approaches tend to cluster all tasks onto the same core, thus the preemption cost is higher for very low utilizations. When the load increases, WF fits more tasks on the same core, potentially increasing the number of preemption compared to our heuristics that are designed to reduce the number and the cost of preemption. CIP is more efficient as it assigns optimal deadlines with the goal of minimizing preemption costs reduce. Figure 3d shows the execution time of the analysis for schedulable tasks set under clustering based and WF based approaches. Even if the theoretical complexity of clustering-based approaches seem to be greater than the classical bin-packing heuristic, in practice they are more efficient. In fact, clustering based approaches group tasks only once and assign them artificial deadlines and offset before proceeding with allocation, whereas, when using bin-packing based heuristics, at each allocation the deadline assignment algorithm is invoked again. We observe that the good performance shown by CIP are not for free. In fact, the execution time increase considerably with the increase of utilization. The clustering heuristic is called more often to re-assign non-allocated sub-tasks to cores.

IX. CONCLUSIONS AND FUTURE WORK
In this paper we propose technique to allocate C-DAG tasks onto identical multicore platforms, by accounting for task preemption. Since the cost of preemption can be very large in modern GPU, our technique reduces the number of preemption by setting appropriate artificial deadline to subtasks and by allocating tasks with similar deadlines on the same core. Results of our extensive synthetic experiments show a significant reduction on total preemption cost when combining preemption-aware allocation with preemption-aware deadline assignment techniques. The work presented in this paper can be easily extended to heterogeneous platforms with multiple ISA and multiple capacities as in [29]. Authors in [29] have proposed to convert the problem of allocating a set of DAG tasks onto heterogeneous ISA architecture, into multiple set of single ISA architecture. Starting from this level, our approach is complementary with their proposal. We also plan to investigate other preemption-reducing techniques, such as the deferred preemption proposed by Baruah et al. [2]. We plan to extend our approach to take into account communication overheads.