January 25, 2013

Abstract

One of the challenges of real-time, performance critical multi-core systems is the efficient scheduling of executed tasks. The scheduling problem consists of assigning the tasks to the different cores and deciding upon the order of execution. A special case is heterogeneous multi-core platforms where the cost of execution varies among the different processors. In this paper we present static and dynamic scheduling approaches, discuss their pros and cons and demonstrate a dynamic deadline-oriented scheduling algorithm with a low processing footprint. We apply the two approaches to the OpenRadio real-time Wi-Fi platform operating at high rates and demonstrate obtaining of feasible schedules. Using our scheduling algorithm we examine implications of hardware parameters on wi-fi processing feasibility. We also propose several possible improvements to the dynamic scheduling algorithm.

Definition 2. Optimal Schedule: A schedule that is optimal according to a certain metric (for instance makespan minimization).

Definition 3. Hard Real-Time System (or platform): A system that must operate within the confines of a stringent deadline. Missing the deadline results in a system failure.

Definition 4. Static Scheduling: A schedule is pre-calculated in advance and used later on during runtime as is without the need for any further runtime decisions or computations.

Definition 5. Dynamic (runtime) Scheduling: Task scheduling decision are made at runtime by the performing platform.

One of the challenges of real-time, performance critical multi-core platforms is efficient task execution scheduling. We are motivated by the OpenRadio project[2] which aims at running a programmable wireless base-station on a widely affordable hardware. The hardware platform currently used by OpenRadio is TI’s Keystone, having 2 DSP chips, each one with 2 cores, 4 Viterbi co-processors and 2 FFT co-processors. Wi-Fi and LTE standards define deadlines for the receiving party to send an acknowledgement which confronts us with tight scheduling requirements. Wi-Fi (802.11) in particular requires sending an Ack 15μs after the packet has been sent. In order to manage to process wireless packets in real-time we need to use multiple cores in a smart way. A lot of research has been done on the broad subject of scheduling, both in the area of computers and hardware and outside it. Yet, multi-core heterogeneous hard real-time platforms such as OpenRadio present a special case, and while we rely on previous work in this area we have to suit it for the specific needs and constraints of our scenarios. Different tasks suit more for running on a certain processor and not on the other. Also, communication times between different processors are to be taken into account.

We explore two main approaches to scheduling: static and dynamic (runtime scheduling). Static scheduling means that we pre-calculate the assignments of the tasks and their start times during design or compilation time. Being able to compute the schedule during compile time has actually different implications than computing the schedule once during the design phase. Compile-time scheduling tolerates longer computation time than real-time scheduling but still requires it to be on the scale of seconds to minutes.

Within the static approach we need to take into account that a Wi-Fi packet can contain a variable number of OFDM symbols. The symbols follow with interval of 4 μs and the 802.11 standard requires responding with an ACK 15μs after the last symbol has been transmitted. It is impractical to fully unroll the flow-graph for every possible packet size and pre-calculate a schedule for each case. Therefore, in order to deal with a large and variable number of symbols we adapt the approach of iterative modulo scheduling [4]. We produce a repeatable schedule that can be tiled again after 4000 cycles (which considering the processor clock rate roughly corresponds to the arrival of a new symbol. This imposes additional constraints which will be discussed in section 2. With dynamic runtime scheduling it is non-issue. New tasks are scheduled upon new symbol arrival.

Static scheduling has many advantages for relatively deterministic real-time workloads; a static scheduler can gauge feasibility in advance of workload, and provide a theoretically optimal schedule for the task. Wi-Fi baseband processing, as implemented on the OpenRadio platform, is entirely deterministic, making static scheduling a good option. However, static scheduling also presents downsides. The most major one is that static schedules deal poorly with non-determinism[6]. While Wi-Fi is deterministic, other workloads that will run on the OpenRadio platform in the future, such as LTE, are not and it may be impossible to employ a suitable static schedule. Static schedulers can also face problems when resources are not fixed at runtime. Should the OpenRadio platform implement multi-channel Wi-Fi or other competitive workflows, varying resources may be an issue for a static schedule[5]. Altering hardware would also require regeneration of static schedules, which can be time consuming.

Dynamic scheduling does not have these issue. Because decisions are made at runtime, dynamic schedulers are more agnostic to non-determinism, can be easily configured to work with a wide variety of core configurations, and are able to take resource availability into account for scheduling decisions[6]. However, dynamic schedulers add overhead, which can be significant if scheduled tasks are very short. Dynamic schedulers may produce suboptimal schedules, potentially failing where a static scheduler would success. Despite this, the adaptability of dynamic scheduling makes it essential for many non-trivial workloads.

Forming optimal schedules for a hard real-time system can be accomplished by solving a mixed-integer linear program
(MILP)[8, 7]. The constraints used in the MILP formulation correspond to the physical and temporal limits of the system. The
MILP is formally specified with the following parameters: a set of n jobs , a set of m processors ,
a job dependency graph G with edges (J_{i},J_{j}) where J_{j} must run after J_{i} is complete, a set of job-processor affinities
(processing times) , and processor to processor communication latencies, . Solving the MILP
gives the optimal start times of each job, {x_{1},…,x_{n}},x_{i} ∈ Z+ and binary processor assignment indicator variables
{y_{1,1},…,y_{n,m}}, where y_{i,j} = 1 if J_{i} is assigned to M_{j}.

In the simplest case, we wish to minimize the longest completion time of the jobs. The completion time of each job is given
by C_{i} = x_{i} + ∑
_{j=1}^{m}y_{ij}p_{ij}. The full description of the MILP is as follows:

Minimize C_{max} such that:

C_{max} ≤ d (hard deadline)

C_{i} ≤ C_{max},1 ≤ i ≤ n (definition of objective function)

∑
_{j=1}^{n}y_{ij} = 1,1 <= i <= n (each job is allocated to exactly one processor)

(2 - y_{ik} - y_{jk}) ⋅ D + z_{ijk} ⋅ D + (x_{i} - x_{j}) ≥ p_{jk},1 ≤ i < j ≤ n,1 ≤ k ≤ m
(tasks do not overlap on a single processor)

(2 - y_{ik} - y_{jk}) ⋅ D + (1 - z_{ijk}) ⋅ (p_{jk} + D) + (x_{j} - x_{i}) ≥ p_{ik},1 ≤ i < j ≤
n,1≤k ≤ m

x_{i} + ∑
_{k=1}^{m}y_{ik}p_{ik} + q_{st}(y_{is} + y_{jt} - 2) ⋅ D < x_{j}1 ≤ s,t ≤ m if G contains an
edge (J_{i},J_{j}) (dependency constraint with communication latency)

(2-y_{ik}-y_{jk})D +z_{ijkr}(p_{jk}+D)+(x_{i}-x_{jr}) ≥ p_{jk}1 ≤ i,j ≤ n,j≠i,1 ≤ k ≤
m,0 < |r|≤ R (repeated schedules do not overlap)

(2-y_{ik}-y_{jk})D+(1-z_{ijkr})(p_{jk}+D)+(x_{jr}-x_{i}) ≥ p_{ik}1 ≤ i,j ≤ n,j≠i,1 ≤
k ≤ m,0 < |r|≤ R

The value D is a constant chosen to be large enough that constraints involving it are always satisfied unless its coefficient is
0. The variable z_{ijk} is equal to 1 if job i precedes job j on processor k, and 0 if job j precedes job i on processor k. The value
does not affect the MILP if jobs i and j are assigned to different processors. The integer R is the number of copies of the
schedule that must overlap within the deadline.

For different radio protocols, we have a different specification of the jobs and job-processor affinities, which we can use to automatically generate the full MILP as specified. We use MATLAB to store and modify the problem specifications and to generate input for the MILP solver. To compute the optimal schedules we used the IBM ILOG CPLEX optimizer [1], a high performance convex optimization toolkit. We configure CPLEX to output all feasible schedules, which we can parse with a simple script and present in MATLAB using the Torsche schedule viewer toolbox.

Wi-Fi baseband processing consists of a deterministic set of tasks, and so is an ideal candidate for static scheduling. However, dynamic scheduling presents two limitations that need to be overcome for the future of the OpenRadio platform. First, static schedules can take time to reproduce, and can require tools that are not widely available. The OpenRadio platform we use is merely one example of a variety of potential configurations created from commodity parts to run OpenRadio tasks. A dynamic scheduler can be easily configured using basic tools (MATLAB or even Excel) to work with many platform configurations. Additionally, while Wi-Fi is the only task currently implemented on OpenRadio, the intention is to implement many more wireless tasks such as LTE. These are not deterministic, and it is unclear how to apply a static schedule to them. Therefore, while the Wi-Fi taskset is not ideal for demonstrating the advantages of dynamic scheduling, we look at feasible dynamic scheduling for Wi-Fi as an indicator of dynamic scheduling potential feasibility for less deterministic wireless tasks.

Due to restrictions with the hardware, we were unable to implement a dynamic scheduler on the real OpenRadio platform, however we made significant progress on how to organize an effective and lightweight scheduler. We also built a schedule simulator to demonstrate the feasibility of schedules produced purely by lightweight runtime decisions (fast functions only of the current state).

An essential part of any dynamic scheduler is how it chooses where to run the variety of tasks available at any given time. A successful algorithm for our purposes must satisfy three criteria:

- The algorithm should have a low computational footprint.
- The algorithm must be suitable for the real time nature of the workload, and focus on meeting deadlines.
- The algorithm should take into account the suitability of various cores to the task at hand, and work on a heterogeneous architecture.

The first and second condition are a given with Wi-Fi; Wi-Fi is a real-time application, and the deadlines are extremely tight, so any algorithm needs to be tailored towards this purpose. The third is necessary because of the OpenRadio platform’s heterogeneous nature, where some cores are much better suited to some tasks than others; any scheduling algorithm should take the differences in runtime into account.

An example of a suitable algorithm is one that gauges the ‘forward surplus’ of cores, and assigns to the most suitable[10]. Forward surplus is a generic term for any metric that gauges the ‘availability’ of a core for a specific task. Any variant of the algorithm takes up to three parameter: The task’s runtime (which may vary per core), the task’s deadline, and the task start time (which again varies by core, and is the time when the core finishes all other tasks currently running and queued on it). The most basic variant is naïve surplus, which is just the difference between the deadline and the start time. The start time can be easily memoized and updated as we assign tasks to a core, so this algorithm is extremely easy to compute, consisting of only a single subtraction. However, it does not take into account the runtime of the task being scheduled, making it unsuitable for meeting deadlines and assigning tasks to the most suitable cores. Instead, we investigated two more complicated variations. Time-based forward surplus is simply the naïve surplus minus the runtime of the task being scheduled on that core. Put most simply, this is a measure of the slack a core has in running a task. The other Ratio-based forward surplus, which is the ratio of naïve surplus to the runtime of the task. Ratio-based surplus essentially looks at the naive metric, and improves it by weighting surplus scores by speed of the processor. This has the advantage of valuing processors that are faster for a given task for more.

1: for i ← 1,number of cores do Forward surplus for task

2: runtime ← processing time of s on i

3: naive_surplus ← deadline - sum of (remaining runtime) of (tasks on processor i)

4: forward_surplus(i) ← naive_surplus - runtime Time-based. Replace with / for ratio-based

5: end for

6: assignment ← max index of array forward_surplus

2: runtime ← processing time of s on i

3: naive_surplus ← deadline - sum of (remaining runtime) of (tasks on processor i)

4: forward_surplus(i) ← naive_surplus - runtime Time-based. Replace with / for ratio-based

5: end for

6: assignment ← max index of array forward_surplus

The exploration in [10] suggests that Ratio-based surplus is generally the most suitable for the ‘average’ workload. However, we found that time-based forward surplus was better for our current task-set. Ratio-based surplus requires one floating point divide for every core, an operation that is expensive on most architectures. This may be a problem given the need for low overheads, although more active profiling of our specific hardware would be required. Additionally, because of the relative linearity of the workflow, ratio-based surplus performed no better than time-based surplus, and actually performs worse in some cases. In a highly competitive and dense taskset, it’s important to preference tasks to the cores they run best on; however, for more linear workflows it’s also important to parallelize when possible. Ratio-based surplus put too much emphasis on running tasks on their most suitable cores, at the expense of exploiting parallelism. Time based forward surplus values runtime, but with far less leverage, leading to more parallelism when there is a relative abundance of resources.

When we tested the algorithms with one of the DSPs removed, Ratio-based surplus produced a feasible schedule but time-based did not. This suggests that in the likely event of hardware modification, and in particular trimming, investigation into Ratio-based surplus will be more fruitful.

Forward surplus is a simple runtime decision algorithm, but some small modifications can be made to leverage static decisions, which will not influence runtime overhead. The nature of real-time in wireless means that while processing a packet is made up of potentially thousands of subtasks, the only deadline is the global deadline for the entire packet. Therefore, even using a simple algorithm such as forward surplus is not trivial, as the deadlines decided for each task feed into how the algorithm assigns jobs. Currently we give each symbol a deadline of 12μs, and define each subtasks deadline by critical path; that is to say, how late can we schedule it and still satisfy all potential jobs that may depend on it. This choice of deadlines encourages success by avoiding situations where it is impossible to finish the packet, and also grants flexibility by not encouraging tasks to be completed earlier than they need to be.

While Wi-Fi workflows are deterministic, but assignments of deadlines on less deterministic workflows such as LTE will be an interesting task. For example, some tasks may have long potential paths based on some very unlikely scenarios. Giving a usually non-critical task a very early deadline can reduce scheduler flexibility, and may hinder operation. It could be interesting to see how performance is affected by gambling with a deadline, where a deadline is set such that there are paths, although very unlikely ones, where finishing at the deadline is not actually sufficient to complete the task. This would have to be paired with a lightweight machine-learning center where different tasks change their deadlines adaptively in the event of failures or near failures.

Porter proposes a mechanism for deciding scheduling parameters such as deadlines based on a market[3]. An avenue of future work would be to run offline simulations of his model to generate deadlines and tweak other parameters, and see if the modified parameters result in more optimal schedules.

Implementing a dynamic scheduler for Wi-fi, or wireless in general, is an intimidating task. The deadlines are very tight and the tasks are extremely short, meaning that significant scheduler overhead is unacceptable. Therefore, any scheduler needs to be very lightweight, and must employ a variety of optimizations in order to accomplish the task at hand.

As previously mentioned, we were unable to get a scheduler implementation running on the hardware. However, we have reason to believe that a dynamic scheduler implementation is possible. In addition to the fact that the schedules that would be produced dynamically are only slightly less than optimal, the calculations needed to make those decisions are extremely lightweight. In addition, there are several ways to trim scheduling overhead significantly; a variety of potential techniques are discussed below.

Each task in a packet has associated state, used for its initial scheduling, satisfying dependencies and coordinating communication. However, with a max symbol count of 200 and 15 tasks per symbol, the amount of state for an individual task could be multiplied by 3000. (Currently we foresee about 4 bytes of state for task, which would result in 12000 bytes for an entire packet. While this is a small fraction of the L2, it may prevent working directly out of the L1). In addition to potentially being a strain on memory, state management becomes difficult, and associative actions on the state become infeasible. However, we can save this overhead by noting that a given piece of state is only relevant for a short time. Specifically, with a symbol arrival rate of 4μs, and a symbol deadline of12μs, only 3 symbols will be active at any time. To illustrate, our first symbol has a deadline at 12μs. The fourth symbol arrives at 12μs. By this time, the first symbol must be done or else we have failed to satisfy our self-imposed deadlines. This means that the state for symbol 1 can later be reused by symbol 4. One disadvantage of this technique is that it requires setting hard deadlines on tasks that don’t actually exist; symbols only need to be completed by the packet deadline, and there is no real individual deadline for each symbol. Allowing for softer deadlines may allow for some unique schedules, which may even make an otherwise infeasible workload feasible. However, our current scheduling policy makes assignment of deadlines and reasoning about scheduling feasibility much simpler.

One problem with the current algorithm is that it requires one calculation for each of potentially many cores. A potential optimization would be to decide that if forward surplus is sufficient, a task should be assigned to a core straight away, rather than bothering to compute forward surplus for other cores. This avoids the remainder of comparisons, making the optimization particularly useful in the case of either many cores or a complicated decision calculation. A problem with this optimization comes from the heterogeneous nature of the platform. The algorithm may decide that a core has sufficient forward surplus to run a task, even if that core does so very slowly. That core may then be out of commission at a time of less slack. Given that our current setup only works with 9 cores, and out calculation is relatively straightforward, the incurred branch penalties will likely be higher than any potential benefit. Given that the OpenRadio platform is subject to change, this optimization may become necessary in the future. An accompanying lightweight learning center to determine which cores to check first could mitigate some of the potential problems.

The naïve way to run such a scheduler is to prompt it when a task finishes, as dependencies will have been satisfied by the finishing task. However, information about which dependencies are about to be completed and which tasks to run next are often available before a task actually finishes, even in non-deterministic cases. The scheduler can begin scheduling dependent tasks prior to the completion of tasks they are dependent on, masking some of the scheduling latency. A possible extension is to introduce speculation about which tasks will run and when; such an extension would require preemption and rollback, and would only be introduced if necessary. Hiding latency is a very important optimization; if the scheduler can be run in parallel on its own core, this optimization can make it appear as though scheduling has zero or close to zero overhead.

Currently, a task represents some basic computational block, such as a fast Fourier transform, a channel estimation, a demodulation, etc.. This grouping is not optimal for a dynamic scheduler. To illustrate, if there is a set of consecutive tasks, where each task depends on the previous and any future task depends on the last, then the entire set of tasks can be viewed as a single task without any loss of flexibility. All Wi-Fi rates have such a set at the beginning of each symbol. Grouping these sets amortizes the overhead of scheduling each of the individual tasks, and ensures no communication cost between the dependent tasks in the bundle.

A slight variation on this is the case where there are future tasks that depend only on tasks in the middle of the chain. For example, in some nine task chain there may be a task that depends only on task 7, and not 8 or 9. Theoretically this dependent task could be scheduled as soon as task 7 has completed, however if the nine tasks are blocked as one atomic unit, the scheduler must wait until the entire bundle completes. The decision to bundle these nine tasks could then result in a suboptimal schedule, which may offset the benefits the amortization provides. Some basic criteria can guide the choice, such as how urgent forked tasks are and how early forks occur. If the task that depends on 7 has a fairly urgent deadline relative to the deadline of the bundle, then it will likely be advantageous to split the bundle so that the task can start as soon as task 7 is finished. If the third task in a nine task chain frees a task outside of the chain, it would severely hinder scheduler flexibility to keep the chain together.

Our results are obtained through simulations both for static and for dynamic scheduling. Processing times are based partially on actual profiling of tasks on the Keystone platform and partially on extrapolation of the profiling results to estimate runtimes for higher bit-rates. The FFT co-processors are not currently used in OpenRadio since the FFT performance is comparably good on the DSP and offloading this task to an FFT co-processor would incur additional communication cost. Many of the Wi-Fi scenarios are linear and therefore we focus on 36 Mbps and 54 Mbps for which it is possible to use multiple Viterbi co-processors in parallel.

In Fig. 4 and Fig. 2 we present feasible schedules generated using the two approaches for 54 Mbps and 36 Mbps respectively. We see that the Mixed-ILP solution produces a better schedule in terms of minimizing the processing time than the dynamic scheduling algorithm as expected, but not by much. The difference is 600 cycles. This result suggests that using a lightweight heuristic scheduling algorithm can be practical.

We were also interested in examining how hardware platform parameters can affect scheduling. We chose to focus on inter-processor communication. We produced Mixed-ILP formulations of the scheduling problems for five different cases corresponding to 0, 50, 100, 150 and 200 cycles communication latencies. Comparing the total processing times obtained from the resulting schedules (Fig. 3) we can see that communication latencies do not influence scheduling much and the difference between the completion time of the schedule for the 0 communication overhead case and the schedule for 200 cycle communication overhead is only 400 cycles. This experiment suggests that other parameters should be examined as well in order to better understand their implications on feasibility of real-time wireless processing. Studies of this kind could result in hardware design choices based on better understanding instead of intuitive assembly of general purpose processors with task specific accelerators.

The OpenRadio system has very specific hardware constraints, such that the hardware specification portion of the MILP is fixed. The MILP framework could be used to choose a set of hardware resources that best perform a particular task. This can be accomplished by simply expanding the set of processors and allowing the MILP solver to "choose" the most optimal configuration to complete the given jobs. In addition to expanding the set of processors, one would also modify the objective function in order to punish solutions that use more hardware resources than necessary. This would require more intermediate binary variables representing whether a processor was used at least once or not at all. Extensions to the MILP such as this are left for future work, however, we did attempt to solve a scheduling problem for a more constrained hardware setup than what is provided by OpenRadio and found feasible solutions for high bandwidth Wi-Fi, suggesting that optimizing for reduced hardware requirements may be fruitful. The result of scheduling using only 2 DSP processors is presented in Fig. 4.

We have shown two methods for computing feasible schedules for implementing demanding Wi-Fi protocols on the OpenRadio platform. The mixed-integer linear program formulation gives us an optimal solution and all feasible solutions but must be run offline in order to pre-compute schedules, which can then be implemented on the OpenRadio platform. We have also presented a heuristic scheduler which produces near-optimal schedules for the workflows and hardware we are interested, and we have discussed the steps required to use this heuristic scheduler to schedule tasks on the fly with minimal overhead, depending on the complexity of the heuristic. Lastly, we have explored reduced hardware scenarios for both the MILP and heuristic schedulers and found that they can find feasible schedules to implement 54 Mbps Wi-Fi with one fewer DSP than we have available on the OpenRadio hardware platform.

The most immediate work to be done with the results presented is to verify the feasibility of the computed schedules on the OpenRadio platform itself. After the performance of these schedules for Wi-Fi protocols has been verified, it may be possible to extend the MILP and the heuristic scheduler to handle non-deterministic wireless protocols such as LTE and implement these schedules on OpenRadio as well.

We aim at implementing and testing a dynamic scheduler on the OpenRadio platform, using the Forward Surplus algorithm and the optimizations discussed in the paper. While dynamic scheduling seems like a fruitful pursuit, profiling a real scheduler will give us much more insight into the true advantages and potential problems with scheduling at runtime.

Another avenue of future work pertains to embedded systems and configurable hardware in general. The mixed-integer linear programming model as given can be used to find schedules for a given set of jobs and a given set of hardware, but as our results have shown, for many of the Wi-Fi rates, not all of the hardware on the OpenRadio platform is required to fully implement the protocol. This suggests that we might instead consider finding an optimal hardware configuration that reduces cost and energy requirements while still yielding feasible schedules for protocols of interest. We propose one way to extend the MILP formulation is to simply allow the solver to use a large number of processors and let it choose which subset of them are best able to complete the jobs under the deadline. In addition to adding more available processors, one would also modify the objective function to penalize schedules that use more processors than are necessary, according to some metric of "cost" of having the processor in the system at all.

[1] Ibm ilog cplex optimizer, http://www-01.ibm.com/software/integration/optimization/cplex-optimizer.

[2] Openradio project, http://snsg.stanford.edu/projects/openradio.

[3] R. Porter. Mechanism design for online real-time scheduling. In Proceedings of the 5th ACM conference on Electronic commerce, pages 61–70. ACM, 2004.

[4] R. Rau. Iterative modulo scheduling. International Journal of Parallel Programming, 24(1):3–64, 1996.

[5] D. Sanchez, D. Lo, R. Yoo, J. Sugerman, and C. Kozyrakis. Dynamic fine-grain scheduling of pipeline parallelism. In Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on, pages 22–32. IEEE, 2011.

[6] J. Stankovic, M. Spuri, M. Di Natale, and G. Buttazzo. Implications of classical scheduling results for real-time systems. Computer, 28(6):16–25, 1995.

[7] P. Šucha and Z. Hanzálek. Cyclic scheduling of tasks with unit processing time on dedicated sets of parallel identical processors. In Proc. of the 3rd Multidisciplinary International Conference on Scheduling: Theory and Application, Paris, 2007.

[8] P. Šucha and Z. Hanzálek. Deadline constrained cyclic scheduling on pipelined dedicated processors considering multiprocessor tasks and changeover times. Mathematical and Computer Modelling, 47(9):925–942, 2008.

[9] P. Šůcha, M. Kutil, M. Sojka, and Z. Hanzálek. Torsche scheduling toolbox for matlab. In IEEE Computer Aided Control Systems Design Symposium (CACSD’06), pages 1181–1186, Munich, Germany, October 2006.

[10] H. Tang, K. Rupnow, P. Ramanathan, and K. Compton. Dynamic binding and scheduling of firm-deadline tasks on heterogeneous compute resources. In Embedded and Real-Time Computing Systems and Applications (RTCSA), 2010 IEEE 16th International Conference on, pages 275–280. IEEE, 2010.