Introduction

This page demonstrates the fault tolerance techniques in the Quest-V distributed system on a chip design. A crazyflie quadcopter is controlled using a USB transceiver and USB joystick both connected to a machine running Quest-V. Faults are injected into the crazyflie application and the system is able to detect and recover from these faults. This results in the quadcopter operating seamlessly.

Crazyflie Demonstration Introduction Diagram

Demonstration

This demonstration shows the effectiveness of the Quest-V fault tolerance subsystem. Specifically, it demonstrates the roll-forward recovery technique, a novel technique which advances the state of a process forward in its execution using a duplicate copy that resides in a different sandbox. In this demo a 4 core microprocessor is divided into 4 sandboxes with one core per sandbox. The USB host controller is isolated to Sandbox 0. Sandbox 0 reads input from a USB joystick and sends the data to Sandboxes 1 to 3 via private shared memory channels. Sandboxes 1 to 3 each contain one instance of a process that repeatedly reads the joystick data from the shared memory channel, performs the necessary computations to determine the parameters that should be sent to the crazyflie quadcopter, places the results of the computation in a shared memory channel accessible only by Sandbox 0 and then makes a sync syscall. In the sync system call the kernel hashes the userspace memory and places the hash into another shared memory channel accessible only by Sandbox 0.

Sandbox 0, after placing the joystick readings into the shared memory channels, waits for Sandboxes 1 to 3 to make the sync system call. After the sync system calls are made the hashes are compared. If the hashes are identical no recovery action is necessary and the arbitrator uses the results from Sandboxes 1 to 3 (which will also be identical) to control the crazyflie quadcopter via the USB transceiver. If the hashes are different Sandbox 0 determines which sandboxes have the majority and sends the minority sandboxes a message indicating that the recovery procedure should occur. When the sandbox receives this message it performs a hypercall into the monitor to perform the recovery procedure. After the recovery procedure has finished the hypercall returns into the kernel at the sync syscall and then returns in the recovered userspace process.

Below is a video of a crazyflie quadcopter being controlled by Quest-V. On the right, the monitor output displays when the system detects and recovers from a fault along with a counter of how many faults have been detected and fixed.

Step By Step Outline

Below is a diagram and step by step outline of the roll-forward recovery mechanism used in the video above.

Roll-Forward Recovery Diagram
  1. A program is run in multiple (m >= 3) sandboxes

  2. Whenever the program writes to a new page a copy-on-write occurs

  3. Program makes a sync syscall dropping down into the kernel

  4. The memory pages of the program are hashed in Sandbox 2 to m and the hashes are sent to Sandbox 1

  5. The programs wait for Sandbox 1 to send a message indicating whether an error occurred or not

  6. Sandbox 1 compares the hashes of the programs, detects that there is an error in Sandbox 2 and sends a message to Sandbox 2 to recover

  7. Sandbox 2 receives the message to recover, makes a hypercall into the monitor and copies the correct pages from Sandbox m.

  8. The program sends a message to Sandbox 1 that it recovered and returns to user space to continue execution

  9. Sandbox 1 releases Sandbox m and it returns up to userspace