This post was written by Adam Share, an engineer that helped build Otter POS's offline mode. Matt Park and Tim Pinkawa contributed notes on backend design.
A little over one year ago, the Otter team embarked on a mission to build a point-of-sale (POS) product that meets the needs of many of our restaurant partners. Otter's Order Manager product has already proven an invaluable tool for managing a restaurant's "online" orders from food delivery providers, but what about all those in-person "offline" orders that make up 80% of a restaurant's business?
This post describes how we built a single solution to manage both your online and offline orders using a restaurant's existing hardware, and how that product maintains business continuity in the face of unexpected outages and connectivity issues. Our goal is to deliver the most reliable POS experience in the industry: no orders lost, ever.
OtterPOS: Going online
OtterPOS combined all the things we're good at in the online world with new hardware and software to support the offline world. Payments are through a stripe terminal and a large screen for easily customizing new orders.
This version worked well, and we delighted our initial customers! They loved our simple user interface with a single home for all their orders. But, as we discussed our product with larger restaurant groups, we learned there was an important feature they cared about that we didn't yet support:
What happens when the internet goes down? What if our servers go down? If you have a customer physically handing you money, it's unacceptable that your POS stops working for any reason.
Traditional POS systems solve this by running a large server onsite in the restaurant. The POS and other devices connect to the server and the server connects to the internet. When the internet goes down, the server insulates the devices by continuing to process orders. This has some drawbacks though - restaurants are required to buy and install the server, the server is a single point of failure, and onboarding to the system is a lot more complicated.
Can we do better?
OtterPOS: Off the grid
So we built "offline mode".
Two new components allow our devices to continue to operate in the event of connectivity loss:
A write-ahead-log in the OtterPOS device to capture all of the events that happened while the device was offline
An asynchronous workflow engine in our servers, built on Temporal, to process the log when connectivity is restored
The write-ahead-log captures every event that happens on the device. Every order created, order update, payment made, refund given, and clock-in/clock-out are saved in persistent storage. This means that if the device is turned off for the night and turned back on for the morning shift, the events are still there and will attempt to be flushed. The events are kept in a local database until the device is sure that the server has received them and processed them, in which case they are cleared from the DB.
The device will attempt to flush the log after every new event, and periodically if it detects that connectivity has been restored. When the server receives the log, it immediately starts up a durable workflow to process the events. As soon as the server is sure the workflow is started, it returns the workflow ID so that the device knows the events have been received and will be processed.
Events are processed serially within the scope of a single order to ensure they are processed in the correct order. However, we process events from different orders in parallel with separate Temporal activities to ensure a failure from one order does not block the processing of another order.
In the event of a Temporal activity failure, the workflow will retry indefinitely to ensure the event is not lost. These events already happened in the real world, so we cannot fail to process and record them in our system of record. The durable nature of the workflow and our retry logic provide us with this guarantee.
For transient failures, the activity will eventually succeed on its own. For failures due to backend bugs, an engineer will fix the bug and redeploy, at which point the activity will automatically retry and succeed, moving on to the next event to process. In the event of a client bug where we receive unexpected bad input, we even have the ability to manually rerun a workflow with the correct input.
Upon success, we send an event status to our push framework which pushes the status back to the device. In the event of an unrecoverable error, we send a failure status back to the push framework. In either case, the device now knows the event was processed and is safe to clear from the DB.
Offline mode has already saved our customers from downtime. In March 2023, our cloud provider, Azure, experienced an outage, degrading some of our services. Our customers using OtterPOS did not notice because orders and payments were still processed locally. When the outage was over, our workflows processed all the pending events and everything worked as expected.
OtterPOS: Offline evolution
Offline mode works great when you have a single POS device, but what about restaurants with multiple POS devices? What about restaurants with KDS devices (kitchen display systems)? We still rely on our backend to sync state between the devices, so that an order started on one POS can be finished on the second POS or viewed in a KDS.
The next version of our POS product will solve this by creating a mesh of interconnected devices all sharing state with each other and syncing events to our servers. This will allow all devices in the restaurant to share state even when the network goes down.
Our local mesh network of devices will leverage an existing open source peer-to-peer networking API Nearby Connections. Nearby Connections abstracts away Bluetooth, BLE, Wi-Fi, and LAN connections and will automatically connect using the best available method. Each device can connect to one or more devices over any of the available protocols to form the mesh.
When two devices establish a connection they will move communication to the highest possible bandwidth pathway. So if two devices find each other over Bluetooth first but join the same WLAN, they will move from a Bluetooth connection to a higher bandwidth WLAN connection.
With devices able to discover and connect to each other to form a mesh network, all app data can now be distributed over the local network. We’ll follow a “publish–subscribe” pattern to generalize how data is sent and received to enable existing features and future development to leverage the offline-first approach. Events that are published will also be cached so that if devices become disconnected or new devices connect, the relevant data can be exchanged to bring all devices back into sync.
In this model, the devices act as both clients and servers and are able to sync data between themselves without the backend orchestrating it. The backend is still essential for durable storage of events, but its responsibility is reduced from being an authoritative coordinator to becoming a peer node in a mesh of devices.
With our new local mesh functionality, all devices in the restaurant can seamlessly continue to operate with no downtime when connectivity is lost, fulfilling our promise of business continuity.
Conclusion
Customers have been delighted with our single solution to managing both your online and offline orders. But a great solution won't get you very far unless it is reliable and allows you to run your business even when things go wrong. This is why offline mode has been a key focus area for our team and continues to be a competitive advantage as we scale offline mode to more scenarios. We've already seen offline mode save our customers hundreds of orders during various unforeseen connectivity issues.
We will continue working to make our entire fleet of devices work while offline, including front-of-house Order Manager, back-of-house Kitchen Display Systems, and customer facing kiosks. They will all work seamlessly together even when the network drops.
Additionally, we plan to make more features available while offline. We focused our efforts first on ensuring business continuity by supporting order placement and payment. But we plan to bring offline mode to menu management, analytics, and more in the near future.