How I learnt about SaaS System Architecture by building my smart home
5 Common SaaS architecture best practices I could apply at home when building my smart home.
After answering questions on the design of my smart home from friends, I realized the complicated architecture, system capabilities, and product limitations were intriguing. Describing these made me realize that my home hobby had taught me the same basics of SaaS architecture that are the usual (best? not necessarily) practices in the industry. I’ll share my challenges and learnings about architecture and product design from building a smart home and draw parallels with the cloud software industry that you might’ve experienced as a Product Manager or an engineer.
A DIY approach to a smart home
We bought a house last year and one of my interests was to make it a “smart home”. After months of attempts, we now have lights, thermostats, fans, blinds, curtains, cameras, TVs, fragrance diffusers, a humidifier, a kettle, and a garden heater controlled via automation without needing to press buttons. The automations are triggered by motion, illumination, sunset, geolocation, temperature, time of the day, and doorbell press. At the same time, the systems also allow overrides via voice commands, button presses, and smartphone apps.
If you see my app drawer on my smartphone, you’ll see the mix of products I’ve needed to use to build the smart home. It in turn led to architectural challenges that taught me similar perspectives to ones we’ve seen in SaaS tech companies.
Challenges and Learnings
1) Read and Write a.k.a bidirectional information
There are some systems like Somfy for automating blinds that provide the capability to be ordered to do something (a.k.a receive a command) from other systems but do not provide the capability to read the status of the devices, for example, are the blinds open or closed. This challenge reduced their usability significantly and required workarounds.
For Example, I created 2 virtual switches per blind to open and close the blind. The automation in SmartThings for blinds would look at when was the last time they requested a blind to close (or open) and wait for at least 20 minutes before asking the same blind to close (or open). Instead, if the automation could read the status of the blind, it would avoid repetitive actions. Sometimes blinds open automatically when I want them closed so I have to command them to close every few minutes; whereas if there were a bidirectional flow of information, the automation would know that despite it closing the blinds recently, they are open, so it can assume I want the blinds to be open and pause its automation for some time.
So, my learning was that bidirectional information has significant benefits in making a system simple and reliable. This is similar to how my product team has launched APIs in the past as a set of APIs - create, read, and update together; even if Delete comes later. I have also done the opposite once when my team launched an MVP without the capability to update data, but to only write data. The former approach worked much better and made initial adoption easier while not punishing early adopters by making them build workarounds.
2) Logs as anecdotes.
I found the historic record of tasks performed by a device or system, a.k.a. system logs, to be crucial for understanding unexpected behaviors. I found it to be similar to user survey data - a few anecdotes can help me find out if something in the product is horrible, even if it will not tell me what is good or great.
For example, I wanted my ground floor living room curtains to automatically close if I am upstairs on weekdays because I don’t want passers-by to be able to see inside my home when it will look deserted. On the other hand, I want the curtains to open up if I come downstairs for lunch or snacks. Although the curtains closed well, they did not always open when I expected them to, so I looked at the historic logs of each sensor (input) and output (the app that controlled curtains) to find the issues and make changes.
3) A unified view is less important than capabilities.
Across multiple roles I’ve been in, I’ve seen companies try to be a great unified solution so that there is an opportunity for cross-sell and to have tighter lock-in with customers. However, when your customers are themselves tech companies, I’ve noticed many such customers want the best capability for a product a.k.a only the best-in-class solution instead of a good enough solution that comes from another vendor they already use. I found the selection of smart home architectural components to be a similar case.
For example, Philips Hue can control lights and have light automations triggered by motion sensors. Although Hue’s reliability and set of presets is best in class, its automation capability is basic and just “good enough”. So, I connect Philips Hue to Samsung SmartThings and code the automations in SmartThings. However, SmartThings is also not sufficient when writing complex automation such as brightening the lights gradually in a room at the end of my toddler’s nap time at a different time each day based on a button, so I instead use WebCore for SmartThings for it. This kind of mix-and-match does create overhead in maintaining multiple software, but the benefits outweigh the costs.
Another example is Google Home and Amazon Alexa’s useless ability to provide a single-pane-of-glass solution to see all my devices - there is just so much information that it is useless. Here is a screenshot from Google Home of just one of the 11 rooms in my home - there is a little bit of information about a lot of devices. The “show everything” approach means it takes a long time to click through this and yet the less amount of information means not being able to understand a situation.
4) Performance monitoring helps at the limits.
Just as logs are helpful to look at anecdotal issues, similarly I found performance monitoring to be useful to understand a pattern of issues. This is similar to how tech teams build DataDog dashboards, PagerDuty alerts, and so on to look at the aggregate of usage data of their systems, a.k.a “Measure what matters”.
For example, my automation to remind me at night to turn on my security system would never run. I wanted it to get triggered when there is movement in the bedroom but no movement in the kitchen, office, or living room, and it is night. It would remind us on the bedroom smart speaker to turn on the security system (ideally this could be done automatically using Tasker on Android). Reviewing the logs did not address my confusion but looking at the performance statistics pointed out that it was taking so long to process this automation that it was timing out. Simplifying the automation by using fewer sensor inputs ensured it did not create memory overloads.
Another example is when I wanted all lights in our home to loop through colors every second on Halloween night, but instead, every automation stopped working. Reviewing the statistics told me that 100s of commands every second were causing the network to overload and the hub was so busy with this one task that it couldn’t process any other automation. Simplifying the automation by changing the colors of lights only on the doorbell press ensured it did not overload memory.
5) It makes sense to keep tech debt.
I have seen the tech debt dilemma in every team I’ve been on, both as a product manager and as an engineer. Product teams are unhappy about tech debt and want to fix it but it takes a long time and by the time you would get ROI from it, the new tech would already be outdated. I found a similar dilemma when architecting the smart home.
For example, I had built my automations in SmartThings before I learnt to do so in WebCore. The latter is easier to debug automations and reuse code. The latter also allows backup of automations, which is helpful to reduce the impact of a black swan event when having 200+ automations. But is it worth my time to read each automation from SmartThings and rewrite it in WebCore? It remains a low priority to-do on my list that I keep ignoring. Tech debt probably relates to preparedness for black swans - you cannot predict when it will come and the scale of it, but you can prepare to reduce their negative impact. In that way, the impact of tech debt cannot be measured and so ends up being underestimated and ranked lower than new features.
More on smart home
I intend to document a day-in-the-life from my smart home experience and illustrate the architecture in a later article.
What are your thoughts? Which of the usual/best practices I shared above feels the most controversial or the most enlightening to you?