A bug in a biological sense is an insect in the group Hemiptera with unsightly, piercing mouthparts, but to a data engineer it is a niggling nightmare. Jacqueline Bilston, a Software Developer at Yelp and a native Calgarian, represented the city when she presented at the Databricks Data + AI Summit in June. The focus of her presentation was highly topical in today’s world where complex Big Data pipelines are moving so much data at such high velocity with numerous downstream reporting, machine learning model, and other pipeline dependencies. Therefore, how do you efficiently test ETL pipelines to ensure the robustness of your data system? In her presentation Jacqueline details how a bug found into her company’s data infrastructure resulted in a 5-day investigation, a 3-day bug fix, and a 7-day historical data fix. In debriefing the situation, Jacqueline had an epiphany that drastically changed how she now approaches testing ETL pipelines and she wants to share this lesson with all data engineers.
How is it possible that a pipeline system with extensive unit tests and a 100% coverage can allow this to happen? A unit test has 3 components – Arrange, Act, and Assert. In theory, each unit test should only be testing a single piece of logic, but in her case the unit test she was working with tested all of the logic in her pipeline. This meant that when it failed, it was difficult to tell why it had failed - was it because the new code she'd added was working and changing the output? Or had she introduced a different bug by changing something accidentally?
With nowhere left to turn, Jacqueline reached out to a mentor of hers for guidance. Two things that her mentor said immediately enabled a renewed insight into the problem and how to go about in solving it. First, he questioned the logic being tested, and then Jacqueline realized she was testing more than one behaviour in her test. Second, he mentioned the idea of a Saff Squeeze, a football term in describing a ball carrier hit by two opposing players. Saff Squeeze was developed by the creators of JUnit and it’s the act of inlining and deleting parts of a failing test until the problem presents itself. It’s basically moving the assert statement from the bottom to the top (hit them low) and keep deleting the passing lines until a line is not passing anymore, which is the culprit of the bug. Then you would keep deleting unused inputs from top to bottom (hit them high) until the test is nice, simple and clean. It was through the application of this method that Jacqueline finally found the root cause of this bug, an inner join that should have been a left join. One simple line of code caused eight hair-pulling days and much anxiety so this lesson highlights the importance of having both unit tests and keeping unit tests simple.
The overriding message from the presentation is that targeting 100% coverage can provide false assurance. Packing multiple logic into a unit test to check multiple behaviours may sound efficient when writing the test, but when a bug hides itself inside a labyrinth of inputs and outputs and then you realize there’s a problem, you would gladly trade that time saving of writing a single unit test that tests for multiple behaviours for the time saving of finding the bug quickly. 50 years on after the US Navy invented the design principle KISS (Keep it simple silly!), it is still very much relevant today as it was yesterday.