I was the victim of an email today that was basically an invoice that was wrong. A company that has sent me a wrong invoice in the past has sent me it again another 3 times now, with increasing levels of angry demands and warnings about credit ratings. This raised my heart rate and anger level to extremely high distressing levels, but this is not a rant about how some companies are just awful, that would be too simple. This is more of an investigation into why stuff like that happens, and how software engineering can help you learn how to not fuck up like that.
I have been coding for 44 years now, since age 11. In all that time I’ve only coded in BASIC, C, C++ and some PHP, but 90% of it has been C++, so that’s the one I’m really good at. I don’t even use all of C++, so lets just say when it comes to the bits I DO use, I’m very experienced. This is why I’m using the term software engineering, and not coding or hacking, which implies a sort of amateurish copy-pasting-from-the-internet style of development. I’ve both worked on large game projects, and coded relatively large projects entirely from scratch, including patching, re-factoring and a lot of debugging. I coded my own neural networks from scratch before they were really a popular thing. Anyway…
Take the simple example of my experience today. An invoice is wrong, and sent to a customer anyway. This has happened three times now, despite the customer replying in detail, with sources and annotations and other people copied in, AND including replies from other people at the offending company who openly admit its wrong and will be fixed. Clearly this is a smorgasbord of incompetence, and irritating. But why do companies do this? what actually goes wrong? I think I know the answer: Once a problem like this is finally flagged up (which often involves threats from someone like me to visit the CEOs home address at 3AM to shout at them), the company ‘fixes’ the mistake, maybe apologizes, and everyone gets on with their life…
This is the mistake.
Companies want to ‘fix’ the mistake in the quickest, easiest and cheapest way possible. If an invoice was accidentally doubled, then they ‘go into the system’ (they just mean database) and they halve it. Problem solved, customer happy. Now I don’t give a damn about the sorry state of this company’s awful internal processes, but I am aware of the fact that companies like this try to fix things in this way, rather than the true software engineering way. So how SHOULD it be done?
Before I was a full time computer programmer I worked in IT. I had a bunch of jobs, gradually going up in seniority. I saw everything from ‘patch it and who cares’ mentality (consumer PC sales) to ‘The absolute core problem must be fixed, verified, documented and reported on with regards to how it could have happened’ (Stock trading floor software). I can imagine that most managers think the first approach is faster, cheaper, better, and the second is only needed for financial systems or healthcare or weapons systems, but this is just flat out wrong.
When a mistake happens, fixing it forever, at a fundamental level, will save way, way more time in the lifetime of the company than patching over a bug.
Spending a lot of time in IT support taught me that there are many more stages to actually finding the cause of a problem than you might think. Here is a breakdown of how I ended up thinking about this stuff:
Stage 1: Work out what has actually happened. This is where almost everyone gets it wrong. For example, in my case I got an email for a £1k bill. This was not correct. Thats the error. Thats ALL WE KNOW RIGHT NOW. You might want to leap immediately to ‘the customer was charged too much’ or ‘the customers invoice was doubled’ but we do not know. Right now we only know the contents of the email are wrong. Is that a problem with the email generation code? Does the email match the amount in the credit control database? Until you check, you are just guessing. You might spend hours trying to fix a database calculation error you cannot find until finally realizing the email writing code is borked…
Stage 2: Draw a box around the problem. This was so helpful in IT. Even when you know what you know, you need to be able to define the scope of the problem. Again, right now all we know is the customer ‘cliff’ got a wrong invoice. Is that the problem? Or is it that every invoice we send out is wrong? or was it only invoices on a certain date? or for a certain product? Until we check a bunch of other emails we do not even know the scope of the problem. If its systemic, then fixing it for cliff is pointless.
Stage 3: Where did things go wrong. You need to find the last moment when everything worked. Was the correct amount applied to the product in the database? was the correct quantity selected when entering the customer’s order? was the amount correct right up until the invoice was triggered? Unless you know WHERE something went wrong, you cannot work out WHAT or WHY.
Stage 4: Developer a theory that fits all available data. You find a line of code that seems to explain the incorrect data. This needs to not only explain why it screwed things up for cliff, but also explain why the invoice for dave was actually ok.
Stage 5: Test a fix: You can now fix the problem. Hurrah. Change the code or the process and run it again. Is everything ok? If so you may be preparing a victory lap but this is laughably premature.
Stage 6: Reset back to the failed state: Now undo your fix and run it again. Seems redundant doesn’t it? Experience has taught me this is vital. If you REALLY have the fix, then undoing your fix should restore the problem. Maybe 1 in a 100 times it will not. The ‘random’ problem just fixed itself for some indeterminate period. Your fix was a red herring. A REAL fix is like a switch. Turn it on and its fixed, turn it off and its broken. Verify this.
Stage 7: Post Mortem: This is where you work out HOW this could ever have happened. This is actually by far the most important stage of the whole process. If you just fix some bad code, then you have just fixed one instance in one program. The coder who made that mistake will make it again, and again and again. The REAL fix is to make it IMPOSSIBLE to ever have that error again. This takes time, and analysis. The solution may be better training, or it may mean changing an API to make it impossible to process bad data. It might mean firing someone incompetent. It might mean adding a QA layer. Whatever the mistake was, if you do not go through this stage, you failed at the task of fixing the problem.
Coders and organizations that work like this have fewer bugs, fewer mistakes. They need smaller QA teams, and less time devoted to fixing mistakes and implementing patches. Their complaints department is tiny, and yet still provides great communication, because there are hardly any complaints. In short, companies that run like this are awesome, popular, profitable, and great places to work. Why is everywhere not like this? Because it requires some things:
- A culture of giving people fixing problems free reign to follow the problem everywhere. If only the department that sends the emails is allowed to get involved in complaints about email, database errors or coding errors will never get fixed because they CAN never be fixed. The phrase ‘not my department’ needs to be banned.
- A culture of accepting that someone is STILL working on a proper fix, even after the complaint has been handled and the customer is happy. When there is a bug, there are TWO problems to fix: the customers bad experience AND the company process failure that allowed this. Managers who pull people away from bug-fix post-mortems to fix the next thing right now are a curse.
Of course your mileage may vary. Be aware I am a hypomanic workaholic who works for himself so my ‘advice’ may not be applicable to everybody, but I hope this was interesting and thought provoking anyway. Too many coders are just copy-pasting from stack overflow or asking chatgpt to quickly patch their bad code. Working out how to fix things properly is a worthwhile pursuit.