My database has bugs — now what?

wo 21 november 2018 — Raphael ‘kena’ Poss

Upon our arrival in New Orleans on November 5th, 2018, my friend Nathan explained to me, with a mix of apparent excitement and apprehension: “our internal Jepsen test suite found a real consistency bug.”

I replied: “This is great! This confirms yet again that good testing is actually useful! Have you found the cause already?”

A discussion ensued about the investigation (the bug has been fixed since). However, in the course of the conversation, Nathan did ask: “don’t you feel bad that there is a real consistency bug?”

And that, dear reader, is the topic for today: my database has bugs — now what?

❦❦❦

This blog post will not cover the topic of “what to do” to address correctness bugs from the database developer’s perspective — for every database product, there are programmers, architects and product managers who apply expertise, business acumen and customer/user sensitivity to decide a good course of action. These are interesting things to understand, but I don’t want to focus on them today.

Instead, I want to come back to “don’t you feel bad.” This was really a surprising question to me, and the subsequent conversations I had with other colleagues revealed that there is something at play here that deserves attention.

Of death, taxes and bugs

To start, a platitude: all software has bugs. Software is built by people, people make mistakes, so software contain bugs.

Also, a small dirty secret of computer architecture science: all hardware has bugs. Hardware is designed by people, people make mistakes, yada, yada. Also, well-functioning hardware deteriorates into wrongly-functioning hardware before it deteriorates further into non-functioning hardware.

Therefore any storage system will contain bugs. A bug in a storage system can have all kinds of consequences. It can even cause data to not be stored properly. This is just life.

Bugs in computer things are like deaths and taxes. There is, philosophically, no point in hoping they did not exist.

❦❦❦

Systems evolve over time. Hardware and software must be upgraded. New features must be added, bugs must be fixed. In any case, evolution means new software and hardware things to replace old things. New things have bugs.

Therefore, every system that evolves over time will encounter a steady stream of new bugs.

There is no such thing as “asymptotically fixing all the bugs.” [1]

Therefore, an evolving database engine (any storage system really) is bound to contain new bugs regularly, even with arbitrary many people and infinite programming speed to fix bugs.

How should we all feel about that?

[1]Some software like TeX claims to contain no bugs. This does not contradict my point; this type of software does not evolve any more. In that case, asymptotic freedom from bugs become a practical possibility.

❦❦❦

“An evolving system is bound to contain new bugs regularly, even with infinite programming speed to fix bugs.” How can this be true? If programmers can fix bugs instantaneously, how can I also say there are always bugs?

That’s because there is always some time between the moment a bug starts to exist (when mistakes are made) and when it is discovered. No amount of people and speed will make that time disappear.

Testing can reduce it greatly, though.

Investing in testing is therefore a very good way to make bugs disappear nearly as fast as they appear.

However…

Testing is made of software and hardware bits too. So tests have bugs.

A buggy test can fail to find a bug in the thing that it is testing. So the main thing will always continue to contain bugs that tests do not catch (because tests will always contain bugs too).

We really can’t get rid of all bugs, even by stunningly many programmers, programming speed and tests. As long as people make mistakes, systems will contain bugs.

❦❦❦

Bugs come from mistakes? So we can make bugs disappear by changing people to make fewer mistakes?

Well, there is yet another thing as unescapable as death, taxes and bugs: we cannot change people.

Also, even for those few individuals with the combined ability and will to change towards making fewer mistakes, this process takes time.

Meanwhile, people get older, new people get born, so teams of programmers evolve too. An evolving system will, over time, see new contributors. These contributors will make new mistakes even if older contributors don’t any more (or fewer).

So, more / new mistakes will be made. New bugs will continue to appear.

Of bugs, and consequences

If I have not convinced you, dear reader, that the pursuit of “no bugs” is pointless, let me then kindly request to suspend your disbelief, or perhaps simply humor me. What if? What if indeed systems are bound to contain bugs no matter what?

In that world, a storage system, a database engine perhaps, can eat your data because of a bug. Maybe it will eat the last picture you had of your late spouse. Maybe it will eat your monies. Or it may tell lies to authorities and get you in trouble.

You may, as a user of said system, feel angry at the problem, or perhaps just frustrated about the inconvenience.

Meanwhile, would the database programmer feel bad about their mistakes?

❦❦❦

Let us become technical for a moment. Structured storage systems have existed since, like, forever. There are databases, but not that many people use databases, or realize they do. What everyone does understand well however, is file systems.

You can copy a picture to a folder of some hard drive on your computer. Two months later, you can try to show the picture to a friend, and the computer can say “no.”

This happens all the damn time and is called filesystem corruption—or, in human words, “computer ate my pictures.”

Sometimes it is a bug in the file system software. Sometimes it is a bug in the underlying storage. Regardless; from the perspective of the end user, problems with storage are just facts of life.

Painful and expensive facts of life.

Therefore, we could be having a conversation about whether the experience of pain and cost by the end users of storage systems would be mirrored by “bad feelings” from the programmers who built the things that ate the data, when the data loss is the result of a mistake.

From a primal “eye for eye, tooth for tooth” perspective, it is tempting to “make the developer feel the pain”:

However, I shall assume henceforth that people are civil, and that end-users of storage systems do not actively wish programmers to feel bad about their mistakes. In fact, based on experience alone, I have never encountered this situation. If anything, users can and do sometimes wish ill feelings to the seller (or owner, in case of rented online storage) of storage systems, but this is quite always removed away from the programmers.

❦❦❦

So what are we really talking about when we ask ourselves “do programmers feel bad about bugs in their database engine?”

Psychologically, humans naturally [2] tend to detach their own feelings away from things they cannot change. Things like death, taxes and the presence of bugs. So if we are talking about “feeling bad about database bugs” we are really talking about feelings induced by others. Experiencing bad feelings in that context requires work — the programmer would have to choose to experience these feelings. They do not have to do that.

Or do they?

That is where the discussion becomes interesting.

Why would a programmer choose to feel bad about mistakes they made in the past?

Choose your adventure: did you react to reading the previous sentence with “I don’t have the impression I choose to feel bad! This is nonsense!” In that case, you can skip directly to the end of this post.

If you buy into the idea of choice, read on.

[2]or, with help from therapy, eventually

Segue

(This section is a mere aside. Feel free to skip to the next section if you are pressed by time or curiosity.)

One thing we can briefly consider is a discussion of remediable vs. non-remediable problems. If a user has back-ups of their data, data loss is remediable by restoring data. If a user is facing mismatched data (for example, one copy of the data tells a police department they committed a crime, and another copy of the data says they did not), this is remediable using due process. Then there are non-remediable situations: no back-ups, bad data turning into false positives in crime identification with no counter-evidence, etc.

Intuitively, there exists some scale of “good” to “bad” where a mistake that causes a remediable bad situation is slightly less bad than a mistake that causes an irremediable situation. Maybe there is even a scale in-between with “cost” or “risk” analysis to decide badness.

However, this is not directly relevant here. The question is really about whether a programmer would feel bad about past mistakes, without too much (or perhaps not any) knowledge of remediability for end-users.

Or perhaps you can humor me and consider the extrema on this scale: What if? What if all situations were somewhat irremediable. Would programmers choose to feel bad about their mistakes then? Or if all situations were always remediable at negligible cost?

These are not theoretical concerns.

Some storage systems exist to store data whose accurracy is not important. For example, the number of “thumbs down” clicks on published articles or videos. This data can, and often does, get lost because of bugs in database systems. Nobody cares. Programmers of said databases certainly do not choose to feel bad about those.

Meanwhile, in the army and organizations that build medical equipment systems, mistakes/bugs that cause bad data can kill people. For example, the most egregious mistake in computer history cost the life of dozens of radiology patients. Would the programmers who made that mistake choose to feel bad about it? In fact, they did not, for reasons also elucidated below.

Actively opting into bad feelings about mistakes

One concrete reason why a programmer could choose to feel bad about mistakes is the pursuit of wealth.

Namely, when the storage system is a commercial product, and programmers choose to receive participatory wealth as a share of increased (or continuing) business revenue around that product, they are really opting into a possible “bad feeling” of loss, or perhaps frustration, or perhaps anguish, or perhaps disappointment, when their mistakes cost money to the business and thus, ultimately, to themselves.

Yet, this is a choice. No one is forced to choose participatory profit sharing, and, more to the point, no one is forced to choose to let temporary setbacks in the pursuit of wealth influence one’s feelings. After all, job security is not at play here and neither are survival issues.

And then, there are open source projects, with contributors without participatory wealth sharing. The question about bad feelings applies to them too.

❦❦❦

What other ways then would a programmer choose to experience bad feelings about their past mistakes?

Maybe, they could choose this out of active empathy for the pain and cost they imagine is incurred into end users. [3]

Yet, (most) programmers are also rational and would likely make their choice by balancing multiple factors against each other.

Empathy and working towards “being nice to users” by feeling bad about the consequences of mistakes is one thing, but then bad feelings also impair productivity. All the time spent feeling bad about mistakes is time not well spent actually fixing the mistakes or helping others deal with their consequences.

All I am saying here is that there is an opportunity cost paid by the choice to feel bad about past mistakes. If one chooses to feel bad about their past mistakes, they must be finding a way to recoup this opportunity cost.

[3]I am setting aside here the case where they personally witness the consequences of their mistakes. In that case, mirror neuros kick in (assuming absence of autism) and “bad feelings” occur without choice. In practice, however, programmers are rarely in position to witness the consequences of their mistakes directly.

❦❦❦

For the sake of completeness, I must mention the obvious: maybe the programmer chooses to become legally and personally liable for the consequences of their mistakes.

This can happen if, for example, they did not actively choose to opt into a legal shield, either via employment or any other suitable instrument to limit their personal liability (incorporation, contracts, etc.). It can also happen if they choose to break laws.

In this case a mistake could naturally yield a feeling of fear—for one’s safety, freedom or both.

There is not much to say about this. Also, it’s exceedingly unlikely in this industry. I will instead focus on the more common situations.

❦❦❦

This is where a common, yet supremely flimsy productivity argument usually kicks in:

“Programmers choose to feel bad about past mistakes because the bad feeling will pressure them psychologically into addressing the mistakes or their consequences faster.”

Bzzzt. Wrong.

Shame, guilt, anguish, fear and the other bad feelings that naturally develop when one lets them develop are extremely disabling. They cause anxiety and depression. They objectively reduce productivity and the likelihood of a good outcome.

❦❦❦

An aside to give lip service to a minor argument I heard:

“Programmers choose to feel bad about past mistakes because otherwise they would be lazy and stop doing anything new as soon as they do better work than their competition.”

This is nonsensical to me.

First of all, feeling bad would kill the glow of pride for past successes. A programmer who wishes to remain lazy on top of the pride of past successes would not rationally choose to feel bad about anything.

Then, in the storage world, there are always new problems to solve. Even past successes would not be sufficient to stop the momentum of continued innovation.

❦❦❦

One more effective utility argument is one of social standing or reputation:

“Programmers choose to feel bad about past mistakes because displaying any other emotional reaction would be detrimental to their reputation.” (or that of their employers, and ultimately whatever business made possible by the existence of the thing being built.)

I particularly like this view because it reveals three key insights on the situation:

  1. what really matters for social standing is not the actual feeling but what the outside world perceives. A programmer could safeguard their social standing, reputation, business, whatever, by merely displaying adequate empathy, but without choosing to actually feel bad about their mistakes.
  2. in practice however, it is also a true fact that many programmers are not extremely capable of displaying empathy successfully or convincingly. As a personal expert on the matter, I stand witness to the prevalence of autism in the industry. Therefore, safeguarding things will require either:
    • that the programmer remains hidden from scrutiny (this is what most organizations with proprietary products choose to do);
    • that the programmer actually experiences the feelings so that they can show they share the pain without using empathy (this is required in the cases where the programmer is exposed to public scrutiny).
  3. if, for unrelated reasons, the programmer chooses to work “in the open” (for example, as an open source project), they are choosing to become exposed to scrutiny. It entails that either they must become good at displaying adequate empathy, or choose into actively “feeling bad” for their mistakes.

So, here we are, at last with a candidate good answer:

don’t you feel bad that there is a real consistency bug?”

“yes, I feel bad for our users! [unsaid: because they expect me to].”

I do not need to actually feel bad, I just “have to” say I do.

❦❦❦

But wait! There is more!

The argument above assumes that the display of empathy is the only way to safeguard social standing and reputation.

This, of course, is nonsensical. If I make a mistake, and you are experiencing the bad consequences of the mistake, you may have some social expectation of a display of empathy but it is not a strict requirement for you to continue to respect me. I can also “make things good” in other ways.

I could apologize. Or, I could engage with you to acknowledge your situation and dedicate time to help you with it. Or, I could distract you away by giving you something shiny (e.g., monetary compensation or feature candy) that offsets your (data) loss.

This is a serious proposal. All databases have bugs. Some commercially successful products offered by vendors with a name starting with I and rhyming with “toxines”, or whose name may be inspired from classical Greece, have bugs too and yet do not display any empathy when their customers experience the consequence of those bugs.

Instead, they offer SLAs (Service Level Agreements).

My database has bugs? I will help you and otherwise pay you to recoup your costs, within a contractual financial liability.

This is completely adequate to preserve social standing and reputation. In fact, serious users of storage systems expect all problems to be scrutinized through the lenses of SLAs with no feelings involved.

❦❦❦

So, finally, here we are, at last with a better answer:

don’t you feel bad that there is a real consistency bug?”

“no, absolutely not. Our users are paying us with the contractual knowledge that bugs will happen, and I am proud to support a business that is respectful of its contractual obligations without delving into irrational feeling-based decision making.”

Of free software or contract-less users

When users adopt a storage system for free, they often do so without contract. At best, they do so via a license that usually states:

… Licensor provides the Work […] on an “AS ISBASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE.

in other words, “Sadly Outta Luck” in case of bugs.

Yet, these systems do have users, users who are aware of the existence of bugs. What is happening?

Are users hoping that programmers who make mistakes, and whose mistakes end up eating their data, will “feel bad” then knock at their door, profusely apologize, pay them for the inconvenience, and work days and nights to fix consequences?

Hah! Of course not.

❦❦❦

Let us be technical again for a moment. Let us recall the situation around new file systems. There are new file systems coming into existence regularly. This happens either because of the appearance of new operating systems, or because of existing OS builders deciding that they wish to provide a new file system, even when there are already perfectly fine file systems around.

Cue in ext2 (1993), NTFS (1993), XFS (1994), ReiserFS (2001), ZFS (2005), Btrfs (2009), ReFS (2012), NOVA (2016).

New file systems have bugs. So we have programmers who knowingly release systems with bugs and tell users to rely on them, when users may be happy already with a previous file systems with (usually) fewer bugs. How is that even possible?

And there is something we know for sure is not involved with new file systems: bad feelings.

Ideally, users would make backups and there would not be anything to say about this further. However, users make mistakes and so backups are not perfect (or not always existent). So data still gets irremediably lost.

Do programmers of file systems feel bad about this? No!

Instead, new file systems come with recovery tools.

New file system comes in. Users come in for free (without contract). Data gets lost or bad. Users report problems. Respect is shown. Back-ups or recovery tools are involved. End of story.

The story is not so different with databases.

New database engine comes in. Users come in for free (without contract). Data gets lost or bad. Users report problems. Respect is shown. Back-ups or recovery tools are involved. End of story.

All that is needed is recovery tools and a support channel where respect is shown. No bad feelings are needed.

❦❦❦

The essence of the situation is this:

Free / contract-less access to complex systems (e.g. via free software) is shielded from legal liability but still happens under a social contract.

The programmer offers the output of their work for free. Users use the product of this work for free. It balances out:

  For programmers For users
Understanding Programmers know they make mistakes Users know their storage solution is bound to contain bugs
Gain in happy case Programmer gets rich and famous, or at least the satisfaction their work is considered useful. User’s data is stored reliably.
Bugs happen, what now? Programmers show respect (no need for bad feelings). Users expect respect. And recovery tools.
Bugs happen, what’s next? Programmers analyze problem and organize solutions. Users provide information about problem.
Bugs happen, what of reputation? Programmers acknowledge in public the lessons they learned. (Transparency) Users are grateful for respect shown, recovery tools, and the knowledge that solutions to bugs found by other users will be shared with everyone.

And here we are. No bad feelings needed.

When bad feelings are happening already

On that fateful day, soon after Nathan asked me “whether I felt bad” about this newly discovered bug, my other friend Andrei shared with me:

“I don’t know bro, I actually feel kinda bad about it. It’s not great!”

A week or two afterwards, my friend Tobias reflected on the gap between the number of bugs and the number of programmers available to fix mistakes. He used the words “somber outlook,” which also somewhat reflects bad feelings.

What then?

❦❦❦

I feel sad for my friends (not the bugs).

I see them feeling bad for something they cannot change (the continued existence of bugs); I am sad because my friends should not have to feel bad about something out of their control.

❦❦❦

Also, as the the blog post discusses above, these feelings are not useful — they impair productivity (bad feelings don’t make bugs disappear any faster, to the contrary), they incur an opportunity cost, they will leak via empathy to the rest of the programmer team, and all other kinds of bad things can happen as a result.

Observing bad feelings happen once as a result of mistakes and bugs was unfortunate and made me sad. I now worry to see more bad feelings develop again in the future; after all, there will always be more bugs.

❦❦❦

I also felt angry, briefly. I started to feel angry when I realized why my friends are feeling bad about their past mistakes.

You see, it is all about expectations. As we discussed above, users know that systems come with bugs. It is unfortunate but there is precedent on how to deal with this without bad feelings.

So bad feelings come from mismanaged expectations on the provider side, not the user side.

In this particular case, my friends were originally given a mission by their leaders to “aim high,” which allows for mistakes without bad feelings, but also to “build to last” and later to “build stable and correct.”

One can aim high, make mistakes while aiming, and continue to aim while dealing with mistakes without bad feelings. The mission is not compromised.

However, if the mission is to “build to last” and the thing being built breaks, the mission has failed. If the mission is to “build stable and correct”, then the mission as failed as soon as there is at least one known stability or correctness bug.

People on a mission will feel bad about failing their mission — that is unavoidable.

However the mission statement is not received from a deity in the sky. The mission statement is a choice by some leader.

So here my friends are feeling bad because some leader has chosen a mission statement that is bound to make them feel bad, unavoidably.

And that, dear reader, made me angry.

But then, soon afterwards, I was not angry any more.

❦❦❦

Leaders are people too. People make mistakes. A bad mission statement is just that. A mistake.

Once the mistake is known, we can work to fix it, again without bad feelings.

❦❦❦

With the understanding that like any storage system, a database tool must provide recovery tools. Otherwise it breaches the social contract, and then there is a good reason to feel bad.

Summary; What’s next?

My database has bugs. What’s next?

From my point of view:

  • bugs are just past mistakes.
  • bad feelings about mistakes are bad for productivity and for morale.
  • bad feelings about mistakes are not useful.
  • not everyone learns from their mistakes at the same rate. The same mistakes can be made repeatedly by different programmers without anyone’s fault.
  • users want relief, not bad feelings — also, respect shown, recovery tools and compensation work just as well as (and sometimes better than) just fixing bugs.
  • if I set business goals as the pursuit of fewer mistakes (e.g. “build to last” or “build a correct and stable product”) I am likely to tie the presence of mistakes / bugs with bad feelings. Because bugs are unavoidable, I would be creating a bad feelings as a business requirement. Not good.

Instead, I aim to:

  • tie business goals to the relief of consequences and cost thereof, not the avoidance (or fixing) of mistakes. In human language, “make programmers think about helping users before they think about fixing and avoiding bugs.” [4]
  • not let programmers feel bad about mistakes (and bugs).
  • celebrate mistakes and their discovery as learning experiences.
  • explain to users how I deal with mistakes when I become aware of them.
  • consider bad feelings as insufficient liability protection by SLAs and bad mission statements.
  • build recovery tools with my database engine.
[4]Of course, there is a balance. The cost of fixing and avoiding bugs can be smaller than the cost of relief of consequences. However, this is a much more interesting conversation to have than how much one should feel bad about mistakes. And in any case, build recovery tools.

Comments


dr knz @ work © Raphael ‘kena’ Poss. Powered by Pelican and Twitter Bootstrap. Icons by Font Awesome.