TransWikia.com

What does the crash early concept mean?

Software Engineering Asked by Hawk on December 21, 2020

While I am reading The Pragmatic Programmer e2, I came across Tip 38: Crash Early. Basically, the author, at least to my understanding, advises to avoid catching exceptions and let the program crash. He goes on saying:

One of the benefits of detecting problems as soon as you can is that
you can crash earlier, and crashing is often the bet thing you can do.
The alternative may be to continue, writing corrupted data to some
vital database or commanding the washing machine into its twentieth
consecutive spin cycle.

Later he says:

In these environments, programs are designed to fail, but that failure
is managed with supervisors. A supervisor is responsible for running
code and knows what to do in case the code fails, which could include
cleaning up after it, restarting it, and so on.

I am struggling to reflect that into real code. What could be the supervisor the author is referring to? In Java, I am used to use a lot of try/catch. Do I need to stop doing that? And replace that with what? Do I simply let the program restart every time there is an exception?

Here is the example the author used (Elixir):

try do
  add_score_to_board(score);
rescue
  InvalidScore
  Logger.error("Can't add invalid score. Exiting");
  raise
rescue
  BoardServerDown
  Logger.error("Can't add score: Board is down. Existing");
  raise
rescue
  StaleTransaction
  Logger.error("Can't add score: stale transaction. Existing");
  raise
end

This is how Pragmatic Programmers would write this:

add_score_to_board(score);

10 Answers

In Java almost two decades ago the concept was called "fail fast" not "crash early". The main issue is that back then most of the times you had a server program handling multiple requests and it could not stop for a problem raised processing one single request, soon people found out that when a program writes just some log lines when an Exception is thrown most of the times it ends up being submerged by a huge amount of trivial reporting and troubleshooting turns into a painful digging into piles of log files. The most pragmatic approach was to send error reports to external applications and, at the same time, use Assertions which are Exceptions disabled by default unless the process is started with the flag Enable Assertions. Purpose of the assertions was to stop quickly processing in test environments letting people to spot immediately possible issues, hence the term fail fast as opposed to "resilient behaviour".

Unfortunately later on Junit reused the keyword assert and created a lot of confusion about it, but this is another issue.

With the development of enterprise Java and server programs spawning multiple instances to process client requests the need to keep working when an exception is raised was lessened and people began telling that if don't know what to do in a catch block you better not wrap the code with a try/catch that would just hide the exception to the caller, but this does not mean that you should not use try/catch at all, exceptions should always be handled as best as possible, you can ignore them only if you are sure that in one way or the other they will be managed at higher levels in the call stack.

Lately the idea of spawning a child for each request has been expanded by the advocates of reactive programming with the goal of obtaining fail fast and resilient behaviour at the same time. Provided you have a framework able to supervise, monitor, handle automatically failed requests. Hence the idea of the supervisor taken from the Actor Model. Or, better say, that the supervisor is a role of the Actor Model as it has been outlined in functional programming, see for an example the Akka framework or the Actor model in Scala

Answered by FluidCode on December 21, 2020

What could be the supervisor the author is referring to?

This could be many different things, depending on the platform and environment in which the program is running.

One possibility is daemontools.  This is a Unix-based program which takes charge of starting and stopping programs under its control, including restarting them if they stop unexpectedly.  I've successfully used it in production environments.

(I think it's largely been superseded by systemd, which does some of the same things.)

Note that it can only do this if the entire program shuts down — if one thread crashes but others continue, it has no way to tell.

(And, as mentioned in other answers, that would leave your program in an inconsistent state.  For example, consider a simple program with one thread reading from a data feed, and another writing it to a DB.  If the reader thread crashed, the writer might happily continue indefinitely, with no data to write.  However, if the writer thread crashed, the reader could only continue until the memory was full of data waiting to be written — which might take a long time.  Either way, it could miss many minutes or hours of data from the feed.  Whereas if the whole program shut down, it would miss data for only the few seconds it took to notice the problem, shut down, and be restarted by the supervisor.)

So if your program uses multiple threads (directly or not), it's not safe just to let an exception go uncaught; for safety, you may need to set up a global uncaught-exception handler which does an explicit shutdown of the entire program.  (Such a handler needs to be very carefully written, in case it throws an exception…  Out-of-memory conditions are especially tricky to handle reliably.)

Answered by gidds on December 21, 2020

How about a physical analogy? Your boss instructs you to organize and file some boxes of paperwork and tells you do exactly as you're told and not to bother her until the job is done for any reason. During the process of filing, you:

  • Discover the office hallway is blocked off for construction work. You stand there starving and dehydrated for a week until the work is complete and the hallway is reopened.

  • Discover the file room door is locked, so you pile all the files up in front of the door.

  • Realize the index that tells you where each file goes is missing, so you make up your own new filing scheme entirely different than the exiting one.

  • Realize the labels on the files are in a writing system you don't understand, so you guess wildly at how to alphabetize them.

  • Notice one of the boxes contains an active bomb, but you know you're not supposed to disturb your boss, so you file the bomb and don't tell anyone.

  • Notice the office is now exploded and on fire, and keep delivering files into the flames until the fire department drags you out of the building.

When you meet your boss outside, you let her know you finished the filing job and there were just a few problems you noticed along the way. That's what happens when you don't crash early: at every point in the process, the environment was unsuitable for you to do the work, but you kept on going in the hope it would work out instead of stopping immediately.


So what does that mean for programming? If there's a problem (usually delivered to you in the form of an exception or a failed assertion check), you need to immediately assess whether it's something you can deal with. Unless you have a clear plan to recover from the problem, you should never just keep going on blindly in the hope it's all going to be fine somehow.

There are a lot of judgement calls here that will depend on your application. If you're processing all the files in a directory and one turns out to be corrupt, there's no hard rule about the right thing to do. For some applications, it will make the most sense to roll everything back and leave things as they were. For others, it would make more sense to skip that file and process the rest. Or it might be best to pause and alert a human and give them a choice of what to do, or allow such configuration before the task starts. You'll have to decide what makes the most sense given the context of how your application used and the ways in which it could cause problems if something goes wrong. This requires even more careful analysis when the software is serving a critical purpose: your judgement about how to handle missing sensor data will likely be different if you're designing a floor cleaning robot (where it may be more important to stop the robot immediately before it causes damage) vs flight control software (where you've put considerable design into redundancy, gradual degradation, and failure modes).

Answered by Zach Lipton on December 21, 2020

The important part here is the kind of error you encountered. There are errors that are expected, and where you know what to do with them. Typical examples are network errors, e.g. in your web application you need to display an error if the server doesn't respond, and probably give the user a button to retry. You don't want to crash everything for this kind of error that you can cleanly handle.

Another type of error are those that simply make the current job impossible. For example if you need to read 100 different files for a specific job, if any of them fails you don't need to continue, it is impossible to complete the job. So you don't need a try/catch around every file access, you can let the whole thing either succeed completely, or let it fail on any error.

The most important error, and the one this statement is really about is an unexpected error that has put your application into an unknown state. Let's assume we're in an application with multiple threads and shared memory. We have a try/catch around the whole program in each thread that catches anything. Is it safe to just restart the thread if any kind of arbitrary exception is thrown?

The answer is no, because of the shared state. The error could have done anything to the shared memory, and put it into a corrupt state. What you need to do is to get the program into a defined, known good state again. In most programming languages this means crashing the entire program and restarting it. You can't recover from having your application in an unknown state. Any of your assumption might be broken, there might simply be garbage data in some of your state.

So of course you should handle exceptions if you understand the error and know how to recover, and if it makes sense to handle the error at that particular point and not at a higher level. What you should not do is try to handle errors that you don't understand, and where you can't guarantee that your application is still in a valid state.

What is special about Erlang/Elixir is that you don't need to crash the entire application. The Erlang VM allows you to have easily hundreds of thousands of processes, and each process is completely isolated, there is no shared, mutable memory there. So in many cases you don't need to catch any exceptions at all, you just let the process crash. This can't affect anything outside that process. And Erlang/Elixir has Supervisors that manage these processes, and you can define restart policies there. So in most cases the process that failed would be simply restarted automatically from a known good state.

Answered by Mad Scientist on December 21, 2020

Exceptions are meant to communicate to your caller that you couldn't fulfill your job. [That's the most-ignored fact about exceptions.]

Fail Early

That's good advice. As soon as you find out that you can't complete successfully, it's best to immediately inform your caller about that fact (after cleaning up any inconsistent state that you'd otherwise left behind, it that applies to your application).

Continuing in your program is typically useless, can even be dangerous because of missing or wrong data.

So, e.g. when opening a file, don't immediately catch the exception, log it and continue. The following code will try to read from that file and of course fail as well.

Generally, you write program statements because your logic needs them. So, if one of your steps fails, the whole method won't give the desired results. So, let exceptions that you receive simply bubble up the stack, and actively throw appropriate exceptions whenever you detect failure conditions.

Avoid Catching

Although a good general guideline, "Avoid Catching" is over-simplified.

Better: think three times if you really want to catch exceptions here in this place. I've seen lots and lots of code cluttered with try/catch constructs that are unnecessary and most of the time even quality traps or plain programming mistakes.

Catch exceptions only in places where you can successfully continue, even after some of your program so far has failed. That translates to the question: Do I have a fallback or recovery strategy available that can turn the failure I just experienced into a success? Maybe by a retry/reconnect or by having an alternative algorithm or whatever.

In catching exceptions, you have to be honest to yourself:

  • You know that something in your current code block failed.
  • You also know how the failing code labeled the type of failure (by means of creating its exception object), or how some layer in-between re-labeled the original failure reason (by wrapping the original exception object). [You see, this isn't the most reliable source of information.]
  • Take into account that a given exception type might come from any enclosed piece of code, from any level deep down the stack. So don't assume you know what happened from just looking at the exception object.
  • Having only these informations, can you turn your current method into success?

An honest answer to this reasoning will be "No" in most cases. And then don't catch the exception.

Valid "Yes" situations are e.g. having a retry/reconnect strategy at hand, or an alternative algorithm, or just reasoning about optional code, something like a cleanup that's nice to have, but not necessary for making your current method succeed.

You should finally catch exceptions at some top-level (user-interface action level, service API top layer, etc.). There:

  • log the error,
  • tell the user or your client that their request failed,
  • and wait for the next request, that probably (hopefully?) won't run into the same problem.

Supervisor

What the author calls a supervisor translates to a well-designed catch block in more traditional languages: a place where you know how to deal with a failure in such a way that you can meaningfully continue.

Answered by Ralf Kleberhoff on December 21, 2020

Further to @DocBrown's answer, it's also worth throwing errors/exceptions in early guard clauses. The example's structure already facilitates this, but people often write code that checks things too late. This can lead to long, WET, highly nested code, more execution than necessary, complex boolean logic, more bugs, and a result that's hard to see at a glance is bug-free. These concerns apply even if it can't fall over. But if it can, and you use early guard clauses, you don't just avoid the above problems; you can also assume certain things succeed in most of the code.

Answered by J.G. on December 21, 2020

What could be the supervisor the author is referring to?

It depends on the type of program. In the very common case of a webserver, the "supervisor" would be a dispatcher thread that hands off individual requests to worker threads to process. If there is an exception while a request is processed, it's perfectly acceptable (and in fact common) practice to just let that exception bubble up to some high level exception handler, which logs the exception, rolls back any DB transactions and sends a HTTP 500 error response to the client. The worker thread can then start processing the next request. For a standalone GUI program, you can imagine something similar in the event handler that responds to user input, although here there is a danger of leaving the GUI in a corrupted state.

In Java, I am used to use a lot of try/catch, do I need to stop doing that?

It depends. In many cases, specific exceptions indicate known error conditions that your program can and should handle. For example, trying to parse a date a user entered, but they entered the 30th of February. Keep doing that.

What you should not do is catch(Exception e) and then just continue as if nothing happened.

Java is unique in that it has checked exceptions that it forces you to catch (or pollute your method signature with). This is widely considered a failed language design experiment. A pretty common way to deal with checked exceptions you can't usefully handle is to wrap them in a RuntimeException, i.e. do something like catch(IOException e) { throw new RuntimeException("Error processing " + filename, e); } - note that you can still do something very useful by adding information (in this case about the file that was being accessed) that will help in debugging when the exception is logged.

and replace that with what? Do I simply let the program restarts every time there is exception?

Depends on the program. If you can find a central, high-level place where it makes sense to catch exceptions because you can log them and have the program in a consistent state as if whatever action ultimately caused the exception was never started, do that.

If there is no such place, yes, let the program crash. For command line utilities, that is often the correct choice.

Answered by Michael Borgwardt on December 21, 2020

What could be the supervisor the author is referring to?

In the context of the book, the author is referring to the supervisor in Erlang. It handles restart logic for crashing processes, and handles exit messages from their dying processes. The supervisor can then decide what action to take to bring the system back to a stable state. We are allowed to define restart policies on the process there.

Because the supervisors in Erlang manages the processes, we can just let the process crash without affecting anything outside the crashed process, instead of catching the exceptions (and try to address/fix it).

In Java, I am used to use a lot of try/catch, do I need to stop doing that?

We should avoid abusing try/catch the unexpected exceptions, because it could be unclear if it's safe the program continues. If the program fails later, it may be very difficult to track the root cause.

Taking Java as example, exceptions inheriting from RuntimeException will produce crashes in runtime. For example, try to avoid try/catch but just let the code crash on NullPointerException.

In your code example, the exception is caught, logged, and then rethrown. It is similar in Java where a checked exception can be caught and re-thrown without losing the Stacktrace info (enforced by compiler), for example

try 
{
  //
} 
catch (final SQLException e) 
{
  // logging the error if necessary
  throw new RuntimeException(e);
}

Answered by lennon310 on December 21, 2020

In principle, your code should handle unusual situations, but it shouldn't handle programming errors (and not expecting that an unusual situation might happen is a programming error).

If there are programming errors, then your code should crash, then a developer figures out why it crashed, and fixes it. If there are unusual situations, your code should handle them if it is possible and safe.

Answered by gnasher729 on December 21, 2020

Basically, the author, [...] advises to avoid catching exceptions and let the program crash

No, that is a misunderstanding.

The recommendation is to let a program terminate its execution ASAP when there is an indication that it cannot safely continue (the term "crash" can also be replaced by "end gracefully", if one prefers this). The important word here is not "crash", but "early" - as soon as such an indication becomes aware in a certain part of the code, the program should not "hope" that later executed parts in the code might still work, but simply end execution, ideally with a full error report. And a common way of ending execution is using a specific exception for this, transport the information where the problem occurred to the outermost scope, where the program should be terminated.

Moreover, the recommendation is not against catching exceptions in general. The recommendation is against the abuse of catching unexpected exceptions to prevent the end of a program. Continuing a program though it is unclear whether this is safe or not can mask severe errors, makes it hard to find the root cause of a problem and has the risk of causing more damage than when the program suddenly stops.

Your example shows how to catch some severe exceptions, for logging. But it does not just continue the execution, it rethrows those exceptions, which will probably end the program. That is exactly in line with the "crash early" idea.

And to your question

What could be the supervisor the author is referring to?

Such a supervisor is either a person, which will deal with the failure of a program, or another program running in a separate process, which monitors the activity of other, more complex programs, and can take appropriate actions when one of them "fails".

What this is precisely depends heavily on the kind of program, and the potential costs of a failure. Imagine the failure scenarios for

  • a desktop application with some GUI for managing address data in a database

  • a malware scanner on your PC

  • the software which makes the regular backups for the Stack Exchange sites

  • software which does automatic high speed stock trading

  • software which runs your favorite search engine or social network

  • the software in your newest smart TV or your smartphone

  • controller software for an insulin pump

  • controller software for steering of an airplane

  • monitoring software for a nuclear power plant

I think you can imagine by yourself for which of these examples a human supervisor is enough, or where an "automatic" supervisor is required to keep the system stable even when one of its components fail.

Answered by Doc Brown on December 21, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP