There’s a clear philosophy to successful break-fix methods. While there are implementation-specific and component-specific details (such as with computers), there are some universal and near-universal approaches to just about any technical/interpersonal matter.
From the top-down, all things can be fixed with a simple procedure:
- Identify a problem.
- Consult/research an answer.
- Apply that solution.
- Verify it works.
It sounds easy, but there’s more to it.
It’s All Networked
Imagine the diagram of a network with nodes:
- A computer network
- A single computer with all its components, a motherboard, one IC on that motherboard, or one wafer of one IC on that motherboard
- An engine
- A building electrical system
- A supply chain
- A social network, large entity, or singular group
- A human body, subsystem, organ, tissue of an organ, or cell
- A philosophy, belief, thought, or sensation
Experience allows people to know which of those details affect results and which details are irrelevant. A lightbulb will fail before a wire, fuses fail before cables, alternators fail before starters, networks fail before software.
If node 3 fails, in the above example, it may cause the whole system to go offline, and someone would broadly declare the thing as “broken”, but only a small part of the device is effectively broken, not the whole thing, and only if the result passes through node 3.
We can typically salvage the object if we mend or circumvent that broken component, wherever it is.
Troubleshooting has a practical problem that prevents this from being easy: where is the breakdown?
Consider the Chain
A network is the “state” of reality, but all things related to living beings have a predetermined purpose or are part of the background noise. “Broken” is shorthand for “the user’s expectations of cause-and-effect driven by prior experience haven’t been satisfied”.
At the beginning, someone doesn’t know what’s wrong. The thing created a result, but now it doesn’t.
Previously-known experience is highly useful because it focuses the scope of what could be wrong:
- It specifies exactly what someone was doing when the thing broke, which creates a “start” point.
- The desired result becomes clear, which creates an “end” point.
- Those “start” and “end” ideas together mean all the possibilities are represented as “chains” between “start” and “end”, with extra complexities branching into separate chains.
Take the above diagram again, and imagine someone saying “I have tried to get 8 to happen, but it’s not doing it”. Without further explanation of where the chain comes and goes, it could represent anywhere on that network. It’s a different story entirely if the statement was “I have tried to get 8 to happen by engaging 15, but it’s not doing it”.
The first thing, more than anything else, is to find out the limits of where the chain lies.
- If the action going to 8 starts at 16, there are only a few possibilities (16, 8, or the 16-8 junction).
- If there are many possible connections, the cause-and-effect will be harder to simply deduce.
If you must deduce smaller links in the chain, draw from several approaches:
- Use your intuition and experience to check on what most likely fails, which is most easily swayed by bias. There’s no reason to investigate for an obscure cause simply because you read about it in school. This mostly requires hands-on experience, but is the quickest way to diagnose.
- Test something in the middle of the chain. If it can slice the chain in half, the number of possible problem areas has been resolved. To be sure, test both sides of that chain, since it might be two problems!
This resolves the location of most simple problems. For more elaborate ones, perception and perspective are the only ways to figure them out.
Pay close attention to every little detail. Absolutely anything out-of-place from convention (e.g., minor scuffs, a strange prompt, an unusual statement) may be connected to the problem.
It’s not uncommon for our bias to get ahold of us when a breakdown creates severe risks. While imagining adverse consequences of a sustained failure, it’s very easy for us to imagine the cause is unlikely as much as the consequences. However, the mundane and likely answer is often the quickest solution, and it’s often not a bad idea to investigate the obvious impulse of the least-educated person in the room.
While completely focused, test it a few more times (if possible) by changing the origin and destination locations:
- Remove the farthest-outputting part and watch what happens.
- Feed in a different input, such as hand-cranking or vary up the quantities.
- Insert a new input or output that gives more or different information.
- Plug in another known-good screen or try a different mouse.
- Access a different website or talk to a different person in that group.
- Use a different computer on the network or try a different operating system.
- Try a different fluid, or remove the extraction pump.
- Use a different thought experiment.
Many times, the end user (who will not be naturally aware of any of this) will have a hard time understanding the specifics of their problem, but will lose patience with the fact the problem still exists. They will often say “my car is broken” or “the computer won’t turn on”. If you can, ask them for key information about precise elements as the event transpires. This serves several functions:
- To the degree of their perceptiveness, you won’t have to revisit everything they’ve experienced in-person.
- They become informally trained on what to perceive in the future, which can profoundly impact an entire group’s corporate culture over time.
To get even more perceptions, it helps tremendously to shift around perspectives. Look at it from multiple angles, which is a product of creative thinking. We can train ourselves to examine conventional things through unconventional lenses.
Smart people tend to be better at finding problems because intelligence could be defined as “a person’s ability to maintain multiple perspectives at once”, especially if they have ADD.
By this point, you’ll typically see the problem and know what needs replacing. However, there’s a comparatively smaller chance the source of the problem won’t arise in a blaze of clarity.
The next step is to whittle away things it can’t be. Your purpose should be to make the likely chain of issues as small as reasonably possible:
- If it doesn’t create any adverse consequences, try to reproduce the issue again, but with different inputs.
- If it does create adverse consequences, ask for volunteers to experiment (which may be the end-user in some scenarios).
- Examine other ways to break the system the same way.
This part here is the most tedious portion of diagnosis. You know what’s going wrong, but now must be 100% sure you’ve replaced all non-working components.
Every single part has at least two aspects to it:
- The part’s inner workings that do things (e.g., the database, the compressor’s parts).
- That part’s connection to other things (e.g., the GUI for the database, the plumbing leading to the rest of the unit).
As stated above, each system’s components are their own system, scaling down to the nanoscopic, so it’s typically difficult to tell at first glance what’s actually wrong with any degree of authority. This gets much worse if any bad design decisions are involved as well.
- The car not getting power may be from a bad battery, a bad alternator, or a blown fusebox.
- The forgotten message may be from a bad messenger, poor interpretation, or a vague message.
- A failed command may be from a bad API, a bad command to the API, or a bad network configuration.
- Failed recipes could be bad instructions, bad ingredients, or bad cooking skills.
It always requires at least 2-3 perspectives of that one part to get an accurate picture. You won’t know until you’ve tried it a few ways.
Finding a cause becomes more difficult proportionally to how much complexity is in the system. Elaborately designed things (e.g., computers) almost guarantee you can never be 100% sure about any solution. For this reason, most advanced troubleshooting uses more in-the-weeds technical controls to keep the chains as small as possible:
- Command-line prompts remove the uncertainty of a bad GUI and allow the user’s input to be more precise.
- Before computerized throttle controls, auto mechanics tested the engine by operating the throttle on top of the engine instead of the accelerator.
- Philosophers tend to imagine idealized scenarios instead of practical considerations that complicate matters.
- Expert conflict managers tend to speak low-context.
Generally, understanding the likely things that fail is the best preventative measure, but that requires gleaned experience (either yours or others working with the items in question), though constant changes in an industry can quickly deteriorate any gains in that department.
Keep an eye out for the XY problem.
- The end user can get lost in the details of X when they’re trying to solve Y, which has nothing to do with X.
- This problem scales dramatically as systems become more complex.
- Repeated support tickets and requests open up for the user’s X problem.
- Those tickets promptly close as highly qualified people consistently fix X.
- End user becomes increasingly frustrated until they either give up or someone notices.
Both the decay of any individual part and discoveries of failures usually work on a logarithmic curve. Unfortunately, after a fix has been mostly implemented in a vast system, enough time can transpire between events that everyone forgot what happened last time or nobody remembers because all the previous technicians no longer work there.
Some issues only arise once in a while, but can cause technicians of all types to lose sleep trying to figure out why:
- Two relatively innocuous edge cases can create problems. One minor failure that typically doesn’t matter will frequently interact with another minor failure that typically doesn’t matter, with the effect making an entire system fall apart.
- On occasion, a connection between two elements can be defective while the elements themselves are known-good.
- Another hidden risk in diagnosing problems is when the chain has two or more points of failure.
- Rolling out an update can break things. Anytime something updates, it’s no longer known-good until every case has been implemented, so always keep backup versions ready to roll things back.
Unusual issues are more likely to happen if the object itself had inherent design or engineering flaws, it was poorly maintained, or previous people who “fixed” it didn’t do it correctly. When unsure, expect something was neglected and more problems are simply waiting to happen.
Ironically, while redundant systems can be very useful to prevent failures, they make it absurdly difficult to diagnose them. If A leads to B and there are two more backup systems to lead A to B, there are now three systems to verify instead of one. The only solution to discover it is to create explicit distinctions in the output of the first part of the system.
To fix things, you need parts or supplies. If you have the new part, swap it out and be done with it. You can add the old part into a “fix later” pile, since your time is important and it’ll get things back to running ASAP and you can close the ticket.
The easiest solution, if it’s attainable, is to get an identical copy of the thing:
- For software, this is trivial on an internet-connected computer (or an OS with a reliable backup schedule), but may simply require copying drivers from another computer.
- For hardware, you will have to have a good inventory management policy beforehand to have parts ready to go.
- The logistics to get new components should be arranged beforehand, and is a significant aspect of good management.
- However, it can become horrifyingly complex if the components are extremely rare or an organization’s security policy becomes Orwellian.
When updating, make sure the thing is as immovable as possible. Particularly savvy people update/replace on-the-fly, and sometimes multi-tasking:
- Always secure and ground the objects you’re working with.
- Keep everything organized to prevent something interacting inappropriately.
- Never permit distractions to an important conversation.
- Keep everything labeled as you go.
- During critical stages, never make more changes than necessary.
There are 3 distinct phases in any repair job:
- Disassembly – remove everything that may obstruct convenient access to the problem.
- Keep everything neatly arranged and labeled to allow the final phase to operate smoothly.
- Repair – access the problem directly, typically to swap out a component.
- Any new problems are a recursion of this 3-phase system, but smaller.
- If the component is a cable, securely attach the end of the new cable to the end of the old cable, then yank the old cable out and detach its end, leaving the new cable.
- Assembly – combine everything back together.
- If it’s a physical object, make sure everything is securely tightened and in place to prevent it from moving later.
Many managers imagine breakdowns are a great time to upgrade. This is only a good idea if it’s replacing an entire system, and that upgrade will likely make more things fail later.
It’s dead critical to thoroughly document what you do end up doing:
- If something failed, it may fail again and represent a different issue along the chain.
- Clarify any slight deviation from what it was before you touched it.
- Even if it’s your solo project, memory is fickle, even 2 days later.
A “duck tape and baling wire” solution is always available. It fixes it in the short-term, but it’s usually not durable enough to withstand heavy use, and is more likely to break down later. The likelihood of future work to fix a quick fix now is called “technical debt”.
Technical debt is never worth incurring unless it’s an emergency. If emergencies keep happening, that’s the product of lots of technical debt, and it’s probably worth considering investment into a better system.
Technical debt isn’t a big deal with an edge case, but will destroy any efficiency gains on a common case.
Also, technical debt is permanently attached to the object as long as it’s unresolved and the object exists. This can be dramatically severe in a long-term project.
Documentation can help curb at least some of it, but the cure for technical debt is to never have it.
From the beginning, think months and years ahead. It’s not uncommon for an edge case to become the de facto common case after a long time, especially after frequently using something. Things frequently scale over time, so it’s often not a question of if, but more of what.
As much as possible, keep spare resources around before you need them.
- This applies to spare parts, especially the most likely parts which could break.
- You’ll often need more time than you’d expect to diagnose problems.
- You’ll also need plenty of space to work in.
- For this reason, it makes tons of sense to eventually assign a designated area (e.g., a workbench) that has all the necessary parts and space, and sometimes worth hiring someone to help with fixing as well.
Frequently, a network has a distinctive pain point where stuff breaks down more often, and there are several ways to work through it:
- Keep swapping out the component, which can be time-intensive and won’t scale well.
- Find a way around it, which often requires clever hacks. Sometimes, if it’s better than the original solution, it might become the new standard.
- Invent something that creates the desired result from a completely different angle.