There’s a clear philosophy to successful break-fix methods. While there are implementation-specific and component-specific details (such as with computers), there are some universal and near-universal approaches to just about any technical/interpersonal matter.
While the steps may sometimes only take a few seconds, all things are fixed with a relatively straightforward, mundane procedure:
- Observe what exists and what is happening.
- Identify a discrepancy between what you imagine is supposed to happen and what is actually happening.
- Brainstorm a solution.
- People often zip quickly through this one, especially if they’re impulsive or not particularly intelligent.
- Decide on what to do.
- This becomes more difficult proportionally to someone’s intelligence.
- Gather resources to perform the task.
- It doesn’t have to be complete, but that person has to be able to imagine where all the resources are located.
- Apply that solution, which typically takes very little time to do.
- Observe that the solution works.
- This requires watching exactly what happens, and is frequently easier with a second observer.
- If it doesn’t work, go back to Step 3.
- Post-mortem followup and analysis.
- Document what happened, understand why it happened, make sure it’s easier next time to fix, etc.
It sounds easy enough, but there’s more to it.
It’s All Networked
Imagine the diagram of a network with nodes:
This network can represent anything that accomplishes a defined purpose:
- A computer network
- A single computer with all its components, a motherboard, one IC on that motherboard, or one wafer of one IC on that motherboard
- An engine
- A building electrical system
- A supply chain
- A social network, large entity, or singular group
- A human body, subsystem, organ, tissue of an organ, or cell
- A philosophy, belief, thought, or sensation
Reality frequently has a type of atomism: each larger thing is made of smaller things, ever-progressively, until we start diving into the periphery of our understanding.
Except for the specific neurodivergent state of autism, most of our intuitive thinking works in reverse to nature: imagine the object as a collective unit first, then work downward into the details.
Experience allows people to know which of those details affect results and which details are irrelevant. A lightbulb will fail before a wire, fuses fail before cables, alternators fail before starters, networks fail before software.
In the above example, if node 3 fails, it may cause the whole system to go offline, and someone would broadly declare the thing as “broken”. However, only a small part of the device is effectively broken, not the entire thing, and it’s only broken when the result passes through node 3.
We can typically salvage the object if we mend or circumvent whatever that broken component is.
Troubleshooting has a practical problem that prevents this from being easy: which node is broken? Or, in plainer terms, where is the breakdown?
Consider the Chain
A network is the “state” of reality, but all things related to living beings have a predetermined purpose or are part of the background noise. “Broken” is shorthand for “the user’s expectations of cause-and-effect driven by prior experience haven’t been satisfied”.
At the beginning, someone doesn’t know what’s wrong. The thing created a result, but now it doesn’t.
Previously known experience is highly useful because it focuses the scope of what could be wrong:
- It specifies exactly what someone was doing when the thing broke, which creates a “start” point.
- The desired result becomes clear, which creates an “end” point.
- Those “start” and “end” ideas together mean all the possibilities are represented as “chains” between “start” and “end”, with extra complexities branching into separate chains.
Take the above diagram again, and imagine someone saying “I have tried to get 8 to happen, but it’s not doing it”. Without further explanation of where the chain comes and goes, it could represent anywhere on that network. It’s a different story entirely if the statement was “I have tried to get 8 to happen by engaging 15, but it’s not doing it”.
The first thing, more than anything else, is to find out the limits of where the chain lies.
- If the action going to 8 starts at 16, there are only a few possibilities (16, 8, or the 16-8 junction).
- If there are many possible connections, the cause-and-effect will be harder to simply deduce.
If you must deduce smaller links in the chain, draw from several approaches:
- Use your intuition and experience to check on what most likely fails, which is most easily swayed by bias. There’s no reason to investigate for an obscure cause simply because you read about it in school. This mostly requires hands-on experience, but is the quickest way to diagnose.
- Test something in the middle of the chain. If it can slice the chain in half, the number of possible problem areas has been resolved. To be sure, test both sides of that chain, since it might be two issues!
- Turn off every single feature or option and see what happens. If it doesn’t work, your chain has become much smaller. If it works, sequentially turn features back on until you see something stop working.
Most simple issues can be resolved this way. For more elaborate ones, the only way to figure them out is through perception and changing perspective.
Perceive
Pay close attention to every little detail. Absolutely anything out-of-place from convention (e.g., minor scuffs, a strange prompt, an unusual statement) may be connected to the issue.
When a breakdown creates severe risks, it’s not uncommon for our bias to blind us to the patently obvious. Imagining adverse consequences of a sustained failure often leads us to imagine the unlikely and phenomenal experience we heard about once in school is the most likely. But, the answer to a far-reaching problem is frequently a mundane and common fix. To that end, it’s typically worth asking the least-educated person in the room what their intuition says.
While completely focused, test it a few more times (if possible) by changing the origin and destination locations:
- Remove the farthest-outputting part and watch what happens.
- Feed in a different input, such as hand-cranking, or vary up the quantities.
- Insert a new input or output that gives more or different information.
- Plug in another known-good screen or try a different mouse.
- Access a different website or talk to a different person in that group.
- Use a different computer on the network or try a different operating system.
- Try a different fluid, or remove the extraction pump.
- Use a different thought experiment.
Many times, the end user or consumer (who will not be naturally oblivious of all this) will have a hard time understanding the specifics of their problem, but will lose patience with the fact the problem persists. They will often say “my car is broken” or “the computer won’t turn on”. If you can, ask them for key information about precise elements as the event transpires. This serves several functions:
- To the degree of their perceptiveness, you won’t have to revisit everything they’ve experienced in-person.
- They may become aware that there are more details than they were originally aware of, which may give them more patience.
- They become informally trained on what to perceive in the future, which can profoundly impact an entire group’s corporate culture over time.
To get even more perceptions, it helps tremendously to shift around perspectives. Look at it from multiple angles, which is a product of creative thinking. We can train ourselves to examine conventional things through unconventional lenses.
Smart people tend to be better at finding problems because intelligence could be defined as “a person’s ability to maintain multiple perspectives at once”, especially if they have ADD. However, the neurology of smart people makes them a bit slower to act, and it’s sometimes better just to throw things at it until something works.
Frustration
By this point, you’ll typically see the issue and know what needs replacing. However, there’s a comparatively smaller chance the cause of the issue won’t arise in a blaze of clarity.
The next step is to whittle away things it can’t be. Your purpose should be to make the likely chain of issues as small as reasonably possible:
- If it doesn’t create any adverse consequences, try to reproduce the issue again, but with different inputs.
- If it does create adverse consequences, ask for volunteers to experiment (which may be the end-user in some scenarios).
- Examine other ways to break the system the same way.
Repetition
This part here is the most tedious portion of diagnosis, but is also often the most overlooked. You know what’s going wrong, but now must be 100% sure you’ve replaced all non-working components.
Every single part has at least two aspects to it:
- The part’s inner workings that do things (e.g., the database, the compressor’s parts).
- That part’s connection to other things (e.g., the GUI for the database, the plumbing leading to the rest of the unit).
As stated above, each system’s components are a separate and dynamic smaller system, scaling down to the nanoscopic, so it’s typically difficult to tell at first glance what’s actually wrong with any degree of authority. This gets much worse if any bad design decisions are involved as well.
- The car not getting power may be from a bad battery, a bad alternator, or a blown fuse box.
- The forgotten message may be from a bad messenger, poor interpretation, or a vague message.
- A failed command may be from a bad API, a bad command to the API, or a bad network configuration.
- Failed recipes could be bad instructions, bad ingredients, or bad cooking skills.
It always requires at least 2–3 perspectives of that one part to get an accurate picture. You won’t know until you’ve tried it a few ways.
Complications
Finding a cause becomes more difficult proportionally to how much complexity is in the system. Elaborately designed things (e.g., computers) almost guarantee you can never be 100% sure about any solution. For this reason, most advanced troubleshooting uses more in-the-weeds technical controls to keep the chains as small as possible:
- Command-line prompts remove the uncertainty of a bad GUI and allow the user’s input to be more precise.
- Before computerized throttle controls, auto mechanics tested the engine by operating the throttle on top of the engine instead of the accelerator.
- Philosophers tend to imagine idealized scenarios instead of practical considerations that complicate matters.
- Expert conflict managers tend to speak low-context.
Generally, understanding the likely things that fail is the best preventative measure, but that requires gleaned experience (either yours or others working with the items in question), though constant changes in an industry can quickly deteriorate any gains in that department.
Keep an eye out for the XY problem, which scales dramatically as systems become more complex:
- The end user can get lost in the details of X when they’re trying to solve Y, which has nothing to do with X.
- Repeated support tickets and requests open up for the user’s X problem.
- Those tickets promptly close as highly qualified people consistently fix X.
- End user becomes increasingly frustrated until they either give up (and solve their problem without that particular product) or someone in the organization notices that everyone keeps fixing X.
Both the decay of any individual part and discoveries of failures usually work on a logarithmic curve. Unfortunately, after a fix has been mostly implemented in a vast system, enough time can transpire between events that everyone forgot what happened last time or nobody remembers because all the previous technicians no longer work there.
Some issues only arise once in a while, but can cause technicians of all types to lose sleep trying to figure out why:
- Two relatively innocuous edge cases can create unique and difficult-to-reproduce. One minor failure that typically doesn’t matter will frequently interact with another minor failure that typically doesn’t matter, with the effect making an entire system fall apart with zero intuitive predictability.
- Occasionally, a connection between two elements can be defective while the elements themselves are known-good.
- Another hidden risk in diagnosing problems is when the chain has two or more points of failure.
- Rolling out an update can break things. Anytime something updates, it’s no longer known-good until every case has been implemented, so always keep backup versions ready to roll things back.
Unusual issues are more likely to happen if the object itself had inherent design or engineering flaws, it was poorly maintained, or previous people who “fixed” it didn’t do it correctly. When unsure, expect something was neglected and more issues are simply waiting to happen.
Ironically, while redundant systems can be very useful to prevent failures, they make it absurdly difficult to diagnose them. If A leads to B and there are two more backup systems to lead A to B, there are now three systems to verify instead of one. The only solution to discover it is to create explicit distinctions in the output of the first part of the system (i.e., severing more connections). This mindset is mostly why hackers are the best technicians.
Repairing
To fix things, you need parts or supplies. If you have the new part, swap it out and be done with it. You can add the old part into a “fix later” pile, since your time is important, and it’ll get things back to running immediately, which means you can close the ticket. Just make sure you’re actually ordering the replacement parts before you require them.
It’s always a good idea to keep tools available and in working order. While your needs will vary by industry, it’s almost always worth keeping a few tools around:
- A high-quality multi-tool
- Hammer
- Variously sized screwdrivers
- Needle nose and snub nose pliers
- Crescent wrench
- Hex wrench set
- Utility knife
- Protective gloves
- Headlamp and flashlight
The easiest solution, if it’s attainable, is to get an identical copy of the thing:
- For software, this is trivial on an internet-connected computer (or an OS with a reliable backup schedule), but may simply require copying drivers from another computer.
- For hardware, you will have to have a good inventory management policy beforehand to have parts ready to go.
- The logistics to get new components should be arranged beforehand, and is a significant aspect of good management.
- However, it can become horrifyingly complex if the components are extremely rare or an organization’s security policy becomes Orwellian.
When updating, make sure the thing is as immovable as possible. Particularly savvy people update/replace on-the-fly, and sometimes multitasking:
- Always secure and ground the objects you’re working with.
- Keep everything organized to prevent something interacting inappropriately.
- Never permit distractions to an important conversation.
- Keep everything labeled as you go.
- During critical stages, never make more changes than necessary.
There are 3 distinct phases in any repair job:
- Disassembly – remove everything that may obstruct convenient access to the problem.
- Keep everything neatly arranged and labeled to allow the final phase to operate smoothly.
- Repair – access the issue directly, typically to swap out a component.
- Any new issues are a recursion of this 3-phase system, but smaller.
- If the component is a cable, securely attach the end of the new cable to the end of the old cable, then yank the old cable out and detach its end, leaving the new cable.
- Assembly – combine everything back together.
- If it’s a physical object, make sure everything is securely tightened and in place to prevent it from moving later.
Many managers imagine breakdowns are a great time to upgrade. This is only a good idea if it’s replacing an entire system, since that upgrade will likely make more things fail later. It’s only cost-effective if it’s been proven to work elsewhere.
It’s dead critical to thoroughly and precisely document what you actually did:
- If something failed, it may fail again and represent a different issue along the chain.
- Clarify any slight deviation from what it was before you touched it.
- Even if it’s your solo project, memory is fickle, even 2 days later.
Technical Debt
A “duck tape and baling wire” solution is always available. It fixes it in the short term, but it’s usually not durable enough to withstand heavy use, and is more likely to break down later. The likelihood of future work to fix a quick fix now is called “technical debt”.
Technical debt is never worth incurring unless it’s an emergency, or you expect the product to be replaced soon. If emergencies keep happening, that’s the product of lots of technical debt, and it’s probably worth considering investment into a better system.
Technical debt isn’t a big deal with an edge case, but will destroy any efficiency gains on a common case.
Technical debt is permanently attached to the object as long as it’s unresolved and the object exists. This can be dramatically severe in a long-term project.
Documentation can help curb at least some of it, but the best cure for technical debt is to never have it. Most people want to ship quickly, so this is against human nature.
From the beginning, think months and years ahead. It’s not uncommon for an edge case to become the de facto common case after a long time, especially after frequently using something. Things frequently scale over time, so it’s often not a question of if, but more of what.
Long-Term Improvement
As much as possible, keep spare resources around before you need them.
- This applies to spare parts, especially the most likely parts which could break.
- You’ll frequently need more time than you’d expect to diagnose problems.
- You’ll also need plenty of space to work in.
- For this reason, it makes tons of sense to eventually assign a designated area (e.g., a workbench) that has all the necessary parts and space, and sometimes worth hiring someone to help with fixing as well.
Frequently, a network has a distinctive pain point where stuff breaks down more often, and there are several ways to work through it:
- Keep swapping out the components, which can be time-intensive and won’t scale well.
- Find a way around it, which may require clever hacks. Sometimes, if it’s better than the original solution, it might become the new standard.
- Invent something that creates the desired result from an entirely different angle.
For this reason, the most skilled people at repairing also often become inventors and start businesses.