This post is the third and last entry in a short series on interviewing candidates for knowledge and skills in reliability. See also Part 1: Root Cause Analysis and Part 2: Bearings. With these three tools, a fairly comprehensive picture of a candidate’s knowledge and skills can be gleaned in a short time.
Tell me about a time that you analyzed a system or piece of equipment for failures and then designed a reliability strategy.
- What process did you use?
- How did you get the recommendations implemented?
- RCM is the flip side to RCA: it is a kind of “RCA in advance.” It is one of the most challenging task that a reliability professional will undertake. Not only do team RCM efforts tend by their very nature to constantly go off track, creating challenges for the facilitator, but the facilitator must direct the group to the correct level of detail: deep enough, but not too deep. Often, the facilitator has to overcome the perception that it is a waste of time. Finally, the recommendations generally span multiple departments and RCM facilitators have to “influence without authority.”
- The RCM process is surprisingly controversial in the reliability world. It has a reputation for “getting in the weeds” and not creating enough value to justify the expenditure of manpower, but that can be the result of inadequate facilitation, an improper team ordering, or inadequate preparation. My personal feeling is that RCM is a tool that is best applied once a general reliability program including lubrication, inspection, and condition monitoring has been established. Otherwise, for each piece of equipment there will be dozens of recommendations that could be addressed by general efforts at establishing the “basics” at a much lower cost in manpower.
Qualities of an Ideal Answer
- Before jumping straight to recommendations, the process should spend at least some time on system and equipment function, functional failure, and FMEA, addressing failure effects and consequences separately before deciding on tasks. Ideally the cost of periodic tasks, whether on the run or during shutdowns, should be weighed against operational consequences unless there is a safety or compliance aspect.
- It would be best to have an experienced and credible reliability engineer sit through this question. However, many of the comments regarding root cause analysis, above, also apply here, particularly those on driving organizational change.
I’ve never worked at a plant that had a maintenance engineer. Recently I started working at a site that did had one (specifically, me). While I had a list of responsibilities from the job description I had applied for, I was interested in a broader description of the role.
Then I ran across a list of typical maintenance engineer responsibilities from Terry Wireman in Maintenance Strategy Series Volume 1 – Preventive Maintenance.
The maintenance engineer typically will have all or some of the following responsibilities:
- Insures that equipment is properly designed, selected, and installed based on life-cycle philosophy
- Insures that equipment is performing effectively and efficiently.
- Establishes and monitors programs for engine/compressor analysis and vibration and other condition-monitoring techniques.
- Reviews deficiencies noted during corrective maintenance.
- Provides technical guidance for CMMS.
- Maintains and advises on use and disposition of stock, surplus, and rental rotating equipment
- Promotes equipment standardization, recommends spare parts levels, and coordinates sharing of spare parts with other asset teams.
- Available for consultation with maintenance technicians.
- Monitors new technology and keeps management/staff appraised on the new developments.
- Champions quality assurance services including shop qualifications for outside services.
- Develops standards and procedures for major maintenance jobs, management programs for areas of responsibility and exchanges information across asset teams.
- Provides technical guidance for PM and PDM programs.
- Monitors competitors activities in the field of Maintenance Management.
- Focal point for monitoring performance indicators for maintenance management program.
- Optimizes maintenance strategies.
- Focal point for analyzing equipment operating data.
Interestingly, some of this I’ve been working on—or see coming my way—despite it not being in the job description. Hopefully, I will not be too hampered by project work, which is not mentioned in the above, but I will have some.
From the perspective of RR&Es, these responsibilities (what you own) are matched to the role of maintenance engineer (who you are on the team). Under MBM, each responsibility would then have one or more expectations (how well you are expected to perform. For example, there would be metrics related to performance of the preventative maintenance program, shop services, maintenenance costs, etc.
If you are a maintenance engineer, let me know how well these responsibilities match up to your day-to-day activities.
This post is the second in a short series on interviewing candidates for knowledge and skills in reliability. See also Part 1: Root Cause Analysis.
How long should a bearing last and why?
- This question is not an SBO. It is a pure knowledge question, but the “why” portion of the answer will reveal the candidate’s true understanding of reliability as well as book knowledge.
Qualities of an Ideal Answer
- A poor answer would be “5 years because that’s what’s typical in the plant” or “It depends on the bearing” with little or unconvincing elaboration.
- A better answer would be “20+ years because that’s what I learned in training.” A similar answer shows knowledge, but also a lack of insight.
- An ideal answer would be “If the integrity of the lubricant film is never compromised, then Hertzian fatigue is minimized and there is little reason for the bearing ever to fail. Things that might compromise the film are overloads, vibration exposure, viscosity changes (perhaps due to temperature), contamination, incompatible lubricants, out-of-spec shafts/housings, belt tension, stray electric currents, or abuse. However, these things are generally the result of errors and are therefore preventable.”
Henry Ford compares the freedom and dignity of an employee to that of a businessman and finds the life of a businessman sometimes inferior to that of an employee:
I am sometimes asked whether it is better to go into business for oneself, or to take employment. Employment as a career competes with private business in a way which few realize. Employment now offers a career such as men sought in their own business and often failed to attain. The very growth of business has tended to give employment a status which even business ownership did not have fifty years ago. A great deal of nonsense had been written about the freedom of the workman under the old system. The old-time guild system held nothing of the ideal. The union rules and repressive tradition of that system weighed heavily alike upon master and man, and led to little satisfaction for the individual and to no prosperity for society.
—Henry Ford, Today and Tomorrow
It is still fashionable for employees to dream of striking out on their own, while those who actually run businesses would cast off the cares of accounting and compliance and making payroll and marketing to focus on practicing their trade and serving customers.
Perhaps it is just another case of the grass always being greener on the other side.
But who has greater potential for fulfillment? The employee who can engage in work they have passion for while (relatively) unencumbered by mundane business cares, or the business owner who is struggling with sales or taxes or accounting and barely has time to think about their primary service?
There is potential for fulfillment, as well as risk, in either path.
As promised, here is the first in a short series of posts on interviewing people for reliability positions. These questions will fit well in a behavior-style interview for evaluating knowledge and skills.
Tell me about a time when you had to determine the causes of a recurring problem.
- How did you determine the causes?
- What were the solutions proposed?
- What were the results?
- Since this is not an initiative question, the candidate does not need to show that they proactively identified a problem. Either they could have found the problem themselves or it could have been given to them. It would actually be best if the problem was given to them since it is less likely to be a problem which they have chosen as particularly interesting or which they have special knowledge about. This helps isolate K&S factors from initiative.
- The plural form of “cause” is used on purpose. It is very unlikely that any problem tracked down to latent roots has just one cause. Therefore, any response that settles on just one cause is unlikely to be as thorough as RCA actually needs to be. An alternative approach would be to ask about a singular cause, in order to try to misdirect them to think it is okay to identify just one single cause, but it might then get into the realm of “trick questions.” The phrasing clearly indicates the expectation that more than one cause is identified.
- Portfolios are under-used in the technical professions, but the ideal candidate will bring written examples of their RCAs in preparation for questions like this. If the printed example follows a logical structured process, they are probably in the top 1%. However, probe to ensure that the example has some substance behind it, and question anything that looks like an assumption.
Qualities of an Ideal Answer
- The failure mode is well defined. Sometimes, the failure definition will change midstream as new information becomes available and the problem is not what people first thought. That’s good. Sporadic problems are generally just what they appear to be: “The shaft broke and the roll fell.” With chronic problems the definition of the failure mode often changes as new information is gathered. The answer should sound almost like a story.
- Drives to multiple latent causes. Stopping at physical cause not acceptable as a failure analysis. Examples of physical causes are: lubrication breakdown, shock load, fatigue, contamination, and abrasion. Stopping at human cause (decision errors) is also not acceptable unless it was proven to be an intended violation or act of sabotage (rare). Examples of human causes are: decision not to follow procedure, decision to use different lubricant than specified, decision to fill tank all the way instead of ¾ full, decision to use screwdriver as pry bar, and decision to use hammer instead of induction heater to install bearing. Latent causes come from asking “why” about human causes, and often deal with perverse incentives. Examples of latent causes are: buyer’s performance measured by cost savings, mechanic received frequent past recognition for getting the job done quickly, and correct tools/parts/supplies not available or out of stock.
- Verification used at all steps. No assumptions are made at any point in the analysis based on past history. Human and latent causes are verified through interviews and feedback from people involved. Strong interview skills are a must for RCA work. Good interviewers have to “hear” what other’s don’t say (whether due to embarrassment or malicious intent) as much as what they do say.
- The solutions proposed tie directly back to cause. Solutions must not be detached from cause. Proposals can address any of the causes, not just “root” causes, but a direct link must always exist. The department that owns the cause (Engineering, Operations, Maintenance, or Procurement) must own the solution. For example, an equipment installation problem must not be addressed with periodic maintenance inspections.
- Demonstrates economic thinking in the level of analysis chosen. Example: metallurgical analysis is avoided if visual inspection can do the job. Metallurgical analysis would generally be chosen to answer a specific question whose answer can’t be obtained from more economic options.
- Demonstrates organizational savvy in driving change. Corrective actions should be reviewed and aligned upon by department managers before assigning and trying to implement on the floor. Convincing a department manager that a problem is coming from their department and winning their support in correcting it takes a level of tact and skill that will only be found in an exceptional candidate.
This quote is on the title page of Chapter 2 of The Science of Success:
Regardless of whether you are an entrepreneur or whether you are an employee of a large company, the absolute prerequisite is that you must know your stuff. There is no substitute for this.
—Fred C. Koch
When it comes to technical skills vs. people skills, the recruiting process at many companies gives short shrift to technical skills. The assumption is that the combination of education and work experience on the resume “qualifies” a candidate for a position. The remainder of the interview process, therefore, is focused on soft skills.
It takes but a brief reflection to see the flaw in this reasoning. When I attended commencement for my degree, there were a few people who shocked me by their attendance. “These people are graduating?!”
I’ve seen resumes with glaring errors in the details. “You were responsible for three 80,000 hp air compressors?!” “Yes, I was.”
I’ve asked straightforward questions of experienced candidates that revealed deficiencies in the technical body of knowledge required for the position. “How long should a bearing last and why?” “About 5 years. That’s how long they lasted in the plant I worked in.”
None of this is to imply that soft skills are unimportant. Identification of behaviors that will disrupt the workplace and cause disharmony are valid goals of recuiters and interviewers.
Yet, it is time for some companies to restore a bit more balance. Rather than 95% soft to 5% technical skills, perhaps a shift to 75% soft to 25% technical skills would add value. Often, mastery of the field’s body of knowledge can be evaluated within two or three questions. Later, I will post some of the questions I’ve asked in interviews for reliability engineer positions, and a few tips on how to interpret the answers.
Quoted in What Went Wrong?, Chapter 12:
I deplore the phrase “Near Miss”, because it has such happy-go-lucky connotations. A near miss is an accident that, solely by chance, did not happen. Near Hit is better… For every 400 near-hits, there is a fatal or serious injury. Railways must find a way of capturing this information, and turning it into part of the learning process. If you cover up a near hit (which is so easy) the elephant trap stays in place waiting for next time.
This quote is on the title page of Chapter 2 of The Science of Success:
The man who grasps principles can successfully select his own methods. The man who tries methods, ignoring principles, is sure to have trouble.
—Ralph Waldo Emerson
This quote is on the title page of Chapter 1 of The Science of Success:
In truth, there is no such thing as a growth industry, I believe. There are only companies organized and operated to create and capitalize on growth opportunities. Industries that assume themselves to be riding some automatic growth escalator invariably descend into stagnation. The history of every dead and dying “growth” industry shows a self-deceiving cycle of bountiful expansion and undetected decay.
—Theodore Levitt, Marketing Myopia, Harvard Business Review, 1960
They say that you shouldn’t mistake a bull market for personal genius. People riding the stock market bubble of the 1990s, or the real estate bubble of the 2000s, or the government bond bubble of the 2010s might want to keep this quote in mind.
Today’s potential injury from What Went Wrong? by Trevor Kletz involves a simple change that could happen anywhere combined with our tendency to go into “auto-pilot” with routine tasks.
A plant used sulfuric acid and caustic soda in small quantities, so the two substances were supplied in similar plastic containers called polycrates. [Two polycrates of acid were kept on one side of the tank and two polycrates of caustic were kept on the other side]. While an operator was on his day off, someone decided it would be more convenient to have a polycrate of acid and a polycrate of alkali on each side. When the operator came back, no one told him about the change. Without checking the labels, he poured some excess acid into a caustic crate. There was a violent reaction, and the operator was sprayed in the face. Fortunately, he was wearing goggles.
In this case, the book suggests that people should be told about changes made while they are away, and that containers for different chemicals should vary in size, shape, or color. What other measures might be taken to prevent this kind of incident?