Products

Role

Resources

Company

Blog Post

Maintenance in Data Centers Discussion with George Parada

George Parada, Global Asset Management Manager at Facebook, joins Ryan Chan, CEO & Founder of UpKeep, to discuss George's data center maintenance experience.

Duration: 8 minutes

George Parada

Published on January 29, 2021

Data centers are responsible for so much of the information transfer we all take for granted today. Managing maintenance and reliability for these centers then can provide some unique challenges.

Ryan Chan, CEO and Founder of UpKeep, welcomed George Parada, Global Asset Management Manager at Facebook, to join him on a webinar to share his experiences. George manages programs and processes around preventive maintenance, life cycle management, and spare parts optimization to keep Facebook’s data centers up and running.

George's Journey Into Maintenance and Reliability

When George graduated from college, he started in a plant environment where he was responsible for a production area for a global food and beverage company.

“Like Ryan, I was in operations and got really fed up with how maintenance was being executed in my department,” George said. “I could see there were big opportunities to get Cargill’s equipment to be more reliable. Our machines kept breaking down, and we were missing customer orders.”

George started to think about how he could make a difference. After a conversation with his manager, he was tasked with something he’s never done before: running the maintenance department.

With a background in mechanical engineering, George loved figuring out how things worked. “We actually made a lot of improvements very quickly,” George said. “It was a matter of getting the skill sets of our folks to the level that they needed to be. A lot of them had a passion to learn new things like predictive and condition-based inspections. We got into that whole space of oil analysis and vibration and just really started to understand how we could continue to improve.”

A few years later, George was asked to lead the maintenance reliability team for the entire business unit, which included about 22 locations around the world.

“I had the opportunity to see best practices at other sites, which allowed us to leverage those for the entire organization,” George explained. After 13 years, George moved over to Constellation Brands, which wanted to establish a reliability program from the ground up.

“I had an opportunity at Constellation to start from a clean slate,” George said. “Our work order system was a whiteboard and some Excel spreadsheets. After exploring options, we found several software solutions. We went through a whole vetting process, and UpKeep shook out in the end for that organization.”

Constellation quickly ramped up and implemented the CMMS across its wineries and operations. “They key was really understanding the foundation that was needed,” George explained. “What processes did we need to improve? Where should we focus? Understanding the people processes as well as the work management processes was essential. We also learned that we had to keep improving our maintenance strategies, particularly around our people. We had to help them execute those strategies effectively.”

After 18 months, George was hired by Facebook to manage a spare parts optimization program, which eventually became one of the foundational elements of the company’s asset management program. “There were a lot of other opportunities at Facebook,” George said. “We started thinking about asset management and how it all ties into our work order system. It has continued to grow, and we are working on processes to not only keep our customer happy but also to make sure we have sustainable, reliable processes. At the end of the day, our mission at Facebook is to give the people the power to build community and just really bring the world closer together. Our data centers are a big piece of that.”

How Do You Manage Data Center Cooling and Redundancy?

As the webinar got under way, the first question addressed data centers, cooling systems, and the need for redundancy. George agreed that those systems are critical to reliable operations.

“I would say cooling systems are probably some of our most critical assets,” George said. “Typically, data centers have a great deal of electrical-based assets where the cooling is critical to keeping the racks and servers at the optimal temperature. We need to make sure those assets are running reliably and safely.”

In the case of all critical assets, organizations cannot run them to failure. However, the amount and structure of redundant systems is always a question. “I think you have to identify your goal, what you are really trying to accomplish, and the number of data halls you need to cool,” George said. “For us, we define the appropriate level of redundancy that we need to have, and we just make sure that we’re focusing on the right maintenance strategy for those specific assets.”

Currently, Facebook is going through an asset criticality process and trying to understand its equipment failure modes and the likelihood that we would experience this. The company is introducing certain elements of condition-based maintenance and exploring Internet of Things (IoT) solutions.

An organization like Facebook will have different redundancy levels when compared with a manufacturing facility. “Obviously, a server room at Facebook is going to be high on the criticality list,” George said. “You have to take things like OEM recommendations and make a determination about the asset’s operational context. You have to understand the failure modes, and some of that needs to be data-driven. We want to find ways to optimize our inspections and how we can rightsize the maintenance for our specific organization.”

What Are Maintenance and Reliability Differences Between Industries?

Although an outsider looking in may expect that working at food and beverage is completely different than working at a big tech company like Facebook, there are actually many similarities when it comes to maintenance and reliability.

“At the end of the day, if you really think about it, whether you’re managing assets at a data center or for a bottling plant, you’re still managing assets,” George said. “The questions are the same. Which assets are critical? Which ones need maintenance? Which ones run to failure? Where should I be investing?”

According to George, the main difference between industries is the maturity level of reliability programs. Industries such as chemical, oil, and gas that have been maturing their reliability programs for decades are at a different level than most data center organizations that have recently started their journey.

What Are Common Assets Found Within a Data Center?

The next question addressed the types of assets within a data center, and how Facebook determined it was time to add a reliability program for its systems.

According to George, most of these large data centers like Amazon and Google have the same kind of infrastructure, including an electrical substation and incoming electrical components that feed into the data center. Other common assets are transformers, switchboards, and electrical gear as well as cooling systems to manage compute and storage equipment. You will also find your typical building and facilities equipment such as fire systems, domestic water, and lighting.

When Did Facebook Realize It Had to Manage Assets?

As Facebook continues to grow, capacity requirements will continue to be evaluated. “Four years ago, Facebook had about four to five data centers around the world,” George said. “That number has been rapidly expanding, and now we have more than 20 data centers globally. At this growth rate, we absolutely need to establish solid maintenance programs and practices to keep these data centers safe, reliable, and efficient.”

Besides scaling up preventive maintenance programs, Facebook is starting to discuss managing the aging of these data center assets. “I’ve worked in plants that were 80 to 100 years old; Facebook is talking about operations that are maybe five or 10 years old. The reality is that technology in data centers is ever-evolving, and we need to get ahead in our industry.”

How Do You Develop Talent?

With the rapidly expanding technological needs for data centers and related maintenance and reliability, developing the people to manage it all is always a challenge.

“Our team is very small right now, which makes it easier to focus on individual development plans,” George said. “More importantly, we want to address what the team needs as a whole. Where is the team falling short? What skills do we need?”

For example, the Facebook team will be taking a 12-week Certified Reliability Leader training course through Reliabilityweb.com. “We want to make sure our team has these skills and can learn from best-in-class organizations,” George said. “I think it’s important to think about your team and how individual team members are mastering their roles. I want to identify gaps where I can coach and mentor them as their manager. That said, I think individuals need to understand that although managers need to be supportive and help remove barriers, employees have to own their own development.”

What Is the Most Important Lesson You’ve Learned?

To wrap up the webinar, George shared the best lesson he’s learned over the years.

“I think the biggest lesson is that reliability is a journey,” George reflected. “When you think about trying to implement an asset management or reliability program, you’re really just trying to take the organization to the next level. I think sometimes we spend a lot of time just trying to perfect it before we roll it out. Sometimes, we need to be okay with continuous improvement and be okay with delivering an 80% solution because that’s what we need now.”

After implementation and training, companies are going to find ways to improve the process, and maybe the next time around, it will be a 90% solution. “It’s also important to move fast,” George said. “You don’t want perfection to block you and keep you from making a significant impact right now, but also keeping focused on your north star.”

Note: This article is based on a webinar “Maintenance in Data Centers: A Live Discussion” with George Parada and Ryan Chan. To view the recording of the webinar, visit this link.

Want to keep reading?

Good choice. We have more articles about maintenance!

Article

How To Build an Efficient Preventive Maintenance Checklist

Preventive maintenance (PM) checklists protect your team, customers, and bottom line. Learn how to create the ultimate PM checklist with...

View Article

Article

How to Implement a Preventive Maintenance Program

This guide helps show you how to implement a preventive maintenance program and avoid common implementation mistakes

View Article

Article

Why Is Preventive Maintenance Cost Effective?

Preventive maintenance (PM) costs, but it also practically pays for itself. Explore the financial benefits of implementing a successful PM...

View Article

4,000+ COMPANIES RELY ON ASSET OPERATIONS MANAGEMENT

Leading the Way to a Better Future for Maintenance and Reliability

Your asset and equipment data doesn't belong in a silo. UpKeep makes it simple to see where everything stands, all in one place. That means less guesswork and more time to focus on what matters.