I was reflecting on one the more challenging problems I’ve encountered during my career as an IT infrastructure architect. As IT architects move higher up the enterprise stack they encounter challenges that stretch beyond their core area of expertise. An architect’s role becomes leading the gathering of subject matter experts and guide the troubleshooting sessions using an architect’s overall knowledge of distributed systems. One of the more challenging examples in my career was a troubleshooting a large-scale webcast system.
A team I led was tasked with providing streaming video of quarterly town hall meetings. In normal circumstances, the project would be fun. The source of video was analog video fed through a recording studio into a networked video encoder. At times, a question and answer session would be added via the high-definition video conferencing system. The output would be streamed via a web-based video server to about 12,000 desktops. Overall, a nice technical challenge.
However, the team was faced with a couple of business challenges and a slew of technical hurdles. The first business challenge was that we didn’t agree on the scope of the service provided to our client. The client expected broadcast TV quality video. My team’s expectation was to provide YouTube (Circa 2012) video quality. The quality at which we were delivering was of less quality than YouTube. The symptoms included stuttering and skipping video.
As a result, we were in danger of being placed on notice for the service. While a relatively small part of a $150M contract it would impact or performance ratings for the overall contract. Therefore, our organization’s chances of winning on the renewal of the $150M contract was at risk. To say we were getting a good deal of attention was an understatement.
Here’s a list of groups and devices involved in maintaining the system.
In addition to my direct support organization, we had “help” from the mother ship. The $150M program was up for re-bid, we’d have to compete against the likes of HP ES and IBM for the new services contract. I was surprised by the amount of time I spent working on politics vs. technology. I reported weekly status to my management team, the client’s management team all in-person.
Initial troubleshooting efforts
An event like this will test not only the technical skill of an architect but their communications and leadership ability. With all the technical stakeholders, there will be an unyielding amount of suggestions where focus needs to be placed. Outsiders will recommend you implement end-to-end quality of service. Vendor partners will suggest that additional investment is made in their products. Engineers will ask for more analytical tools. An successful architect will help organize the chaos and provide a clear set of paths to invest time and resources.
Ultimately as in most distributed systems, the problem wasn’t single sourced. The symptoms were caused by compounding issues starting at the video encoders, magnified at the webcast servers and hidden within different layers of the network.
While an enterprise architect needs deep technical skill, this is only one aspect of their skill set. They need to understand applications that run within their infrastructure, the infrastructure and the business operations and drivers associated with the service.
We made a series of changes that ultimately resolved the issues. It was by far one of the most complex challenges I experienced in my career. It also showed me where my passions lay. If you are looking to hire or become a CTO or architect, there are some lessons to be learned in troubleshooting distributed systems.