How to Find a Needle in the Desert

So if you happened to be up and in system a couple of weeks ago, you may have noticed a prolonged period of “disruption”. While I won’t go into details of what happened for a variety of reasons the short story is that one evening, the brigade on rotation here began to see instability across the satellite network effecting both TDMA and FDMA. We were informed by the commercial satellite provider that there was a carrier on the satellite who was causing extensive interference on the satellite and that it was causing problems across the entire 500 MHz of the transponder (for anyone who doesn’t know satellites, that’s a pretty big deal). After some time, we were contacted again by the provider who informed us that the problem terminal was indeed located in the Fort Irwin area.

The unit and NTC quickly and unfortunately unsystematically went through and began muting terminals in an effort to stop the interference. After a period of nearly 12 hours, the interference ended. When it was over, the brigade in the box had four terminals out of 16 who had been identified as being the “possible” problem terminal and were ordered to stay off the satellite until further notice. The disruption likely caused the satellite provider well over a million dollars in lost revenue and its other thousands of customers on the satellite considerably more.

The focus of this post is sort of an AAR of what occurred in an effort to identify the responsible terminal, and what could have been done to improve the situation from the very beginning and to minimize the chances of this happening again.

Peak and Pol

Anyone who has been around satellite communications for more than a few days is familiar with the terms peak and pol but many may not realize what exactly they are doing. “Peaking” refers to ensuring that your dish is optimally pointed at the satellite, and that your terminal is transmitting at just the right power. Too little power and the signal is disrupted and unusable. Too much power and you have the possibility of jamming a portion of the spectrum or in this particular case raising the overall noise floor across the entire spectrum. “Poling” or polarization is the adjustment of the polarization of your signal to the satellite. By using multiple polarizations on the satellite, the company is able to reuse the spectrum more efficiently and increase the number of customers it can support. Again, if the polarization is off you may be forced to raise power, or bleed into another customer on the same spectrum.

It is standard procedure for any satellite terminal to call the controller and peak and pol before they begin to access, but in reality a very small number actually do. This MUST occur each and every time that the dish is moved. If you simply stow the dish without moving it, it is possible to return to the previously established configuration (assuming you actually wrote it down) but if the terminal is moved in any way, the process needs to be repeated.

For many units, the primary reason for not doing this is because making a call to the satellite controller isn’t always the easiest thing in the world. The simple fact is there is always a way (even if it means possibly having to put some guy on a hill top to get a cell phone signal and talking to him over the FM radio or having to talk to NETOPS via FM and having them call the controller) for you to peak and pol. It’s not the easiest thing in the world, but it must be done.

Knowing Terminal Location

Knowing terminal locations is important for a huge number of reasons (it’s nice not to blow up your own guys) but in this particular case, it’s because knowing those locations could greatly aid in finding the terminal responsible. The satellite provider was able to provide us with a plot of the general area where the responsible terminal was located. Now I’ve known that this was possible although I expected the area to be fairly large. I was impressed to see an ellipse that while it was probably 50 miles long, it was only about 5 miles wide. When you take into consideration that NTC has a training area the size of Rhoad Island, we were able to eliminate a large number of terminals involved in the rotation as being possibly responsible.

Unfortunately for the unit, at the time they did not have good locations for each terminal which meant that instead of being able to focus on a few select terminals, they instead had to go through a process of coming off the satellite for each and every terminal. Only afterwards were we able to determine terminal location and use it to help confirm the possible terminals that were responsible.

Logs

Logs are critical to the management and operation of a network. There are two types of logs, automated (syslog server, SNMPc, etc.) and manual (think your duty log from staff duty). Each log provides important information especially in a situation like this and unfortunately in this case, none were accurately kept. Early in the troubleshooting process the brigade main and the NTC JNN were pulled off the satellite as possible offenders. When this happened it not only greatly hindered the brigade’s ability to later contact teams and direct them, but it also caused them to loose visibility of the network and be able to see who was up and who wasn’t and when they came off. Camp Roberts (the RHN involved) had has a duty log that kept some key events and also automated logs such as SNMPc and the MRT but we discovered that the time off the MRT was not consistent, and their SNMPc logs had limited information for what we were looking for.

When we went to examine the terminals we found that no manual logs were kept to identify key actions they had taken (access, deaccess, receiving commands related to the dish, etc.) and in one case the time on the SNMPc logs was not consistent and in the other case, SNMPc wasn’t running at all. In both cases, the log syslog collector was not running also which made it impossible to determine when exactly each terminal had done something.

For its part, the satellite provider had detailed RF plots that were able to give us to the second when the interference started and ended but without other logs to compare it to, they were of limited value.

Both automated and manual logs would have made this process significantly easier and must be kept at each terminal. SNMPc should be configured to monitor key links and system statuses. Syslog collector should be running to receive information and warning messages from the router, switches, and firewalls. In both cases, the time for both the laptop, and the network devices should operate on a common time zone (normally Zulu) and be as accurate as possible through the use of NTP. This is the only way to ensure that logs are collected in a clear way and can be easily compared across the network.

Manual logs should also be kept by the operators and include key events. In a previous assignment, I was the operations officer for a satellite control facility where DISA regulations required my operators to log everything (to include when someone covered the console position so they could use the restroom). While I don’t think that JNNs/ CPNs need that kind of fidelity, they should at the very least include information about access, deaccess, instructions they receive from NETOPS and other controllers, and things along those lines. If we had this information for the locations involved, I think it would have been much easier to identify the terminal in question.

Trained Operators

As I wrote very recently, having trained operators is very important to the operation of the network but something that many units are lacking. In this case, we ran into some problems with operators who were given instructions on procedures to perform, but weren’t trained in how to actually do them. This led to delays in the actions being performed and further delayed the overall process of removing the offending system from the network and remove the interference.

Positive Contact

Being able to have positive contact with every terminal is important. Once upon a time the Army gave each STT terminal with a satellite phone that was supposed to be part of the STT. This would allow the operator to contact controllers anywhere in the world, and for the controllers to contact them. Unfortunately, this was short lived and is rarely the case anymore. Instead we rely on cell phones.

Use of cell phones is severely restricted for rotational units at NTC however there is an exception to policy for STT operators. Unfortunately despite the exception, this doesn’t mean that the unit will actually setup in a location where there is cell phone reception. As I said above, this hinders units when it comes to peak and pol, and can make it impossible for the satellite controller to directly contact the terminal (assuming that the SAA actually has a valid contact number on it).

Instead NETOPS may be forced to act as that point of contact who then reaches out and relays instructions to the actual terminal. This is normally done simply by calling the other terminal on SIPR or NIPR but when they are down due to interference or because your own dish is stowed, it can make that impossible as was the case here. Instead NETOPS had to rely on FM communications and FIPR messages which caused significant delays. Some BNs didn’t have FM communications, others didn’t respond to FIPR messages and many others didn’t acknowledge the instructions that they did receive. Being able to reach out and communicate with each node through a variety of means is the only way to ensure positive control of our terminals and these communications means need to be verified regularly.

Final Thoughts

In the end, this was an extremely unfortunately incident made worse for a variety of reasons. From the satellite provider’s side, they lost access to a critical asset for a considerable amount of time losing a huge amount of money in the process. For their customers, the same is true. We were told of one customer who operated a number of off shore oil rigs. When those rigs lost their communications link, they were required by law to stop drilling.

For the unit, the incident itself caused a large disruption in being able to operate during the interference itself. Following the interference, several terminals were forced to stay off the satellite. We were eventually able to clear two of those terminals and get them back up however two more were forced to stay down for many days while we worked to prove or disprove which terminal was responsible. Having some or all of the information I wrote about above may have greatly aided our efforts to find the terminal that was responsible and gotten it back online much sooner.

Update: There was a comment in one of the Facebook forums that said in part “Sounds like the unit in question, or at least their communicators, have a discipline and leadership issue. It might be training, but experience tells me that they know they’re supposed to do these things but they don’t.” While I don’t disagree that the operators/unit should know to do at least some of the things (peak and pol in particular but others as well) this article was not written as a hit on the particular unit but instead to bring to light things that I see literally every single rotation. These guys just happened to be the unit that got effected by this particular event that served as a great learning opportunity. I would hazard to say that if this same thing happened on pretty much any other rotation, my comments would be more or less exactly the same.