Tuesday, April 12, 2016

Distributed Infrastructure Testing with Raspberry Pi

Let’s review this scenario, early in the morning, a couple of emails are in the inbox from users located at a remote office, with more than 6 hours difference, complaining that they cannot reach any internal IT service from the Office. The only way to reach those services is connecting directly from their smartPhones or Home.

First thing to check, Any alert from our monitoring system? No, there is no alert, actually, everything looks good: all services are up and running and the office’s devices (switches, firewalls, etc) are reachable from the main Datacenter, latency seems to be normal, no packet drops in our firewall, etc. 

What is the problem? the only way to find out is to connect to a computer in the remote office and try to access the internal services. We were lucky that one user allowed us to connect to his computer, after a couple of tests, the problem was evident: DNS resolution issue. Trying to access the Internal Services using IP addresses worked perfectly, but it didn’t work with DNS names.

The problem was that our Firewall was acting as the DNS server for this office, and it was forwarding DNS queries to the right DNS server based on the domain: internal services or external services. For some reason, the firewall was trying to reach the Internal DNS resolver using the wrong interface. This interface was used to connect the remote office with Internal Services, but we changed the Internet connection, and added a new virtual interface for this connection. The change was performed the night before the failure, but our internal tests passed. 

There are two questions to be asked here:

  • How can we prevent this to happen again? 
  • How can we detect this issue before our users do?

First thing is to understand why this issue happened, and then how can we stop it from happening again. After a brainstorming with the team, we agreed that our main problem was lack of visibility from the remote office. We were performing the tests and monitoring all services from our main datacenter, but we didn’t have the user perspective of the IT Services. 

A second problem raised during this troubleshooting: we needed access to a device in the same office to test or troubleshoot any issue. However, users weren’t always able to help, and in some cases, performing a change after normal working hours requires somebody in the office willing to help us. 

It’s evident that we needed to setup a device locally that could help monitor and test the infrastructure. However, this raised many questions:

- What are going to be the hardware Specs?
- What if the device fails?
- How are we going to monitor all services?
- How are we going to install, configure and manage all those devices?

In order to make this evaluation process simpler, we established the following rules:

- The device must run Linux.
- A low cost and easily replaceable device.
- The device should be configured centrally and then ship to each office.

The Raspberry Pi was chosen as our monitoring and remote infrastructure testing device. Basically, it complies with all requirements:

- Runs Linux
- It’s a low cost device.
- We can configure it centrally, and later ship it to the remote offices.

In order to make the deployment process simpler, we created a Linux image (raspbian) that can easily be copied to the SD cards using dd.

However, we always need to perform changes on the images once installed, actually, installing new packages or just upgrading the current installation are very important processes in our operation, and we didn’t want to connect to each device manually to perform these activities. The obvious solution is a configuration management system, in our case Ansible.

Ansible helps to configure all devices and keep their configuration in sync and current. However, we still need to add a couple of services to satisfy the main requirements.

First, we added the Nagios agent into the raspberry Pi, but instead of adding several checks, we decided to keep it simple, and created a rule of no more than three checks:

- Internal service located in our main datacenter. 
- internal service located in AWS.
- External web site.

All three checks must be end-2-end tests: log into the service (for the first two cases) and access a feature. A failure in any of the checks could imply several possible root error, however, adding several checks (Local Internet Link, Main datacenter Internet Link, DNS resolution, Service availability, etc) gives extra granularity at the expense of complexity in term of configuration and exponential increase in the number of checks generated by the Nagios system.

We don’t need this granularity. If a failure occurred, we can connect to the raspberry Pi of the affected remote office, and run a set of commands to understand the root cause.

One additional requirement related to the troubleshooting process is that we wanted to facilitate the execution of scripts/commands, in a way that our support team or any team member that was on duty (on call rotation) could receive the alarm and act accordingly. But, if they need to connect using ssh to the right device, and later execute a set of commands, the troubleshooting process will be harder, taking more time than the clients are willing to wait to have all services up and running.

For this reason, we introduced RunDeck as a central point to access and execute commands and scripts on the raspberries Pi. RunDeck provides a web interface that can be accessed from any web browser, and our team can deploy infrastructure tests easily using a central Git repository. With one click, we can check if all VPNs, Internet Links or MPLS connection are up.

Having a failure in a remote office implies running a check service availability for that remote office in RunDeck, and just wait for the result. The same applies after performing a change, instead of manually running a check list, just run the infrastructure tests for the remote office.



We’re still learning and experimenting with this approach, incrementally adding more tests. But we’ve gained more visibility in our infrastructure.