Our secondary DNS and DHCP server died last Sunday. Besides some people noticing some services were slower on the network it was a non-event, and that is a good thing. Rather than just doing a restore of the old server, we decided to go ahead and upgrade the OS to the latest version of Red Hat and DHCP and DNS to whatever was supported on that Red Hat version. I realize that is the easy way out, but we used to run a hand compiled version and I just did not see the advantage. I am going to take the time to document the upgrade process for those planning their upgrade.
After NS2 died the primary DHCP server started to run out of leases because the peer held all of the free leases so we told the primary that it’s peer was down. Make sure you have an omapi port defined in your dhcpd.conf file:
# This is for omshell omapi-port 7911
From this site we got the basics for the following script:
omshell << EOF connect new failover-state set name = "dhcp-failover" open set local-state = 2 update EOF
Here are the options for setting fail over state in omshell:
/* A failover peer's running state. */ enum failover_state { unknown_state = 0, /* XXX: Not a standard state. */ startup = 1, normal = 2, communications_interrupted = 3, partner_down = 4, potential_conflict = 5, recover = 6, paused = 7, shut_down = 8, recover_done = 9, resolution_interrupted = 10, conflict_done = 11,
Here are all of the DNS/DHCP servers we built for the upgrade:
NS1 — Primary server that needed to be upgraded, physical machine.
NS2 — Secondary server, DOA physical machine.
NS3 — Temporary secondary server, virtual machine.
NS4 — New primary DNS/DHCP server, physical machine.
NS5 — New test primary DNS/DHCP server, virtual machine.
NS6 — New test secondary DNS/DHCP server, virtual machine.
The plan was to test the upgrade process on NS5 and NS6 while one of the other team members built NS4. This may look like overkill but let me explain the rationale behind each server. After the failure of NS2, the first thing we did was stand up a third DNS server, NS3, as a new secondary so that we had a live copy of all of our zones should something happen to our primary DNS. Initially we turned on DHCP for this server as well but because the versions of failover protocol were differed between the servers, we just left DNS running. The failover protocols between versions 3.0 and 3.1 are different enough that they are not compatible. This server was not actually being queried by end users but was there as a failsafe option should we need one. It has been left running as an immediate option for the future.
Once we got a secondary server that would maintain current state we started building servers for the upgrade process. NS4 would eventually become the new primary DNS/DHCP server and is a physical machine. When it was brought online it was first a secondary server to to NS1 so that it had a complete DNS database, then promote it to the new NS1. Because a physical machine takes so much longer to build we spun up NS5 and NS6 as test servers quickly. The plan was to test on NS5 and NS6, promote NS6 to be the new NS2 and convert NS4 to the new NS1. The reason we didn’t just build and move was because we did not want to have to change our IP helper addresses throughout our network.
Here is a step-by-step outline of the actual go live.
1. Build NS4 as secondary DNS server to NS1 so that it has a copy of the DNS database and we don’t have to copy files from NS1.
2. Secure shell into each of the servers to be worked on during this time.
ssh into ns1 on the backp NIC.
ssh into ns2 on the backup NIC.
ssh into ns4 on the backup NIC.
We have a dedicated network for backup traffic, I got into the backup NIC so that I could manipulate the primary addresses without losing connectivity to the servers.
3. Stop DHCP on NS1 and copy the lease data base to the other servers.
service dhcpd stop
scp /var/state/dhcp/dhcpd.ad.leases root@ns2.chainringcircus.org:/var/state/dhcp/dhcpd.leases
scp /var/state/dhcp/dhcpd.ad.leases root@ns4.chainringcircus.org:/var/state/dhcp/dhcpd.leases
3. Shut the interfaces on NS1 before taking down DNS.
ifconfig eth0 down
ifconfig eth1 down
4. Start DHCP on NS2 so that we don’t have too many problems.
service dhcpd start
5. Shut down DNS on NS1
rndc freeze — Make sure there are no .jnl files left.
service named stop
6. Convert NS4 to NS1, we left NS1 up for now in case we needed to copy files or bring this server back online.
Change /etc/sysconfig/network to be ns1.chainringcircus.org
Change the addresses from NS4 to NS1
cp ~/DNS-Primary/ifcfg-eth0 /etc/sysconfig/network-scripts/
cp ~/DNS-Primary/ifcfg-eth1 /etc/sysconfig/network-scripts/
cp ~/DNS-Primary/named.conf.primary /etc/named.conf
7. We tested to make sure everything was running correctly and then rebooted the NS1 to make sure it came up correctly.
shutdown -r now
8. Shut down NS1 for the last time.
shutdown -h now