During an online upgrade of an Ubuntu 18.04 server to Ubuntu 20.04, the system became unresponsive for over an hour.
One idea was to reset the host, but this could have caused seriously problems for the upgrade.
At this point even the host virtual machine manager interface was unresponsive, so we had to use their help line.
How this sorry saga was ended was by asking the host to temporary upgrade the RAM and CPU so that the migration can complete.
Here is a transcript of some of the events:
When the upgrade started, we got this message: To make recovery in case of failure easier, an additional sshd will be started on port '1022'. If anything goes wrong with the running ssh you can still connect to the additional one. If you run a firewall, you may need to temporarily open this port. As this is potentially dangerous it's not done automatically. You can open the port with e.g.: 'iptables -I INPUT -p tcp --dport 1022 -j ACCEPT' To continue please press [ENTER] Reading package lists... Done Building dependency tree Reading state information... Done Hit http://za.archive.ubuntu.com/ubuntu bionic InRelease ... Very long delay after finding initrd image, then things went wrong .... Found linux image: /boot/vmlinuz-4.15.0-153-generic Found initrd image: /boot/initrd.img-4.15.0-153-generic done Setting up python3-gi (3.36.0-1) ... Setting up libnet-libidn-perl (0.12.ds-3build2) ... Setting up cloud-initramfs-copymods (0.45ubuntu1) ... Setting up proftpd-basic (1.3.6c-2) ... Installing new version of config file /etc/default/proftpd ... Installing new version of config file /etc/init.d/proftpd ... usermod: no changes Failed to reload daemon: Connection timed out Failed to reload daemon: Connection timed out Failed to retrieve unit state: Connection timed out Failed to start proftpd.service: Connection timed out See system logs and 'systemctl status proftpd.service' for details. invoke-rc.d: initscript proftpd, action "start" failed. Failed to get properties: Connection timed out invoke-rc.d: release upgrade in progress, error is not fatal ... New CPU and RAM added ... Setting up libvariable-magic-perl (0.62-1build2) ... Setting up libb-hooks-op-check-perl (0.22-1build2) ... ... Installation completed!
During the upgrade many files were updated were only the default was selected. This is a highly complex server with many services, but in the end it appears everything is working.
These are some of the configuration files that had to be chosen back to default:
postfix no configuration was selected jail.conf nginx.conf /etc/services /etc/logrotate.conf /etc/bind/named.conf.default-zones /etc/bind/named.conf.options /etc/dovecot/conf.d/10-mail.conf /etc/ssh/sshd_config /etc/dovecot/conf.d/20-pop3.conf /etc/default/snmpd /etc/snmp/snmpd.conf Configuration file '/etc/mysql/mysql.conf.d/mysqld.cnf' /etc/default/opendkim /etc/opendkim.conf /etc/default/spamassassin
Finally the following cross checks were done:
- Check if both ip addresses were available afterwards
- Send and receive test emails
- Check billing system health including automatic invoice generation
- Check SMS Gateway
- Check Bind replication
- Check Fail2ban
- Check if the firewall is running
- Check if email is working
- Would Apparmor cause issues, e.g. with the control panel?
- Tail as many log files as possible, especially syslog
In the end, the most complex problem to troubleshoot, believe it or not, was ioncube and PHP 7.4 not working properly together.