Replacing Proxmox Virtual Environment Server in a Ceph cluster

From RoseWiki
Jump to navigation Jump to search

Replacing a Proxmox Virtual Environment Server in hyperconverged ceph configuration.

This guide assumes a four node cluster with hostnames node1-node4. We're assuming a replacement of node2.

  1. Using HA, migrate running VM's from node2 to node1, or any other location that has ample resources and is currently a member of CEPH.
  2. Set OSDs on node2 to "out" and wait for rebalance.
  3. Following rebalance, stop and destroy all OSDs on node2.
  4. Remove Ceph mon and manager from node2.
  5. Clean up the Ceph CRUSH map and remove the host bucket using "ceph osd crush remove node2".
  6. From a node that's still participating in Ceph, run "pvecm delnode node2".

The node is now decommissioned and no longer participating in Ceph. It can be removed. Let's install the replacement.

  1. Physically remove old server, install new server. Cable and power server.
  2. Configure LOM such as Dell DRAC to use the correct IP address.
  3. Set new node2's IP management IP address to the IP of the previous machine. Validate connectivity.
  4. Edit /etc/hostname and /etc/hosts to confirm hostname is correctly matched to previous install's hostname.
  5. Reboot and verify hostname and IP are correct.
  6. If the previous machine had a Proxmox license, apply it now.
  7. Validate network connectivty on corosync network and both the Ceph frontend (consumption and management) and backend (replication) networks to all other nodes.
  8. Join the Proxmox cluster.
  9. Install Ceph.
  10. Add Ceph Mon and Ceph Manager to this node.
  11. Migrate a test VM to the new node to confirm consumption.
  12. If there are any other maintenance tasks to complete (like swapping another node with the previous node's hardware) do NOT add OSDs back to node2 until ready.

A similar series of steps can be taken if existing drives are being moved to a new server intallation, maintaining OS and OSDs, as opposed to new drives. We'll assume a replacement of node 1 with the previous node2's. hardware.

  1. Using HA, migrate running VM's from node1 to node2, or any other location that has ample resources and is currently a member of CEPH.
  2. Unlike before, set the noout flag - the OSDs aren't actually going anywhere, so we do not want a rebalancing.
  3. Shutdown node1.
  4. Physially move boot and data drives from node1 to the donor that was previously node2.
  5. Un-rack now driveless node1 and replace with now populated donor node2. Cable.
  6. Power on and configure LOM such as Dell DRAC to use the correct IP address.
  7. Validate system boot.
  8. Validate network connectivty on corosync network and both the Ceph frontend (consumption and management) and backend (replication) networks to all other nodes.
  9. Verify that OSDs are online and that all PGs report synced.
  10. Disable noout flag.
  11. We can now safely add OSDs back to the new node2 and allow it to rebalance. This could take a large amount of time - up to several days, depending on quantity of storage.