Exchange 2007: SCR replication repair

Last week I had to do some serious debugging on storage copy replication.  We discovered that one of our SCC clusters had decided to quit replicating to the SCR node at the other site.  We’re not sure why (we think it’s because the SCR node was rebooted and replication was not cleanly suspended), but the ramifications of failed replication are interesting.

In the Exchange 2003 world, you had to depend on your backups running smoothly to purge log files from the log disks or else eventually, you’d find your databases dismounting in the middle of the day because you’re out of space.  Exchange 2007 and storage group replication has added a new complexity to that.  Now, not only do your backups have to succeed, your log file replication has to be working well too.  We discovered that log files were not being purged and voila… databases dismounted.  If your replication is broken for any reason, Exchange 2007 will not purge those log files.

So, with that in mind, I thought I’d share some of the email that was sent around to the team that discusses how to troubleshoot the storage group replication processes just in case someone out there needs it.

(introduction cut)

Sometime last week, SCOM started complaining about the ReplayQueueLength being elevated on SCR02.  This meant that replication had, once again, halted for some reason.  I thought I’d share with you on how to debug/correct this should it happen again.

Open up Exchange Management Shell ON THE PASSIVE NODE.  To check the replication status of a storage group, type:

Get-StorageGroupCopyStatus -server -standbymachine

For instance:

Get-StorageGroupCopyStatus -server exchscc02 -standbymachine exchscr02

This will produce a list of the storage groups and the replication status.

First column lists the storage group.

Second column informs you of the overall status.  Should be == HEALTHY.

Third column lists the CopyQueueLength.  This is how many log files that must be copied from the source active node to the passive SCR node.  Should be a low number or zero.  Anything higher and not decrementing means there is a likely an issue developing.  SCOM is probably (hopefully) alarming about it if that’s the case.

Fourth column lists the ReplayQueueLength.  This is how many log files need to be played into the passive copy of the database at the SCR side.  Will always be 50 or below.  Above 50 indicates there is some kind of problem at the passive SCR side.  DO NOT BE ALARMED by this number being 50.  Exchange is hard-coded to not play anything into the database until it gets 50 log files.  We cannot change this.  If we were to activate the SCR side of the node, it will play these 50 files in.

Fifth column lists the last time the replication service checked the log files out (should be FAIRLY recent, depends on database usage).

If you discover any of the message stores are in a state of “suspended” or “failed” you must debug the issue.

If the message store is in a “suspended” state, you may be able to restart the replication with little issue.  Try this (RUN THESE COMMANDS FROM THE SCR OR PASSIVE NODE ONLY!)

Resume-storagegroupcopy -identity <servername\sgname> -standbymachine

If the log files are intact and things are happy, replication will restart and you’ll be told that all is well.  If something goes awry at this point, the storage group will go down to a failed state.  You can run the get-storagegroupcopystatus to double check where things are after trying a resume.

If you get a storage group in a FAILED state, things are a little more delicate.  Make sure there are no issues with the servers talking to each other.  CHECK EVENT LOGS, especially APPLICATION log for any errors (PLEASE always do this FIRST, for EVERY EVERY EVERY EVERY EVERY and I mean EVERY (did I say EVERY?) Exchange issue!)  Make sure the replication service is started on both nodes.  Make sure they can ping each other.  Make sure they can open a windows explorer window to each other’s c$ share.  Check all of that out before proceeding.

If you can find absolutely no reason why the servers cannot talk to each other and the SG’s should be replicating fine, you can try to reseed the databases.  This is a time-consuming operation and could consume lots of bandwidth.

Before reseeding a database, you must put the FAILED storage groups in the suspended state.  In this example, let’s assume exchscc02\SG02 went down to a FAILED state.  First, we suspend it:

Suspend-storageGroupCopy -identity exchscc02\sg02 -standbymachine exchscr02

Now do another get-StorageGroupCopyStatus command to verify it is suspended:

Get-StorageGroupCopyStatus -server exchscc02 -standbymachine exchscr02

Verify that SG02 is now showing SUSPENDED.

Now the moment of truth.  BE CAREFUL to execute this ONLY on the passive node (usually the SCR node).  This command DELETES THE PASSIVE COPY of the database and log files and restarts replication!  There’s no going back once you’ve made this decision.  Choose carefully.

Update-StorageGroupCopy -identity exchscc02\sg02 -standbymachine exchscr02 -deleteexistingfiles

After a confirmation and a pause, you should get a progress bar as the live copy of the edb file is copied over the wire to the passive copy and log files begin accumulating.

After this completes, be sure to run another get-StorageGroupCopyStatus command to verify everything is healthy again.

There are no reboots or storage group/database offlines required for any of these commands.

(end email)

Upon review of the notes and the activities that led up to the failed replication states, it was determined that as a standard operating procedure, replication should be manually suspended on all SCC –> SCR nodes prior to patching and rebooting machines.  This means, of course, that replication has to be restarted after your patch party is over.

To do this is pretty much the same as above:

Suspend-StorageGroupCopy -identity <servername\sg> -standbymachine <scr_server>

You could get fancy and do something like a pipe of Get-StorageGroupCopyStatus to this command and it would probably will in all of the identity stuff.  That’d be fine, but I prefer to do things the hard way I guess.  I like to take it easy.

Then when your patch party is over:

Update-StorageGroupCopy -identity <servername\sg> -standbymachine <scr_server>

Hope these notes help someone out there struggling with Exchange 2007.