What I Want from Data Center Management Software

(note: the following is a stream of consciousness post regarding some software requirements as i dream them up.  if you are a developer and actually take up these requirements as the design for a software project, please let me know.  if you are aware of a software product that accomplishes all of this, please do not bother to let me know about it.  i don’t care.  fact is, nothing on the market today does this well enough to make me care about it the way i want to care about it.)

Let’s face it.  Documentation sucks.

I’ve traveled around this country and seen many an IT environment.  All of them have one thing in common: the documentation about the environment sucks.  It’s in such a sad state that should anything happen to anything, nothing would be recoverable.

We’re guilty of it in our own environment.  I’m not going to sit here and disparage everyone else’s IT environment without realizing that it’s a problem where I work too.  I’ve spent a lot of time wondering why this documentation is in such a sad state and come to a few conclusions.  I suspect these conclusions aren’t a surprise to anyone.

  • The staff is overworked.  They have no time to sit in your meetings, listen to the managers and customers rant and rave about how nothing works right (funny how that not-listening thing travels both ways), or get all of their assigned work done to begin with.
  • Documentation is boring.  There is nothing glamorous about writing a Word document about how you configured a paging file.
  • After writing the documentation, maintaining it is a real bear, especially in an age when the corporation that owns 90% of your data center farts a new patch daily.  What?  Tuesdays?  Oh man, that’s just for OS patches.  Try running some enterprise software sometime.  (NOTE TO SELF: Bitch about Exchange 2007 more, because that obviously hasn’t sunk in yet).
  • Too many fires to put out.  Remember that not-listening-is-a-two-way-street crack?  Yeah well, since management didn’t listen about your needs, you’re working 70 hours this week to fix all the crap that broke.  Oh yeah, don’t forget to document what you did to fix it.  (Now it’s 90 hours).

I could go on, but I think you get the idea.

So, now that I’ve listed reasons why you do not have the documentation, let’s talk about what happens when you do not sit near the data center and have questions about what’s what out there.

  • Need to find out what port a server is hooked up to?  Scan through your endless amounts of PSTs on the file share (haha!) to discover what port was assigned two years ago.  Fail.  Look for the document.  Oh!  Wait.  Fail that too.  No docs.  Ask someone who is sitting in the data center.  How the hell should they know?  They’re busy and don’t have time to help you.  Oh, by the way, that cable isn’t labeled anyway.  Look it up in the docs, dumbass.  Yeah, what docs?  Time to get in the car and drive over to look for yourself, cursing all the way that you have no documentation.
  • Need switch zoning information for that fabric?  See above.  At least you can login to the switch remotely… until Java fails.  Drive over.
  • Time to build a server.  Time to put it into production.  What do you mean it’s got a bug we fixed two years ago?  Oh, shit.  We forgot that NoServerTimeZoneComp registry key.  It’s always the Mac users that make your Windows admin lives hell, right?  No, buddy, it’s because you didn’t follow the documentation.  Uhh, what documentation?

I think I’ve stated my case.  Now then.  I want software that can overcome the burden of writing this documentation and I want it available in damn near real time.  So, here goes.

I want data center management software that:

  • …is object-oriented like C.  I want to be able to instantiate a new instance of a Dell 2950 and define its properties – like what rack it’ll be in and what U numbers it occupies.
  • …can perform discovery on that new Dell 2950 and figure out the rest of the properties for the object (a la service tag number, CPU, RAM, maintenance left on contract, etc.)
  • …can allow me to connect the network to a specific switch port by dragging and dropping a line like Visio.
  • …can allow me to connect it to a storage area network like the network connection above.
  • …can produce a 3-dimensional rack drawing (the rack itself should be just another object, since we’re object-oriented and the server objects are just properties) that details every network connection, fibre hookup and power connection.
  • …can, upon sensing a failure from SCOM 2007 or NetIQ, label each server and cable that has failed to look for common properties in an anomaly (because it’s always the network’s fault).
  • …is able to produce a server installation document by right-clicking on it and selecting “current state documentation.”  I want it in PDF format so I don’t have to open fracking Microsoft Word ever, ever, ever again.  I want it to be able to spot every piece of software that is loaded on the server.  I want it to be able to tell me every patch and registry tweak that has been applied to that server since I racked it and installed the operating system.
  • …is able to alert me when servers are about to run out of maintenance.
  • …is visual enough that the customers can use a dashboard of sorts to view some of the same properties and elements that I need to see.

I think you see the challenge here.

Now I ask you…

…why doesn’t this software exist?

Reblog this post [with Zemanta]

Exchange 2007: SCR replication repair

Last week I had to do some serious debugging on storage copy replication.  We discovered that one of our SCC clusters had decided to quit replicating to the SCR node at the other site.  We’re not sure why (we think it’s because the SCR node was rebooted and replication was not cleanly suspended), but the ramifications of failed replication are interesting.

In the Exchange 2003 world, you had to depend on your backups running smoothly to purge log files from the log disks or else eventually, you’d find your databases dismounting in the middle of the day because you’re out of space.  Exchange 2007 and storage group replication has added a new complexity to that.  Now, not only do your backups have to succeed, your log file replication has to be working well too.  We discovered that log files were not being purged and voila… databases dismounted.  If your replication is broken for any reason, Exchange 2007 will not purge those log files.

So, with that in mind, I thought I’d share some of the email that was sent around to the team that discusses how to troubleshoot the storage group replication processes just in case someone out there needs it.

(introduction cut)

Sometime last week, SCOM started complaining about the ReplayQueueLength being elevated on SCR02.  This meant that replication had, once again, halted for some reason.  I thought I’d share with you on how to debug/correct this should it happen again.

Open up Exchange Management Shell ON THE PASSIVE NODE.  To check the replication status of a storage group, type:

Get-StorageGroupCopyStatus -server <servername> -standbymachine <SCRnode>

For instance:

Get-StorageGroupCopyStatus -server exchscc02 -standbymachine exchscr02

This will produce a list of the storage groups and the replication status.

First column lists the storage group.

Second column informs you of the overall status.  Should be == HEALTHY.

Third column lists the CopyQueueLength.  This is how many log files that must be copied from the source active node to the passive SCR node.  Should be a low number or zero.  Anything higher and not decrementing means there is a likely an issue developing.  SCOM is probably (hopefully) alarming about it if that’s the case.

Fourth column lists the ReplayQueueLength.  This is how many log files need to be played into the passive copy of the database at the SCR side.  Will always be 50 or below.  Above 50 indicates there is some kind of problem at the passive SCR side.  DO NOT BE ALARMED by this number being 50.  Exchange is hard-coded to not play anything into the database until it gets 50 log files.  We cannot change this.  If we were to activate the SCR side of the node, it will play these 50 files in.

Fifth column lists the last time the replication service checked the log files out (should be FAIRLY recent, depends on database usage).

If you discover any of the message stores are in a state of “suspended” or “failed” you must debug the issue.

If the message store is in a “suspended” state, you may be able to restart the replication with little issue.  Try this (RUN THESE COMMANDS FROM THE SCR OR PASSIVE NODE ONLY!)

Resume-storagegroupcopy -identity <servername\sgname> -standbymachine <SCRnode>

If the log files are intact and things are happy, replication will restart and you’ll be told that all is well.  If something goes awry at this point, the storage group will go down to a failed state.  You can run the get-storagegroupcopystatus to double check where things are after trying a resume.

If you get a storage group in a FAILED state, things are a little more delicate.  Make sure there are no issues with the servers talking to each other.  CHECK EVENT LOGS, especially APPLICATION log for any errors (PLEASE always do this FIRST, for EVERY EVERY EVERY EVERY EVERY and I mean EVERY (did I say EVERY?) Exchange issue!)  Make sure the replication service is started on both nodes.  Make sure they can ping each other.  Make sure they can open a windows explorer window to each other’s c$ share.  Check all of that out before proceeding.

If you can find absolutely no reason why the servers cannot talk to each other and the SG’s should be replicating fine, you can try to reseed the databases.  This is a time-consuming operation and could consume lots of bandwidth.

Before reseeding a database, you must put the FAILED storage groups in the suspended state.  In this example, let’s assume exchscc02\SG02 went down to a FAILED state.  First, we suspend it:

Suspend-storageGroupCopy -identity exchscc02\sg02 -standbymachine exchscr02

Now do another get-StorageGroupCopyStatus command to verify it is suspended:

Get-StorageGroupCopyStatus -server exchscc02 -standbymachine exchscr02

Verify that SG02 is now showing SUSPENDED.

Now the moment of truth.  BE CAREFUL to execute this ONLY on the passive node (usually the SCR node).  This command DELETES THE PASSIVE COPY of the database and log files and restarts replication!  There’s no going back once you’ve made this decision.  Choose carefully.

Update-StorageGroupCopy -identity exchscc02\sg02 -standbymachine exchscr02 -deleteexistingfiles

After a confirmation and a pause, you should get a progress bar as the live copy of the edb file is copied over the wire to the passive copy and log files begin accumulating.

After this completes, be sure to run another get-StorageGroupCopyStatus command to verify everything is healthy again.

There are no reboots or storage group/database offlines required for any of these commands.

(end email)

Upon review of the notes and the activities that led up to the failed replication states, it was determined that as a standard operating procedure, replication should be manually suspended on all SCC –> SCR nodes prior to patching and rebooting machines.  This means, of course, that replication has to be restarted after your patch party is over.

To do this is pretty much the same as above:

Suspend-StorageGroupCopy -identity <servername\sg> -standbymachine <scr_server>

You could get fancy and do something like a pipe of Get-StorageGroupCopyStatus to this command and it would probably will in all of the identity stuff.  That’d be fine, but I prefer to do things the hard way I guess.  I like to take it easy.

Then when your patch party is over:

Update-StorageGroupCopy -identity <servername\sg> -standbymachine <scr_server>

Hope these notes help someone out there struggling with Exchange 2007.

Reblog this post [with Zemanta]