Wallpaper Outrage

Image representing Microsoft as depicted in Cr...
Image via CrunchBase

Paul Thurrott posted a nice attaboy to the MSN folks today for releasing a wallpaper product that will check Microsoft for updates to your operating system.

Get ready folks, I’m about to show my ass again.

Are you KIDDING ME?  Paul Thurrott has obviously never had to manage a network beyond his own house.  Microsoft commonly releases updates through Windows Update and if you’re a Windows admin worth your salt, you’ll know that it’s wise to wait on many of these updates until you’re sure they’re not going to fry your systems.  Indeed, many enterprises flat out block Windows Update and only deploy them when they’re ready to support any mishaps.

Any of you who think Microsoft cannot commit mishaps in an operating system update is just a fracking idiot.  Period.  Don’t even bother talking to me.

So now these MSN goons have released software that lets you bypass your enterprise security measures on Windows Update?  THANKS ASSHOLES.  If you’re running an enterprise network, please take a look at this software package that your users are RUSHING OUT THERE TO DOWNLOAD before they gank up your network.

This is basically WALLPAPER that can update your OS?  Great.  Another background app.  Another systray app.  Another useless waste of time and resources from a company that should be spending its time fixing Exchange 2007 instead of releasing useless garbage that grants enterprise users free license to bypass their IT department.

THANKS AGAIN, MICROSOFT!

You have so jumped the shark.

Reblog this post [with Zemanta]

What I Want from Data Center Management Software

(note: the following is a stream of consciousness post regarding some software requirements as i dream them up.  if you are a developer and actually take up these requirements as the design for a software project, please let me know.  if you are aware of a software product that accomplishes all of this, please do not bother to let me know about it.  i don’t care.  fact is, nothing on the market today does this well enough to make me care about it the way i want to care about it.)

Let’s face it.  Documentation sucks.

I’ve traveled around this country and seen many an IT environment.  All of them have one thing in common: the documentation about the environment sucks.  It’s in such a sad state that should anything happen to anything, nothing would be recoverable.

We’re guilty of it in our own environment.  I’m not going to sit here and disparage everyone else’s IT environment without realizing that it’s a problem where I work too.  I’ve spent a lot of time wondering why this documentation is in such a sad state and come to a few conclusions.  I suspect these conclusions aren’t a surprise to anyone.

  • The staff is overworked.  They have no time to sit in your meetings, listen to the managers and customers rant and rave about how nothing works right (funny how that not-listening thing travels both ways), or get all of their assigned work done to begin with.
  • Documentation is boring.  There is nothing glamorous about writing a Word document about how you configured a paging file.
  • After writing the documentation, maintaining it is a real bear, especially in an age when the corporation that owns 90% of your data center farts a new patch daily.  What?  Tuesdays?  Oh man, that’s just for OS patches.  Try running some enterprise software sometime.  (NOTE TO SELF: Bitch about Exchange 2007 more, because that obviously hasn’t sunk in yet).
  • Too many fires to put out.  Remember that not-listening-is-a-two-way-street crack?  Yeah well, since management didn’t listen about your needs, you’re working 70 hours this week to fix all the crap that broke.  Oh yeah, don’t forget to document what you did to fix it.  (Now it’s 90 hours).

I could go on, but I think you get the idea.

So, now that I’ve listed reasons why you do not have the documentation, let’s talk about what happens when you do not sit near the data center and have questions about what’s what out there.

  • Need to find out what port a server is hooked up to?  Scan through your endless amounts of PSTs on the file share (haha!) to discover what port was assigned two years ago.  Fail.  Look for the document.  Oh!  Wait.  Fail that too.  No docs.  Ask someone who is sitting in the data center.  How the hell should they know?  They’re busy and don’t have time to help you.  Oh, by the way, that cable isn’t labeled anyway.  Look it up in the docs, dumbass.  Yeah, what docs?  Time to get in the car and drive over to look for yourself, cursing all the way that you have no documentation.
  • Need switch zoning information for that fabric?  See above.  At least you can login to the switch remotely… until Java fails.  Drive over.
  • Time to build a server.  Time to put it into production.  What do you mean it’s got a bug we fixed two years ago?  Oh, shit.  We forgot that NoServerTimeZoneComp registry key.  It’s always the Mac users that make your Windows admin lives hell, right?  No, buddy, it’s because you didn’t follow the documentation.  Uhh, what documentation?

I think I’ve stated my case.  Now then.  I want software that can overcome the burden of writing this documentation and I want it available in damn near real time.  So, here goes.

I want data center management software that:

  • …is object-oriented like C.  I want to be able to instantiate a new instance of a Dell 2950 and define its properties – like what rack it’ll be in and what U numbers it occupies.
  • …can perform discovery on that new Dell 2950 and figure out the rest of the properties for the object (a la service tag number, CPU, RAM, maintenance left on contract, etc.)
  • …can allow me to connect the network to a specific switch port by dragging and dropping a line like Visio.
  • …can allow me to connect it to a storage area network like the network connection above.
  • …can produce a 3-dimensional rack drawing (the rack itself should be just another object, since we’re object-oriented and the server objects are just properties) that details every network connection, fibre hookup and power connection.
  • …can, upon sensing a failure from SCOM 2007 or NetIQ, label each server and cable that has failed to look for common properties in an anomaly (because it’s always the network’s fault).
  • …is able to produce a server installation document by right-clicking on it and selecting “current state documentation.”  I want it in PDF format so I don’t have to open fracking Microsoft Word ever, ever, ever again.  I want it to be able to spot every piece of software that is loaded on the server.  I want it to be able to tell me every patch and registry tweak that has been applied to that server since I racked it and installed the operating system.
  • …is able to alert me when servers are about to run out of maintenance.
  • …is visual enough that the customers can use a dashboard of sorts to view some of the same properties and elements that I need to see.

I think you see the challenge here.

Now I ask you…

…why doesn’t this software exist?

Reblog this post [with Zemanta]

Exchange 2007: SCR replication repair

Last week I had to do some serious debugging on storage copy replication.  We discovered that one of our SCC clusters had decided to quit replicating to the SCR node at the other site.  We’re not sure why (we think it’s because the SCR node was rebooted and replication was not cleanly suspended), but the ramifications of failed replication are interesting.

In the Exchange 2003 world, you had to depend on your backups running smoothly to purge log files from the log disks or else eventually, you’d find your databases dismounting in the middle of the day because you’re out of space.  Exchange 2007 and storage group replication has added a new complexity to that.  Now, not only do your backups have to succeed, your log file replication has to be working well too.  We discovered that log files were not being purged and voila… databases dismounted.  If your replication is broken for any reason, Exchange 2007 will not purge those log files.

So, with that in mind, I thought I’d share some of the email that was sent around to the team that discusses how to troubleshoot the storage group replication processes just in case someone out there needs it.

(introduction cut)

Sometime last week, SCOM started complaining about the ReplayQueueLength being elevated on SCR02.  This meant that replication had, once again, halted for some reason.  I thought I’d share with you on how to debug/correct this should it happen again.

Open up Exchange Management Shell ON THE PASSIVE NODE.  To check the replication status of a storage group, type:

Get-StorageGroupCopyStatus -server <servername> -standbymachine <SCRnode>

For instance:

Get-StorageGroupCopyStatus -server exchscc02 -standbymachine exchscr02

This will produce a list of the storage groups and the replication status.

First column lists the storage group.

Second column informs you of the overall status.  Should be == HEALTHY.

Third column lists the CopyQueueLength.  This is how many log files that must be copied from the source active node to the passive SCR node.  Should be a low number or zero.  Anything higher and not decrementing means there is a likely an issue developing.  SCOM is probably (hopefully) alarming about it if that’s the case.

Fourth column lists the ReplayQueueLength.  This is how many log files need to be played into the passive copy of the database at the SCR side.  Will always be 50 or below.  Above 50 indicates there is some kind of problem at the passive SCR side.  DO NOT BE ALARMED by this number being 50.  Exchange is hard-coded to not play anything into the database until it gets 50 log files.  We cannot change this.  If we were to activate the SCR side of the node, it will play these 50 files in.

Fifth column lists the last time the replication service checked the log files out (should be FAIRLY recent, depends on database usage).

If you discover any of the message stores are in a state of “suspended” or “failed” you must debug the issue.

If the message store is in a “suspended” state, you may be able to restart the replication with little issue.  Try this (RUN THESE COMMANDS FROM THE SCR OR PASSIVE NODE ONLY!)

Resume-storagegroupcopy -identity <servername\sgname> -standbymachine <SCRnode>

If the log files are intact and things are happy, replication will restart and you’ll be told that all is well.  If something goes awry at this point, the storage group will go down to a failed state.  You can run the get-storagegroupcopystatus to double check where things are after trying a resume.

If you get a storage group in a FAILED state, things are a little more delicate.  Make sure there are no issues with the servers talking to each other.  CHECK EVENT LOGS, especially APPLICATION log for any errors (PLEASE always do this FIRST, for EVERY EVERY EVERY EVERY EVERY and I mean EVERY (did I say EVERY?) Exchange issue!)  Make sure the replication service is started on both nodes.  Make sure they can ping each other.  Make sure they can open a windows explorer window to each other’s c$ share.  Check all of that out before proceeding.

If you can find absolutely no reason why the servers cannot talk to each other and the SG’s should be replicating fine, you can try to reseed the databases.  This is a time-consuming operation and could consume lots of bandwidth.

Before reseeding a database, you must put the FAILED storage groups in the suspended state.  In this example, let’s assume exchscc02\SG02 went down to a FAILED state.  First, we suspend it:

Suspend-storageGroupCopy -identity exchscc02\sg02 -standbymachine exchscr02

Now do another get-StorageGroupCopyStatus command to verify it is suspended:

Get-StorageGroupCopyStatus -server exchscc02 -standbymachine exchscr02

Verify that SG02 is now showing SUSPENDED.

Now the moment of truth.  BE CAREFUL to execute this ONLY on the passive node (usually the SCR node).  This command DELETES THE PASSIVE COPY of the database and log files and restarts replication!  There’s no going back once you’ve made this decision.  Choose carefully.

Update-StorageGroupCopy -identity exchscc02\sg02 -standbymachine exchscr02 -deleteexistingfiles

After a confirmation and a pause, you should get a progress bar as the live copy of the edb file is copied over the wire to the passive copy and log files begin accumulating.

After this completes, be sure to run another get-StorageGroupCopyStatus command to verify everything is healthy again.

There are no reboots or storage group/database offlines required for any of these commands.

(end email)

Upon review of the notes and the activities that led up to the failed replication states, it was determined that as a standard operating procedure, replication should be manually suspended on all SCC –> SCR nodes prior to patching and rebooting machines.  This means, of course, that replication has to be restarted after your patch party is over.

To do this is pretty much the same as above:

Suspend-StorageGroupCopy -identity <servername\sg> -standbymachine <scr_server>

You could get fancy and do something like a pipe of Get-StorageGroupCopyStatus to this command and it would probably will in all of the identity stuff.  That’d be fine, but I prefer to do things the hard way I guess.  I like to take it easy.

Then when your patch party is over:

Update-StorageGroupCopy -identity <servername\sg> -standbymachine <scr_server>

Hope these notes help someone out there struggling with Exchange 2007.

Reblog this post [with Zemanta]

Where Powershell Fails

I’m all about negativity today. Sorry.

Anyway, I’ve had something nagging at me for a while now and I think I’ve just figured it out. Powershell is Microsoft‘s answer to having a dumb command line through the Win95 – Win2003 years and it’s quite powerful, as the name implies. Microsoft likes it so much that they makes most of the Exchange 2007 administration efforts in the Exchange Management Shell, a derivative of Powershell that contains Exchange-specific cmdlets.

I’ve long bemoaned to our internal support personnel… and… well, probably my Microsoft contacts too… about how discombobulated Powershell actually is. It’s like it was designed with no standard in mind for the commands – each developer wrote their own cmdlet with their own switches and methods to do things the way they saw fit.

But it’s actually worse than that. Now I’ve come to realize that the problem with managing Exchange from the shell is not only because of the lack of standardization, but because a great deal of this SHOULDN’T be done in a shell command. I’ve heard that Powershell was designed to attract Linux admins who prefer the command line and that’s fine. But I do not know of a Linux admin who would type a command to set a disclaimer on the entire Exchange organization, but rather he/she would edit a config file of some kind. That way, not only would the disclaimer setting be readily apparent and visible, but it wouldn’t take some obscure command to be executed to show me the meat of the option.

What tripped this realization was this “power tip” when I just went into the Exchange shell on one of our servers:

Tip of the day #58:

Do you want to add a disclaimer to all outbound e-mail messages? Type:

$Condition = Get-TransportRulePredicate FromScope
$Condition.Scope = "InOrganization"
$Condition2 = Get-TransportRulePredicate SentToScope
$Condition2.Scope = "NotInOrganization"
$Action = Get-TransportRuleAction ApplyDisclaimer
$Action.Text = "Sample disclaimer text"
New-TransportRule -Name "Sample disclaimer" -Condition @($Condition, $Condition2) -Action @($Action)

Why am I not looking in a config file for this information? Fail.

Reblog this post [with Zemanta]

When RUS Strikes

One item you’ve probably learned by now if you’re an Exchange admin working on a 2007 deployment is that Microsoft has changed the behavior of the recipient update policy.  Most of you won’t care about this and that’s just fine.  You shouldn’t.  I would dare say that if your Exchange environment is engineered well and planned out the way Microsoft probably expects it to be, you should have almost no issues whatsoever.

Consider, however, if you’ve deployed Exchange with some type of “non-standard” approach.  Yes, please picture air quotes around that.  We’re trying to be politically correct here.  What if your Exchange deployment wasn’t, for instance, master of all mail within your TLD?

Let’s say you have a TLD of contoso.com.  Now let’s say you set up an Exchange service forest called services.contoso.com (see my earlier post about why an Exchange service forest is a Bad Idea).  Now let’s say that because there are many other businesses and entities within contoso.com that route their own mail, the decision is made that Exchange cannot be authoritative for all mail coming in to contoso.com.  You need to forward it up to some traffic directors at the top level to determine where the traffic goes.  Now you have Exchange installed in a service forest and you’re not authoritative for contoso.com.  So let’s say you decide to become authoritative for mail.contoso.com.

Now your recipient policy probably says that when new users are created, give them a service.contoso.com and a mail.contoso.com SMTP address.  What about the contoso.com address?  Well, since you’re handling that elsewhere, a third party process has to come in and manually assign that address.  Fine.

Now in 2003, once the user object is created and the addresses are stamped, RUS will never touch the object again and muck with it unless you forcibly tell it to do so.  Believe me though, it’s rare in this setup that you’ll be running this manually.

When you begin to roll out Exchange 2007, you get a new issue.  If you’re configured in this manner and make any changes to the user object… say… moving a mailbox or anything of that nature… then you’ll notice that RUS will take your user object and mangle it up according to what it thinks the SMTP addresses should be.  It’ll reset the primary address.  Fun.  Now your users start to complain that their mailing list memberships are failing, their business cards are incorrect, yadda yadda.  Yes, the behavior of RUS changed in 2007 from 2003.  Take note of it, because if you’re set up in a wonky way that prevents you from being authoritative in your domain, this is going to bite you once for every user you have.

Reblog this post [with Zemanta]