Metro on a server?

Here’s a gem.

The overall point of this blog post is:

  • Metro UI is not intuitive and it’s so bad that they have to blog about how to use it.

Bonus:

  • Why the hell do we have a touch interface on a server product? Are data centers planning to replace all of those KVMs with touch screens?

12 Routines of Windows Server 2012 Using a Non-Touch Device (1 of 3) – Yung Chou on Hybrid Cloud – Site Home – TechNet Blogs

Lion Finally Installed (…and Here’s Why It Was Failing)

After a tremendous ordeal of trying to install OS X Lion on my January 2008 Mac Pro, I finally had a breakthrough. I discovered what appears to be a hardware incompatibility.

To properly tell this story we’ll have to go back in time.

In January of 2008 when this model of Mac Pro was available (version 3.1) it was definitely the cat’s meow. I bought a true boss of a system too. I picked up an 8-core 2.8ghz Mac Pro with 4gb of RAM. Later, I bought some third party RAM from Crucial to stuff it to the brim with 32gb.

I also visited Newegg.com to pick up three more hard drives. That was an easy decision. I picked up 3 more 750gb hard drives, all of them Seagate. When I was a PC guy Seagates or Maxtors were the only drives to buy.

I also picked up a Drobo. If you’ve read this blog for a while, you know what happened with the Drobo. To replace the Drobo I eventually purchased a Promise DS4600 and 4 Seagate ST31000340AS 1 terabyte drives.

The Promise unit had issues. It was finicky and liked to drop drives out of the array for no discernible reason. Oddly enough, the drive that failed out of the array most often was whatever drive was in bay 4. It didn’t have to be the same drive. You could literally swap the drives in and out of bay 4 and eventually it would fail. It was really bizarre. I opened numerous cases with Promise.

Promise finally came back with the information that this particular model of Seagate drives were not certified to work with the DS4600. Okie, I can handle that. No problem. I reviewed the compatibility list provided by Promise and selected a set of Hitachi 2TB drives.

I then moved on to fight the DS4600 again to make it work over eSATA. Only this weekend did that finally get resolved. But that’s another story for another day. If you want to hear it, let me know and I’ll be happy to tell it.

Anyway, back to the Seagate drives. After all of that drama and switching around, I now had a set of Seagate ST31000340AS 1tb drives… four of them to be exact. During this time I was also experiencing regular S.M.A.R.T. failures in the 750gb drives in the chassis of the Mac Pro. Those drives were being sent off and replaced on a fairly regular basis.

As these drives were replaced and rolled around, I decided that maybe I should move the 1tb drives into the Mac Pro and get 250gb of extra space. That’s kind of a no-brainer decision, right? I ended up with the original 750gb drive that shipped from Apple and three ST31000340AS drives in the other bays. I had the bright idea of creating a RAID-0 and installing Lion on it. I just knew it was going to scream.

…except that Lion wouldn’t install.

The install would always start off just fine. It would write files and then reboot. Then, somewhere in the next storage of the install it would just die. An error message popped up claiming that there was a problem and Lion couldn’t be installed.

how can this be? thought I. There’s no way Apple would release an operating system that has an incompatibility of this nature with a 2008 Mac Pro. this is insane.

I lost many countless hours of sleep to install attempts. I would try to install. I would watch it fail. I would research a little more. I’ve spent weeks trying to get through this. Nothing… and I mean NOTHING would get through with the install…

…until one time, it did.

Immediately, my trust level of the whole installation was suspect of course. Why would it fail to install so many times and then just out of the blue… it would work? It didn’t make sense. I had tried reseating hardware. I had tried pulling out the BlackMagic Intensity Card. I had tried pulling out the eSATA cards. I tried putting the stock RAM back in place. I tried everything. Nothing worked… until this time it did. Weird. It didn’t make sense.

I ran with Lion on the RAID-0 for a few days and happily thought I would go about the installation of Carbon Copy Cloner so I could set up clone tasks for the operating system disk.

cue music. It started to happen. Everything segfaulted. I could literally open the Console application and watch the crash reports roll in like a riot was going on in the Grid and reporters were on the air. No program was safe. Every one of them blew up. Sync your iPhone? Bam. iTunes died. Sync your iPod? SLAM. VTPCDecoder (or something) explodes. yeah, this OS is suspect.

I decided to whack the RAID-0 and try the install again on a single ST31000340AS drive. Guess what? The install failed… again and again.

I booked a Genius Bar appointment. Obviously, my logic board was bad.

I’m not sure what made me think to try it, but I did. One of the 1tb drives had died at some point and I received a replacement that was a completely different model. Also I ran across information on the net that a lot of people were having problems with ST31000340AS drives and certain versions of firmware. Those versions were SD1A and SD15. I looked over the drives I had. All of them had one of those two firmware revisions.

interesting.

I took one of the drives Seagate sent back that was a different model number. For the record, the drive was model ST31000528AS. I slapped it in the chassis and formatted it with HFS+. I fired up the Lion installer and hit go. I asked it to give me a full, fresh install of Lion on this disk. It worked the first time.

Not only did it work, it has been rock solid. Nothing is crashing like it was before on the other drives. Lion has become a joy to use the past two days. I stored away the Snow Leopard volume and kept it for emergencies.

I cancelled the Genius Bar appointment.

By now I think you can figure out what my conclusion. There’s something wrong with ST31000340AS Seagate drives. Don’t try to use them with Lion. Something about the kernel in Lion disagrees strongly with that model of drive. If you read around on the net you will find many, many horror stories with those drives.

Beware.

Exchange Server 2007 SP3 RU4

Description of Update Rollup 4 for Exchange Server 2007 Service Pack 3

It looks like Exchange 2007 SP3 RU4 has a lot of goodies in it. At least 5 of the items in this list are impacting the environment at my day job.

While it’s good to see progress, I’m always wary of these updates because of the regression bugs they often introduce. Test and patch carefully, gang.

Exchange ActiveSync and Your Mobile Devices

It’s brutally important that you understand this article if you support Exchange 2007 or 2010.

Read it. Now.

http://support.microsoft.com/kb/2563324

Drobo Still Takes Forever to Rebuild?

I’m guessing from the amount of hits on the Drobo article from 2009 that people are still having problems with Drobos rebuilding the array in a decent amount of time.

Ever since I got a DS4600 using standard RAID-5 I’ve been quite happy. Rebuild times on a 6TB volume are about 2.5 hours. Note: the volume is only about 1/3rd full, but it’s still way more data than what was on the Drobo in 2009.

Since that incident I strongly reconsider anything that implements something in a closed, proprietary fashion to replace a standard.

Just sayin’.

If you have one of the newer Drobo units and still have problems with the array rebuilding in an acceptable amount of time, let me know in the comments. I’d love to hear about it.

Could a Bug be Deliberately Coded into an Open Source Project for Financial Gain?

For some bizarre reason, the thought at the top of my head last night at bedtime was… “I wonder if sometimes… open source developers deliberately code bugs or withhold fixes for financial gain?”

If you don’t follow what I mean, here’s where I was: often times, large corporations or benefactors will offer a code fix bounty or developmental funding for an open source project they have come to rely upon.  What if an open source developer were to deliberately code a bug into an open source project or withhold a fix so they might extract some financial support with this method?

I brought it up in #morphix to Gandalfar, one of my trusted open source advisors.  We debated it shortly and he brought up several good points.  While this may happen, the scheme is likely to fall apart quickly.  The community is the resolver of situations like this.  If the community finds a bug and offers a fix for the problem, then the developer will find themselves in a political combat situation.  They would likely try to stifle the fix with some ridiculous excuses and/or start to censor discussion of the subject over mailing lists or on forums.  Speculation could be raised about the issue and ultimately, people could start to fork the project elsewhere, unless the license of the project disallows that.  In the long run, the community would resolve the situation by simply offering a new solution.

So while it could theoretically be achieved for short-term gain, in the long run the community makes the approach unsustainable.

Why do I bring this up?  Well, I think we all know that closed source entities often engage in this practice.  I could point out several examples that I have absolute knowledge of this happening, but I don’t think I have to.  I’m not completely absolving open source from this either – look at what “official distributions” do in some situations… Red Hat Enterprise Linux or Novell (SUSE) for example.  But in those situations, if you didn’t want to pay to upgrade the operating system and still resolve your situation, we all know that with the right application of effort and skill you could overcome it.

All in all, this whole thought process ends up with a positive note about open source.  If it’s broken, you can fix it yourself or work with others to make it happen.  The community – that incredibly large, global groupthink – keeps it all honest.

Or, you can put all your money and eggs into a closed source basket and find out you’re getting screwed when it’s too late.

It’s all about choice, right?

Reblog this post [with Zemanta]

A brand new NO CARRIER

For those of you who follow my adventures here, but not necessarily my adventures over there, you should be aware that we’ve posted NO CARRIER Episode #11.  This episode is very special to my heart because it’s the first show we did in our new studio (Whitey is still over Skype though).  I think the audio quality is MUCH better.  Of course, we’ll be tweaking as things move on, but the new studio and the new processes we’re using to lay down the audio sound damn fine if I do say so myself.

Check it out and let us know what you think!

Reblog this post [with Zemanta]

Count the messages on your Exchange 2007 environment

Are you curious about the hard stats of messages running around your organization?

Try this one in powershell on your hub transport server:

get-messagetrackinglog -start “mm/dd/yyyy hh:mm:ss” -end “mm/dd/yyyy hh:mm:ss” -eventid “send” -resultsize 9999999 | measure-object

This will pull stats for messages that were “sent”.  To pull the number of messages received, change the “eventid” parameter to “receive.”

Reblog this post [with Zemanta]

Exchange 2007: SCR replication repair

Last week I had to do some serious debugging on storage copy replication.  We discovered that one of our SCC clusters had decided to quit replicating to the SCR node at the other site.  We’re not sure why (we think it’s because the SCR node was rebooted and replication was not cleanly suspended), but the ramifications of failed replication are interesting.

In the Exchange 2003 world, you had to depend on your backups running smoothly to purge log files from the log disks or else eventually, you’d find your databases dismounting in the middle of the day because you’re out of space.  Exchange 2007 and storage group replication has added a new complexity to that.  Now, not only do your backups have to succeed, your log file replication has to be working well too.  We discovered that log files were not being purged and voila… databases dismounted.  If your replication is broken for any reason, Exchange 2007 will not purge those log files.

So, with that in mind, I thought I’d share some of the email that was sent around to the team that discusses how to troubleshoot the storage group replication processes just in case someone out there needs it.

(introduction cut)

Sometime last week, SCOM started complaining about the ReplayQueueLength being elevated on SCR02.  This meant that replication had, once again, halted for some reason.  I thought I’d share with you on how to debug/correct this should it happen again.

Open up Exchange Management Shell ON THE PASSIVE NODE.  To check the replication status of a storage group, type:

Get-StorageGroupCopyStatus -server <servername> -standbymachine <SCRnode>

For instance:

Get-StorageGroupCopyStatus -server exchscc02 -standbymachine exchscr02

This will produce a list of the storage groups and the replication status.

First column lists the storage group.

Second column informs you of the overall status.  Should be == HEALTHY.

Third column lists the CopyQueueLength.  This is how many log files that must be copied from the source active node to the passive SCR node.  Should be a low number or zero.  Anything higher and not decrementing means there is a likely an issue developing.  SCOM is probably (hopefully) alarming about it if that’s the case.

Fourth column lists the ReplayQueueLength.  This is how many log files need to be played into the passive copy of the database at the SCR side.  Will always be 50 or below.  Above 50 indicates there is some kind of problem at the passive SCR side.  DO NOT BE ALARMED by this number being 50.  Exchange is hard-coded to not play anything into the database until it gets 50 log files.  We cannot change this.  If we were to activate the SCR side of the node, it will play these 50 files in.

Fifth column lists the last time the replication service checked the log files out (should be FAIRLY recent, depends on database usage).

If you discover any of the message stores are in a state of “suspended” or “failed” you must debug the issue.

If the message store is in a “suspended” state, you may be able to restart the replication with little issue.  Try this (RUN THESE COMMANDS FROM THE SCR OR PASSIVE NODE ONLY!)

Resume-storagegroupcopy -identity <servername\sgname> -standbymachine <SCRnode>

If the log files are intact and things are happy, replication will restart and you’ll be told that all is well.  If something goes awry at this point, the storage group will go down to a failed state.  You can run the get-storagegroupcopystatus to double check where things are after trying a resume.

If you get a storage group in a FAILED state, things are a little more delicate.  Make sure there are no issues with the servers talking to each other.  CHECK EVENT LOGS, especially APPLICATION log for any errors (PLEASE always do this FIRST, for EVERY EVERY EVERY EVERY EVERY and I mean EVERY (did I say EVERY?) Exchange issue!)  Make sure the replication service is started on both nodes.  Make sure they can ping each other.  Make sure they can open a windows explorer window to each other’s c$ share.  Check all of that out before proceeding.

If you can find absolutely no reason why the servers cannot talk to each other and the SG’s should be replicating fine, you can try to reseed the databases.  This is a time-consuming operation and could consume lots of bandwidth.

Before reseeding a database, you must put the FAILED storage groups in the suspended state.  In this example, let’s assume exchscc02\SG02 went down to a FAILED state.  First, we suspend it:

Suspend-storageGroupCopy -identity exchscc02\sg02 -standbymachine exchscr02

Now do another get-StorageGroupCopyStatus command to verify it is suspended:

Get-StorageGroupCopyStatus -server exchscc02 -standbymachine exchscr02

Verify that SG02 is now showing SUSPENDED.

Now the moment of truth.  BE CAREFUL to execute this ONLY on the passive node (usually the SCR node).  This command DELETES THE PASSIVE COPY of the database and log files and restarts replication!  There’s no going back once you’ve made this decision.  Choose carefully.

Update-StorageGroupCopy -identity exchscc02\sg02 -standbymachine exchscr02 -deleteexistingfiles

After a confirmation and a pause, you should get a progress bar as the live copy of the edb file is copied over the wire to the passive copy and log files begin accumulating.

After this completes, be sure to run another get-StorageGroupCopyStatus command to verify everything is healthy again.

There are no reboots or storage group/database offlines required for any of these commands.

(end email)

Upon review of the notes and the activities that led up to the failed replication states, it was determined that as a standard operating procedure, replication should be manually suspended on all SCC –> SCR nodes prior to patching and rebooting machines.  This means, of course, that replication has to be restarted after your patch party is over.

To do this is pretty much the same as above:

Suspend-StorageGroupCopy -identity <servername\sg> -standbymachine <scr_server>

You could get fancy and do something like a pipe of Get-StorageGroupCopyStatus to this command and it would probably will in all of the identity stuff.  That’d be fine, but I prefer to do things the hard way I guess.  I like to take it easy.

Then when your patch party is over:

Update-StorageGroupCopy -identity <servername\sg> -standbymachine <scr_server>

Hope these notes help someone out there struggling with Exchange 2007.

Reblog this post [with Zemanta]

Where Powershell Fails

I’m all about negativity today. Sorry.

Anyway, I’ve had something nagging at me for a while now and I think I’ve just figured it out. Powershell is Microsoft‘s answer to having a dumb command line through the Win95 – Win2003 years and it’s quite powerful, as the name implies. Microsoft likes it so much that they makes most of the Exchange 2007 administration efforts in the Exchange Management Shell, a derivative of Powershell that contains Exchange-specific cmdlets.

I’ve long bemoaned to our internal support personnel… and… well, probably my Microsoft contacts too… about how discombobulated Powershell actually is. It’s like it was designed with no standard in mind for the commands – each developer wrote their own cmdlet with their own switches and methods to do things the way they saw fit.

But it’s actually worse than that. Now I’ve come to realize that the problem with managing Exchange from the shell is not only because of the lack of standardization, but because a great deal of this SHOULDN’T be done in a shell command. I’ve heard that Powershell was designed to attract Linux admins who prefer the command line and that’s fine. But I do not know of a Linux admin who would type a command to set a disclaimer on the entire Exchange organization, but rather he/she would edit a config file of some kind. That way, not only would the disclaimer setting be readily apparent and visible, but it wouldn’t take some obscure command to be executed to show me the meat of the option.

What tripped this realization was this “power tip” when I just went into the Exchange shell on one of our servers:

Tip of the day #58:

Do you want to add a disclaimer to all outbound e-mail messages? Type:

$Condition = Get-TransportRulePredicate FromScope
$Condition.Scope = "InOrganization"
$Condition2 = Get-TransportRulePredicate SentToScope
$Condition2.Scope = "NotInOrganization"
$Action = Get-TransportRuleAction ApplyDisclaimer
$Action.Text = "Sample disclaimer text"
New-TransportRule -Name "Sample disclaimer" -Condition @($Condition, $Condition2) -Action @($Action)

Why am I not looking in a config file for this information? Fail.

Reblog this post [with Zemanta]