Category: work

DUI and riding at night

Colorado Springs ranks consistently high in the Men’s Health survey (the only national survey I am aware of, and Colorado Springs is 16th for 2010) of drunkest cities in the U.S., so it follows that DUI is a big problem here.

This past weekend, a cyclist was struck and killed by a hit and run driver at 1:00 a.m. while coming home from work. Having commuted at night for a couple of years, it goes without saying that vigilance is paramount in this arena. As cyclists, one of the first things we think when we read these reports are: Was drinking a factor, and was the cyclist reflective and well-lit? All that is clear in this case is that drinking may have been involved. (more…)

VMware ESXi on SunFire x4140 x4240 x4440

A work posting again because everybody always asks what I do for a living.

I work in x64 engineering at Sun Microsystems. I ensure that vendor software from VMware works well on the various AMD and Intel x64 systems we ship. The process works something like this.

    * Product Team specifies hardware and software requirements.

    * Hardware Team designs the hardware.

    * Software Team (that’s me) engages OS vendors (e.g. VMware, Redhat, Novell, Microsoft and Sun) and says, “In six months we’ll release the SuperConstellationMegaPlus using the unreleased 64 core Unobtanium HyperQuickConnect CPU with support for 256 sockets and maximum 32TB RAM and we already know your OS breaks with that high CPU count and memory size; InterHub MCP100 bridge on each socket so don’t forget to fix your multiroot PCI Express support; SAS 2 with zoning and up to 1024 discrete solid state disks; and the usual peripheral support and high speed Infiniband, 10Gbps networking, etc. with hot plug required on everything, including the processors and RAM.”

    * 3 months later, OS vendors toss their pre-release builds to me.

    * I toss the builds to our Software QA.

    * Software QA finds bugs. It’s my job to work with the OS vendor and find the cause of those bugs and ask our OS vendors to fix them.

Here’s an example:

Early on, we discovered the VMware’s ESXi ‘thin’ hypervisor would not install on the SunFire x4140, x4240 and x4440 servers. These machines, codenamed “Dorado Tucana” (DTa for short), are essentially identical and share the same motherboard.

Previously, VMware always booted a modified Redhat distribution to install ESX. The ESXi install process differs from ESX “Classic” in that it uses itself as the installer. When you boot the ESXi install CD, you are booting ESXi.

I initially thought there was a bug in the HBA storage adapter because the install program always locked up at “Loading aacraid…”, which is the software to control the Adaptec storage controller we use in our test machines. Debug by process of elimination: I removed the Adaptec controller.

So now the machine hangs somewhere else. Hmmm….

But now I’m able to see messages like “Keyboard controller buffer overflow….” And our nifty hardware debug tool shows me that the program is stuck in a very small loop that looks something like this:

    while (inb(0x64 & 0x01)) { call somefunction() }

I/O port 0x64 is the old legacy 8042 keyboard controller, except DTa does not have an 8042 or even a SuperIO chip! When I was reviewing the DTa hardware design way back in 2007, I even made a notation to our product team that this was our first platform without a legacy keyboard controller of any kind and we may encounter some OS bugs.

All modern PCs emulate the old 8042 keyboard controller first used in the IBM PC AT in 1984, because MS-DOS, the BIOS setup program, and the various option ROM setup programs all depend on the existence of a PC/AT keyboard even though your PC no longer even has a keyboard connector. The system BIOS can find your USB keyboard and make it pretend that it’s an old legacy PS/2 keyboard for this old legacy software.

When your modern OS (such as Windows, Linux or ESXi) boots, it pokes the BIOS and USB controller and tells them to stop pretending to be an 8042 and start acting like a real USB controller — this is called USB BIOS handoff. Almost every PC made, however, still has something that acts like the 8042 at I/O locations 0x64 and 0x60 somewhere on the motherboard — when the pretending stops, the real 8042 I/O is still there. When the OS reads the keyboard status register at 0x64, though, the 8042 isn’t connected to a keyboard, so it always reports the keyboard buffer is empty with a value of “0”.

As I mentioned previously, DTa does not have an 8042 of any kind. As soon as the OS takes over the USB operations and tells BIOS and the USB controller to stop pretending to be an 8042, there’s no longer anything at I/O locations 0x64 and 0x60. And when the CPU reads an invalid I/O location, the returned value is always “-1.” This means every bit of what the keyboard driver thinks is a status register is set. The keyboard driver thinks the keyboard buffer is full, reads the keyboard data register at 0x60 (which also returns -1 or 0xff), and tests the keyboard status again, which will be 0xff again. Rinse and repeat until done, except, of course, it never is done because inb(0x64) always returns -1.

I proved this by dissecting the guts of ESXi and removing the OHCI and UHCI USB drivers (which forces this handoff behavior and keeps the BIOS and USB driver in legacy keyboard mode). When I remove those software bits, the problem goes away. I reported this to VMware so they could make the necessary changes.

There are a couple of fixes to this problem. Linux counts the number of “-1” values it reads and if it decides the number is unreasonable, it decides there’s no 8042. The engineers at VMware got a little more clever for the fix and they look at the ACPI DSDT — the Differentiated System Description Table. This is a data structure in BIOS that lists the component hardware. If an 8042 is not listed in this table, ESXi knows not to load the keyboard controller device driver.

For those waiting to install ESXi on the x4140, x4240, and x440 (and many people have asked): This fixed version of ESXi is not yet released, though it should be available Real Soon Now and we’re already certifying that new version of ESXi for those servers.

Bike to work challenge

My employer’s Eco Responsibility group and fitness center is offering incentives to those who can spend 2,000 minutes biking to work over the next 12 weeks to promote alternative transportation, environmental awareness and wellness. The human resources department sent an email to all USA employees promoting this program, and I’ve actually heard people who normally don’t bike to work in the hallway talking about it.

2,000 minutes is a genuine challenge — it’ll take effort even for me to get those many hours in. I know the people who put this program together, but unfortunately I don’t think it will do anything to encourage newbies to try bike commuting.

Fun at work

I should mention bicycles and bicycling. I did none yesterday because I stayed home with a nasty head cold. I ached. I coughed. I sniffled. I had a fever. I stayed home and slept (when I wasn’t committing evil conspiracy against the bike industry).

This morning I dragged myself out of bed, grabbed the bike and came to work. I feel better. I relate to Mr. Elder’s bike commuter race ethic, so I passed two cyclists this morning (never mind the half dozen who passed me).

Now to work. Jonathan Schwartz is CEO of Sun Microsystems. He is my boss’s boss’s boss’s boss. He was captured on a hidden camera enjoying lunch with a friend at a local restaurant.

While I’m talking about my work: the Sun Constellation system won the Product of the Year award from Supercomputing Online; and UAE University has rolled out a student-designed 8 teraflop grid computer using Sun Microsystem blade systems. While I’m not directly involved in these specific efforts, I’m in the group in Menlo Park that designed the systems and we’re all pretty proud of how well they’re doing.

Finally, somebody unleashed literally hundreds of pink and blue dolphins on our South Bay campuses this morning. They represent Sun’s ownership of MySQL (which uses the dolphin) but I think it’s also a play on the French Poisson d’Avril or “April Fish” as they call this day.