By Yokota Fritz
A work posting again because everybody always asks what I do for a living.
I work in x64 engineering at Sun Microsystems. I ensure that vendor software from VMware works well on the various AMD and Intel x64 systems we ship. The process works something like this.
* Product Team specifies hardware and software requirements.
* Hardware Team designs the hardware.
* Software Team (that's me) engages OS vendors (e.g. VMware, Redhat, Novell, Microsoft and Sun) and says, "In six months we'll release the SuperConstellationMegaPlus using the unreleased 64 core Unobtanium HyperQuickConnect CPU with support for 256 sockets and maximum 32TB RAM and we already know your OS breaks with that high CPU count and memory size; InterHub MCP100 bridge on each socket so don't forget to fix your multiroot PCI Express support; SAS 2 with zoning and up to 1024 discrete solid state disks; and the usual peripheral support and high speed Infiniband, 10Gbps networking, etc. with hot plug required on everything, including the processors and RAM."
* 3 months later, OS vendors toss their pre-release builds to me.
* I toss the builds to our Software QA.
* Software QA finds bugs. It's my job to work with the OS vendor and find the cause of those bugs and ask our OS vendors to fix them.
Here's an example:
Early on, we discovered the VMware's ESXi 'thin' hypervisor would not install on the SunFire x4140, x4240 and x4440 servers. These machines, codenamed "Dorado Tucana" (DTa for short), are essentially identical and share the same motherboard.
Previously, VMware always booted a modified Redhat distribution to install ESX. The ESXi install process differs from ESX "Classic" in that it uses itself as the installer. When you boot the ESXi install CD, you are booting ESXi.
I initially thought there was a bug in the HBA storage adapter because the install program always locked up at "Loading aacraid...", which is the software to control the Adaptec storage controller we use in our test machines. Debug by process of elimination: I removed the Adaptec controller.
So now the machine hangs somewhere else. Hmmm....
But now I'm able to see messages like "Keyboard controller buffer overflow...." And our nifty hardware debug tool shows me that the program is stuck in a very small loop that looks something like this:
while (inb(0x64 & 0x01)) { call somefunction() }
I/O port 0x64 is the old legacy 8042 keyboard controller, except DTa does not have an 8042 or even a SuperIO chip! When I was reviewing the DTa hardware design way back in 2007, I even made a notation to our product team that this was our first platform without a legacy keyboard controller of any kind and we may encounter some OS bugs.
All modern PCs emulate the old 8042 keyboard controller first used in the IBM PC AT in 1984, because MS-DOS, the BIOS setup program, and the various option ROM setup programs all depend on the existence of a PC/AT keyboard even though your PC no longer even has a keyboard connector. The system BIOS can find your USB keyboard and make it pretend that it's an old legacy PS/2 keyboard for this old legacy software.
When your modern OS (such as Windows, Linux or ESXi) boots, it pokes the BIOS and USB controller and tells them to stop pretending to be an 8042 and start acting like a real USB controller -- this is called USB BIOS handoff. Almost every PC made, however, still has something that acts like the 8042 at I/O locations 0x64 and 0x60 somewhere on the motherboard -- when the pretending stops, the real 8042 I/O is still there. When the OS reads the keyboard status register at 0x64, though, the 8042 isn't connected to a keyboard, so it always reports the keyboard buffer is empty with a value of "0".
As I mentioned previously, DTa does not have an 8042 of any kind. As soon as the OS takes over the USB operations and tells BIOS and the USB controller to stop pretending to be an 8042, there's no longer anything at I/O locations 0x64 and 0x60. And when the CPU reads an invalid I/O location, the returned value is always "-1." This means every bit of what the keyboard driver thinks is a status register is set. The keyboard driver thinks the keyboard buffer is full, reads the keyboard data register at 0x60 (which also returns -1 or 0xff), and tests the keyboard status again, which will be 0xff again. Rinse and repeat until done, except, of course, it never is done because inb(0x64) always returns -1.
I proved this by dissecting the guts of ESXi and removing the OHCI and UHCI USB drivers (which forces this handoff behavior and keeps the BIOS and USB driver in legacy keyboard mode). When I remove those software bits, the problem goes away. I reported this to VMware so they could make the necessary changes.
There are a couple of fixes to this problem. Linux counts the number of "-1" values it reads and if it decides the number is unreasonable, it decides there's no 8042. The engineers at VMware got a little more clever for the fix and they look at the ACPI DSDT -- the Differentiated System Description Table. This is a data structure in BIOS that lists the component hardware. If an 8042 is not listed in this table, ESXi knows not to load the keyboard controller device driver.
For those waiting to install ESXi on the x4140, x4240, and x440 (and many people have asked): This fixed version of ESXi is not yet released, though it should be available Real Soon Now and we're already certifying that new version of ESXi for those servers.
By Yokota Fritz
My employer's Eco Responsibility group and fitness center is offering incentives to those who can spend 2,000 minutes biking to work over the next 12 weeks to promote alternative transportation, environmental awareness and wellness. The human resources department sent an email to all USA employees promoting this program, and I've actually heard people who normally don't bike to work in the hallway talking about it.
2,000 minutes is a genuine challenge -- it'll take effort even for me to get those many hours in. I know the people who put this program together, but unfortunately I don't think it will do anything to encourage newbies to try bike commuting.
By Yokota Fritz
I should mention bicycles and bicycling. I did none yesterday because I stayed home with a nasty head cold. I ached. I coughed. I sniffled. I had a fever. I stayed home and slept (when I wasn't committing evil conspiracy against the bike industry).
This morning I dragged myself out of bed, grabbed the bike and came to work. I feel better. I relate to Mr. Elder's bike commuter race ethic, so I passed two cyclists this morning (never mind the half dozen who passed me).
Now to work. Jonathan Schwartz is CEO of Sun Microsystems. He is my boss's boss's boss's boss. He was captured on a hidden camera enjoying lunch with a friend at a local restaurant.
While I'm talking about my work: the Sun Constellation system won the Product of the Year award from Supercomputing Online; and UAE University has rolled out a student-designed 8 teraflop grid computer using Sun Microsystem blade systems. While I'm not directly involved in these specific efforts, I'm in the group in Menlo Park that designed the systems and we're all pretty proud of how well they're doing.
Finally, somebody unleashed literally hundreds of pink and blue dolphins on our South Bay campuses this morning. They represent Sun's ownership of MySQL (which uses the dolphin) but I think it's also a play on the French Poisson d'Avril or "April Fish" as they call this day.
By Yokota Fritz
There are apparently unconfirmed reports of a 'beer bust' last week at my work location. When you think 'beer bust,' what do you usually think?
Imagine, hypothetically, about 2,000 (primarily male) socially inept engineers all gathered in one place. Beer, peanuts and pretzels are available.
If you're picturing a junior high school dance, you're pretty close to this hypothetical reality.
VMware ESXi on SunFire x4140 x4240 x4440
Uh, all I hear is "white noise". Sounds pretty involved, though, even though I didn't understand a lick of that!
All I'm hearing is, "I'm Richard, I work for Sun Microsystems but for some reason I still insist on forwarding my domain to blogger rather than getting a virtual host and installing Wordpress."
There's no forwarding going on at all, Kit.
When is the new ESXi coming out? Also the latest ESX 3.5 has a very difficult time booting also on the X4140. Only successfully installs in text mode. I have 3 Sun X4140 that need to be installed and shipping by March 20th, so can you send me your patch or work around so I can have these servers installed.
Emerich:
(1) I'm not allowed to say when VMware's next updates will be released.
(2) ESXi installable is not (yet) certified for x4140, which means neither VMware or Sun support this configuration right now.
(3) ESX 3.5 "Classic" is certified and supported. Have you contact Sun or VMware support? What kind of problems do you see? What option cards, memory, CPUs and storage?
Praises to you for debugging! We were waiting our asses off for ESXi to run on the X4240.
Yokota, when is Sun going to compile HERD for the ESX service console? I have two X4440, my colleague has two, and yet another has two.
Also, is there a fix for: a) IPMI checksum errors, b) IPMI claiming that the power supplies are asserting/deasserting "ok" all the time, or c) cimserver (pegasus) soaking up CPU0?
You might also let the tech writers know that slots 2,4,5 work just fine and don't disable the onboard NICs 2&3, if ACPI is on. And ACPI is on by default.
Regards, Nathan dot hudson hyphen crim at milliman dot com
We were supplied a X4140 from a "solutions provider" as an ESXi platform. After first trying to install ESXi U3 unsuccessfully I came across this post. As it seemed the issue would soon be fixed we held off on returning the machine. ESXi U4 is now out but still no X4x40 support, I'm curious to know what happened? Obviously our X4140 is on it's way back to the supplier, a shame as it is a nice machine.
Any word on X4140 certification for ESX 4? I was about to purchase some 4140's for VMware, and then ESX 4 came out...and is not on the HCL. :(
Casey, certification for ESX 4 on x4140, x4240 and x4440 is in process. I did initial ESX & ESXi 4 bringup and don't expect problems (including on the new Instabul processor), but I'm still supposed to *ahem* manage expectations and not pre-announce any release dates. Sorry. You can install ESX and ESXi 4.0 on the x4x40 family and it will work, especially if you stick to vanilla configurations, but you won't get any support on it from either Sun or VMware.
Several Sun x64 systems were on the HCL at vSphere launch, but we didn't have the manpower to get everything certified. Thanks for your interest.
THANK YOU!!!!
I've tryed to install vmware exsi on our x4140 and now I now why it fails. The sun support told me that the problem was that quad proc was not certified. Now I know which was the problem and I feel better.
Gabriele.
Since people are still asking: ESX & ESXi 3.5 Update 4 are certified on the x4140, 4240, and 4440 for all released CPU configurations except Istanbul. Istanbul support will come with the next software release that's due Real Soon Now.
GREAT!!! Somebody is actually saying the WHY behind an error, what will be done, and what has been done! VERY instructional!!!!
I have successfully installed esxi 4 on the sun x4240 without any errors and its works perfectly fine. The issue may have been an old one and may have been resolved by now
@Anon 7:51 - Yeah, we fixed it a while back.
Bike to work challenge
Bit of a silly way to promote cycle commuting - by creating or reinforcing a perception that it takes time.
2000 minutes? That's 33 hours - and nobody has a spare 33 hours to "waste" riding a bike, regardless of the fact that they're probably already wasting most (or all) of this time sitting in a car.
This might be a little steep, but it's actually not that unreasonable. 12 weeks is 60 "work" days, so it comes out to 33 minutes per day (round trip) if you were commuting every work day. Which is only going to be, say, 6-7 miles (round-trip) for someone just getting started. Unless you're in a pretty urban center, most people likely live further away than that from work.
If you realistically go on the fact that people are more likely to do 2 or 3 days a week, it's comes out to a 33 minute trip each way at which point that 6-7 miles is each way. Still not much for a spread out area, but starting to be more plausible.
Maybe it's geared more toward gearheads?
"Incentive." Ha! Those that meet the challenge are eligible for a raffle that might get a jersey. Talk about incentive!
...fritz...i would guess the main point would be that your company is at least making a consideration...the next step is perhaps helping them set realistic goals for maybe two groups...the regulars like yourself that already bike commute & a program to encourage new &/or occasional commuters...
...a good idea, not particularly well conceived...
I'm with Jeremy.
This seems trivial to me. My commute is about 55 minutes each way by bike, 35-40 minutes average by car. So, 19 days over 12 weeks. That's less than twice a week.
Anybody who is commuting from the "suburbs" of Boston to either Boston or Cambridge probably can't be shorter than a 20 minute one way trip. And the shorter the commute, the more likely that driving takes a substantial fraction of that time that a bike commute would take.
Jennifer, it seems exactly like it's geared to gearheads.
Since the weather has gotten nice, I've been that gearhead.
It is a little bit intimidating to the newbie - but then, newbies do things like sign up for MS 150 rides all the time. They oughter have a 100-minute get-you-hooked mini-incentive (something free like getting your name somewhere)... and a 1000-minute lure... and then the addiction will happen.
For some people it's nothing, for me it would mean adding riding on top of having to do every day possible. My commute is at most 25 min a day, so hitting the 2000 min would take more than 12 weeks or additional time.
Not that I'm whining, but... The bike part of my commute is typically about 25 minutes, so I'll need to get creative and stretch things. I also work from home once a week, plus there's a week of vacation during that time period.
On the internal bicycling discussion list at work, a couple of people joked about doing laps around the parking lot. The parking lot circumference is right at one mile.
Fun at work
Intro music was by Plump DJs. One of my faves.
Engineers' beer bust
Actually, I was thinking a bunch of buxom ladies who enhanced their mangos with beer.
Balding, middle-aged male engineers, on Prozac and Levitra, trying to live out their wild and crazy juvenile deliquent days they NEVER had when they were actually juveniles NEVER crossed my mind when 'beer' and 'bust' are mentioned in the same breath.
Happy Super Bowl Sunday, people! Suppose you post pix of this said 'beer bust' for the rest of us to make snide comments.
What photos, Paul? This was hypothetical.
I love hypotheticals like that...
Like all the hypothetical drinking I could have done in Taiwan with all of my alleged vendor partners. It didn't happen of course and I certainly did get the nickname "Sake King" while at dinner having sushi. That never happened.
My bad. I started drinking again.