Tracking Down Snow Leopard’s Apple Events Bug
No major system upgrade is without bugs, but it’s surprising to find breakage in deep mechanisms that are crucial to the workings of the system as a whole and have operated perfectly well for many years. Surely, one thinks, Apple wouldn’t mess with such a mechanism, on the grounds that (a) it’s crucial and (b) it isn’t broken so it’s better not to fix it. Yet such is apparently the case with an Apple event bug that I isolated and reported to Apple on October 6th.
What are Apple events, and why should you care about them? For a complete answer (with some really great diagrams), read the online book chapter I wrote some time ago on this topic. Briefly, Apple events are a messaging mechanism that allows one running process to communicate with another. Apple events are the underlying basis of application scriptability on Mac OS X; they are what allows AppleScript to work (for those who have forgotten, I wrote a book about AppleScript). Every AppleScript command directed at an application – asking the Finder for a folder’s name, or asking iTunes for the tracks in
an album – is an Apple event. But Apple events are far more fundamental than that; even something as simple as opening a document by dropping its icon onto an application’s icon in the Finder relies on an Apple event.
To be honest, although I’m proud of my detective work on this bug, I didn’t have to think very hard, and my role in the story was rather passive. Here’s what happened. On October 2nd, someone on the rb-appscript mailing list (which discusses rb-appscript, a bridge between the Ruby scripting language and AppleScript scriptability) posted a note complaining that his script, which drives iTunes rather intensively, worked perfectly well on Leopard but occasionally came to a grinding halt on Snow Leopard with an “Apple event timed out” error. Then, on October 4th, I noticed that someone on the AppleScript-Users
mailing list was complaining that his script, which drives Microsoft Entourage, worked perfectly well on Leopard but occasionally came to a grinding halt on Snow Leopard with an “Apple event timed out” error.
At first I had been inclined to propose a theory that maybe some external force, such as a bad scripting addition, might be messing things up somehow. This was a reasonable guess, because the Snow Leopard transition to 64-bit has in fact caused existing 32-bit scripting additions to become troublesome.
But the rb-appscript user (Nick Forge, who deserves a great deal of credit in this tale) was able to reproduce the problem even when all third-party scripting additions were removed from his system; moreover, he provided an extremely short, simple script that just looped so as to do one simple thing over and over, which he said would elicit the problem. So I tried his script on my machine. At first I couldn’t reproduce the problem, but it turned out that this was just because I wasn’t letting the script run long enough. When I allowed the script to run for 20 or 30 seconds, I got the same error message he did.
Now my eyes were opened. We knew that the problem was not due to any particular scriptable application, because it didn’t matter whether you targeted Entourage or iTunes. We knew that AppleScript was not the problem, because rb-appscript isn’t AppleScript: they both use the underlying Apple events mechanism, but in different ways. And we knew that a perfectly good Apple event, if sent repeatedly enough times, would eventually error out. But what did “eventually” mean? To find out, I wrote this script:
set i to 0
try
tell application "Finder"
repeat -- forever
set i to i + 1
count Finder windows
-- this is the Apple event
end repeat
end tell
on error errMsg number errNum
display dialog i
error errMsg number errNum
end try
In theory, this script should just run forever, repeatedly sending a single innocuous Apple event (one that asks the Finder to report how many windows it has open). Indeed, if you run it in Script Editor on Leopard, it will run forever, and you’ll have to force quit Script Editor (so don’t run it on Leopard; I’ll present a “safer” version of the script in a moment).
But in AppleScript Editor (the renamed Script Editor) on Snow Leopard, after about 60 seconds, the script freezes for an additional minute – because the “Apple event timed out” error is occurring – and then reports the error. The clever part is that the script also puts up a dialog reporting how many times we sent our Apple event before the error occurred.
And here is the Really Interesting Part. In most cases, the reported number is somewhere around 65000. In fact, in some cases, it is extremely close to 65535. And that fact is highly suspicious, because 65535 is one less than 2 to the 16th power – the size of a “short integer” (a 16-bit value) in computer science. It’s as if, behind the scenes, something other than the script is counting our Apple events, and stumbling at the point where the count resets to zero.
At that point, I wrote a “safer” version of the script and sent it to Apple. Here it is:
set i to 0
try
tell application "Finder"
repeat -- forever
with timeout of 5 seconds
set i to i + 1
count Finder windows
if i > 70000 then
display dialog
"No problem!"
return
end if
end timeout
end repeat
end tell
on error errMsg number errNum
display dialog i
error errMsg number errNum
end try
That version is “safer” because if by chance it gets past the 65535-event limit, it will come to a stop in good order all by itself. So you can try it on Leopard and on Snow Leopard and see the difference for yourself. Another feature of this version is the “with timeout” line, which shortens the frozen moment when our Apple event errors out from 60 seconds (the default) to 5 seconds.
After I had sent the script to Apple, I posted it on the AppleScript-Users list, and a particularly knowledgeable user, Hamish Sanderson (who is, not coincidentally, the author of rb-appscript) wrote back and said: “I bet I know what’s counting in the background: it’s the increment of the return ID.”
That was the final piece of the puzzle. Here’s what Hamish was saying. When a process sends an Apple event that expects a reply, there’s no telling when the reply will come back. So there needs to be a way to associate the reply, when it does come, with the original Apple event. So when you send the Apple event, you specify a return ID – a “short integer” that identifies the Apple event. The reply is given the same return ID, so it can be matched up with the original Apple event.
However, it is more common not to assign your own return ID. Instead, you give a special return ID value called kAutoGenerateReturnID. This means that the system itself should dynamically assign the Apple event a return ID. And how does the system do this? By adding 1 to the last return ID that it assigned. Thus, eventually, after enough Apple events have been sent, a moment will come where the next return ID in the sequence is hexadecimal FFFF, also known as decimal 65535. In a short signed integer, that value is the same as -1. And -1 is kAutoGenerateReturnID.
This fact, evidently, is confusing Snow Leopard in a way that no previous system was ever confused since 1991 (when Apple events were invented). When Snow Leopard assigns FFFF (-1) as the Apple event’s return ID, it takes this as an invitation to increase the return ID again. So the Apple event goes out with a return ID -1, but the reply comes back with the next return ID in the sequence, which is 0. The two return IDs don’t match! Thus, the reply can’t be associated with the original Apple event. So the sender thinks that no reply has ever come back – and, after waiting for a while, gives up and generates the “Apple event timed out” error.
(I have also written a Cocoa application which repeatedly sends an Apple event and keeps track of the return IDs of both the sent Apple event and the reply, and proves that this account of the bug is correct.)
The bug sounds minor, but it is really very important because Apple events are crucial to so much of what goes on under the hood in Mac OS X, and in any case it has caused everyone’s scripts to break (whether written in AppleScript, rb-appscript, or anything else that sends Apple events). The underlying Apple event manager assigns a new return ID to every Apple event, and so sooner or later some Apple event is going to hit the magic FFFF value, and whatever sent that Apple event is going to error out, apparently randomly. You may even have seen such random errors on your machine without knowing it. Anyhow, it’s an easy bug to understand, and there are already indications that Apple will roll a fix into the Mac OS X 10.6.2 update,
whenever that comes out. What’s hard to understand is how Apple came to inflict breakage on such a fundamental mechanism in the first place.
Ouch. This one is quite painful. Seems strange that Apple wouldn't have picked this up... Seems like Apple should have automated stress-testing tools for such an essential subsystem.
I'll bet that this is because someone noticed some odd looking crufty old code while fixing something else and 'fixed' the obvious error without investigating why it was that way.
good find guys; nice writeup
I'd guess Liam has it exactly right. Anyone who's done serious software development can imagine how fixing or rewriting one piece of code could trigger this bug in a possibly-far-off other section of code. And it's easy to see what tests need to be written with 20-20 hindsight. Besides, anyone who thinks Apple management views AppleEvents as an "essential subsystem" is barking mad.
When you double-click a file in the Finder and an application opens it, that's an Apple event. You wouldn't call that an essential subsystem? I sure as heck would.
Sounds like a classic off by one bug to me.
On a probably unrelated bug I've noticed that some scripts of mine that do a repeated series of key down events (ie tell application "System Events" to keystroke "s" ) into Safari will drop characters, whereas they didn't under Leopard.
Thank you for this article, Matt. I always read you with great interest. Out of curiosity, I tried your 'safer' script under Mac OS X 10.4.11 (copied and pasted in Script Editor, clicked Run -- I'm an ignorant AppleScript tourist, I admit) and after a good while it gave me an "AppleEvent timed out" error. Twice. I can try again of course, I just wanted to ask you if it's a normal behaviour under Tiger. According to what you've written, it shouldn't be.
But I may very well be missing something.
Kind regards,
Rick
In addition to timeout errors, have you noticed the execution of scripts taking longer as a result of this bug? How much extra time would be added to Finder queries with a bug like this? especially if you are recursively asking for information on every item in a folder?
Great article. This would have made a great Kon and Bal!
"Nasty."
"Yeah."
sounds like a 32-bit to 64-bit error. On 32-bit FFFF is -1 but on 64-bit it isn't. Guessing they use a 64-bit number in one place and forgot to use it in another.
The return ID is (I assume) an unsigned short int, which should be 2 bytes across 32 and 64 bit machines
Is this why the Aperture to InDesign scripts by Apple aren't working properly?
As a Programmer and System Designer for over 40 years, I can say that this is a bad error and has a simple solution. Since -1 is a special value, all that is needed to prevent the confusion of it with 65535 is to put a sanity check in the routine that adds 1 to the prior value. Once that value reaches (lets say) 65500 return 0 (or 1) NOT 65501. This insures that 65535 does not get returned to be confused with the special -1 value.
Hi Matt,
Thanks for your help in both diagnosing this issue and submitting a comprehensive bug report so that Apple can (hopefully) roll out a fix as soon as possible. I've written a blog post explaining why it was that I was even coming close to the 65535 return ID limit in the first place:
http://forgecode.net/2009/10/scripting-itunes-with-ruby-and-snow-leopards-apple-events-bug/
Cheers,
Nick
Re: "it's surprising to find breakage [..] in mechanisms that [..] have operated perfectly well for many years."
in many posts like this I read that it's surprising to find bugs in a "zero features" update and so on.
But I think bugs and surprises are much more likely to occur when you rewrite existing things under the hood than when you just add a hand full of new features to existing mess^Wcode.
And this one? Maybe as simple as someone (or even a changed compiler default) changing "unsigned short" to "short"
Thanks for a clear discussion of a complex subject. Would this account for the many weird, random misbehavior of several programs I use, notably PS CS3 and keyboard commands being ignored in many programs at times?
I apparently ran into this same problem while writing an AppleScript to transfer a bunch of data from some old Expression Media files to FileMaker. It would run fine for a while, then get an event time-out. I eventually got around it by trapping the error and retrying, and eventually got everything transfered.
I am now having problems getting Aperture to back up to a vault. It runs the comparison scan for a while, then says there is an error in the vault, even with a brand new vault. Since I have >65K items to copy, I wonder if this is due to the event time-out bug also? (I'm running up-to-date Snow Leopard on a Mac Pro.)
I just tried this in 10.6.2, and it appears as if the bug still exists. Has anyone else tried?
Yes, the bug still shows itself in 10.6.2