Hardcore Debugging - Heavy Weapons for Hard Bugs
Tracking Retain/Release
The other day, while refactoring, our tests suddenly crashed with the message every experienced developer fears:
*** -[PSPDFDocument release]: message sent to deallocated instance 0x617000007780
In the default background queue. Without any further context nor any information in the stack trace. The state on other threads was not relevant and changed on every run. This is pretty much one of the worst bugs.
Now people who worked on the Mac platform long enough to remember manual retain release will surely know these kind of issues. However, ARC normally does a really great job at making sure over-releases are no longer an issue. Still, here I was, with this impossible crash, at midnight, desparate to find a solution.
This only happened when Zombies are enabled, so our CI was still green, and it also only happened on our UI KIF tests. This caused quite a lot of frustation. Instruments is pretty good at tracking retain/release, however this only works for regular targets, not for test targets.
And I need to step back a little bit here, because there are actually quite a few things we do to make these kind of issues very rare:
- We give our background queues explicit names. I consider it bad practice to throw tasks in the global queue without creating a named serial subqueue. GCD queues are cheap, and it’s incredibly helpful to name all our background operations. We wrote a wrapper class around GCD in Objective-C that makes this really easy:
// release resources as we no longer need them - but don't block the calling thread for this!
weakify(self);
[[PSPDFDispatchQueue.backgroundQueue namedQueue:@"cache.deallocation"] async:^{
strongify_or_return(self);
@synchronized(self) {
self->_document = nil;
self->_imageRenderingCompletionBlock = nil;
}
}];
I’m sorry for all the Objective-C here. Our framework is heavily built around C++ and Objective-C, as we cannot use Swift yet, as it’s not yet binary compatible. (Blog Post) Notice how this creates a one-time use queue to deallocate things in the background - this makes background tasks much simpler to follow, and debugging much easier.
In addition to that, when our code is compiled in DEBUG
mode, we track all created queues weakly, so we can break at any point and call [PSPDFDispatchQueue _listAllQueues]
to get a list of all active queues.
static NSHashTable<PSPDFDispatchQueue *> *_activeQeues;
static dispatch_block_t PSPDFDebugTrackedBlock(PSPDFDispatchQueue *queue, dispatch_block_t block) {
PSPDFCAssertBlock(block);
PSPDFCAssertClass(queue, PSPDFDispatchQueue);
PSPDFCAssertClass(_activeQeues, NSHashTable);
return ^{
@synchronized(_activeQeues) {
[_activeQeues addObject:queue];
}
block();
@synchronized(_activeQeues) {
[_activeQeues removeObject:queue];
}
};
}
// Wrap working block:
#if GCD_DEBUG_ENABLED
workItem = PSPDFDebugTrackedBlock(self, workItem);
#endif
+ (void)_listAllQueues {
let activeQueues = _activeQeues.allObjects.copy;
NSLog(@"Active Queues: \n");
foreach (queue, activeQueues) {
NSLog(@"%@\n", queue);
}
}
In the case of this bug, it didn’t help a bit. My next step was using malloc stack logging. You set MallocStackLoggingNoCompact=YES
as environment variable, and lldb should be able to print object generation.
script import lldb.macosx.heap
malloc_info --stack-history
However, this didn’t work at all, and it’s unclear to me if that still works at all or just failed in the particular case.
Next, I went and enabled ASAN, UbSAN and the Main Thread Checker. Zombie Objects was already enabled. This creates a lengthy recompile, but didn’t help a bit in my case.
The root cause here is that some code has to call autorelease
on the PSPDFDocument
. So let’s set a breakpoint. BUT - that doesn’t work anymore with ARC, since ARC doesn’t use message passing anymore for performance reasons, instead the compiler emits objc_retain
, objc_release
and objc_autorelease
, which is quite a bit harder to break on, and debugging becomes impossibly slow if you try to break on every objc_autorelease
, even using conditional rules is to slow. ARC does work work classes that implement custom retain/release logic, so there’s a runtime check to test if a document implements such a method - so there’s still hope. So, let’s implement autorelease
in our class, which forces ARC to go the slow path:
The naive way of just overriding NSObject’s autorelease does not work with ARC either. Now, I can’t casually convert a 2000 line class to MRR just for debugging (chances are, I’d introduce 10 new bugs while even trying to convert this, if I don’t go insane before that). So that doesn’t work either. The good thing is that this isn’t the first time we needed something like that, and we have a class which helps with that. We used it in a different context, but it’ll do:
/**
Helps to track down problems related to "The Deallocation Problem".
http://developer.apple.com/library/ios/#technotes/tn2109/_index.html%23//apple_ref/doc/uid/DTS40010274-CH1-SUBSECTION11
*/
@interface PSPDFRetainTracker : NSObject
@end
@implementation PSPDFRetainTracker
- (BOOL)shouldPrintRetainRelease {
return YES;
// return !NSThread.isMainThread;
}
- (const char *)queueName {
PSPDF_DEPRECATED_NOWARN(return dispatch_queue_get_label(dispatch_get_current_queue());)
}
- (id)retain {
if ([self shouldPrintRetainRelease]) {
printf("RETAIN %p %s (queue:%s) -> %s", (void *)self, NSThread.currentThread.name.UTF8String, [self queueName], [NSThread callStackSymbols].description.UTF8String);
}
return [super retain];
}
- (id)autorelease {
if ([self shouldPrintRetainRelease]) {
printf("AUTORELEASE %p %s (queue:%s) -> %s", (void *)self, NSThread.currentThread.name.UTF8String, [self queueName], [NSThread callStackSymbols].description.UTF8String);
}
return [super autorelease];
}
- (oneway void)release {
if ([self shouldPrintRetainRelease]) {
printf("RELEASE %p %s (queue:%s) -> %s", (void *)self, NSThread.currentThread.name.UTF8String, [self queueName], [NSThread callStackSymbols].description.UTF8String);
}
[super release];
}
@end
Now, PSPDFDocument
is basically our core model, and it’s used everywhere. I tried to just change the superclass from NSObject
to PSPDFRetainTracker
, however that caused header module errors. The helper class lives in a file called PSPDFDebugHelper.h
which isn’t part of the framework, and it’d be painful to restructure the headers to include that. Copying the code over also doesn’t work, since PSPDFDebugHelper.m
needs to be compiled with ARC disabled.
Objective-C Runtime to the rescue. Let’s just swap classes at runtime!
+ (void)initialize {
let thisClass = PSPDFDocument.class;
if (self == thisClass) {
return;
}
class_setSuperclass(PSPDFDocument.class, (Class)NSClassFromString(@"PSPDFRetainTracker"));
}
I’ll simply get the instance from runtime here so I don’t need to fiddle with further imports, and we use PSPDFDocument
’s initialize
to take care of that early on.
Now, if you’ve never ever heard about class_setSuperclass
, there’s a good reason for that. DO NOT USE IT IN PRODUCTION. Just check out the header file. Apple is pretty blunt about that and writes “warning You should not use this function.” and “not recommended”. But alright, we’re debugging here, and we need the heavy weaponry!
/**
* Sets the superclass of a given class.
*
* @param cls The class whose superclass you want to set.
* @param newSuper The new superclass for cls.
*
* @return The old superclass for cls.
*
* @warning You should not use this function.
*/
OBJC_EXPORT Class _Nonnull
class_setSuperclass(Class _Nonnull cls, Class _Nonnull newSuper)
__OSX_DEPRECATED(10.5, 10.5, "not recommended")
__IOS_DEPRECATED(2.0, 2.0, "not recommended")
__TVOS_DEPRECATED(9.0, 9.0, "not recommended")
__WATCHOS_DEPRECATED(1.0, 1.0, "not recommended")
__BRIDGEOS_DEPRECATED(2.0, 2.0, "not recommended");
To my surprise, this actually works! Running the test gets really really slow, as it spits out a ton of stack traces for every single retain/release and autorelease. And compiler optimizations are not enabled here since we want to debug, which makes ARC really really dumb - there’s no optimization at all, and it emits a ton of calls. However, eventually, things crash again. Now copy out that huge log output, use your text editor of choice (Ideally not Atom or VS Code but one that can deal with A LOT of text, such as Sublime). And we just search for AUTORELEASE 0x617000007780 (queue:com.apple.root.background-qos)
because that’s what we want to know. And here we are:
)AUTORELEASE 0x617000007780 (queue:com.apple.root.background-qos) -> (
0 PSPDFKit 0x000000012ed5818b -[PSPDFRetainTracker autorelease] + 747
1 PSPDFKit 0x0000000129fbce4c _ZN5PSPDF2asI13PSPDFDocumentEEPT_P11objc_object + 652
2 PSPDFKit 0x0000000129fb83e3 _ZN5PSPDF11as_force_nnI13PSPDFDocumentEEPT_P8NSObject + 2563
3 PSPDFKit 0x0000000129fb8b02 -[PSPDFCacheDocumentHashManager willCloseDocument:] + 514
4 CoreFoundation 0x000000010feb2b8c __CFNOTIFICATIONCENTER_IS_CALLING_OUT_TO_AN_OBSERVER__ + 12
5 CoreFoundation 0x000000010feb2a65 _CFXRegistrationPost + 453
6 CoreFoundation 0x000000010feb27a1 ___CFXNotificationPost_block_invoke + 225
7 CoreFoundation 0x000000010fe74422 -[_CFXNotificationRegistrar find:object:observer:enumerator:] + 1826
8 CoreFoundation 0x000000010fe735a1 _CFXNotificationPost + 609
9 Foundation 0x000000010eef2e57 -[NSNotificationCenter postNotificationName:object:userInfo:] + 66
10 PSPDFKit 0x000000012a073e05 -[PSPDFDocument destroyDocumentProviders] + 917
11 PSPDFKit 0x000000012a0533d7 -[PSPDFDocument dealloc] + 375
12 Foundation 0x000000010ef574e8 NSKVODeallocate + 158
13 libobjc.A.dylib 0x000000010f525a6e _ZN11objc_object17sidetable_releaseEb + 202
14 PSPDFKit 0x000000012ed587dd -[PSPDFRetainTracker release] + 1117
15 PSPDFKit 0x000000012a2a37f6 -[PSPDFRenderRequest .cxx_destruct] + 182
16 libobjc.A.dylib 0x000000010f50f920 _ZL27object_cxxDestructFromClassP11objc_objectP10objc_class + 127
17 libobjc.A.dylib 0x000000010f51b502 objc_destructInstance + 124
18 CoreFoundation 0x000000010ff98936 -[NSObject(NSObject) __dealloc_zombie] + 150
19 libobjc.A.dylib 0x000000010f525a6e _ZN11objc_object17sidetable_releaseEb + 202
20 PSPDFKit 0x000000012a02a208 -[PSPDFRenderTask .cxx_destruct] + 152
21 libobjc.A.dylib 0x000000010f50f920 _ZL27object_cxxDestructFromClassP11objc_objectP10objc_class + 127
22 libobjc.A.dylib 0x000000010f51b502 objc_destructInstance + 124
23 CoreFoundation 0x000000010ff98936 -[NSObject(NSObject) __dealloc_zombie] + 150
24 PSPDFKit 0x000000012a0204e9 -[PSPDFRenderTask dealloc] + 409
25 libobjc.A.dylib 0x000000010f525a6e _ZN11objc_object17sidetable_releaseEb + 202
26 CoreFoundation 0x000000010fe938bd -[__NSSetM dealloc] + 157
27 libobjc.A.dylib 0x000000010f525a6e _ZN11objc_object17sidetable_releaseEb + 202
28 PSPDFKit 0x000000012a1161fe -[PSPDFRenderJob .cxx_destruct] + 126
29 libobjc.A.dylib 0x000000010f50f920 _ZL27object_cxxDestructFromClassP11objc_objectP10objc_class + 127
30 libobjc.A.dylib 0x000000010f51b502 objc_destructInstance + 124
31 CoreFoundation 0x000000010ff98936 -[NSObject(NSObject) __dealloc_zombie] + 150
32 PSPDFKit 0x000000012a10ce39 -[PSPDFRenderJob dealloc] + 409
33 libobjc.A.dylib 0x000000010f525a6e _ZN11objc_object17sidetable_releaseEb + 202
34 PSPDFKit 0x000000012a115fcd __destroy_helper_block_.127 + 29
35 libsystem_blocks.dylib 0x000000011414d98a _Block_release + 111
36 PSPDFKit 0x000000012ec8334b __destroy_helper_block_.216 + 59
37 libsystem_blocks.dylib 0x000000011414d98a _Block_release + 111
38 libobjc.A.dylib 0x000000010f5261b2 _ZN12_GLOBAL__N_119AutoreleasePoolPage3popEPv + 860
39 libdispatch.dylib 0x0000000114066762 _dispatch_last_resort_autorelease_pool_pop + 27
40 libdispatch.dylib 0x0000000114070fc1 _dispatch_root_queue_drain + 1186
41 libdispatch.dylib 0x0000000114070ac1 _dispatch_worker_thread3 + 119
42 libsystem_pthread.dylib 0x00000001145881ca _pthread_wqthread + 1387
43 libsystem_pthread.dylib 0x0000000114587c4d start_wqthread + 13
)
During deallocation of one of our render tasks, the document is deallocated. Inside dealloc we clean up some things and also update the hash manager, which listens to the close event and runs following code:
- (void)willCloseDocument:(NSNotification *)aNotification {
let document = PSPDF::as_force_nn<PSPDFDocument>(aNotification.object);
if (document.hasUnsavedChanges || document.hasDirtyAnnotations) { return; }
[self updateHashesForDocument:document];
}
This used to work, because in earlier versions of it, we just did a manual cast of the notification object to the document class. It can only be a document in the object. Years of writing code made me a grumpy old suspicious developer however, so we add checks even if we are sure what we get. Because crashing early is good. In that case we use a cast-helper in C++, which basically is a check that the object is not null and that it matches the class we say we want (PSPDFDocument
!) via an isKindOfClass:
check. This helper is quite optimized and cheap, but in debug, ARC emits autorelease
in one of the inlined helpers. Calling autorelease
within dealloc
is not a good idea, and will give us exactly that crash.
Let’s look at the generated assembly to understand where this autorelease is really coming from:
So we see that our cast is converted to a __ZN5PSPDF11as_force_nnI13PSPDFDocumentEEPT_P8NSObject
and ARC is using _objc_retainAutoreleasedReturnValue
to retain the value here. The naming here is C++ mangled, but still readable.
The fix was to add an explicit __attribute__((always_inline))
to the inline helper, which forces inlining and thus keeps ARC from doing dumb things.
/// as? Casts nullable object to type T. Return nil if wrong type.
template <class T> inline __attribute__((always_inline)) T *_Nullable as(id _Nullable o) noexcept {
if ([o isKindOfClass:[T class]]) {
return (T * _Nullable)o;
}
return nil;
}
And that’s the fix. One word.