The Descriptor Protocol, and Python Black Magic
April 26, 2016
Late last night, I saw a very confusing tweet:
class A: def b(self): pass if A.b is A.b: print("Python 3") else: print("Python 2")
— Jake VanderPlas (@jakevdp) April 26, 2016
Like any self-respecting programmer, the first thing I did was copy paste that into a terminal, even though I knew exactly what to expect. My response was simple:
Since I graduated last summer, I have been writing lots of both Python 2 and 3. This snippet seemed like something I should understand well. However, I did not, so this post is an attempt to solve that. I was inspired by Julia Evans , and her campaign to share the things she learns, however incomplete her understanding might be.
This post assumes you have at least a basic understanding of Python and OOP. For a good overview of OOP in Python, I recommend Leonardo Giordani’s series which builds up nicely from simple concepts to the internals of Python classes (he also has one on Python 2.x , although I haven’t read it closely).
So, what is this black magic?
My first instinct was to check the behavior of the comparison itself. While
== delegates to an object’s
__eq__ method to check equality , the
is keyword checks identity , so those objects can’t be the same in memory!
# Python 2 >>> A.b <unbound method A.b> >>> hex(id(A.b)) '0x1006ebc80' >>> hex(id(A.b)) '0x1006beb90' # Python 3 >>> A.b <function A.b at 0x101b75158> >>> hex(id(A.b)) '0x101b75158' >>> hex(id(A.b)) '0x101b75158'
As expected! The memory locations (as given by
id ) in Python 2 are different, causing the identity check to fail. Not so in
3 . So far so good. But why do we get
unbound method on one end and
function on another? How are these objects even stored internally? In most cases, Python uses a dictionary, accessible under
__dict__ to store the local variables, or namespace of an object (Note that not all objects have a
__dict__ , but that is a different story). Let’s look up
# Python 2 >>> A.__dict__['b'] <function b at 0x1007a8398> >>> type(A.__dict__['b']) <type 'function'> >>> type(A.b) <type 'instancemethod'> # Python 3 >>> A.__dict__['b'] <function A.b at 0x101b75158> >>> type(A.__dict__['b']) <class 'function'> >>> type(A.b) <class 'function'>
2 we get an
instancemethod , while
3 spits out a function, but if we check the
type inside the enclosing
__dict__ we see they are both functions ? How does this work? This is caused by the design of the Descriptor Protocol , which defines how data in an object is reached through a series of attribute accesses. In Python 2, the protocol sets in place a
type distinction based on how the function object is accessed. In the doc, Raymond Hettinger explains:
# Python 2 >>> class D(object): ... def f(self, x): ... return x >>> d = D() >>> D.__dict__['f'] # Stored internally as a function <function f at 0x00C45070> >>> D.f # Get from a class becomes an unbound method <unbound method D.f> >>> d.f # Get from an instance becomes a bound method <bound method D.f of <__main__.D object at 0x00B18C90>>
In 3, this distinction between
unbound doesn’t exist, but strangely, the docs for Python 3 are not up to date, so I can’t tell what the underlying behavior is. The same code clearly has a different output:
# Python 3 >>> class D(object): ... def f(self, x): ... return x >>> d = D() >>> D.__dict__['f'] # Stored internally as a function <function D.f at 0x1014021e0> >>> D.f # Get from a class becomes an unbound method... NOT! <function D.f at 0x1014021e0> >>> d.f # Get from an instance becomes a bound method <bound method D.f of <__main__.D object at 0x10123cf28>>
Also explained in the documentation is the fact that both
unbound methods are backed by the same C implementation, except for the value of their
im_self attribute, which is NULL when unbound. So I am guessing that
instancemethod is creating a new instance of the function object at runtime in
2 regardless of whether it is bound or unbound, while in
3 the instantiation only happens when
bound , given that the
unbound s don’t exist. This would make sense, as the function must be executed each time you access it.
If that were the case, we would expect that calling
b on an instance on A would always return a different object, regardless of which Python runtime we’re on, as they are always bound:
# Python 2&3 >>> a = A() >>> a.b is a.b False >>> hex(id(a.b)) '0x1003bf988' >>> hex(id(a.b)) '0x1003f1448'
So, the reason why
A.b is A.b in Python 3, and not Python 2 is this whole bound/unbound story. Seems like the Descriptor Protocol is responsible for this sorcery! Magic is just technology we don’t understand, yet .
If you have more insight into the inner workings of this, I’d love tohear about it.
Update (4/26/16):Jake VanderPlas replied to my tweet, and pointed to a 2009 post by Guido describing the behavior. Apparently, the bound/unbound distinction was introduced as a way to achieve “first-class everything,” which methods didn’t quite fit into. Python 3’s undoing of unbound methods is just a further expression of the idea.
Update 2 (4/29/16):Today I received an email from Todd Jennings, who pointed me to the bug that tracks the out-of-date documentation for Python 3. Sadly, it is marked as still waiting.
Image: “The Witch No. 1” by Baker, Joseph E. – Licensed under Public Domain, via Wikimedia Commons Want to see more articles like this? Sign up below: