Wednesday, October 18, 2017

(Cape of) Good Hope for PyPy

Hello from the other side of the world (for most of you)!

With the excuse of coming to PyCon ZA during the last two weeks Armin, Ronan, Antonio and sometimes Maciek had a very nice and productive sprint in Cape Town, as pictures show :). We would like to say a big thank you to, which sponsored part of the travel costs via its awesome Sourcelift program to help Open Source projects.

Armin, Anto and Ronan at Cape Point

Armin, Ronan and Anto spent most of the time hacking at cpyext, our CPython C-API compatibility layer: during the last years, the focus was to make it working and compatible with CPython, in order to run existing libraries such as numpy and pandas. However, we never paid too much attention to performance, so the net result is that with the latest released version of PyPy, C extensions generally work but their speed ranges from "slow" to "horribly slow".

For example, these very simple microbenchmarks measure the speed of calling (empty) C functions, i.e. the time you spend to "cross the border" between RPython and C. (Note: this includes the time spent doing the loop in regular Python code.) These are the results on CPython, on PyPy 5.8, and on our newest in-progress version:

$ python     # CPython
noargs      : 0.41 secs
onearg(None): 0.44 secs
onearg(i)   : 0.44 secs
varargs     : 0.58 secs

$ pypy-5.8   # PyPy 5.8
noargs      : 1.01 secs
onearg(None): 1.31 secs
onearg(i)   : 2.57 secs
varargs     : 2.79 secs

$ pypy       # cpyext-refactor-methodobject branch
noargs      : 0.17 secs
onearg(None): 0.21 secs
onearg(i)   : 0.22 secs
varargs     : 0.47 secs

So yes: before the sprint, we were ~2-6x slower than CPython. Now, we are
faster than it!
To reach this result, we did various improvements, such as:

  1. teach the JIT how to look (a bit) inside the cpyext module;
  2. write specialized code for calling METH_NOARGS, METH_O and METH_VARARGS functions; previously, we always used a very general and slow logic;
  3. implement freelists to allocate the cpyext versions of int and tuple objects, as CPython does;
  4. the cpyext-avoid-roundtrip branch: crossing the RPython/C border is slowish, but the real problem was (and still is for many cases) we often cross it many times for no good reason. So, depending on the actual API call, you might end up in the C land, which calls back into the RPython land, which goes to C, etc. etc. (ad libitum).
The branch tries to fix such nonsense: so far, we fixed only some cases, which are enough to speed up the benchmarks shown above. But most importantly, we now have a clear path and an actual plan to improve cpyext more and more. Ideally, we would like to reach a point in which cpyext-intensive programs run at worst at the same speed of CPython.

The other big topic of the sprint was Armin and Maciej doing a lot of work on the unicode-utf8 branch: the goal of the branch is to always use UTF-8 as the internal representation of unicode strings. The advantages are various:
  • decoding a UTF-8 stream is super fast, as you just need to check that the stream is valid;
  • encoding to UTF-8 is almost a no-op;
  • UTF-8 is always more compact representation than the currently used UCS-4. It's also almost always more compact than CPython 3.5 latin1/UCS2/UCS4 combo;
  • smaller representation means everything becomes quite a bit faster due to lower cache pressure.
Before you ask: yes, this branch contains special logic to ensure that random access of single unicode chars is still O(1), as it is on both CPython and the current PyPy.
We also plan to improve the speed of decoding even more by using modern processor features, like SSE and AVX. Preliminary results show that decoding can be done 100x faster than the current setup.

In summary, this was a long and profitable sprint, in which we achieved lots of interesting results. However, what we liked even more was the privilege of doing commits from awesome places such as the top of Table Mountain:

The panorama we looked at instead of staring at cpyext code


Nickolas said...

It was awesome meeting you all, and I'm so stoked about the recent PyPy improvements :-D

Anonymous said...

Fantastic news. Many Python users need to use some of these many specialized CPython-based extension modules for which there is no CFFI alternative extensively and as a result have not benefited much, or not at all, from PyPy's speed advantages. These improvements could make PyPy the default Python for many of us.

Anonymous said...

Could you give a hint to how you're doing O(1) individual character access in UTF-8 strings? Not that I'd find such a requirement particularly necessary (might be handy for all-ASCII strings, but easy to flat those cases), but how is it done? I can figure O(log(n)) ways with up to O(n) storage overhead or O(sqrt(n)) with up to O(sqrt(n)) storage overhead, but O(1) w/o the O(n) storage overhead of having UTF-32 around?

Maciej Fijalkowski said...

Hi Anonymous.

It's O(1) time with O(n) storage overhead, but the constants can be manipulated to have 10% or 25% overhead, and only if ever indexed and not ascii at that.

intgr said...

Really excited to hear about the Unicode representation changes; it should finally make PyPy significantly faster at Unicode manipulation than CPython 3.6 is. It seems this has been bogging down PyPy's advantage at Unicode-heavy workloads like webapp template rendering.

Even without O(1) access to characters by index, I think it's a great idea to use UTF-8 internally, since that's the prevalent encoding for input/output pretty much everywhere. Accessing Unicode characters by index is an antipattern in most situations and UCS-2/UTF-16 is becoming irrelevant.

Oscar Smith said...

I would also be really interested in a quick blogpost at some point about how to get O(1) indexing without greater storage overhead than just using UTF-32