import site.body

Mutation and splice()ing: Hitting the point where things go wrong

Few bugs have stumped me like this before. I knew all the parts, I correctly guessed the cause (at least for part of it) and yet due to the nature of the bug this has me scratching my head for days exclaiming 'that's not possible'. This is a deep dive into what occurred, How i solved this and that nature of python and its different implementations (pypy and cpython).

Note: I use vmsplice(2) to refer to the kernel implementation of vmsplice bellow and vmsplice() to refer to the python level implementation.

I received a bug report from a user pointing to a problem with my vmsplice() implementation in my library Butter. The function would not work as intended on pypy (an implementation of pypy with a JIT compiler and a proper non ref counted GC). For those who want to follow along with the src code try looking here. As pypy was one of my supported interpreters and my preferred choice i looked into this issue asap and hit a much larger issue. My module did not compile with the latest libraries.

In the time that i swapped over to the use of cffi for my bindings to Linux syscalls and the time of the bug report, cffi had hit a 1.0 release and deprecated the method i used to build modules in favour of a much better and cleaner method (at least form a distribution point of view). While i wont focus on the specifics of this, A major feature of the 0.12 release of butter was porting to this new mechanism as well as getting rid of some of the build errors. Now a pip install butter is both near silent and does not require you to install cffi first.

With the build out of the way i made sure to grab the latest arm build of pypy. Some of the developers of pypy are the same people behind cffi. As cffi is tied to the specific implementation of the interpreter and pypy can be somewhat different, a copy of cffi is bundled with pypy and does not need to be pip installed. I mention the ARM version as my primary development and day to day machine is ARM. As such specific attention has been paid when writing the library to ensure that it works on ARM and x86 and i am acutely aware that some bugs may be architecture specific. If i could not replicate it locally then i would have to build a quick test environment on a x86 machine (this proved not to be the case however).

So what is the issue

vmsplice() had a hack in place to support 'zero copy' to the pipe, It would take the object and get a void * to it and pass this to the IOVEC that gets passed to vmsplice(2). On cpython this will work as long as the string is 'alive' in the GC as it is a non moving garbage collector. Pypy on the other hand has a full GC that moves objects from place to place (or generation in GC parlance) based on numerous factors such as longevity. This would render the pointer invalid as it would no longer be pointing to the right location if a GC happens between pointer creation and pointer use. To its credit, cffi on pypy detects this and prints an exception as below:

Traceback (most recent call last):
 File "test.py", line 39, in <module>
   spliced = vmsplice(wr, ['1', '2', '3\n'])
 File "/home/dablitz/code/butter/venv-pypy/site-packages/butter/splice.py", line 174, in vmsplice
   v.iov_base = _ffi.from_buffer(s)
TypeError: from_buffer() got a 'str' object, which supports the buffer interface but cannot be rendered as a plain raw address on PyPy

The report that came through included a fix, drop my hack, build a char[] using ffi.new() (causing the string to be copied and losing zero copy feature) and pass this to vmsplice(2). The code provided giving an example worked as expected and after being unable to find no way to avoid losing the zero copy feature i dropped the requirement and made the changes.

Then the code stopped working on both platforms.

I lie for dramatic effect, however what i saw was not that much different. What i saw was bytes being written and the same amount of bytes being read and was very happy with the results. Digging futher and looking at the output i was seeing 1-2 corrupted bytes on cpython with 'aaaa' coming back as 'aaau' on the read(). This seemed to be happening on the last byte and was consistent between runs.

Looking in to the code it all looked good, not perfect but no obvious bugs. I quickly took this as an opportunity to get this code up to standard and improve it as much as possible hoping the flaw would present itself in this refactor but the bug still remained.

The code i ended up going was not far off the code suggested in the email with some minor modifications (it used len() on a string when we want the byte length which is different for unicode) and this code had been proven to work in the test script. when inlined into the vmsplice() the code exhibited odd corruption issues.

Putting print statements to look at the raw memory in the vmsplice() function itself showed the data being written was exactly what i expected yet on the read i would get a random 'u' back. after much hair pulling i decided to try with different data sizes and found that the issue did not manifest with strings of 8 bytes or longer. The IOVEC struct is 8 bytes (32bit pointer on arm and a 32bit int) making me wonder if i had screwed up a length or address.

After getting some sleep and coming back to the issue, I tried much of the same thing again trying to deny reality. Pulling out strace showed something truly odd:

vmsplice(4, [{"a", 1}, {"a", 1}], 2, 0) = 2
read(3, "au", 20)                       = 2

The data i was seeing from the print statements in my code was correct, the kernel was getting the correct data. Yet when read i would get corrupted data. After a quick trip to twitter to accuse the Linux kernel of horribleness i decided to see if this behavior was identical in C. I am not a good C programmer but was able to whip up a proof of concept in a fairly short amount of times. The most painful thing was seeing the C program work as expected for strings of any length. I could not reproduce the issue with C.

This cemented in my mind that this was likely a GC issue in python. I had seen similar issues in the past and this had the same 'smell'. On a whim i decided to simulate a GC issue in C by overwriting the buffers in the code after the vmsplice(2) but before the subsequent read()

vmsplice(4, [{"aaaaaaaaaaaaaaaaa", 20}], 1, 0) = 20
read(3, "yes it is a gc issue", 20)           = 20

The above strace has been changed for dramatic effect but highlights what occurred nicely. The data is not read until the read(), ie it is not copied as i had originally thought. Reading the man page:

The vmsplice() system call maps nr_segs ranges of user memory described by 
iov into a pipe. The file descriptor fd must refer to a pipe.

vmsplice(2) promises to map ranges of memory. Not copy them. In retrospect it is likely that the SPLICE_F_GIFT flag lead me astray into the behavior of this syscall. I was of the impression that this flag produced a major change in behavior (mapping memory pages) when in fact it was a minor optimization for what the function was already doing.

Time for a quick lesson about GC

Pypy has a full generational GC while python has a ref counting GC. in the context of this issue this is minor apart from complications with pypy's ability to move data around which lead to the issue being reported. The other major difference between these GC's is when the GC will be invoked. On pypy this is simple (and makes your life complex): 'anytime'. You cannot guarantee when or when not the GC will be invoked. The actual trigger points are based on memory usage and if you are interested i would recommend consulting pypy's docs.

cpython can be basically summed up as 'when it needs more system mem and on exiting a function'. This is highly inaccurate but accurate enough to show what went wrong here.

Remember when i said the provided patch worked until i inlined it in the vmsplice() function? lets lay down some code to illustrate:

Patch

def vmsplice(data):
    data mangling
    lib.vmsplice(data) # call the original implementation

data = [ffi.new('char[]', x) for x in ('a', 'b', c')]
vmsplice(data)
new_data = read()
print(new_data)

Inlined

def vmsplice(data):
    data mangling
    data = [ffi.new('char[]', x) for x in data]
    C.vmsplice(data) # call the C implementation

vmsplice(['a', 'b', 'c'])
new_data = read()
print(new_data)

Question: in the above 2 examples when would you expect cpython to do a GC of data causing its removal?

If you answered 'For Patch: at the end of the program' and 'For Inlined: when vmsplice() returns but before the read()' you would be 100% correct and described the issue exactly. This is 1:1 with the C program. data was being GC'd causing new values to be overwritten in some locations in the string that was passed to vmsplice(2).

Now for Fixes

Well now we know this is the issue, we can just use malloc() and put the strings in a known, non movable location. Once again we have lost Zero copied (basically the only reason to use this syscall) however there is a larger issue and this highlights the real issue with vmsplice(2). When do we free that memory? in our application we where using a 'self pipe' trick for testing but how do we know the other end has done a read() and that we can deallocate the memory? I can do it, I have tricks but they dont work 100% of the time and are approximate (syscall interception) and not to mention frankly horrible.

This relegates vmsplice(2) to copying RO data into a pipe. This may be useful in C but there is no real concept of RO non-movable data in python.

Sad Goodbyes

It was at this point i decided to remove this syscall from my library and return a NotImplemented for compatibility. There is no safe way to implement this in python that wont cause a mem leak and it does not do what i expected it to do. Luckily for us there is a nearly identical API that does what we want to do which vmsplice(2) was modeled off: os.writev(). This has the copying behavior we want and is as zero copy as we can get in python.

This is not an ideal situation however i don't fully believe that vmsplice() worked correctly in butter to begin with. I am unaware of any users and should anyone complain i may just change the implementation to call writev.

Lessons learned

  • My test coverage is inadequate
  • Part of this is due to the amount of APIs that need root or significant setup work to test (testing forking is a PITA)
  • The above is a poor excuse and i am just lazy
  • Easy to see 'patterns in the clouds' that are not actually there or reflective of anything at all and lead you down a rabbit hole
  • Many Linux APIs look useful until you try and use them.
  • Many Linux APIs are not as generic as they look.
  • This was not an x86 specific issue (or ARM)

Clarifications

This is a half remembered post, some possibilities where not covered as they where not tested. An example of this is that the bug was reported for pypy and yet most of my testing was on cpython. on occasion i switched interpreters but mad no attempt to exhaustively test on both interpreters until i had locked the nature of the bug in place.

There may also be some inaccuracies in the above or odd conclusions (ie not enough supporting evidence). While i remember the conclusions i don't remember every command and change i did and bash_histories only get you so far. This post is intended to serve as a warning and entertainment only.