Fusepy was recently modified to treat all filenames and metadata as Unicode strings instead of byte strings. This works when filenames are UTF-8, but breaks when filenames or other metadata contain byte sequences that are not valid UTF-8 characters.
As I originally reported here, user code should be written to deal with bytes, not the other way around. On POSIX operating systems, file paths are NOT specified as being UTF-8 or any specific Unicode encoding. The only correct way to deal with filenames on Unix is to treat them as byte strings. Most software I've seen treats them as UTF-8, but at the file system level, they are binary strings and any FUSE implementation would be broken if it didn't support non-UTF-8 filenames.
In other words, I need to be able to cd to a FUSE-mounted file system, open Python 3, and type this:
>>> import os
>>> open(b'd\xe9j\xe0_vu.txt', 'w').close()
>>> os.listdir(b'.')
[b'd\xe9j\xe0_vu.txt']
In most shells, if you ls
this file, it will display as d?j?_vu.txt
. But it is a perfectly valid Latin-1-encoded filename. If fusepy encoded the filename as a Unicode string before sending it to the user code, it would either throw an exception in this case, or corrupt the filename.
The current version of fusepy (the GitHub version) breaks this assumption. Commit 2590f48 adds an 'encoding' argument to the FUSE constructor, and then decodes all the bytes values to strs with this encoding before giving them to the user-supplied operations, and encodes all strs supplied by user code before giving them back to the operating system. Unfortunately, it isn't a correct solution to simply say "pick an encoding before you start". File systems must be able to support different files with different encodings on their names.
If you run memory.py and then perform my above example, you get this:
Traceback (most recent call last):
File "fuse.py", line 402, in _wrapper
return func(*args, **kwargs) or 0
File "fuse.py", line 410, in getattr
return self.fgetattr(path, buf, None)
File "fuse.py", line 640, in fgetattr
attrs = self.operations('getattr', path.decode(self.encoding), fh)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 2: invalid continuation byte
Alternatively, create the b'd\xe9j\xe0_vu.txt'
file somewhere on a normal drive, and then run the loopback.py example on the directory containing that file. Attempting to 'ls' the directory results in this exception:
Traceback (most recent call last):
File "fuse.py", line 402, in _wrapper
return func(*args, **kwargs) or 0
File "fuse.py", line 586, in readdir
if filler(buf, name.encode(self.encoding), st, offset) != 0:
UnicodeEncodeError: 'utf-8' codec can't encode character '\udce9' in position 1: surrogates not allowed
First things first: I have created a relatively incomplete test script which is in my tests branch. It tests mounting the memory filesystem and creating a binary filename, and this test currently fails in both Python 2 and 3. (Note that this test passes in Python 2 on the latest Google Code version of fusepy, SVN r65.)
I have also prepared a branch which fixes this issue in the ideal way for me: bytes. This branch removes all of the encode/decode calls and the encoding parameter to the FUSE constructor. All Operation methods accept and return plain bytes objects. This is ideally the way it should work. I have made sure this works on Python 2.7 and 3.2 (with 2to3). Note that I have updated the memory, loopback and context examples, but not the sftp example, as I was unable to test it.
However, it may be too drastic to accept this change: for one thing, it breaks on Python 2.5 and earlier, because it introduces b'' notation. (This is necessary to properly run on Python 3.) For another thing, it may break existing users who have adopted the Unicode strings in the current version.
I have thought a bit about another less drastic solution: work the way that Python 3's os.listdir works. The Python docs explain a bit about how this works here. Basically, byte strings are converted into unicode strings, but if any bytes are invalid in that encoding, they are replaced with U+DCxx where xx is the byte value. This is a bit hacky, but it allows the user to deal with Unicode strings, and then roundtrip properly back into bytes objects. Fusepy could employ the same technique on Python 3 when sending and receiving strings from the user's Operations methods. It is easy to implement: just call encode and decode with errors='surrogateescape'.
However, this only works in Python 3. In Python 2, you should absolutely not be encoding filenames and metadata as Unicode strings, because the Python 2 str type allows you to represent binary data but treat it as text.
If you go with this approach, I would also like the option of using bytes in my own code. Perhaps if I call the FUSE constructor with encoding=None, it will deal with bytes objects, and if I specify an encoding, it will use surrogateescape?
I'm happy to help work on the solution to this problem. I recognise that my branch is not a final solution, but the start of a discussion (which is why I haven't made a pull request). But I think it is necessary to resolve this issue in one way or another, because the current behaviour makes it impossible to implement a robust filesystem, and that is unacceptable to me.