Those darn strings

It's long overdue, but I finally added the promised section about "How to deal with strings" to the Python GTK+3 Tutorial.

It all started with bug 663610 which got me confused because with PyGObject for Python 2.x you can supply a unicode instance to a method that expects a string, although GTK+ can only deal with UTF-8 encoded 8-bit strings. However, every GTK+ method that returns a string will always be in the form of a str instance. This is not necessarily PyGObject's fault, it is exceptionally easy to do bad things in Python 2.x when mixing unicode and str (hello UnicodeDecodeError). The lesson hear is that if you use PyGObject for Python 2.x, don't use unicode and stick to str all the time. If you instead use Python 3.x, you don't have to worry about this. A more detailed description and a little bit of background can be found in the "How to deal with strings" section of the tutorial.

I also want to thank everybody who helped to improve the tutorial as part of Google Code-In 2011. Students wrote sections about Drag and Drop, sub-classing GObject, Gtk.Builder, Gtk.IconView, Gtk.Clipboard, and Gtk.TreeSortable. They all did an outstanding job in improving the tutorial to enable others to learn GTK+ 3 in Python.


I haven't checked to see if it is handled, but Python allows traditional UTF-8 strings. However, GLib and Gtk+ typically use "modified UTF-8". The difference being that traditional UTF-8 allows NUL bytes ('\0') in the string. Therefore you need to track the string length. However, modified UTF-8 encodes NUL bytes using two bytes 0xC0,0x80.

I think you should at least mention how it was done in PyGTK and how to do the same using pygobject, eg:

import sys

In reply to by Johan Dahlin (not verified)

What problem would this work-around solve?

Judging from the documentation of sys.setdefaultencoding, it seems that this function is not supposed to be used by applications.

In reply to by sebp

It will maintain PyGTK compatibility and make sure that you can use str types instead of unicode in all apis. It enforces the utf-8 encoding of all strings, which makes total sense in a gtk+ application.

There are advantages to using unicode strings over UTF-8 strs in a Python app: e.g. s.upper()/.lower() work, you can slice to substring, you can left-align columns of text for plain-text reports (ohdear, I'm showing my age here aren't I?). The general rule is to convert between str and unicode at application boundaries (I/O), and it works just fine if you treat PyGObject as a boundary and .decode('UTF-8') as soon as you get an str object from a Gtk+ method call.

In reply to by Marius Gedminas (not verified)

I would prefer working with unicode all the way as well, but since PyGObject will return str instances you have to be super carefully to not mess things up. Therefore, I suggested using str only to be on the safe side.