Unicode Trial by Fire

As part of our Intranet at work, I created a web interface to collect textual data, which then publishes a PDF containing it, as well as other data from other sources. Given this interface, people will inevitably paste text into the form from Microsoft Word.

What appears in Word as:

“this is inside curly quotes”

… when read by Python becomes:

'\xe2\x80\x9cthis is inside curly quotes\xe2\x80\x9d'

I had been using ReportLab 1.21 to generate these PDFs. At that point, I had a function that would replace all the weird Word character encodings with a different encoding that would render properly in the final PDF. Without this function, the PDF generation would hang indefinitely, or throw a nasty traceback. E.g.,:

txt = txt.replace('\xe2\x80\x9c', '\x93') # double left quote
txt = txt.replace('\xe2\x80\x9d', '\x94') # double right quote
... etc.

All this broke when I upgraded the ReportLab installation to version 2.2.

Section 3.1 of the ReportLab 2.2 User Guide states:

Starting with reportlab Version 2.0 (May 2006), all text input you provide to our APIs should be in UTF8 or as Python Unicode objects.

So this was to be my Unicode trial by fire.

I fuckin’ tried everything Unicode-related that I could think of. All of my attempts were semi-educated stabs in the dark. All of my hopes were subsequently dashed by the dreaded UnicodeDecodeError.

Findings:

1. My default encoding is 'ascii', and there’s no simple way to change it.
2. What I now know is that the function used with 1.21 was changing the encoded characters from (ostensibly) UTF-8 to CP-1252.
3. BUT, the information was being saved as a straight ASCII string (see 1) even though the data was (ostensibly) UTF-8 encoded.
4. All attempts to convert this to Unicode failed. Until…

>>> txt = '“crazy shit”'
>>> txt
'\xe2\x80\x9ccrazy shit\xe2\x80\x9d'
>>> print txt
“crazy shit”
>>> txt.decode('utf-8')
u'\u201ccrazy shit\u201d'
>>> print txt.decode('utf-8')
“crazy shit”

Yeah, so it seems logical in retrospect. Word uses something close to UTF-8, so the text with crazy characters in it needs to be converted to Unicode by way of decoding the UTF-8 encoding.

What a huge pain in my ass.

Upshot:
If using ReportLab 1.21, convert UTF-8 to CP-1252 for it to render properly in the PDF. If using ReportLab 2.x, txt.decode('utf-8') and go from there.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s