Unicode Trial by Fire

As part of our Intranet at work, I created a web interface to collect textual data, which then publishes a PDF containing it, as well as other data from other sources. Given this interface, people will inevitably paste text into the form from Microsoft Word.

What appears in Word as:

“this is inside curly quotes”

… when read by Python becomes:

'\xe2\x80\x9cthis is inside curly quotes\xe2\x80\x9d'

I had been using ReportLab 1.21 to generate these PDFs. At that point, I had a function that would replace all the weird Word character encodings with a different encoding that would render properly in the final PDF. Without this function, the PDF generation would hang indefinitely, or throw a nasty traceback. E.g.,:

txt = txt.replace('\xe2\x80\x9c', '\x93') # double left quote
txt = txt.replace('\xe2\x80\x9d', '\x94') # double right quote
... etc.

All this broke when I upgraded the ReportLab installation to version 2.2.

Section 3.1 of the ReportLab 2.2 User Guide states:

Starting with reportlab Version 2.0 (May 2006), all text input you provide to our APIs should be in UTF8 or as Python Unicode objects.

So this was to be my Unicode trial by fire.

I fuckin’ tried everything Unicode-related that I could think of. All of my attempts were semi-educated stabs in the dark. All of my hopes were subsequently dashed by the dreaded UnicodeDecodeError.

Findings:

1. My default encoding is 'ascii', and there’s no simple way to change it.
2. What I now know is that the function used with 1.21 was changing the encoded characters from (ostensibly) UTF-8 to CP-1252.
3. BUT, the information was being saved as a straight ASCII string (see 1) even though the data was (ostensibly) UTF-8 encoded.
4. All attempts to convert this to Unicode failed. Until…

>>> txt = '“crazy shit”'
>>> txt
'\xe2\x80\x9ccrazy shit\xe2\x80\x9d'
>>> print txt
“crazy shit”
>>> txt.decode('utf-8')
u'\u201ccrazy shit\u201d'
>>> print txt.decode('utf-8')
“crazy shit”

Yeah, so it seems logical in retrospect. Word uses something close to UTF-8, so the text with crazy characters in it needs to be converted to Unicode by way of decoding the UTF-8 encoding.

What a huge pain in my ass.

Upshot:
If using ReportLab 1.21, convert UTF-8 to CP-1252 for it to render properly in the PDF. If using ReportLab 2.x, txt.decode('utf-8') and go from there.

ReportLab Table Cell Follow-Up

Previously, I documented wrestling with links inside table cells. There, I wrote:

So, for example, if you have one cell with a number and another with a long string of text, you can’t say, “Make the number cell X wide, and make the text cell Y wide.”

That’s completely wrong, and the ReportLab documentation proves it. While I did scan the docs, it’s safe to say that I didn’t actually RTFM.

You can set the same width for every table column by passing a number (the behavior I complained about):

inch = 72 # 72 points per inch
t = Table(table_data, colWidths=1*inch)

or, you can set different widths for each by passing a list of numbers:

t = Table(table_data, colWidths=[1*inch, 2*inch, 3*inch])

This well-documented feature is a boon to my application, which–until today–suffered from table cell overruns and general suckiness.

I wrote a function that measures the longest string of each column, then sets the width appropriately. If the total length of all columns exceeds that of the page (less the margins), then I have a “sacrifice” column that’s left to text-wrap as ReportLab sees fit.

While I can report that things are coming together pretty well using the open source libraries… work has agreed to pony up for an “Enterprise” license. I’ll dive into merging PDFs and popping my RML cherry next week.

I have to fix that whole “using A4 instead of 8.5×11 anachronism” layout problem first.

Stupid metric system.

LayoutError in ReportLab

Attempting To: Generate a PDF file listing a few attributes of various systems. Each set of system information needs to stay together as a block of info on the page; having some on one page, and the remainder on the following page is unacceptable.

Established Method: Employ the “KeepTogether” flowable. Pass a list of things (here, Paragraph objects) to it, and it’s supposed to handle things properly. If the length of the content passed exceeds that of the page, it will auto-magically insert a page break, and place the content on the next page.

Problem: ReportLab throws “LayoutError: Splitting error” on one of the KeepTogether flowable objects. Further, this particular exception seems impossible to handle; wrapping the calls in “try” and “except” clauses appears to have no effect whatsoever, and the error is raised without fail.

Investigation, conclusion & code examples follow…

Continue reading “LayoutError in ReportLab”

A Reportlab Link in Table Cell Workaround

Front matter: You’re generating a table inside a PDF document with Reportlab, and one of the cells needs to feature a link to an external website. Links aren’t supported (as far as I know) outside a Paragraph object, but shoving a Paragraph into the table cell results in some pretty fucked up word wrapping and table-cell sizing.

It’s ugly, and unacceptable.

According to the Reportlab mailing list:

Paragraphs behave well inside a table cell with a fixed width.

I won’t even try to re-find the post wherein this valuable nugget lies, but it’s somewhere in the mailing-list archives.

The biggest problem here is: You cannot specify individual widths of table cells to fulfill whatever random “okay, now it works” behavior. When you create a Table object, there’s an optional cell-width argument (colWidths), but it applies across the board to every cell in the table. So, for example, if you have one cell with a number and another with a long string of text, you can’t say, “Make the number cell X wide, and make the text cell Y wide.”

Update August 21, 2008: The above paragraph is completely wrong. See my post “ReportLab Table Cell Follow-Up” for more information.

Of course, you can just leave the cell-width calculation up to the Table class when you instantiate it… but that merely results in the aforementioned fucked up shit.

You see the problem, eh?

My solution is:

1. Create the link within a Paragraph. Again, AFAIK, this is the only way to link.
2. Determine the length of the linked text. Use pdfmetrics.stringWidth.
3. Create a Table with only one cell whose width is that determined in step 2. This makes things “behave”.
4. Drop that Table into the cell of the real Table you actually want displayed in the final PDF.

Here:

from reportlab.pdfbase import pdfmetrics
from reportlab.platypus import SimpleDocTemplate, Paragraph
from reportlab.platypus.tables import Table

doc = SimpleDocTemplate('filename_of_pdf')
Story = []

str = 'linked text'
url = '<link href="http://the/actual/url.html">%s</link>' % str

L = pdfmetrics.stringWidth(str, your_font_name, your_font_size)
inside_Table = Table([[Paragraph(url, your_paragraph_style)]], colWidths=L)

real_Table_data = ['a', 'b', inside_Table, 'c', 'd']
t = Table(real_Table_data)

Story.append(t)
doc.build(Story)

For the most part: Blickity-Blam!

Sometimes the pdfmetrics.stringWidth() call will still cause your shit to wrap, but throw a +1 after the call in the L declaration (to add a single point, which is 1/72 of an inch) and things will work beautifully.

How to Make a Comparative Bar Graph PDF using Python and ReportLab

Note: this how-to comes from the same vein as “Generating CSS-Only Sparklines in Python“.

So.

Let’s say that you have 2 sets of data where each data point in each set corresponds to the same 1 parameter… and you want to make a printable bar graph of it all.

For example, you might want to compare your store’s history-averaged daily sales to recent daily sales for some specific dates of the year. Or… it could be the high temperature for your town… or, it could be your weight versus percent body fat. It doesn’t really matter; as long as you need to compare 2 “variables” against the same 1 “constant”.

If you’re familiar with graphing, then you’ll immediately realize that the dates will work on one axis (in this post, that’s the x-axis), and that the things you’re comparing will work on the other (here, the y-axis).

Without going into the advantages of print-resolution vectors over screen-resolution rasters…

I’ve written a Python module (with a ton of kickass notation by way of comments) that will generate a PDF file featuring this type of bar graph for you. Of course, it’s easily customizable.

Before you can use my code, however, you need to have Python and ReportLab installed on your system. I’ll leave the details of that up to you (it’s easy). RTFMs; they won’t steer you wrong.

As with the sparkline how-to, this module will squish-down all the data so that the graph fits on one (US standard “letter” size) page, in “landscape” (wide) orientation. The range of the data and the number of data points don’t matter; it’ll all fit.

For the time being there is no graph legend, no headers for the axes, and no labels for the x-axis. The y-axis, however, does have labels based on the lower- and upper-limits of the data.

With all of that (unnecessary?) exposition aside, I give you:

  • The code (remember to change the .txt to a .py), and
  • A sample PDF (which you can generate from the code’s default state).

If you have any questions or comments, please feel free to leave ’em here on the site.