Monday, June 8, 2009

Much Ado About Nothing

Whitespace in XML files can behave unintuitively, especially when it comes to newlines. My script that converts AsciiDoc to Blogger posts entry is peppered with awful kludges meant to bring them in line, but still fails to indent source code.

After much experimentation, and as little reading of the official XML specifications as possible, I have determined the best practices for my situation:

  • Adding xml:space="preserve" to the opening content tag preserves the source code indentation.

  • It appears impossible to preserve newlines anywhere. They vanish mysteriously, not even leaving a space behind. By default XML leaves a space, so I don’t understand why; perhaps it’s an Atom thing. Thus I add "<br />" at end of each line where formatting matters, and prefix each newline with a space.

My updated script appears below, this time with indentation.

#!/bin/bash

if [[ -z "$1" ]] ; then
echo Usage: $0 ASCIIDOC_SOURCE [LABELS...]
exit 1
fi

if [[ ! -f "$1" ]] ; then
echo $1 not found.
exit 1
fi

outfile=$1.xml
# Extract = Title =, which must be on a line by itself.
title=$(grep '^ *=' -m 1 $1 | sed 's/^ *=* *//' | sed 's/ *=* *$//')

# Hmm, this draft thing used to work, but not anymore.
echo '<entry xmlns="http://www.w3.org/2005/Atom">
<app:control xmlns:app="http://www.w3.org/2007/app">
<app:draft>yes</app:draft>
</app:control>
<title type="text">'$title'</title>
<content type="xhtml" xml:space="preserve">
<div xmlns="http://www.w3.org/1999/xhtml">' > $outfile
# We need \n newlines for sed.
asciidoc -a newline=\\n -s -o - $1 >> $outfile
echo '
</div>
</content>' >> $outfile
while [[ -n $2 ]]; do
echo ' <category scheme="http://www.blogger.com/atom/ns#" term="'"$2"'" />' >> $outfile
shift
done
echo '</entry>' >> $outfile

# I can't figure out how to preserve line breaks in Atom XML,
# hence the following.
# Prefix each newline with a space.
sed -i 's/$/ /' $outfile
# Add <br /> to the end of all lines between <pre> and </pre>.
sed -i '/<pre>/,/<\/pre>/s/ *$/<br \/>/' $outfile
# Undo what we just did for the </pre> line.
sed -i 's/\(<\/pre>.*\)<br \/>/\1/' $outfile

if [[ -z $AUTH_TOKEN ]]; then
stty -echo
read -p "Blogger password: " pw
stty echo

token=$(curl --silent https://www.google.com/accounts/ClientLogin \
-d Email=benlynn@gmail.com -d Passwd="$pw" \
-d accountType=GOOGLE \
-d source=asciidoc2blogger \
-d service=blogger | grep Auth | cut -d = -f 2)
AUTH_TOKEN=$token
echo AUTH_TOKEN=$token
fi

# The URL was cut and pasted from <link rel="service.post"> from
# my blog's HTML source.
curl --silent --request POST --data "@$outfile" \
--header "Content-Type: application/atom+xml" \
--header "Authorization: GoogleLogin auth=$AUTH_TOKEN" \
"http://www.blogger.com/feeds/4222267598459829544/posts/default" \
| tidy -xml -indent -quiet

Sunday, June 7, 2009

AsciiDoc To Blogger

Once I grew accustomed to writing AsciiDoc, editing even tiny amounts of HTML became bothersome. The sites I maintain use custom scripts to build HTML files from AsciiDoc source, but I have less control over this blog. Up until now I’ve been using the Blogger in-browser editor to fine-tune the markup in these posts.

AsciiDoc’s author, Stuart Rackham, also wrote a tool to go from AsciiDoc to a WordPress blog. Blogger should be similar, and perhaps even easier to work with, since WordPress appears to have a few quirks.

My first thought was to use the Mail-to-Blogger feature: I could run AsciiDoc on on the source, then send it to a particular email address to publish it. This attempt floundered becaues GMail has no raw HTML mode. Of course, I could script an SMTP server instead, but this seems excessive.

Next I considered the import and export feature. But even if I could figure out how to generate suitable XMLs, I’d have to click around and solve a CAPTCHA each time I imported a post.

Finally the simplest solution hit me: use the Blogger Data API. With an HTTPS request or two, I can post raw HTML, set labels, and even choose whether to publish immediately or save as a draft. All it takes is a shell script using curl with Google data services:

#!/bin/bash

if [[ -z "$1" ]] ; then
echo Usage: $0 ASCIIDOC_SOURCE [LABELS...]
exit 1
fi

if [[ ! -f "$1" ]] ; then
echo $1 not found.
exit 1
fi

outfile=$1.xml
# Extract = Title =, which must be on a line by itself.
title=$(grep '^ *=' -m 1 $1 | sed 's/^ *=* *//' | sed 's/ *=* *$//')

# Hmm, this draft thing used to work, but not anymore.
echo '<entry xmlns="http://www.w3.org/2005/Atom">
<app:control xmlns:app="http://www.w3.org/2007/app">
<app:draft>yes</app:draft>
</app:control>
<title type="text">'$title'</title>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">' > $outfile
asciidoc -f macros -s -o - $1 >> $outfile
echo '
</div>
</content>' >> $outfile
while [[ -n $2 ]]; do
echo ' <category scheme="http://www.blogger.com/atom/ns#" term="'"$2"'" />' >> $outfile
shift
done
echo '</entry>' >> $outfile

# I can't figure out how to preserve line breaks in Atom XML, hence the
# following workaround.
sed -i '/<pre>/,/<\/pre>/a<br \/>' $outfile

if [[ -z $AUTH_TOKEN ]]; then
stty -echo
read -p "Blogger password: " pw
stty echo

token=$(curl --silent https://www.google.com/accounts/ClientLogin \
-d Email=benlynn@gmail.com -d Passwd="$pw" \
-d accountType=GOOGLE \
-d source=asciidoc2blogger \
-d service=blogger | grep Auth | cut -d = -f 2)
AUTH_TOKEN=$token
echo AUTH_TOKEN=$token
fi

# The URL was cut and pasted from <link rel="service.post"> from my
# blog's HTML source.
curl --silent --request POST --data "@$outfile" \
--header "Content-Type: application/atom+xml" \
--header "Authorization: GoogleLogin auth=$AUTH_TOKEN" \
"http://www.blogger.com/feeds/4222267598459829544/posts/default" \
| tidy -xml -indent -quiet

Actually, that’s not quite all: to work around another weird XML whitespace issue I use the following AsciiDoc macros file.

[miscellaneous]
newline=" \n"

We insert a space before each newline to prevent words separated by a line break from being joined together.