Monday, June 8, 2009

Much Ado About Nothing

Whitespace in XML files can behave unintuitively, especially when it comes to newlines. My script that converts AsciiDoc to Blogger posts entry is peppered with awful kludges meant to bring them in line, but still fails to indent source code.

After much experimentation, and as little reading of the official XML specifications as possible, I have determined the best practices for my situation:

  • Adding xml:space="preserve" to the opening content tag preserves the source code indentation.

  • It appears impossible to preserve newlines anywhere. They vanish mysteriously, not even leaving a space behind. By default XML leaves a space, so I don’t understand why; perhaps it’s an Atom thing. Thus I add "<br />" at end of each line where formatting matters, and prefix each newline with a space.

My updated script appears below, this time with indentation.

#!/bin/bash

if [[ -z "$1" ]] ; then
echo Usage: $0 ASCIIDOC_SOURCE [LABELS...]
exit 1
fi

if [[ ! -f "$1" ]] ; then
echo $1 not found.
exit 1
fi

outfile=$1.xml
# Extract = Title =, which must be on a line by itself.
title=$(grep '^ *=' -m 1 $1 | sed 's/^ *=* *//' | sed 's/ *=* *$//')

# Hmm, this draft thing used to work, but not anymore.
echo '<entry xmlns="http://www.w3.org/2005/Atom">
<app:control xmlns:app="http://www.w3.org/2007/app">
<app:draft>yes</app:draft>
</app:control>
<title type="text">'$title'</title>
<content type="xhtml" xml:space="preserve">
<div xmlns="http://www.w3.org/1999/xhtml">' > $outfile
# We need \n newlines for sed.
asciidoc -a newline=\\n -s -o - $1 >> $outfile
echo '
</div>
</content>' >> $outfile
while [[ -n $2 ]]; do
echo ' <category scheme="http://www.blogger.com/atom/ns#" term="'"$2"'" />' >> $outfile
shift
done
echo '</entry>' >> $outfile

# I can't figure out how to preserve line breaks in Atom XML,
# hence the following.
# Prefix each newline with a space.
sed -i 's/$/ /' $outfile
# Add <br /> to the end of all lines between <pre> and </pre>.
sed -i '/<pre>/,/<\/pre>/s/ *$/<br \/>/' $outfile
# Undo what we just did for the </pre> line.
sed -i 's/\(<\/pre>.*\)<br \/>/\1/' $outfile

if [[ -z $AUTH_TOKEN ]]; then
stty -echo
read -p "Blogger password: " pw
stty echo

token=$(curl --silent https://www.google.com/accounts/ClientLogin \
-d Email=benlynn@gmail.com -d Passwd="$pw" \
-d accountType=GOOGLE \
-d source=asciidoc2blogger \
-d service=blogger | grep Auth | cut -d = -f 2)
AUTH_TOKEN=$token
echo AUTH_TOKEN=$token
fi

# The URL was cut and pasted from <link rel="service.post"> from
# my blog's HTML source.
curl --silent --request POST --data "@$outfile" \
--header "Content-Type: application/atom+xml" \
--header "Authorization: GoogleLogin auth=$AUTH_TOKEN" \
"http://www.blogger.com/feeds/4222267598459829544/posts/default" \
| tidy -xml -indent -quiet

No comments: