|
©2001 Pierre Igot
One unfortunate thing, for most
writers, about the personal computer —
which has effectively replaced the typewriter
as the tool of choice for those who need or
want a tool to write — is that it was
invented by American engineers, and not by European
writers or expert typesetters.
Why is this a problem? Because,
obviously, those American engineers were faced
with all kinds of technological issues that
were more important to them than character/typesetting
issues such as “curly quotes,” diacritics
and other so-called “special characters.”
The computer they invented, therefore, initially
lacked the ability to meet the typesetting needs
of the professional writer. And, in many ways,
more than two decades later, in spite of all
the progress that has been made, we are still
suffering from the consequences of this simple
original fact.
Addressing all the details of
the historical evolution of the personal computer
with respect to character and typesetting issues
would be too great of an undertaking. I will
therefore take a more superficial approach and
just attempt to “cover the bases,”
so that other readers/writers, I hope, may gain
a better understanding of those issues. I am
not a specialist in any of the areas touched
upon below either — I am just a long-time
computer-using writer who happens to have taken
a keen interest in those issues and who is constantly
struggling to achieve the best possible compromise
in word processing and web publishing. (I am
afraid I will also have to restrict myself to
Western languages using the Latin alphabet,
as I have zero knowledge of the way modern computers
are used to handle Arabic, Japanese or the cyrillic
alphabet.)
The ASCII Code
In order to really understand
the issues with which we are still faced today,
one needs to get a bit technical and go back
to the fundamentals. In short, computers store
and transfer data in the form of strings of
“bits,” which are, metaphorically
speaking, single units that can only be either
“ON” or “OFF.” In the
language of the computer, they can only have
one of two values: “1” (on) and
“0” (off).
In order to use this system to
store alphabetical characters, you need to use
a number of “bits” that provide
you with enough possible combinations to cover
the alphabet. For example, with two bits, you
can only have the following four (2^2) combinations:
00, 01, 10, 11 — and four is clearly insufficient
to cover the an alphabet of 26 letters in both
lowercase and uppercase, numerals and all the
necessary punctuation marks and associated symbols.
With seven bits, however, you
have 2^7, i.e. 128 possible combinations. Due
to technological limitations and the fact that
their priorities were elsewhere, the original
designers of the personal computer decided that
128 was enough to cover basic writing needs
and came up with what is called the ASCII code
(acronym for American Standard Code for Information
Interchange). In the ASCII code, for example,
uppercase A is represented by the 7-bit string
‘00100001,’ uppercase B by ‘00100010,’
uppercase C by ‘00100011,’ etc.
With seven bits, there are enough
combinations for the 26 letters of the alphabet
in both lowercase and uppercase, the 0 to 9
digits, and a few punctuation signs and symbols
such as (, ), [, ], +, -, etc. (The entire
ASCII table is available here.)
Clearly, however, there are not enough possible
combinations for non-English characters using
Latin characters with diacritics, i.e. additional
character features used to represent a different
pronunciation or conventional spelling. No room
for ‘é,’ ‘ã,’
‘û,’ etc. No room either
for a number of punctuation marks such as the
em dash (—), curly English quotes (“
and ”), French quotes (‘«’
and ‘»’), ligatures, etc.
And there you have the beginning
of the nightmare. As it became clear that personal
computers would also be used as writing tools
by non-American writers and demanding typesetters,
the ASCII code was extended by using 8-bit combinations
(for a total of 256). But it was not standardized
and the extended ASCII code for the Macintosh
ended up being different from the extended ASCII
code on PCs. The ASCII code for ‘é,’
for example, is not the same depending on whether
you are using a Mac or a Wintel machine.
ISO Latin
The different ASCII codes created
all kinds of problems when transferring files
from computer to computer, which had to be addressed
either by software developers or by the users
themselves. With the advent of the Internet,
however, the problem became even more important,
because all kinds of people started exchanging
all kinds of files over the network using all
kinds of different pieces of software.
Once again, the Internet was
clearly not “invented” by non-American
writers or typesetters. And all the problems
that software developers and users had managed
to fix had to be fixed again, only on a much
larger scale. Before a Mac user could send an
email with accented characters to a Windows
user, for example, we had to have two email
programs (one on the Mac and one on the Windows
PC) and a whole network of computers between
the sender and the recipient that all “understood”
the same language, the same code — and
that clearly couldn’t be the extended
ASCII code.
This problem led to the “ISO Latin 1”
standard (officially named ISO-8859-1), developed
by the International Organization for Standardization.
It uses the same code as ASCII for the first
128 characters, but the “extended”
portion (also known as “upper ASCII”)
is the same for everyone.
Gradually, Internet servers,
providers, and individual users have been migrating
to computer programs that understand the ISO Latin 1
code and are able to encode and decode texts
written using this code. Unfortunately, even
today in 2001, there are still a significant
number of programs out there, either on servers
or on end-user PCs, that do not encode and decode
ISO Latin 1 properly. For example,
the popular ListBot
service, which lets people create and manage
free mailing lists, still does not support ISO Latin 1
characters, which means that if you create a
ListBot list for French users, you’ll
have to ask them all not to use accented characters,
or else Windows users won’t be able to
read the list messages coming from Mac users
and vice-versa. Needless to say, it’s
easier said than done and, as a minority, Mac
users know all too well how easy it is for Windows
users to ignore their needs altogether.
In addition, the 256 characters
provided by the ISO Latin 1 code still
aren’t enough, not only to meet the needs
of other languages such as Greek, Chinese and
Japanese, but even to meet the needs of the
demanding Western writer. For example, the ISO Latin 1
code doesn’t include the ligatured ‘œ,’
which is commonly used in French for words such
as œuf, œil, sœur, etc.
Unicode
The next step was then to move
beyond the 8-bit, 256-character barrier. A new
standard was thus developed using 16 bits instead
of 8, for a total of 65536 possible combinations.
This standard is called Unicode,
and it covers many languages as well as many
symbols and icons.
The problem, here again, is about
the slow adoption of Unicode as the universal
standard on all computers and all software programs.
Unfortunately, we are still very far from having
reached a “critical mass” of Unicode-compatible
machines. Many pieces of software and many versions
of system software still in use out there don’t
support Unicode. We are still very much in a
transition phase and will remain so for several
more years at least.
For the Mac-using writer, this
means that he has to adopt a cautious approach
and achieve a compromise between his own character
requirements as a writer and the current global
situation when it comes to Unicode compliance.
Over the past couple of years,
I have developed a set of personal rules and
tools that enable me to achieve such a compromise.
This compromise is satisfactory for me, but
it might not be right for you. I still want
to share it with you so that you get a better
idea of what the issues are for the Mac-using
writer and what you can do to address them to
your own satisfaction.
Characters in Word Processing
When I write a text (such as
this one), I always do so in a word processor
first (Word 2001 in my case). This is for
a number of reasons: I want the WYSIWYG interface
that a text editor such as BBEdit doesn’t
offer; I want to be able to use style sheets
that automate my formatting tasks; I want to
be able to print the text on my laser printer
in a quality that approaches that of traditional
printed text, with a minimum number of page
layout features; I want to be able to exchange
the text (with all its formatting, style sheets, etc.)
with other users, many of whom are Windows users
using Microsoft Office; etc.
So I type my texts in Word 2001.
I use all the characters I need or like to use,
including accented and ligatured French characters
when I write en français, punctuation
marks such as the em dash, non-breaking spaces, etc.
I do not worry about Unicode at this point.
Even if I wanted to, I couldn’t, because
Word doesn’t use the Unicode standard
for its files. (It actually uses the Windows
ASCII code, even on the Mac, as you can see
if you open a Word file containing diacritics
with a text editor such as BBEdit. This is a
pain if you like to use a program such as the
excellent SpeedSearch
to search through the contents of your Word
files to locate a particular occurrence of a
string of characters with diacritics. It simply
won’t work, and you can only search for
strings without diacritics in Word files with
SpeedSearch.)
Word does have a Unicode “feature”
in the form of a file format called “Unicode
Text” that you can select in the “Save
As…” dialog. However, if saving
a Word file as “Unicode Text” appears
to work, reopening the Unicode Text file in
Word uncovers a bug that makes Word “forget”
characters here and there in your text, effectively
rendering it unusable. This bug was already
present in Word 98, and still hasn’t been
fixed in Word 2001. I do not recommend,
therefore, that you use Word’s Unicode
feature in the current version. Microsoft has
a long history of very iffy standards compliance,
and it will probably be a while before Office
applications become 100% compliant with the
Unicode standard.
I suspect it will be more or
less the same with other word processing applications.
Characters in Web Publishing
Once my text is ready in Word 2001,
I want to transfer it to my HTML editing application
of choice, i.e. BBEdit, to publish it on the
web. There are several possible ways to do this:
I can save the text as a “plain text”
file and open it in BBEdit, I can drag-and-drop
the text from Word to BBEdit, or I can use the
clipboard to copy and paste it. Whichever approach
I choose, however, I will lose some vital character
information.
For example, if I have any non-breaking
spaces in my document, they will be lost in
the process. In this particular article, I have
typed non-breaking spaces between “Word”
and “2001” in all occurrences of
the string “Word 2001,” so
as to avoid having an unsightly “2001”
at the beginning of a line, separated from the
preceding “Word.” Those non-breaking
spaces will be lost. (Non-breaking spaces are
also widely used in French typography.)
Since there is no way to automatically
restore such non-breaking spaces in BBEdit (I
can’t expect BBEdit or a BBEdit AppleScript
script to guess that I had put a non-breaking
space between “Word” and “2001”),
I need to “prepare” my text in Word
before I transfer it to BBEdit.
The way I use to prepare my text
is a Word macro-command which automatically
replaces all my non-breaking spaces with the
HTML entity for the character, i.e. < >,
in the selected text.
With Selection.Find
.Text = "^s"
.Replacement.Text = "&nsbp;"
.Forward = True
.Wrap = wdFindStop
.MatchCase = False
.Format = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
When I work on my text later
in BBEdit, I just need to make sure that it
won’t automatically replace ‘&’
characters with the HTML entity for that character,
i.e. <&>. (And I want to use the ‘&’
character in my document, I will need to replace
it with ‘&’ using a Word
macro before I do anything else.)
Things get more complicated,
however. BBEdit is excellent in terms of standards
compliance, but web browsers are not. Both Netscape 4.x
and Internet Explorer 4.5/5.x still misinterpret
several key characters in the way that BBEdit
encodes them — a form of encoding which
is perfectly correct, of course, but ends up
being useless for my target audience. For example,
the HTML entity for the ellipsis (‘…’)
is <…>. Internet Explorer 5.x
will interpret this entity properly and display
an ellipsis, but Netscape Navigator 4.x will
not.
Because of the fact that Netscape
Navigator 4.x is still widely used and will
remain so for quite a while, I need, here again,
to “prepare” my text before I transfer
it to BBEdit. In this particular case, the only
solution is, unfortunately, to replace the ellipsis
with a string of three dots (which I have done
for this article). This is just an example of
the type of compromise one currently has to
accept.
Why not use Unicode text?
Ideally, what I would like to
do, of course, is use Unicode all the way through
and not have to encode my characters at any
point using hard-to-read HTML entities. Theoretically,
I could get closer to that ideal by using Word’s
“Unicode Text” file format. However,
even if the current versions of the most popular
browsers on the Mac and under Windows (Explorer 5
and Netscape 4) do provide some level of
support for Unicode, this support is not yet
sufficient, to enable me to use the standard
throughout my web publishing activities.
For example, the Unicode encoding
standard requires that the file start with what
is called a “Byte-Order Mark” (BOM)
for proper decoding. However, if you do include
this BOM in your HTML file, Explorer 5
will be unable to display it properly. If you
don’t include it, when you close your
Unicode file and open it again in BBEdit, since
BBEdit doesn’t find this BOM, it decodes
the Unicode file improperly and all your character
encoding is lost. This effectively makes Unicode
encoding unusable for web publishing with the
tools I am currently using.
Netscape 4 doesn’t
fare much better and keeps displaying all kinds
of question marks for characters that it apparently
doesn’t understand. I have yet to test
Unicode HTML files under Netscape 6 —
but, in any case, even if AOL/Netscape does
manage to release a proper version of the product,
it will be a while before the browser is widely
adopted.
HTML Entities
This brings us back to ISO Latin 1
and HTML entities. (Originally, HTML entities
were just alphanumeric equivalents of ISO Latin 1
codes. For example, é et é
both refer to the same ISO Latin 1
character, the ‘é’ with an
acute accent. Now, with Unicode, however, there
are also HTML entities referring to Unicode
characters. In the rest of this article, however,
when I refer to “HTML entities,”
I mean ISO Latin 1 HTML entities.)
Those entities — which
are always of the form &xxxx; — are
what makes your HTML file with diacritics, special
characters, etc. readable on all computers
with all browsers (most of them anyway). They
are inconvenient, because a word such as “élève”
is definitely more difficult to read and edit
than élève, but,
right now, they are the only reliable way for
the Mac-using writer to publish texts with diacritics
and special characters on the web.
In addition, the complete list
of HTML entities, which is the HTML equivalent
of the ISO Latin 1 character set,
doesn’t include important characters such
as the ligatured ‘œ.’
What’s worse, Netscape 4.x
is notorious for not supporting several of those
HTML entities. For example, if I didn’t
use the HTML entity for the superscript ‘2’
above when discussing 2-bit strings (I used
“2^2”), it’s not because I
didn’t want to or because it’s not
part of the ISO Latin 1 code. It’s
simply because it’s not supported by Netscape 4.x.
“2^2” is a more or less acceptable
substitute, but it’s clearly not satisfactory
for the demanding writer/web publisher in me.
Yet, at this moment in the evolution of Internet
standards and their adoption, I don’t
really have amy choice.
And to make things even more
complicated, several web designing tools, such
as Adobe GoLive and Macromedia Director, also
have their own specific idiosyncrasies and do
not fully support the importing (through cut-and-paste
or by opening a text file) and exporting of
ISO Latin 1 entities.
The Big Compromise
Everyone has different needs
and different target audiences. And the current
situation is obviously a royal mess for both
the average user and the demanding writer/typesetter.
It’s understandable, therefore,
that many people have simply given up on trying
to make things work for now and gone back to
using a very minimal set of characters. For
example, many people only use straight English
quotes ("), even in languages which would normally
use a different type of quotation mark, and
many people still use a double dash (--) in
lieu of an em dash.
Even then, you can still use
automated tools (either Word macros or BBEdit
AppleScript scripts, for example) to simplify
your work and enable you to still have both
a high-quality, ready-for-print word processing
document and a web-ready file using only the
most commmon and widely supported characters
and HTML entities. Depending on the person with
whom I want to share a file, for example, I
will use a BBEdit script to replace all em dashes
with ‘--‘:
tell application "BBEdit 6.0"
replace "—" using "--" searching in the selection
end tell
For my own personal web publishing
needs, however, I have decided to adopt a “middle
ground” approach by using a combination
of HTML entities and some Unicode characters
that are already supported by both Explorer 4.5/5.x
and Netscape 4. For example, since the
‘—’ entity is supported
by Explorer but not by Netscape, I use the ‘—’
Unicode entity, which appears to be already
supported by both browsers (based on my own
testing). Similarly, I use œ and Œ
for ‘œ’ and ‘Œ’
respectively.
If you want to find out more
about which characters are supported by which
browsers — depending on your own personal
needs, of course — I recommend that you
do your own research. One useful web site that
I have recently found is “HTML
4.0 Character Entity References”.
You can then write automated tools such as Word
macros or BBEdit scripts to automate the replacement
of your special characters by the appropriate
entities.
Of course, it is a “patch-up”
kind of approach that makes switching back and
forth between your word processor and your web
page editor pretty much irrealistic. Once you
have converted all your special characters into
the HTML entities you have chosen to use, you
are unlikely to want to go back and do the reverse
conversion in order to be able to continue your
editing work in the word processor.
Unfortunately, we will only achieve
such a smooth, seamless process when all computers,
all system software and all applications finally
adopt the same, uniform standard. In the mean
time, I hope this article has given you a few
ideas on how to achieve a reasonable compromise
that will both ensure that your web published
work is accessible to all the members of your
target audience and enable you to use a set
of characters that satisfies your own personal
standards as a Mac-using writer.
Next time: “Word Processing
& Web Publishing - Part III: Character Styles”
You can contact Pierre Igot at
applepeel@applelust.com
Pierre's "Apple
Peel" page here at Applelust
|