title
brancg
adam_ev
oped resources forums contacts subscribe site_map home
 

forums


OpEd

All Mac Considered
Amen Corner
Apple Peel
Digital Canvas
Editorials
Ether Nectar
iMaculate
   Conception

Infinite Loop
Notes from Dis
Scientia et
   Macintosh

Skewed Mac
Treo of Life

Resources

Books
Contacts/Mission
Forums
Links
Reviews
Subscribe


RadTech

Applelust is looking to add writers to its staff. If you are interested or want to be part of the Applelust community, drop us a line with your resume or vita. We are always on the look out for good, very smart, and reliable people to join the staff. If you think you have what it takes, let us know.

- The Publisher

Word Processing & Web Publishing - Part II: ASCII, HTML Entities, ISO Latin, Unicode and All That Jazz

©2001 Pierre Igot

One unfortunate thing, for most writers, about the personal computer — which has effectively replaced the typewriter as the tool of choice for those who need or want a tool to write — is that it was invented by American engineers, and not by European writers or expert typesetters.

Why is this a problem? Because, obviously, those American engineers were faced with all kinds of technological issues that were more important to them than character/typesetting issues such as “curly quotes,” diacritics and other so-called “special characters.” The computer they invented, therefore, initially lacked the ability to meet the typesetting needs of the professional writer. And, in many ways, more than two decades later, in spite of all the progress that has been made, we are still suffering from the consequences of this simple original fact.

Addressing all the details of the historical evolution of the personal computer with respect to character and typesetting issues would be too great of an undertaking. I will therefore take a more superficial approach and just attempt to “cover the bases,” so that other readers/writers, I hope, may gain a better understanding of those issues. I am not a specialist in any of the areas touched upon below either — I am just a long-time computer-using writer who happens to have taken a keen interest in those issues and who is constantly struggling to achieve the best possible compromise in word processing and web publishing. (I am afraid I will also have to restrict myself to Western languages using the Latin alphabet, as I have zero knowledge of the way modern computers are used to handle Arabic, Japanese or the cyrillic alphabet.)

The ASCII Code

In order to really understand the issues with which we are still faced today, one needs to get a bit technical and go back to the fundamentals. In short, computers store and transfer data in the form of strings of “bits,” which are, metaphorically speaking, single units that can only be either “ON” or “OFF.” In the language of the computer, they can only have one of two values: “1” (on) and “0” (off).

In order to use this system to store alphabetical characters, you need to use a number of “bits” that provide you with enough possible combinations to cover the alphabet. For example, with two bits, you can only have the following four (2^2) combinations: 00, 01, 10, 11 — and four is clearly insufficient to cover the an alphabet of 26 letters in both lowercase and uppercase, numerals and all the necessary punctuation marks and associated symbols.

With seven bits, however, you have 2^7, i.e. 128 possible combinations. Due to technological limitations and the fact that their priorities were elsewhere, the original designers of the personal computer decided that 128 was enough to cover basic writing needs and came up with what is called the ASCII code (acronym for American Standard Code for Information Interchange). In the ASCII code, for example, uppercase A is represented by the 7-bit string ‘00100001,’ uppercase B by ‘00100010,’ uppercase C by ‘00100011,’ etc.

With seven bits, there are enough combinations for the 26 letters of the alphabet in both lowercase and uppercase, the 0 to 9 digits, and a few punctuation signs and symbols such as (, ), [, ], +, -, etc. (The entire ASCII table is available here.) Clearly, however, there are not enough possible combinations for non-English characters using Latin characters with diacritics, i.e. additional character features used to represent a different pronunciation or conventional spelling. No room for ‘é,’ ‘ã,’ ‘û,’ etc. No room either for a number of punctuation marks such as the em dash (—), curly English quotes (“ and ”), French quotes (‘«’ and ‘»’), ligatures, etc.

And there you have the beginning of the nightmare. As it became clear that personal computers would also be used as writing tools by non-American writers and demanding typesetters, the ASCII code was extended by using 8-bit combinations (for a total of 256). But it was not standardized and the extended ASCII code for the Macintosh ended up being different from the extended ASCII code on PCs. The ASCII code for ‘é,’ for example, is not the same depending on whether you are using a Mac or a Wintel machine.

ISO Latin

The different ASCII codes created all kinds of problems when transferring files from computer to computer, which had to be addressed either by software developers or by the users themselves. With the advent of the Internet, however, the problem became even more important, because all kinds of people started exchanging all kinds of files over the network using all kinds of different pieces of software.

Once again, the Internet was clearly not “invented” by non-American writers or typesetters. And all the problems that software developers and users had managed to fix had to be fixed again, only on a much larger scale. Before a Mac user could send an email with accented characters to a Windows user, for example, we had to have two email programs (one on the Mac and one on the Windows PC) and a whole network of computers between the sender and the recipient that all “understood” the same language, the same code — and that clearly couldn’t be the extended ASCII code.

This problem led to the “ISO Latin 1” standard (officially named ISO-8859-1), developed by the International Organization for Standardization. It uses the same code as ASCII for the first 128 characters, but the “extended” portion (also known as “upper ASCII”) is the same for everyone.

Gradually, Internet servers, providers, and individual users have been migrating to computer programs that understand the ISO Latin 1 code and are able to encode and decode texts written using this code. Unfortunately, even today in 2001, there are still a significant number of programs out there, either on servers or on end-user PCs, that do not encode and decode ISO Latin 1 properly. For example, the popular ListBot service, which lets people create and manage free mailing lists, still does not support ISO Latin 1 characters, which means that if you create a ListBot list for French users, you’ll have to ask them all not to use accented characters, or else Windows users won’t be able to read the list messages coming from Mac users and vice-versa. Needless to say, it’s easier said than done and, as a minority, Mac users know all too well how easy it is for Windows users to ignore their needs altogether.

In addition, the 256 characters provided by the ISO Latin 1 code still aren’t enough, not only to meet the needs of other languages such as Greek, Chinese and Japanese, but even to meet the needs of the demanding Western writer. For example, the ISO Latin 1 code doesn’t include the ligatured ‘œ,’ which is commonly used in French for words such as œuf, œil, sœur, etc.

Unicode

The next step was then to move beyond the 8-bit, 256-character barrier. A new standard was thus developed using 16 bits instead of 8, for a total of 65536 possible combinations. This standard is called Unicode, and it covers many languages as well as many symbols and icons.

The problem, here again, is about the slow adoption of Unicode as the universal standard on all computers and all software programs. Unfortunately, we are still very far from having reached a “critical mass” of Unicode-compatible machines. Many pieces of software and many versions of system software still in use out there don’t support Unicode. We are still very much in a transition phase and will remain so for several more years at least.

For the Mac-using writer, this means that he has to adopt a cautious approach and achieve a compromise between his own character requirements as a writer and the current global situation when it comes to Unicode compliance.

Over the past couple of years, I have developed a set of personal rules and tools that enable me to achieve such a compromise. This compromise is satisfactory for me, but it might not be right for you. I still want to share it with you so that you get a better idea of what the issues are for the Mac-using writer and what you can do to address them to your own satisfaction.

Characters in Word Processing

When I write a text (such as this one), I always do so in a word processor first (Word 2001 in my case). This is for a number of reasons: I want the WYSIWYG interface that a text editor such as BBEdit doesn’t offer; I want to be able to use style sheets that automate my formatting tasks; I want to be able to print the text on my laser printer in a quality that approaches that of traditional printed text, with a minimum number of page layout features; I want to be able to exchange the text (with all its formatting, style sheets, etc.) with other users, many of whom are Windows users using Microsoft Office; etc.

So I type my texts in Word 2001. I use all the characters I need or like to use, including accented and ligatured French characters when I write en français, punctuation marks such as the em dash, non-breaking spaces, etc. I do not worry about Unicode at this point. Even if I wanted to, I couldn’t, because Word doesn’t use the Unicode standard for its files. (It actually uses the Windows ASCII code, even on the Mac, as you can see if you open a Word file containing diacritics with a text editor such as BBEdit. This is a pain if you like to use a program such as the excellent SpeedSearch to search through the contents of your Word files to locate a particular occurrence of a string of characters with diacritics. It simply won’t work, and you can only search for strings without diacritics in Word files with SpeedSearch.)

Word does have a Unicode “feature” in the form of a file format called “Unicode Text” that you can select in the “Save As…” dialog. However, if saving a Word file as “Unicode Text” appears to work, reopening the Unicode Text file in Word uncovers a bug that makes Word “forget” characters here and there in your text, effectively rendering it unusable. This bug was already present in Word 98, and still hasn’t been fixed in Word 2001. I do not recommend, therefore, that you use Word’s Unicode feature in the current version. Microsoft has a long history of very iffy standards compliance, and it will probably be a while before Office applications become 100% compliant with the Unicode standard.

I suspect it will be more or less the same with other word processing applications.

Characters in Web Publishing

Once my text is ready in Word 2001, I want to transfer it to my HTML editing application of choice, i.e. BBEdit, to publish it on the web. There are several possible ways to do this: I can save the text as a “plain text” file and open it in BBEdit, I can drag-and-drop the text from Word to BBEdit, or I can use the clipboard to copy and paste it. Whichever approach I choose, however, I will lose some vital character information.

For example, if I have any non-breaking spaces in my document, they will be lost in the process. In this particular article, I have typed non-breaking spaces between “Word” and “2001” in all occurrences of the string “Word 2001,” so as to avoid having an unsightly “2001” at the beginning of a line, separated from the preceding “Word.” Those non-breaking spaces will be lost. (Non-breaking spaces are also widely used in French typography.)

Since there is no way to automatically restore such non-breaking spaces in BBEdit (I can’t expect BBEdit or a BBEdit AppleScript script to guess that I had put a non-breaking space between “Word” and “2001”), I need to “prepare” my text in Word before I transfer it to BBEdit.

The way I use to prepare my text is a Word macro-command which automatically replaces all my non-breaking spaces with the HTML entity for the character, i.e. <&nbsp;>, in the selected text.


With Selection.Find
    .Text = "^s"
    .Replacement.Text = "&nsbp;"
    .Forward = True
    .Wrap = wdFindStop
    .MatchCase = False
    .Format = False
    .MatchWholeWord = False
    .MatchWildcards = False
    .MatchSoundsLike = False
    .MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll

When I work on my text later in BBEdit, I just need to make sure that it won’t automatically replace ‘&’ characters with the HTML entity for that character, i.e. <&amp;>. (And I want to use the ‘&’ character in my document, I will need to replace it with ‘&amp;’ using a Word macro before I do anything else.)

Things get more complicated, however. BBEdit is excellent in terms of standards compliance, but web browsers are not. Both Netscape 4.x and Internet Explorer 4.5/5.x still misinterpret several key characters in the way that BBEdit encodes them — a form of encoding which is perfectly correct, of course, but ends up being useless for my target audience. For example, the HTML entity for the ellipsis (‘…’) is <&hellip;>. Internet Explorer 5.x will interpret this entity properly and display an ellipsis, but Netscape Navigator 4.x will not.

Because of the fact that Netscape Navigator 4.x is still widely used and will remain so for quite a while, I need, here again, to “prepare” my text before I transfer it to BBEdit. In this particular case, the only solution is, unfortunately, to replace the ellipsis with a string of three dots (which I have done for this article). This is just an example of the type of compromise one currently has to accept.

Why not use Unicode text?

Ideally, what I would like to do, of course, is use Unicode all the way through and not have to encode my characters at any point using hard-to-read HTML entities. Theoretically, I could get closer to that ideal by using Word’s “Unicode Text” file format. However, even if the current versions of the most popular browsers on the Mac and under Windows (Explorer 5 and Netscape 4) do provide some level of support for Unicode, this support is not yet sufficient, to enable me to use the standard throughout my web publishing activities.

For example, the Unicode encoding standard requires that the file start with what is called a “Byte-Order Mark” (BOM) for proper decoding. However, if you do include this BOM in your HTML file, Explorer 5 will be unable to display it properly. If you don’t include it, when you close your Unicode file and open it again in BBEdit, since BBEdit doesn’t find this BOM, it decodes the Unicode file improperly and all your character encoding is lost. This effectively makes Unicode encoding unusable for web publishing with the tools I am currently using.

Netscape 4 doesn’t fare much better and keeps displaying all kinds of question marks for characters that it apparently doesn’t understand. I have yet to test Unicode HTML files under Netscape 6 — but, in any case, even if AOL/Netscape does manage to release a proper version of the product, it will be a while before the browser is widely adopted.

HTML Entities

This brings us back to ISO Latin 1 and HTML entities. (Originally, HTML entities were just alphanumeric equivalents of ISO Latin 1 codes. For example, &#233; et &eacute; both refer to the same ISO Latin 1 character, the ‘é’ with an acute accent. Now, with Unicode, however, there are also HTML entities referring to Unicode characters. In the rest of this article, however, when I refer to “HTML entities,” I mean ISO Latin 1 HTML entities.)

Those entities — which are always of the form &xxxx; — are what makes your HTML file with diacritics, special characters, etc. readable on all computers with all browsers (most of them anyway). They are inconvenient, because a word such as “&eacute;l&egrave;ve” is definitely more difficult to read and edit than élève, but, right now, they are the only reliable way for the Mac-using writer to publish texts with diacritics and special characters on the web.

In addition, the complete list of HTML entities, which is the HTML equivalent of the ISO Latin 1 character set, doesn’t include important characters such as the ligatured ‘œ.’

What’s worse, Netscape 4.x is notorious for not supporting several of those HTML entities. For example, if I didn’t use the HTML entity for the superscript ‘2’ above when discussing 2-bit strings (I used “2^2”), it’s not because I didn’t want to or because it’s not part of the ISO Latin 1 code. It’s simply because it’s not supported by Netscape 4.x. “2^2” is a more or less acceptable substitute, but it’s clearly not satisfactory for the demanding writer/web publisher in me. Yet, at this moment in the evolution of Internet standards and their adoption, I don’t really have amy choice.

And to make things even more complicated, several web designing tools, such as Adobe GoLive and Macromedia Director, also have their own specific idiosyncrasies and do not fully support the importing (through cut-and-paste or by opening a text file) and exporting of ISO Latin 1 entities.

The Big Compromise

Everyone has different needs and different target audiences. And the current situation is obviously a royal mess for both the average user and the demanding writer/typesetter.

It’s understandable, therefore, that many people have simply given up on trying to make things work for now and gone back to using a very minimal set of characters. For example, many people only use straight English quotes ("), even in languages which would normally use a different type of quotation mark, and many people still use a double dash (--) in lieu of an em dash.

Even then, you can still use automated tools (either Word macros or BBEdit AppleScript scripts, for example) to simplify your work and enable you to still have both a high-quality, ready-for-print word processing document and a web-ready file using only the most commmon and widely supported characters and HTML entities. Depending on the person with whom I want to share a file, for example, I will use a BBEdit script to replace all em dashes with ‘--‘:


tell application "BBEdit 6.0"

	replace "—" using "--" searching in the selection

end tell

For my own personal web publishing needs, however, I have decided to adopt a “middle ground” approach by using a combination of HTML entities and some Unicode characters that are already supported by both Explorer 4.5/5.x and Netscape 4. For example, since the ‘&mdash;’ entity is supported by Explorer but not by Netscape, I use the ‘&#8212;’ Unicode entity, which appears to be already supported by both browsers (based on my own testing). Similarly, I use &#339; and &#338; for ‘œ’ and ‘Œ’ respectively.

If you want to find out more about which characters are supported by which browsers — depending on your own personal needs, of course — I recommend that you do your own research. One useful web site that I have recently found is “HTML 4.0 Character Entity References”. You can then write automated tools such as Word macros or BBEdit scripts to automate the replacement of your special characters by the appropriate entities.

Of course, it is a “patch-up” kind of approach that makes switching back and forth between your word processor and your web page editor pretty much irrealistic. Once you have converted all your special characters into the HTML entities you have chosen to use, you are unlikely to want to go back and do the reverse conversion in order to be able to continue your editing work in the word processor.

Unfortunately, we will only achieve such a smooth, seamless process when all computers, all system software and all applications finally adopt the same, uniform standard. In the mean time, I hope this article has given you a few ideas on how to achieve a reasonable compromise that will both ensure that your web published work is accessible to all the members of your target audience and enable you to use a set of characters that satisfies your own personal standards as a Mac-using writer.

Next time: “Word Processing & Web Publishing - Part III: Character Styles”

You can contact Pierre Igot at applepeel@applelust.com

Pierre's "Apple Peel" page here at Applelust



©2000-2001 Applelust.com. All rights reserved. No part of this publication may be reproduced in any way without prior, expressed permission from the Publisher. It is the sole property of Applelust.com and its writers, who retain copyright to their own works. If you wish to link to us, please see our Privacy Statement for conditions. Apple, Macintosh, and Mac are trademarks of Apple Computer, Inc, with whom we are in no way affiliated or endorsed.

Hosting provided by itsamac.com -- Macintosh Powered Web Hosting

Serve Different

dreamy