Writing Files with the Encoder Fallbacks

If you are using the DLL version of the Microsoft Ajax Minifier, you might have noticed that it does not include any file input or output functionality. The DLL only works with in-memory strings. Your application will have to handle reading from or writing to the hard drive. You don’t really need to do anything special to read JavaScript or CSS files. Simply use the .NET StreamReader class (System.IO namespace) or any other existing file-reading methodology, although you may have to provide the appropriate Encoding object if it’s not readily determined by the file content itself. Writing files, however may or may not be a little more complicated. If your JavaScript doesn’t contain anything outside of the normal ASCII range, then you don’t really need to do anything special. Or if you are writing files in the UTF-8 or UTF-16 character encodings, you also don’t really need to do anything special. The complication arises if you want to write in a different encoding, and if your code contains characters that can’t be represented in that encoding.

Take, for example, this JavaScript snippet:

alert("你好!")

It’s simple, but it contains UNICODE Chinese characters. The string representation of this code is twelve characters long, and the code points are:

0061 006c 0065 0072 0074 0028 0022 4f60 597d ff01 0022 0029

If you write an overly-simple block of code to utilize the StreamWriter class to write this string (in the minifiedCode variable) like so, you will create a file encoded as UTF-8 (the default encoding for StreamWriter):

using (var writer = new StreamWriter(@"c:\test.js"))
{
    writer.Write(minifiedCode);
}

If you open the file up in a binary editor, you will see the characters have been written in eighteen bytes:

61 6c 65 72 74 28 22 e4 bd a0 e5 a5 bd ef bc 81 22 29

The Chinese characters have each been encoded into three-byte UTF-8 sequences. The beauty of this is that UTF-8 can represent any character in the UNICODE character set, which is essentially every character you will really want to use on a computer. Nothing special needs to be done with this file, because browsers and any other tools that might want to read it back (like AjaxMin) will correctly read and decode the UTF-8 characters. Loading this file up in Internet Explorer, for example, will get you the original text:

alert("你好!")

That’s all fine and dandy, but what if you want to save your files in another text encoding, say, ASCII or Big-5? Well, let’s assume we want to write this JavaScript code as an ASCII file. Obviously the ASCII character set does not include the Chinese characters in our code snippet. If we do the wrong thing, and simply pass in the default ASCII encoder to the StreamWriter constructor, you will lose your Chinese characters:

using (var writer = new StreamWriter(@"c:\test.js", false, Encoding.ASCII))
{
    writer.Write(minifiedCode);
}

That code produces a file with twelve bytes, and the three Chinese characters will all be changed into question-marks (0x3F). Probably not what you want.  To handle characters outside the encoding scheme, you need to use an EncoderFallback class. To make a long story short, an Encoder Fallback is called whenever a text encoding object is asked to write a character that isn’t in the destination encoding. The default fallback is to simply write a question mark (hence the lost Chinese characters). Microsoft Ajax Minifier provides two Encoder Fallback classes for use in your code: JSEncoderFallback for JavaScript and CssEncoderFallback for CSS. All they do is take the character that cannot be displayed in whatever output encoding you are using, and escape it as appropriate for the particular language. For instance, characters in JavaScript are encoded with the UNICODE escape sequence: \uXXXX if the character is in the standard 16-bit UNICODE range, and as the surrogate pair \uUUUU\uLLLL if the character is in the extended UNICODE range (all digits in hexadecimal). For CSS code, the escape sequence is \XXXXXX, consisting of up to six hexadecimal digits; if less than six digits are needed, a space character is appended to signify the end of the escape sequence.

Now, back to our example. If we want to write the file in the ASCII encoding, we need to clone the default ASCII encoding object and set the EncoderFallback property to an instance of our JSEncodeFallback class. We have to clone the default ASCII encoder because it will throw an exception if you try to set the EncoderFallback property on the default ASCII encoder object proper.

var encoding = (Encoding)Encoding.ASCII.Clone();
encoding.EncoderFallback = new JSEncoderFallback();
using (var writer = new StreamWriter(@"c:\test.js", false, encoding))
{
    writer.Write(minifiedCode);
}

Now when we write our sample code, we will get the following ASCII text:

alert("\u4f60\u597d\uff01")

Loading that code in a browser that executes the code will get you the desired alert box:

image

There are only a handful of default encoders that you can clone. If you want to encode your output in a different encoding, then I would suggest using the Encoding.GetEncoding method, passing in the name of the desired encoding:

var encoding = Encoding.GetEncoding(
    "iso-8859-1",
    JSEncoderFallback(),
    new DecoderReplacementFallback("?"));
using (var writer = new StreamWriter(@"c:\test.js", false, encoding))
{
    writer.Write(minifiedCode);
}

The encoding names you can use are defined in the Info.Name column of the data table on this MSDN page. For example, if you want to encode your output in the Big-5 encoding scheme, pass “big5” to the GetEncoding method.

Generally speaking it’s always a good idea to use an encoding object with the appropriate AjaxMin-provided EncoderFalback object whenever writing JS or CSS code, regardless of what encoding scheme you are using or whether or not your code contains special characters. Using the provided EncoderFallback objects will always generate the proper JavaScript or CSS code.

What If I Just Want A JS-Encoded Encoded String?

That is also easily handled, but it will take a little more work. What we need to do is write the minified code (internally stored by .NET as a UNICODE string) to an array of ASCII-encoded bytes using an instance of JSFallbackEncoder, and then read those bytes back into a .NET string using a StreamReader. For instance, you could define a function like this:

private static string GetAsciiEncodedOutput(string minified)
{
    var encoder = (Encoding)Encoding.ASCII.Clone();
    encoder.EncoderFallback = new Microsoft.Ajax.Utilities.JSEncoderFallback();
    var encodedBytes = encoder.GetBytes(minified);

    using (var memoryStream = new MemoryStream(encodedBytes))
    {
        using (var reader = new StreamReader(memoryStream))
        {
            return reader.ReadToEnd();
        }
    }
}

Last edited May 18, 2013 at 2:04 AM by ronlo, version 5

Comments

No comments yet.