:merzwaren:

More Text Extractors

Last updated on March 22, 2002


What The Heck Is This?

More Text Extractors (MTE) is an experimental suite of Find by Content plug-ins, i.e., extension modules that help Sherlock understand and index the textual content of more file formats.

There are currently two plug-ins in this suite: the Unicode Text Extractor and the Style Text Extractor. The Unicode Text Extractor allows Sherlock to index UTF-16 files, i.e., plain text files encoded according to the Unicode standard, in the flavor technically known as UTF-16. The Style Text Extractor allows Sherlock to index documents created by our Style scriptable text editor. The plug-in supports documents generated by version 1.7 and 1.8 of Style, but not by earlier versions.


About Unicode Text Files

Unicode is the universal character set: an encoding capable of representing the most widely used writing systems of the world, supported by more and more vendors, platforms, programming languages and applications. It defines tens of thousands of characters and has room for more than a million. The writing systems supported by Unicode 3.0.1 (the most recent version of the standard at the time of this writing) include Latin, Greek, Cyrillic, Arabic, Hebrew, Devanagari, Bengali, Chinese, Japanese and Korean.

Unicode comes in several different flavors known as transformation formats, including UTF-8, UTF-16 and UTF-32. The Unicode Text Extractor specifically handles UTF-16 files. On the Macintosh platform, these files have a 'utxt' file type. On other platforms, such as Windows, where file types are not available, you can usually tell a UTF-16 file from other files by looking for a characteristic 'signature' at the beginning of the file, known as the byte order mark. This signature is a strong hint that the file is UTF-16-encoded, and also allows the reader to tell so-called little-endian files (typical of Windows) from big-endian files (common on just about every other platform). The Unicode Text Extractor can handle both variants of UTF-16.

There are several Macintosh programs that understand UTF-16 files, including Style, Tex-Edit Plus, BBEdit and Microsoft Word.


System Requirements

More Text Extractors requires Mac OS 8.6 or newer. It has only been tested with Mac OS 9.0.4.


Installation

To install More Text Extractors, drag 'Unicode Text Extractor' and 'Style Text Extractor' onto the system folder. The Finder will route these items to the appropriate location in your Extensions folder, supplementing the existing PDF and HTML extractors that come standard with the system software.

The plug-ins will be available immediately. There is no need to reboot your Macintosh.

The plug-ins will add two new entries to the File Extension Mappings database maintained by the system. This is required for proper operation of the Find by Content library. One entry maps the '.uni' file extension to UTF-16 files; the other maps the '.style' file extension to Style documents. You can inspect and modify the mappings database at any time using the File Exchange or Internet control panels. In the Internet control panel, set the user mode to Advanced, then click the Advanced tab.


Downloadables
Distribution

More Text Extractors is freeware: it is copyrighted, but it can be used and redistributed freely. Full C++ source code included. Reuse of the source code in commercial applications is restricted (please contact me for details).


References

Valid XHTML 1.0! Copyright © 2000-2002 Merzwaren