This release is pre-ALPHA and is made available to allow testing outside the currently development environment, future more stable release with better documentation are currently under development. Please comment on problems directly to Alan W Black (awb@cs.cmu.edu) This directory contains scripts and models for the expansion of non-standard words to simple words. That is this software is designed to expand arbitrary tokens in text to simple words, expanding numbers, abbreviations, roman numerals etc. This work is a product of the CLSP Summer Workshop at Johns Hopkins University 1999. Authors: Alan W Black (awb@cs.cmu.edu) Stan Chen (sfc@cs.cmu.edu) Shankar Kumar (skumar@clsp.jhu.edu) Mari Ostendorf (mo@rcs.ee.washington.edu) Chris Richards (crichard@wso.williams.edu) Richard Sproat (rws@research.att.com) Please read http://www.clsp.jhu.edu/ws99/projects/normal/ for more details on the project, and the final report there for a description of scientific aspects of this work --------------------------------------------------------------------- The distribution consists of a number of parts nsw-X.X.tar.gz Basic expansion scripts and basic expansion models for four domains. Includes scripts for building new models from data This part of the distribute is free software and may be used for any purpose commercial or otherwise. nsw-data-xxx.tar.gz XML marked up data (raw, labels and marked up XML) for various data bases. These fall under varying licences and are not all freely re-distributable. nsw-data-rfr.tar.gz From rec.food.recipes, freely re-distributable nsw-data-pc110.tar.gz From the pc110 mailing list, freely re-distributable PC110 is an IBM palmtop PC, the list is technical e-mail-like. nsw-data-nantc.tar.gz Data is a subset of the LDC's North American News Text Corpus (Wall Street Journal, New York Times, LA Times and two Rueters news sources). You must have access to the LDC's CD to use this. Scripts are included to take the raw data from the CD and anotate with the provided labels. nsw-data-classifieds.tar.gz Classified real estate adds from various sources as collected by the LDC. ============ REQUIREMENTS ============ You must have gnumake (any version) and the festival speech synthesizer installed. gnu make is available from any good ftp site while Festival (1.4.0 or later) is available from http://www.cstr.ed.ac.uk/projects/festival.html or http://www.speech.cs.cmu.edu/festival/index.html In this release festival is simply used as a a scripting language, and technically only the CMU lexicon needs to be installed in addition to the festival executable, though I would recommend install a complete US English voice. Festival is used as it contains all the sub-parts necessary to run the basic expansion model (e.g. lexical accessed, CART interpretation, tokenizing, regex support and ngram, viterbi decoding) even though we are not doing synthesis here. Later release, that support model building, will require use of more aspects of Festival and the Edinburgh Tools (notably the Wagon CART tree builder) and possible other FSM libraries (FSMTOOLS and LEXTOOLS from AT&T) for some aspects of building. At present these scripts have only been tested under Unix systems (Linux, FreeBSD and Solaris). There is nothing that explicitly stops them from working under NT (everything in the expander is Festival internal) but we have not looked at this at all. ============ INSTALLATION ============ At present the system only offers the abality to expand texts using the pre-built domain models (nantc, classifieds, pc110 and rfr). Requirements You must have gnumake (any version) and the festival speech synthesizer installed. gnu make is available from any good ftp site while Festival (1.4.0 or later) is available from http://www.cstr.ed.ac.uk/projects/festival.html or http://www.speech.cs.cmu.edu/festival/index.html At present these scripts have only been tested under Unix systems (Linux, FreeBSD and Solaris). There is nothing that explicitly stops them from working under NT (everything in the expander is Festival internal) but we have not looks at this at all. To install cd config cat config-dist >config cd .. gnumake The make process is very short is merely makes a few scripts based on the pathname of your festival binary The program festival should be in your path for this to work, or you may explicitly set the variable FESTIVAL in config/config as ion FESTIVAL := /usr/local/festival/bin/festival Note the default automatic setting of this variable (through which festival) may not work properly in multiple NFS environments. ===== USAGE ===== Basic usage will exapnd an arbitrary text file (no XML markup is required) into words. bin/nsw_expand -domain classifieds examples/ads2.txt -output ads2.word Various output formats are support (more to follow). The default output is simply words with the whitespace/newlines form the original this wont be useful in many cases. bin/nsw_expand -domain classifieds examples/ads2.txt -format opl -output ads2.word Format opl (one per line) outputs each found token, its NSW tag, a binary flag telling you if this is the first token in a split or not (tokens that weren't separated from previous tokens by white space will have 0), and then the list of words that the token expands to. Other formats will be added when we have a better idea of what is needed. Multiple files may be expand by listing multiple input files and specifying an output format containing a %s e.g. bin/nsw_expand -domain classifieds examples/*.txt -format opl -output out/%s.word The example database files are marked up in XML format (called NSWML). The input mode may be specified on the command line bin/nsw_expand -domain classifieds -mode NSWML data/classifieds/xml/adsBG.aa.xml -mode NSWML ====== FUTURE ====== The beginings of model building is included in this release but its neither fully tested nor documented yet. We also intend to provide full scripts and instructions for building expansion models for unlabelled data. Documentation, databases, testsuites etc are obviously currently missing. If you have specific requests, or are using this work in any way please let us know as we wnat this to be as useful as possible. Minor changes as well as large recommendations are welcome.