SPECIALIST Lexicon

The SPECIALIST lexicon is the newest of the four UMLS® sources under development by the National Library of Medicine as part of the Unified Medical Language System® (UMLS) project. The SPECIALIST lexicon is an English language lexicon with many biomedical terms. The lexicon has been developed in the context of the SPECIALIST natural language processing project at the National Library of Medicine. The 1994 version of the lexicon includes approximately 60,000 lexical records and over 120,000 forms.

Scope and Content of the SPECIALIST lexicon

The lexicon entry for each word or term records syntactic, morphological, and orthographic information. Lexical entries may be single or multi-word terms. Entries which share their base form and spelling variants, if any, are collected into a single lexical record. The base form is the uninflected form of the lexical item; the singular form in the case of a noun, the infinitive form in the case of a verb, and the positive form in the case of an adjective or adverb.

Lexical information includes syntactic category, inflectional variation (e.g., singular and plural for nouns, the conjugations of verbs, the positive, comparative, and superlative for adjectives and adverbs), and allowable complementation patterns (i.e., the objects and other arguments that verbs, nouns, and adjectives can take).

The lexicon recognizes eleven syntactic categories, or parts of speech: verbs, nouns, adjectives, adverbs, auxiliaries, modals, pronouns, prepositions, conjunctions, complementizers, and determiners. The basic sentence patterns of a language are determined by the number and nature of the complements taken by verbs, since the complementation of the main verb largely determines the structural skeleton of a sentence. The lexicon recognizes five broad complementation patterns: intransitive, transitive, ditransitive, linking and complex-transitive. Verb entries also encode each of the inflected forms (principal parts of the verb). Verbs are inflectionally classified as regular, Greco-Latin regular or irregular. Noun entries describe the inflection of the nouns (pluralization) and spelling variations. Complementation patterns for nouns and nominalization information is also included when relevant. In addition to inflection and complement codes, adjectives in the lexicon have position codes to indicate the syntactic positions in which they may occur. An adjective may be a qualitative, classifying, or color adjective. Adverbs in the lexicon are coded to indicate their modification properties. The lexicon recognizes sentence, verb phrase and intensifier type adverbs, and classifies sentence and verb phrase adverbs into manner, temporal and locative types.

Lexical items are selected for coding from a variety of sources. Approximately 20,000 lexical items from the UMLS Test Collection of MEDLINEr citation records together with lexical items which appear both in the UMLS Metathesaurusr and the Dorland's Illustrated Medical Dictionary form the core of the entries. In addition, an effort has been made to include words from the general English vocabulary. The 10,000 most frequent words listed in The American Heritage Word Frequency Book and the list of 2,000 words used in the controlled definitions in Longman's Dictionary of Contemporary English have been also coded.

Lexical Tools

The 1994 release of the UMLS knowledge sources includes tools that may be of use to developers who work with the UMLS knowledge sources. Included on the CD-ROM is a set of lexical programs, indexes, and databases. The lexical programs generate a range of variations for English lexical items and should be useful for recognizing lexical variation in biomedical terminologies and texts. The lexical variant generation (lvg) programs use data from the SPECIALIST lexicon as they compute different forms of lexical items.

The programs consist of several different modules that can be combined in a variety of ways. Each module is a single lvg option. Options include lowercasing, uninversion, sorting words in a multi-word term, stopword removal, possessive marker removal, punctuation removal, and generation of inflectional and derivational variants. Three indexes are also provided; a simple word index of all words in Metathesaurus strings, a normalized word index, and a normalized string index. The creation of the normalized indexes involves one set of lvg options.

Three lexical databases that may be useful for some developers have also been provided on the CD-ROM. These are a file of known derivational variants (alternations such as "aphasic/aphasia"), a file of closely related terms that mean the same thing but may have a different syntactic category (e.g., "hepatocellular/liver cells"), and a file of spelling alternations (e.g, "foetal/fetal").

Distribution Formats

The SPECIALIST lexicon is provided in two formats; a relational table format and a unit record format. The information associated with each lexical entry includes a unique identifier, a base form, a syntactic category code, certain agreement information, complementation information if relevant, and various other properties relevant to the particular lexical entry. Data for lexical entries is represented in ten relational tables.

The lexicon relational format is not fully normalized. By design, there is duplication of data among different relations and within certain relations. Developers will need to decide about the extent to which this redundancy should be retained, reduced, or increased for their specific applications. Among other tables, there are separate tables for agreement and inflection information, for complementation patterns, for spelling variants, and for abbreviations and acronyms and their fully expanded forms.

The unit record format is a frame structure consisting of slots and fillers. The slots are the basic lexical attributes, and the fillers express the possible values of those attributes for that particular lexical item. The record for "anaesthetic" given below illustrates some of the features of the lexical unit record:

    {base=anaesthetic
  spelling_variant=anesthetic
  entry=E0008769
          cat=noun
          variants=reg
  entry=E0008770
          cat=adj
          variants=inv
          position=attrib(3)
          }
The base form "anaesthetic" and its spelling variant "anesthetic" determine a lexical record consisting of a noun and a verb entry. The variants= slot contains a code indicating the inflectional morphology of the entry; the filler reg in the noun entry indicates that the noun "anaesthetic" is a count noun which undergoes regular English plural formation ("anaesthetics"); inv in the variants= slot of the adjective entry indicates that the adjective "anesthetic" does not form a comparative or superlative. The position= slot indicates that the adjective "anaesthetic" is attributive and appears after color adjectives in the normal adjective order. The lexical programs are written in the C programming language and are provided as ASCII source code.

Application Procedures

Those who wish to obtain copies of UMLS products are required to sign a one-year experimental agreement with the NLM. Sample records, documentation, and copies of the experimental agreement, are available from the NLM anonymous ftp file service at nlmpubs.nlm.nih.gov (UMLS files and documents are located in the /nlmpubs/umls directory) or write to:

Betsy L. Humphreys
UMLS Project Officer
National Library of Medicine
8600 Rockville Pike
Bethesda, MD 20894
FAX 301/496-4450

Other fact sheets in the UMLS series:

  1. Unified Medical Language System
  2. UMLS Metathesaurus
  3. UMLS Semantic Network
  4. UMLS Information Sources Map
Copies are available from:

Office of Public Information
National Library of Medicine
8600 Rockville Pike
Bethesda, Maryland 20894

Internet

Access to UMLS fact sheets is also available for Internet users through FTP (File Transfer Protocol). To access, ftp to nlmpubs.nlm.nih.gov and login as: anonymous.


NLM HyperDOC / SPECIALIST Lexicon / June 1994