Lexical information includes syntactic category, inflectional variation (e.g., singular and plural for nouns, the conjugations of verbs, the positive, comparative, and superlative for adjectives and adverbs), and allowable complementation patterns (i.e., the objects and other arguments that verbs, nouns, and adjectives can take).
The lexicon recognizes eleven syntactic categories, or parts of speech: verbs, nouns, adjectives, adverbs, auxiliaries, modals, pronouns, prepositions, conjunctions, complementizers, and determiners. The basic sentence patterns of a language are determined by the number and nature of the complements taken by verbs, since the complementation of the main verb largely determines the structural skeleton of a sentence. The lexicon recognizes five broad complementation patterns: intransitive, transitive, ditransitive, linking and complex-transitive. Verb entries also encode each of the inflected forms (principal parts of the verb). Verbs are inflectionally classified as regular, Greco-Latin regular or irregular. Noun entries describe the inflection of the nouns (pluralization) and spelling variations. Complementation patterns for nouns and nominalization information is also included when relevant. In addition to inflection and complement codes, adjectives in the lexicon have position codes to indicate the syntactic positions in which they may occur. An adjective may be a qualitative, classifying, or color adjective. Adverbs in the lexicon are coded to indicate their modification properties. The lexicon recognizes sentence, verb phrase and intensifier type adverbs, and classifies sentence and verb phrase adverbs into manner, temporal and locative types.
Lexical items are selected for coding from a variety of sources. Approximately 20,000 lexical items from the UMLS Test Collection of MEDLINEr citation records together with lexical items which appear both in the UMLS Metathesaurusr and the Dorland's Illustrated Medical Dictionary form the core of the entries. In addition, an effort has been made to include words from the general English vocabulary. The 10,000 most frequent words listed in The American Heritage Word Frequency Book and the list of 2,000 words used in the controlled definitions in Longman's Dictionary of Contemporary English have been also coded.
The programs consist of several different modules that can be combined in a variety of ways. Each module is a single lvg option. Options include lowercasing, uninversion, sorting words in a multi-word term, stopword removal, possessive marker removal, punctuation removal, and generation of inflectional and derivational variants. Three indexes are also provided; a simple word index of all words in Metathesaurus strings, a normalized word index, and a normalized string index. The creation of the normalized indexes involves one set of lvg options.
Three lexical databases that may be useful for some developers have also been provided on the CD-ROM. These are a file of known derivational variants (alternations such as "aphasic/aphasia"), a file of closely related terms that mean the same thing but may have a different syntactic category (e.g., "hepatocellular/liver cells"), and a file of spelling alternations (e.g, "foetal/fetal").
The lexicon relational format is not fully normalized. By design, there is duplication of data among different relations and within certain relations. Developers will need to decide about the extent to which this redundancy should be retained, reduced, or increased for their specific applications. Among other tables, there are separate tables for agreement and inflection information, for complementation patterns, for spelling variants, and for abbreviations and acronyms and their fully expanded forms.
The unit record format is a frame structure consisting of slots and fillers. The slots are the basic lexical attributes, and the fillers express the possible values of those attributes for that particular lexical item. The record for "anaesthetic" given below illustrates some of the features of the lexical unit record:
{base=anaesthetic spelling_variant=anesthetic entry=E0008769 cat=noun variants=reg entry=E0008770 cat=adj variants=inv position=attrib(3) }The base form "anaesthetic" and its spelling variant "anesthetic" determine a lexical record consisting of a noun and a verb entry. The variants= slot contains a code indicating the inflectional morphology of the entry; the filler reg in the noun entry indicates that the noun "anaesthetic" is a count noun which undergoes regular English plural formation ("anaesthetics"); inv in the variants= slot of the adjective entry indicates that the adjective "anesthetic" does not form a comparative or superlative. The position= slot indicates that the adjective "anaesthetic" is attributive and appears after color adjectives in the normal adjective order. The lexical programs are written in the C programming language and are provided as ASCII source code.
Betsy L. Humphreys
UMLS Project Officer
National Library of Medicine
8600 Rockville Pike
Bethesda, MD 20894
FAX 301/496-4450
Other fact sheets in the UMLS series:
Office of Public Information
National Library of Medicine
8600 Rockville Pike
Bethesda, Maryland 20894
Internet
Access to UMLS fact sheets is also available for Internet users through FTP (File Transfer Protocol). To access, ftp to nlmpubs.nlm.nih.gov and login as: anonymous.