Natural Language Processing and Applications

An African American walks into a cafe…

"We don't serve coloured people here", says the owner.

"That's fine", says the man, "I don't eat coloured people; I'd just like a piece of chicken"

Introduction

Natural Language Processing (NLP) also called language technology, linguistic engineering and computational linguistics aims to study and develop methods by which "natural" human languages can be processed effectively by computer.
NLP has the potential to make a very significant contribution to the usefulness of information technology in the long term future. Some key growth areas are:

  • automatic localisation of software and its documentation (via language translation)
  • information retrieval
  • machine assisted translation
  • grammatical and stylistic analysis
  • natural language interfaces for databases

There has recently been an explosion in this field as the computer hardware available to home users achieves a level where real-time processing is possible.
A growing number of groups are discovering the potential of large scale linguistic resources such as machine readable dictionaries, tagged linguistic manuscripts and bi-lingual texts. The existence of these resources has allowed the development of NLP system components such as part-of-speech taggers and machine tractable lexicons.
Standards are being established for the representation of linguistic components in a machine-readable form. Internationally supported projects such as the Text Encoding Initiative have recently appeared with the specific objective of creating and disseminating such standards.
There is a move towards freer exchange of information, data and software between groups. This is exemplified by the growing number of electronic newsgroups dedicated to NLP, and by the formation of international clearing houses such as the Consortium for Lexical Research. Access to these resources has been greatly facilitated by the extension of the internet to Europe.
The fields of text processing and natural language processing are gradually converging. For example style checkers are often incorporated into word processors. Developments of this kind are greatly expanding the potential market for NLP products.
There is a huge field being studied that involves speech recognition and speech synthesis. These areas are beyond the scope of this research, however it is clear that speech recognition will be able to provide higher levels of language comprehension accuracy due to the added components of stress, accenting and pauses.

Components of language

A machine that processes natural language must first be able to categorise, ‘understand’ and process the wide variety of language components. Some of the different level of language as distinguished by Russian analysts (Zvegintsev, 1976) are:

These components (listed in order of decreasing size). The last several in this list (morphemes to phonemes) are most important for understanding and replicating speech, so this research will focus mainly on discourse, sentences, phrases and words.

Language Analysis

Now that we have a basic understand of the units of language, we can begin to examine how a computer can process them. Luger and Stubblefield (1998) identify several key analysis methods for language understanding:

Linguists have classically preferred the use of rigid structured analysis techniques such as grammar, and word order to study language. Computer scientists have found that this technique does not allow enough flexibility to process "ungrammatical sentences", slang, and garbled input. Thus AI researches have established other approaches.

They have introduced more flexible data structures and parallel parsing techniques that allow several analysis techniques to be run concurrently, while pooling their results.

Production rules (IF-THEN rules based on logic that enable some understanding of input text to be derived) and semantic networks have been used to achieve greater processing opportunities.

Semantic networks networks are a general representational technique and they are used in NLP for several different purposes (Beardon et al, 1991). One of the most powerful is the representation of type hierarchies (or knowledge hierarchies) which allow us to capture the properties of other objects through a process of inheritance. See a graphical example.

All these techniques lead to the same focus: the need to be able to process input language and ascertain as many facts as possible. Some common processing goals are determining:

  • what objects were involved
  • what occurred
  • when it occurred
  • what was the outcome

Morphology

Morphology analysis helps determine the use of a word in a sentence by analysing the effect of prefixes and suffixes, thus giving information about tense, number, and part of speech.

Morphological analysis

A morphological analysis means processing word forms without considering context. Word form is defined by Popov as "that part of a text which lies between two blanks (punctuation marks are also considered word forms)".

Normal steps in MA

  1. searching for a word form in the dictionary
  2. distinguishing the stem of the word
  3. the search for the stem in the dictionary of stems
  4. word-combination processing
  5. pre-syntax

With most European Languages, sentence analysis is traditionally divided into morphological, syntactic and semantic analyses. Analysis of Asian language is a very different and difficult process due to the structure of those languages.

The processor is given goals or objectives for analysis. Common goals include:

  1. identifying words
  2. determining those which correspond to events
  3. distinguishing and processing nominal groups

Grammar and Syntax

The rules of grammar can give us information about the events taking place. We can determine how many objects were affected and whether the action took place in the past, will take place in the future or only has a chance of happening. Because language is fuzzy, the classical language analysis techniques cannot provide the depth of understanding that humans achieve. Grammar is but one way to for a machine to get closer to that understanding.

Immediate Constituent Analysis (IC)

This type of analysis was pioneered by Bloomfield (Crystal, 1971) who illustrated how you can take a sentence and split up it into two immediate constituents. For example, he used the sentence Poor John ran away. He first split this up into a subject and a predicate:

Subject: Poor John
Predicate: ran away

In turn there were split up into Poor and John, and ran and away. Thus he was one of the first to see the sentence not as a sequence, but as a series of layers on constituents. Thus tree diagrams began to be used for visual reference to language structure.

Strengths: gives a beginning look at the structure of language
Weaknesses: it does not consider grammatical relationships.

Cannot tell between active and passive sentences, does not show that "That man saw John’s mother" and "John’s mother was seen by that man" are almost the same.

‘Deep’ Syntax

Deep syntax is a much better way to represent a sentence. Deep syntax trees (see below) allow storage in a more systematic way and flexible way. Their structure makes it possible for easy conversions between passive and active, between different tenses, and they also facilitate translations to other languages.

A deep syntax tree

A deep syntax tree for the sentence - "John seems to know the answer"

Semantics

In general, semantics is the study of meaning. A machine will have to analyse in great detail, any input data in order to deduce some meaning from it. It needs to split up the sentence into syntactical components, layer by layer. Often there is more than one possible meaning from the sentence and so a machine will either have to guess by using experience, heuristics or by determining the most appropriate meaning according to the sentences before and after it. Thus because a machine needs to take into account not only the meaning of the sentence but also of the more broad discourse, it would need to support multiple-parsing.

Pragmatics

In broad terms, pragmatics is the way that the setting of the sentence in a discourse is used to determine its correct interpretation. The key features of pragmatics are context and reference. These will be discussed later under Inference.

Inference and Interpretation

Inference and interpretation are logic processes that require examining the input language, comparing it with knowledge that has already been accumulated, and drawing a conclusion. To get to this stage, the analysis techniques already discussed need to be run, and an internal representation of the discourse needs to exist. (see diagram from Luger & Stubblefield).

This discourse processing uses functions of predicate logic to draw general conclusions. For a machine to draw sensible conclusions it is necessary to interpret the incoming data correctly, and thoroughly so that there is as much data as possible from which to draw a conclusion. There are two methods that are especially useful for extracting information from the natural language source: reference and context.

Reference

By considering the references of certain objects in a sentences, a machine can determine the linguistically expressible interdependencies between sentences in a discourse. Reference is the most important means of linking sentences in a discourse (Popov, 1982). The procedure for analysing reference is:

  1. establish where in the context we should seek the entity that is denoted by the given reference
  2. establish how to determine that a given referent and the given reference correspond to reach other (Popov, 1982)

Using reference, we are able to determine know which pronouns are referring to which previously described object. For example: "He lent Jan some money. She was very grateful". A machine would be able to to determine that she referred to Jan by calculating reference (Popov, 1982).

Context

By considering context while processing natural language, we are able to interpret the meaning of a sentence from a connected text by placing that individual sentence in context. If an individual sentence being processed is not related to context, that sentence may have several different meanings, or it might be totally incomprehensible.

There are many levels of context that need to be considered when processing natural language (Popov, 1982):

textual context is the meaning derived from the sentences preceding the current sentence

situational context is the meaning from the current sentence, and is usually only given implicitly.

global context is like the topic of the conversations and allows an algorithm to choose between several meanings (such as "bark" would be chosen differently if we were talking about dogs, as opposed to talking about trees)

local context is the meaning derived from only the few preceding sentences, this is useful because the topic of the conversation may progress. Local context provides the most recent topic.

A simple algorithm for processing context and reference is not really possible since a form of 'fuzzy' processing is required. Thus researchers are experimenting with neural networks to train a computer to recognise certain common situations (called frames) and also to generalise about new situations.

Now that the machine has a basis for accurately determining which objects are being referred to and how, when and where those objects are interacting we have reached a stage that some level of understanding is possible. This gives us the opportunity for an accurate translation to be possible from natural language to a machine readable form, and then to a different natural language.

Knowledge Hierarchies

"Because [an NLP] requires such large amounts of broad-based knowledge, natural language understanding has always been a driving force for research in knowledge representation." (Luger & Stubblefield, 1998)

For language parsing, comprehension and translation, a machine must have a knowledge base of information from which it can process the incoming language. Current systems base their knowledge storage on the way that human brains operate (Luger & Stubblefield, 1998). This knowledge is stored in object hierarchies, with each object having a set of properties and value associated with it. Like all object orientated storage, subclasses inherit properties from their superclass. Examples of these properties could be "colour is yellow" and "size is small" for the object canary. However, the object canary will inherit the properties "can fly" and "lays eggs" from the superclass object: bird. However, there can also be class exceptions. For example, a penguin is an instance of the bird class, but it has an exceptions: it can not fly. This knowledge is stored in a tree (see diagram from Luger & Stubblefield, 1998), and during the process of language parsing, this tree may need to be referred to. This provides the NLP with a basic set of "common sense"

Semantic Networks

The first computer implementation of semantic networks were developed in the early 1960s for use in machine translation (Luger & Stubblefield, 1998)

An early influential program that illustrates many of the features of early semantic network was written by Quillian in the late 1960s (Quillian, 1967). He defined words based on other words, this sometimes resulted in circular definitions, but the program traversed the tree until it gained satisfactory understanding.

Quillian suggested that a natural language system would have to:

  1. Determine the meaning of a body of text by building up collections of these intersection nodes
  2. Choose between multiple meanings, by finding the closest meaning as an intersection on the relationship path
  3. Answer a flexible range of queries based on associations between word concepts in the queries and concepts in the system.

Machine Translation (MT)

MT has been explored and tested for many years. Originally the goal was defined very simply: process an input text in one language and produce and output text in another language, such that the meaning remained unchanged.

The following components of language must be considered during translation:

Machine translation is a very complicated process because the source and destination languages may be very alien. Also, slang, idioms and other regional dialects confuse the process even further.

MT Methods

A process of choosing appropriate word meaning, like the one used by Quillian (1967) is extremely important in machine translation, where choosing the incorrect meaning of the word during translation could totally change the meaning of a translated sentence.

General structure of MT programs:

  • Syntactic analyser creates syntactic parse tree

  • Syntactic transformations modify the syntactic parse tree for the destination language
  • A language generator builds the target sentence from the parse tree

Stages of Language Generation:

  1. Dictionary look-up and morphological analysis
  2. Identification of homographs
  3. Identification of compound nouns
  4. Identification of noun and verb phrase
  5. Processing of idioms
  6. Processing of prepositions
  7. Subject-predicate identification
  8. Syntax identification

The field of machine translation has recently come of age. Many packages are available for home PCs are affordable prices. However, the quality of these applications is still rather poor. Part of the problem is that efficient machine translation requires neural networks, and until parallel processors are more affordable, software emulation must be used. This software emulation is slow, and so quality is compromised so that a reasonable speed can be achieved.

Conclusion

The field of natural language processing has come a long way in the last few years, but it has a long way to go yet. With even faster processors on the horizon, there will be more opportunity for even more accurate and complicated processing tasks. With the business opportunities that present themselves with listening and speaking computers, there is sure to be no lack of funding for more research. A future of voice-activated appliances, cars and houses looks certain.

Check out some natural language examples that the machine may have trouble processing.

Bibliography

Beardon, C. Lumsden, D & Holmes, G. (1991) Natural Language and Computational Linguistics an Introduction. Chichester: Ellis Horwood Limited

Brachman, R.J. and Levesque, H. J. (1985) Readings in Knowledge Representation. Los Altos, CA: Morgan Kaufmann.

Charniak, E & Wilks, Y. (1981) Fundamental Studies in Computer Science: Computational Semantics. Amsterdam: North-Holland Publishing

Chomsky, N. (1976) Relfections on Language. Glasgow: William Collins Sons & Co.

Chomsky, N. (1986) Knowledge of Language: Its Nature, Origin, and Use. New York: Praeger Publishers.

Crystal, D. (1971) Linguistics. Harmondsworth: Penguin Books

Gazdar, G. & Mellish, C.S. (1989) Natural Language Processing in PROLOG. WokinghamL Addison-Wesley Publishing

Luger, G. F. & Stubblefield W. A. (1998) Artificial Intelligence – Structures and Strategies for Complex Problem Solving. Harlow: Addison Wesley Longman, Inc.

Maslov, Y.S. (1975) Vvedenye v yazikoznanye. Visshaya Shkola, Moscow.

Mellish, C. S. (1985). Computer Interpretation of Natural Language Descriptions. Chichester: Ellis Horwood Limited

Obermeier, K. K. (1989). Natural Language Processing Technologies in Artificial Intelligence: The Science and Industry Perspective. Chichester: Ellis Horwood Limited

Popov, E. V. (1982) Talking with Computers in Natural Language. Berlin: Springer-Verlag

Quillian, M.R. (1967) Word concepts: A theory and simulation of some basic semantic capabilties. In Brachman and Levesque (1985)

Taylor, J. R. (1989) Linguistic Categorization – Prototypes in Linguistic Theory. Oxford: Clarendon Press

Waterworth, J. A. & Talbot, M. (1987). Speech and Language-based Interaction with Machines: towards the conversational computer. Chichester: Ellis Horwood Limited

WWWebster Dictionary (1998) Merriam-Webster Online. http://www.m-w.com/

Zvegintsev, V. A. (1976) Predlozzheye i yevo otnoshenye k yaziku i rechi. Moskovskiy Universitet, Moscow.

Appendix

Glossary

dialect (Webster Dictionary) - a regional variety of language distinguished by features of vocabulary, grammar, and pronunciation from other regional varieties and constituting together with them a single language

differentiating signs - non-alphabet characters that enable words to be distinguished. These can include capital letters, italics and accents.

discourse - a complete text or conversation

frames / scripts (Obermeier, 1989)- a way of representing knowledge as chunks of information, which are actually data structures that represent stereotypical situations.

idioms (Webster Dictionary) - the syntactical, grammatical, or structural form peculiar to an individual language

language [1] (Maslov, 1975) - a system of elements, possessed by a certain group, with constitutes units of different levels (words, significant parts of words, etc.) plus a set of rules governing the usage of these units. The system of units is called the vocabulary of the language, while the system of rules for creating and understanding intelligible statements is called the grammar of this language.

language [2] (Webster Dictionary) - the words, their pronunciation, and the methods of combining them used and understood by a community

language [3] (Webster Dictionary) - a systematic means of communicating ideas or feelings by the use of conventionalised signs, sounds, gestures, or marks having understood meanings

language [4] (Webster Dictionary) - a formal system of signs and symbols (as FORTRAN or a calculus in logic) including rules for the formation and transformation of admissible expressions

MT - machine translation

metonymy - the study of metaphors and their actual meanings

morphemes (Webster Dictionary) - a distinctive collocation of phonemes (as the free form pin or the bound form -s of pins) having no smaller meaningful parts

morphology - the components, called morphemes, that make up words

NLP - natural language parsing

phonemes (Webster Dictionary) - a distinctive collocation of phonemes (as the free form pin or the bound form -s of pins) having no smaller meaningful parts

phonology - the sounds that combine to form language

phrase (Webster Dictionary) - a word or group of words forming a syntactic constituent with a single grammatical function (e.g. "under the bridge", or "before breakfast")

pragmatics - the study of appropriate conversation content

prosody - the study of rhythm and intonation of language

semantics - the way that order and word components indicate meaning

sentences - a package of language that may or may not contain enough information to derive meaning (ie. context is important)

slang (Webster Dictionary) - an informal nonstandard vocabulary composed typically of coinages, arbitrarily changed words, and extravagant, forced, or facetious figures of speech

syllables (Webster Dictionary) - a unit of spoken language that is next bigger than a speech sound and consists of one or more vowel sounds alone or of a syllabic consonant alone or of either with one or more consonant sounds preceding or following

syntax (Webster Dictionary) - the part of grammar dealing with the way in which linguistic elements (as words) are put together to form constituents (as phrases or clauses)

words - constituents of a sentence that due to their order, their suffices, prefixes and differentiating signs give some meaning.

world knowledge - background knowledge, and goal understanding required to understand text and conversation

Semantic

Production System Example

An example from Luger & Stubblefield (1998) uses a type of product system to reach the goal:

Rules

  1. Sentence = Noun_Phrase + Verb_Phrase
  2. Noun_Phrase = Adjective + Noun
  3. Noun_Phrase = Article + Noun
  4. Noun_Phrase = Noun
  5. Verb_Phrase = Verb
  6. Verb_Phrase = Verb + Noun Phrase
  7. Article = ‘a’
  8. Article = ‘the’
  9. Noun = ‘man’
  10. Noun = ‘dog’
  11. Verb = ‘likes’
  12. Verb = ‘bites’

Note: this is by no means an exhaustive list of sentence breakdowns

Thus the sentence "The dog bites the man" can be parsed as follows:

Output Rule
Sentence 1
Noun_phrase + Verb_Phrase 3
Article + Noun + Verb_Phrase 8
‘The’ + Noun + Verb_Phrase 10
‘The’ + ‘dog’ + Verb_Phrase 6
‘The’ + ‘dog’ + Verb + Noun_Phrase 12
‘The’ + ‘dog’ + ‘bites’ + Noun_Phrase 3
‘The’ + ‘dog’ + ‘bites’ + Article + Noun 8
‘The’ + ‘dog’ + ‘bites’ + ‘the’ + Noun 9

‘The’ + ‘dog’ + ‘bites’ + ‘the’ + ‘man’

Universal Grammar

The idea that a universal grammar exists that all humans share was conceived by Noam Chomsky, a famous linguists and political writer. This Universal Grammar (UG) was an abstraction of the rules of every human language. Chomsky proposed that since human can so easily learn new languages and switch between them, that there must be a basic set of rules that govern all languages so that our brain can process them. Thus if our brains can process them then so can computers.

Humorous Examples

Some Natural Language examples that computers may have trouble understanding

Squad helps dog bite victim

Man eating piranha mistakenly sold as pet fish

Juvenile court to Try shooting Victim

Women are requested not to have children in the bar

Dwarf seer escapes from jail - small medium at large

Lost small terrier de-sexed at Hungry Jack's

Stud tires out

Drunk gets nine months in violin case

Iraqi head seeks arms

Queen Mary having bottom scraped

Note: Queen Mary is the name of an ocean liner

Yoko Ono will talk about her husband John Lenon who was killed in an interview with Barbara Walters

Two cars were reported stolen by the Groverton police yesterday

We will sell gasoline to anyone in a glass container

For Sale: mixing bowl set designed to please a cook with round bottom for efficient beating

New housing for elderly not yet dead

Note: These samples were taken from "The Language Instinct", a book by Stephen Pinker (1995) Australia Print Group: Maryborough

Inheritance system description of birds

Inheritance system description of birds