An African American walks into a cafe…
"We don't serve coloured people here", says the owner.
"That's fine", says the man, "I don't eat coloured people; I'd just like a piece of chicken"
Natural Language Processing (NLP) also called language technology, linguistic engineering and computational linguistics aims to study and develop methods by which "natural" human languages can be processed effectively by computer.
NLP has the potential to make a very significant contribution to the usefulness of information technology in the long term future. Some key growth areas are:
There has recently been an explosion in this field as the computer hardware available to home users achieves a level where real-time processing is possible.
A growing number of groups are discovering the potential of large scale linguistic resources such as machine readable dictionaries, tagged linguistic manuscripts and bi-lingual texts. The existence of these resources has allowed the development of NLP system components such as part-of-speech taggers and machine tractable lexicons.
Standards are being established for the representation of linguistic components in a machine-readable form. Internationally supported projects such as the Text Encoding Initiative have recently appeared with the specific objective of creating and disseminating such standards.
There is a move towards freer exchange of information, data and software between groups. This is exemplified by the growing number of electronic newsgroups dedicated to NLP, and by the formation of international clearing houses such as the Consortium for Lexical Research. Access to these resources has been greatly facilitated by the extension of the internet to Europe.
The fields of text processing and natural language processing are gradually converging. For example style checkers are often incorporated into word processors. Developments of this kind are greatly expanding the potential market for NLP products.
There is a huge field being studied that involves speech recognition and speech synthesis. These areas are beyond the scope of this research, however it is clear that speech recognition will be able to provide higher levels of language comprehension accuracy due to the added components of stress, accenting and pauses.
A machine that processes natural language must first be able to categorise, ‘understand’ and process the wide variety of language components. Some of the different level of language as distinguished by Russian analysts (Zvegintsev, 1976) are:
These components (listed in order of decreasing size). The last several in this list (morphemes to phonemes) are most important for understanding and replicating speech, so this research will focus mainly on discourse, sentences, phrases and words.
Now that we have a basic understand of the units of language, we can begin to examine how a computer can process them. Luger and Stubblefield (1998) identify several key analysis methods for language understanding:
Linguists have classically preferred the use of rigid structured analysis techniques such as grammar, and word order to study language. Computer scientists have found that this technique does not allow enough flexibility to process "ungrammatical sentences", slang, and garbled input. Thus AI researches have established other approaches.
They have introduced more flexible data structures and parallel parsing techniques that allow several analysis techniques to be run concurrently, while pooling their results.
Production rules (IF-THEN rules based on logic that enable some understanding of input text to be derived) and semantic networks have been used to achieve greater processing opportunities.
Semantic networks networks are a general representational technique and they are used in NLP for several different purposes (Beardon et al, 1991). One of the most powerful is the representation of type hierarchies (or knowledge hierarchies) which allow us to capture the properties of other objects through a process of inheritance. See a graphical example.
All these techniques lead to the same focus: the need to be able to process input language and ascertain as many facts as possible. Some common processing goals are determining:
Morphology analysis helps determine the use of a word in a sentence by analysing the effect of prefixes and suffixes, thus giving information about tense, number, and part of speech.
A morphological analysis means processing word forms without considering context. Word form is defined by Popov as "that part of a text which lies between two blanks (punctuation marks are also considered word forms)".
With most European Languages, sentence analysis is traditionally divided into morphological, syntactic and semantic analyses. Analysis of Asian language is a very different and difficult process due to the structure of those languages.
The processor is given goals or objectives for analysis. Common goals include:
The rules of grammar can give us information about the events taking place. We can determine how many objects were affected and whether the action took place in the past, will take place in the future or only has a chance of happening. Because language is fuzzy, the classical language analysis techniques cannot provide the depth of understanding that humans achieve. Grammar is but one way to for a machine to get closer to that understanding.
This type of analysis was pioneered by Bloomfield (Crystal, 1971) who illustrated how you can take a sentence and split up it into two immediate constituents. For example, he used the sentence Poor John ran away. He first split this up into a subject and a predicate:
Subject: Poor John
Predicate: ran away
In turn there were split up into Poor and John, and ran and away. Thus he was one of the first to see the sentence not as a sequence, but as a series of layers on constituents. Thus tree diagrams began to be used for visual reference to language structure.
Strengths: gives a beginning look at the structure of language
Weaknesses: it does not consider grammatical relationships.
Cannot tell between active and passive sentences, does not show that "That man saw John’s mother" and "John’s mother was seen by that man" are almost the same.
Deep syntax is a much better way to represent a sentence. Deep syntax trees (see below) allow storage in a more systematic way and flexible way. Their structure makes it possible for easy conversions between passive and active, between different tenses, and they also facilitate translations to other languages.
A deep syntax tree for the sentence - "John seems to know the answer"
In general, semantics is the study of meaning. A machine will have to analyse in great detail, any input data in order to deduce some meaning from it. It needs to split up the sentence into syntactical components, layer by layer. Often there is more than one possible meaning from the sentence and so a machine will either have to guess by using experience, heuristics or by determining the most appropriate meaning according to the sentences before and after it. Thus because a machine needs to take into account not only the meaning of the sentence but also of the more broad discourse, it would need to support multiple-parsing.
In broad terms, pragmatics is the way that the setting of the sentence in a discourse is used to determine its correct interpretation. The key features of pragmatics are context and reference. These will be discussed later under Inference.
Inference and interpretation are logic processes that require examining the input language, comparing it with knowledge that has already been accumulated, and drawing a conclusion. To get to this stage, the analysis techniques already discussed need to be run, and an internal representation of the discourse needs to exist. (see diagram from Luger & Stubblefield).
This discourse processing uses functions of predicate logic to draw general conclusions. For a machine to draw sensible conclusions it is necessary to interpret the incoming data correctly, and thoroughly so that there is as much data as possible from which to draw a conclusion. There are two methods that are especially useful for extracting information from the natural language source: reference and context.
By considering the references of certain objects in a sentences, a machine can determine the linguistically expressible interdependencies between sentences in a discourse. Reference is the most important means of linking sentences in a discourse (Popov, 1982). The procedure for analysing reference is:
Using reference, we are able to determine know which pronouns are referring to which previously described object. For example: "He lent Jan some money. She was very grateful". A machine would be able to to determine that she referred to Jan by calculating reference (Popov, 1982).
By considering context while processing natural language, we are able to interpret the meaning of a sentence from a connected text by placing that individual sentence in context. If an individual sentence being processed is not related to context, that sentence may have several different meanings, or it might be totally incomprehensible.
There are many levels of context that need to be considered when processing natural language (Popov, 1982):
textual context is the meaning derived from the sentences preceding the current sentence
situational context is the meaning from the current sentence, and is usually only given implicitly.
global context is like the topic of the conversations and allows an algorithm to choose between several meanings (such as "bark" would be chosen differently if we were talking about dogs, as opposed to talking about trees)
local context is the meaning derived from only the few preceding sentences, this is useful because the topic of the conversation may progress. Local context provides the most recent topic.
A simple algorithm for processing context and reference is not really possible since a form of 'fuzzy' processing is required. Thus researchers are experimenting with neural networks to train a computer to recognise certain common situations (called frames) and also to generalise about new situations.
Now that the machine has a basis for accurately determining which objects are being referred to and how, when and where those objects are interacting we have reached a stage that some level of understanding is possible. This gives us the opportunity for an accurate translation to be possible from natural language to a machine readable form, and then to a different natural language.
"Because [an NLP] requires such large amounts of broad-based knowledge, natural language understanding has always been a driving force for research in knowledge representation." (Luger & Stubblefield, 1998)
For language parsing, comprehension and translation, a machine must have a knowledge base of information from which it can process the incoming language. Current systems base their knowledge storage on the way that human brains operate (Luger & Stubblefield, 1998). This knowledge is stored in object hierarchies, with each object having a set of properties and value associated with it. Like all object orientated storage, subclasses inherit properties from their superclass. Examples of these properties could be "colour is yellow" and "size is small" for the object canary. However, the object canary will inherit the properties "can fly" and "lays eggs" from the superclass object: bird. However, there can also be class exceptions. For example, a penguin is an instance of the bird class, but it has an exceptions: it can not fly. This knowledge is stored in a tree (see diagram from Luger & Stubblefield, 1998), and during the process of language parsing, this tree may need to be referred to. This provides the NLP with a basic set of "common sense"
The first computer implementation of semantic networks were developed in the early 1960s for use in machine translation (Luger & Stubblefield, 1998)
An early influential program that illustrates many of the features of early semantic network was written by Quillian in the late 1960s (Quillian, 1967). He defined words based on other words, this sometimes resulted in circular definitions, but the program traversed the tree until it gained satisfactory understanding.
Quillian suggested that a natural language system would have to:
The following components of language must be considered during translation:
Machine translation is a very complicated process because the source and destination languages may be very alien. Also, slang, idioms and other regional dialects confuse the process even further.
A process of choosing appropriate word meaning, like the one used by Quillian (1967) is extremely important in machine translation, where choosing the incorrect meaning of the word during translation could totally change the meaning of a translated sentence.
Syntactic analyser creates syntactic parse tree
The field of machine translation has recently come of age. Many packages are available for home PCs are affordable prices. However, the quality of these applications is still rather poor. Part of the problem is that efficient machine translation requires neural networks, and until parallel processors are more affordable, software emulation must be used. This software emulation is slow, and so quality is compromised so that a reasonable speed can be achieved.
Check out some natural language examples that the machine may have trouble processing.
Beardon, C. Lumsden, D & Holmes, G. (1991) Natural Language and Computational Linguistics an Introduction. Chichester: Ellis Horwood Limited
Brachman, R.J. and Levesque, H. J. (1985) Readings in Knowledge Representation. Los Altos, CA: Morgan Kaufmann.
Charniak, E & Wilks, Y. (1981) Fundamental Studies in Computer Science: Computational Semantics. Amsterdam: North-Holland Publishing
Chomsky, N. (1976) Relfections on Language. Glasgow: William Collins Sons & Co.
Chomsky, N. (1986) Knowledge of Language: Its Nature, Origin, and Use. New York: Praeger Publishers.
Crystal, D. (1971) Linguistics. Harmondsworth: Penguin Books
Gazdar, G. & Mellish, C.S. (1989) Natural Language Processing in PROLOG. WokinghamL Addison-Wesley Publishing
Luger, G. F. & Stubblefield W. A. (1998) Artificial Intelligence – Structures and Strategies for Complex Problem Solving. Harlow: Addison Wesley Longman, Inc.
Maslov, Y.S. (1975) Vvedenye v yazikoznanye. Visshaya Shkola, Moscow.
Mellish, C. S. (1985). Computer Interpretation of Natural Language Descriptions. Chichester: Ellis Horwood Limited
Obermeier, K. K. (1989). Natural Language Processing Technologies in Artificial Intelligence: The Science and Industry Perspective. Chichester: Ellis Horwood Limited
Popov, E. V. (1982) Talking with Computers in Natural Language. Berlin: Springer-Verlag
Quillian, M.R. (1967) Word concepts: A theory and simulation of some basic semantic capabilties. In Brachman and Levesque (1985)
Taylor, J. R. (1989) Linguistic Categorization – Prototypes in Linguistic Theory. Oxford: Clarendon Press
Waterworth, J. A. & Talbot, M. (1987). Speech and Language-based Interaction with Machines: towards the conversational computer. Chichester: Ellis Horwood Limited
WWWebster Dictionary (1998) Merriam-Webster Online. http://www.m-w.com/
Zvegintsev, V. A. (1976) Predlozzheye i yevo otnoshenye k yaziku i rechi. Moskovskiy Universitet, Moscow.
dialect (Webster Dictionary) - a regional variety of language distinguished by features of vocabulary, grammar, and pronunciation from other regional varieties and constituting together with them a single language
differentiating signs - non-alphabet characters that enable words to be distinguished. These can include capital letters, italics and accents.
discourse - a complete text or conversation
frames / scripts (Obermeier, 1989)- a way of representing knowledge as chunks of information, which are actually data structures that represent stereotypical situations.
idioms (Webster Dictionary) - the syntactical, grammatical, or structural form peculiar to an individual language
language [1] (Maslov, 1975) - a system of elements, possessed by a certain group, with constitutes units of different levels (words, significant parts of words, etc.) plus a set of rules governing the usage of these units. The system of units is called the vocabulary of the language, while the system of rules for creating and understanding intelligible statements is called the grammar of this language.
language [2] (Webster Dictionary) - the words, their pronunciation, and the methods of combining them used and understood by a community
language [3] (Webster Dictionary) - a systematic means of communicating ideas or feelings by the use of conventionalised signs, sounds, gestures, or marks having understood meanings
language [4] (Webster Dictionary) - a formal system of signs and symbols (as FORTRAN or a calculus in logic) including rules for the formation and transformation of admissible expressions
MT - machine translation
metonymy - the study of metaphors and their actual meanings
morphemes (Webster Dictionary) - a distinctive collocation of phonemes (as the free form pin or the bound form -s of pins) having no smaller meaningful parts
morphology - the components, called morphemes, that make up words
NLP - natural language parsing
phonemes (Webster Dictionary) - a distinctive collocation of phonemes (as the free form pin or the bound form -s of pins) having no smaller meaningful parts
phonology - the sounds that combine to form language
phrase (Webster Dictionary) - a word or group of words forming a syntactic constituent with a single grammatical function (e.g. "under the bridge", or "before breakfast")
pragmatics - the study of appropriate conversation content
prosody - the study of rhythm and intonation of language
semantics - the way that order and word components indicate meaning
sentences - a package of language that may or may not contain enough information to derive meaning (ie. context is important)
slang (Webster Dictionary) - an informal nonstandard vocabulary composed typically of coinages, arbitrarily changed words, and extravagant, forced, or facetious figures of speech
syllables (Webster Dictionary) - a unit of spoken language that is next bigger than a speech sound and consists of one or more vowel sounds alone or of a syllabic consonant alone or of either with one or more consonant sounds preceding or following
syntax (Webster Dictionary) - the part of grammar dealing with the way in which linguistic elements (as words) are put together to form constituents (as phrases or clauses)
words - constituents of a sentence that due to their order, their suffices, prefixes and differentiating signs give some meaning.
world knowledge - background knowledge, and goal understanding required to understand text and conversation

An example from Luger & Stubblefield (1998) uses a type of product system to reach the goal:
Note: this is by no means an exhaustive list of sentence breakdowns
Thus the sentence "The dog bites the man" can be parsed as follows:
| Output | Rule |
| Sentence | 1 |
| Noun_phrase + Verb_Phrase | 3 |
| Article + Noun + Verb_Phrase | 8 |
| ‘The’ + Noun + Verb_Phrase | 10 |
| ‘The’ + ‘dog’ + Verb_Phrase | 6 |
| ‘The’ + ‘dog’ + Verb + Noun_Phrase | 12 |
| ‘The’ + ‘dog’ + ‘bites’ + Noun_Phrase | 3 |
| ‘The’ + ‘dog’ + ‘bites’ + Article + Noun | 8 |
| ‘The’ + ‘dog’ + ‘bites’ + ‘the’ + Noun | 9 |
‘The’ + ‘dog’ + ‘bites’ + ‘the’ + ‘man’ |
Squad helps dog bite victim
Man eating piranha mistakenly sold as pet fish
Juvenile court to Try shooting Victim
Women are requested not to have children in the bar
Dwarf seer escapes from jail - small medium at large
Lost small terrier de-sexed at Hungry Jack's
Stud tires out
Drunk gets nine months in violin case
Iraqi head seeks arms
Queen Mary having bottom scraped
Note: Queen Mary is the name of an ocean liner
Yoko Ono will talk about her husband John Lenon who was killed in an interview with Barbara Walters
Two cars were reported stolen by the Groverton police yesterday
We will sell gasoline to anyone in a glass container
For Sale: mixing bowl set designed to please a cook with round bottom for efficient beating
New housing for elderly not yet dead
Note: These samples were taken from "The Language Instinct", a book by Stephen Pinker (1995) Australia Print Group: Maryborough
