Toward the Realization of Typological Semantic Pattern Dictionaries for MT

Satoru Ikehara (ikehara@ike.tottori-u.ac.jp)
Faculty of Engineering, Tottori University, Japan

Invited talk, presented at ELSNET's MT Roadmap Workshop in conjunction with TMI 2002, Keihanna, Japan

1. Introduction

I would like to discuss the importance of "Semantic Analysis" and propose to build a Large scale Knowledge Base for "Semantic Patterns". I think that the most important subjects in the researches on Natural Language Processing will be "Syntactic Analysis" and "Semantic Analysis". But these subjects, particularly "Semantic Analysis," seem to be avoided, and recent researches tend to lean toward statistical processing, which uses a large-scale corpus.

Since the proposal of Chomsky's "Generative Grammar", I think, the role of a mathematical method has been overestimated, and natural languages were separated from the human community. Natural language is not a natural phenomenon but a human intellectual product. It is deeply related to human life and community. We need to reconsider the concept of the scientific method in this field.

From the viewpoint of mathematics, natural language is classified as a non-linear phenomenon. In order to approximate it using linear models, we are studying a new method. The method works by obtaining the mapping of the whole through local mappings. We have also been developing a knowledge base for the meanings of Japanese expressions over the past ten years. Three years ago, a partial result was published, titled "A-Japanese-Lexicon".

Based on this experience, I would like to discuss the essentials of meanings of expressions, analysis methods for them and the necessity for semantic knowledge bases for linguistic expressions.

2. Past Research on Machine Translation and Meaning Processing

Now, let's begin with an overview of the current status of Machine Translations. In Japan, the race to develop machine translation systems began in the latter half of 1980s, after the publication of the results of "Mu-project," conducted by Professor Nagao. This race terminated in 1993 with the well-known bursting of Japan's economic bubble.

The driving forces of this race were advances in computer hardware and support from the Japanese government, which at that time had the economic power for this kind of funding. Translation quality was improved by a fair degree, and the systems had the good fortune to become very popular in the field of "Internet Translations". However, they cannot satisfy users' expectations. Many kinds of researches on analysis algorithms have been conducted based on syntactic knowledge to date. However, machine translation technologies seem to have reached a saturation point.

If we recall the days more than 30 years ago, the ALPAC committee in the United States indicated 3 points for research on machine translations. First was that computing power was not satisfactory. Second, semantic processing methods needed to be studied. And the third was, therefore, that preparation of translation aids should take priority.

In the ensuing years, the first problem, the shortage of computer power, has been resolved by rapid progress in hardware technologies. The third problem, translation aids, is also being resolved by the realization of "Translation Memory" and many kinds of electronic dictionaries. However, the second problem, semantic processing, has not yet been solved. As to the reasons, we consider the following to be most important: First, there is no established theory for the definition of a meaning, so that the substance of semantic processing has not yet been clarified. The second reason is that a large-scale knowledge base seems to be necessary for the realization of semantic processing.

There were some trials in the 1980's concerning this problem. For example, the "Electronic Dictionary Research Company" developed a "Concept Dictionary" of 400,000 Japanese words. The "Information Processing Agency" compiled a new dictionary for Japanese basic verbs and nouns. In this dictionary, meanings of words are defined as finely as possible. Unfortunately, these dictionaries are not useful in resolving ambiguities in meanings.

Recently, the importance of lexical knowledge has been taken notice of, and attention has been paid to research on ontology. But, no matter to what degree of detail information was collected, the semantic ambiguity of expressions cannot be resolved without decision knowledge about the relationship between meanings and usage.

3. Fundamental Problem in Natural Language Processing

In any case, the most important problem is the ambiguity of structures and meanings of expressions. As in all research, it is very important to clarify the fundamental characteristics of the research target. Particularly, in researches on natural languages, which are human intellectual products, the method of approach depends on perception of the nature of natural languages. Now, I would like to recall the fundamental characteristics of natural languages and discuss inevitable ambiguities. After that, we will discuss what is necessary to solve the ambiguities.

3.1. Linguistic Norms and their Characteristics

(1) Basic Nature of Natural Languages

What is the nature of natural languages? There are many theories and discussions about this problem, such as the concepts of symbolism and instrumentalism. I will start with the idea that natural language is an expression. Here, an expression can be defined as the medium of conveying a person's perception to the mind of another. In this definition, pictures and music are also classified as expressions.

In order to discriminate among these expressions, I will classify expressions into "Sensitive Expressions" and "Rational Expressions". "Sensitive Expressions" stand for expressions that appeal to the five human senses. This kind of expression does not need any promise in advance. On the other hand, "Rational Expressions" appeal to the rational mind. This kind of expression requires promises about the relationships between expressions and meanings. In this definition, pictures and music are classified into the former category and natural language is classified into the latter one.

Since a "Rational Expression" can be considered as a symbol, symbols that are used in such places as world atlases or road signs belong to "Rational Expressions". If we consider the differences between natural languages and these symbols, we can point out that, basically, meanings of symbols in atlases and road signs need to be defined in advance to use. On the contrary, symbols in natural language can be used without definitions beforehand. The reason is as follows. There exist social promises that can be called "linguistic norms". And a speaker's perception is related to the expression through these promises.

In the end, natural languages can be said to be "Rational Expressions" based on "Linguistic Norms". This is a very important aspect of natural languages.

(2) Characteristics of Linguistic Norms

Thus, it is important to clarify the characteristics of "Linguistic Norms". The most important feature is that "Linguistic Norm" differs from the rules of nature. It is one of the "Social Conventions" that are spontaneously formed in a society with same way of life.

Generally, the "convention" reflects how the world is seen based on the means of production and the way of life in the community, so that the "linguistic norm" also reflects them. Thus, the "linguistic norm" is individual to each community and natural language is also individual. It is impossible to design a universal language that will cover all of the natural languages in the world.

The "Pivot Method" has been proposed as a method for a multi-lingual translation system that translates natural languages through a universal language. But, in reality, there is no such a system being implemented. The actual systems use English or Japanese as a pivot.

(3) Semantic Conventions and Syntactic Conventions

Next, let's consider the contents of a "linguistic norm". As mentioned previously, the "linguistic norm" is the social promise that relates the speaker's perception to expressions. The promises can be classified into "Semantic Promise" and "Syntactic Promise", namely, a "Semantic Convention" and a "Syntactic Convention."

The "Semantic Convention" is a rule, such as those that are registered in a word dictionary. Usually, we say that meanings are written in a word dictionary. But, strictly speaking, it is not a meaning but a rule for relations between a word and its meanings. Here, we need to pay attention to the fact that no dictionary has not been developed for the relationships between linguistic expressions and meanings. It is regrettable problem.

On the other hand, "Syntactic Convention" is a rule for the classification of parts of speech and word order in a sentence. Linguists have adopted this kind of rule as a common feature among the "Semantic Conventions".

Therefore, we can say that "Semantic Convention" is primary and "Syntactic Convention" is subsidiary. Conventional methods have been mainly based upon subsidiary rules, and primary rules have rarely been used. This is one of the most important problems.

3.2. Polysemy and Ambiguity of Linguistic Expressions

(1) Necessity of Polysemy

Now, let's consider polysemy, or multiplicity of meanings of linguistic expressions.

A linguistic expression is open-ended and full of variety. Expressions do not always represent only correct things. Society is not controlled by complete discipline. There are many kinds of conflicts and disorders. Subsequently, human perception of them has also same distortions. However, in a natural language, these perceptions need to be related to one-dimensional expressions such as character strings or waves of air. Many frameworks have been developed for linguistic expressions, but it is impossible to strictly relate human perception to expressions. Approximation is inevitable and, thus, there arise polysemic expressions. When this happens, a listener needs to decide the rule that is being used in the expression, using his own knowledge about the target world. And based on this decision, even supplementing with information not described, he experiences himself what the speaker has gone through.

I might add that these characteristics of a language make human mental life very rich. If polysemy were deleted from linguistic expressions, presentation ability would suffer serious losses and the natural language would become like an artificial language such as a "programming language". Therefore, the difference between a natural language and a compiler language is in whether there are polysemy phenomena or not. Thus, we can say that a processing method that does not deal with polysemy phenomena is not a method that will be effective for a natural language.

(2) Polysemy and Ambiguity

Let's differentiate polysemy from ambiguity. Polysemy means that an expression that has two or more semantic rules is used in an actual expression. On the other hand, ambiguity means that the rule that is used in the actual expression can not be determined.

In reality, polysemy often does not matter for a human listener because it is handled in such a way as not to disturb the listener's comprehension. But it is a serious problem for the computer. The reason for polysemy is a lack of information. Even if there is a polysemy in an actual expression, there is no ambiguity if the computer has the information to solve it. On the contrary, if a computer does not have the information, no algorithm can solve the ambiguities.

Thus, it is very important in natural language processing to make a plan to solve ambiguity. The target is ambiguity and the fighting power is information. In some cases, necessary information can be supplemented by deeper analysis of expressions. Sometimes, however, there is no way to solve ambiguities other than supplying them from the outside in the form of a dictionary or other knowledge base.

3.3. Meaning of Expression and Semantic Analysis

In the above, we have discussed the characteristics of natural languages and the necessity of solving the problem of ambiguity. In the following, we will consider what is the meaning of an expression and what is the role of semantic analysis.

(1) Meaning of Expression and Promise written in Dictionary

The linguistic representation and comprehension process is composed of 4 elements: "subject", "speaker's perception", "expression" and "listener's comprehension". These are called "linguistic substances". Depending on which of these substances is considered a meaning, conventional semantic theories can be classified into 4 groups. For example, Seal's semantics is classified into "Recognition Semantics" and Grice's semantics is "Interpretation Semantics".

Differing from these conventional semantics, I will propose a new semantics: that the meaning of an expression is a corresponding relationship between an expression and a speaker's perception. Usually, a relationship can be represented by a bidirectional pointer in a computer. This pointer is tied to an actual expression. However, the meaning in a dictionary is not connected to an actual expression, so it is not a real meaning but a candidate for the meaning. When there are two or more candidates registered in the dictionary, ambiguities arise.

(2) Semantic Analysis and Semantic Dictionary

Let's consider comprehension by a computer. Comprehension can be thought as experiencing for oneself what another person has gone through. This process includes the process that determines the semantic convention used in the actual expression. Taking notice at this process, I divide the comprehension process into two steps.

The first step is "Semantic Analysis," where the convention used in the expression is determined. The second step is "Semantic Understanding" where the speaker's perception is reproduced in a computer.

Let me show you some examples of the first step. Fig.1 [omitted here] represents the relationships between expressions and semantic conventions used in them. There are two sentences the syntactic structures of which are same. The translation of the first sentence is,

"She tied an expensive present with a ribbon". And the second is "She sat down on a high place."

The adjective "takai" is used in both sentences. This word can be used in many meanings as defined in the dictionary. Out of these, it is used with the meaning of "expensive" in the first sentence. On the other hand, it is used with the meaning of "high" in the second sentence. Similarly, the verb "kakeru" is used as the meaning of "tie" in the first sentence, but it is used in corporation with the noun "koshiwo" as the meaning of "sit down" in the second sentence.

In semantic analysis, no problem arises when only one rule corresponds to the actual expression. The problems arise only when a rule that corresponds to two or more conventions is used. But, as you see, this case is quite common. We must also notice the fact that there are many conventions for expressions as well as for words. Until now, very little attention has been paid to the semantic conventions surrounding an expression. That's a problem.

Fig.1 Example of Semantic Analysis for Japanese Expressions [omitted here]

But, I think, this problem can be solved if we compile the application conditions of semantic conventions. However, "semantic understanding" is not easy for general use, because it requires common sense and world knowledge. Next, limiting the aim to the realization of "semantic analysis", I will discuss the knowledge base for it. This knowledge base will undoubtedly grow to a huge scale, and compilation in the form of a dictionary is expected.

4. Fundamental Conditions for Compiling a Semantic Dictionary

Now, let's proceed to the discussion of the contents and construction of a "Semantic Dictionary" required for semantic analysis.

4.1. Conditions for Dictionary Construction

The first subject is about conditions for a semantic dictionary construction. In the following, I will discuss 4 conditions, such as, "unit of semantic convention", "granularity of a meaning", "description of meaning" and "application condition of rules".

(1) Unit of Semantic Convention

The first is about "Unit of semantic convention". If we classify the semantic conventions into two categories, one for each word and one for an expression composed of two or more words, the latter one is more important. If we decompose the expression into words, many times, the original meaning will be lost. I think this is the most important problem that we encounter in natural language processing.

In this field, as you know, linguistic expressions are assumed to be linear, based on "Frege's Principle". This principle is same as "Compositional Semantics" and "Reductionism". The most of conventional natural science has been constructed based on this principle. However, in many cases, this assumption does not hold for natural languages.

The unit of a meaning does not always correspond to individual words. There are many conventions that relate an expression of two or more words to a meaning. For example, take the case of when one unit of a meaning is represented by a expression of 3 words, the meaning of which can not be separated into the meaning of each word nor into two binomial expressions. The whole expression must be registered in the dictionary. Thus it is very important to find the unit of meaning in expressions and compile them into the dictionary.

(2) Granularity of Meaning

The next concerns granularity of a meaning. The granularity of a word meaning is relative to each language. For example, Eskimo has several hundreds words to differentiate snow and ice. On the other hand, Arabic has the same number of words to represent the states of sand.

Therefore, it is important to define a meaning depending on a target of semantic analysis. Here, let's consider the case of Japanese to English machine translation. In this case, it is sufficient that the meanings of Japanese expressions can be corresponded to English expressions. More detailed decomposability is not required. This is the condition for the granularity of meanings in semantic analysis.

(3) Description of Meanings

The third is the principle of "the meaning of the meaning". In a dictionary for human use, meanings of a word or an expression are written in natural language. But a computer cannot understand this description. To the computer, any description is nothing more than a symbol, so that any description will do in so far as it is systematically defined. Therefore, we describe the meanings of Japanese words and expressions by English expressions. This is easy and convenient for designing a translation system.

(4) Application Conditions of Rules

The last condition is about "application conditions of rules". If the meanings of a word and examples are given in a dictionary, a human can understand them and skilfully use the defined word or expression in an actual sentence. But, a computer has no such ability. Not only the definition of word senses, but also an activating framework for it is required.

Here, the framework is a system that defines the application conditions. In this system, the relationships between the meanings of expressions and their usage need to be exclusively defined.

4.2. Definition of Meaning and Usage

Based on the above 4 conditions, let's consider how to describe the "Semantic Conventions".

(1) Definition by Semantic Pattern

I said before that if we recklessly decompose an expression into small pieces, the original meaning can not be reconstructed. The strict way to avoid this risk in machine translation will be found in the way that target expressions are defined for all of the original expressions. However, as in the Example-base Translation Method, it is impossible to realize such a method because of the infinity of linguistic expressions. On the contrary, from the viewpoint of taking the generality of rules seriously, if we pursue only the conventional method of syntactic similarity, the meanings will be lost. This is an opposition problem well known as the relationship of individuality to generality.

This is one conflict, but not necessarily a hostile one. It can be harmonized. This kind of problem has been solved in engineering. In the field of engineering, many inventions have been achieved by resolving such a conflict from a higher point of view. The same can be expected in natural language processing, too. Here, we propose the method of typological "Semantic Pattern" in order to define the "Semantic Conventions" and their usage.

The importance of a "Semantic Pattern" in the perception process has long been indicated. "Theory of meaning and form" says that a "grasping process for the world" involves form. Hegel pointed out "the analogy grasped by intuition" in "Philosophy of Nature". I think a "Semantic Pattern" is much more suitable for human thinking than "Phrase Structure Grammar". It should be re-evaluated.

Now let's talk about how to construct a pattern. Generally, a linguistic expression is composed of "Objective Expressions" and "Subjective Expressions". An "Objective expression" represents the speaker's conceptualized perception of the contents. Nouns and verbs, namely independent words, are used for this expression in Japanese. On the other hand, a "Subjective Expression" represents un-conceptualized perception of the speaker's emotion, intention, decision and feeling. Appendant words such as particles and auxiliary verbs are used for this expression. This relationship was found by a Japanese linguist, Norinaga Motoori, more than 200 years ago. But the same idea was proposed by the Port Royal grammarians in France 300 years ago.

In Japanese, an "Objective Expression" is followed by a "Subjective Expression" to form a unit of an expression. And this unit is nested to build an embedded expression structure. I think this structure represents the way of perception of Japanese people. Taking notice of this point, we represent the form of the speaker's perception by a "Semantic Pattern".

(2) Classification of Semantic Pattern

The number of appendant words is limited to a relatively small number. Compared to appendant words, the number of independent words is huge. Thus I think it is better to comprise a semantic pattern from variables for independent words and literals for appendant words. As for variables, any of "Syntactic Attributes", "Semantic Attributes" and literals can be used.

Here, independent words can be classified into declinable words such as verbs and adjectives and indeclinable words such as nouns. Taking the usage of these two kind of words into consideration, we classify semantic pattern into 3 groups as follows.

(1) Expression structures for simple sentences

(2) Expression structures for complex sentences.

(3) Expression structures for noun phrases.

"Valency pattern" is suitable to write a pattern for the first group. Valency pattern is similar to "Case Frame" but not the same. The difference is as follows. "Case frame" assumes the existence of deep structure as introduced by Chomsky. On the contrary, "Valency Pattern" does not assume it. I think "Valency pattern" is much more practical for actual languages.

5. Current Status of Semantic Dictionary Development

We decided to develop a semantic dictionary for Japanese expressions about 15 years ago. At this time, most of the semantic patterns for the first group, namely simple sentences, has been gathered and compiled into a knowledge base. Outline of this dictionary is shown in Table 1 [omitted here]. This dictionary is composed of 3 parts, "Semantic Attribute System" for words, "Semantic Word Dictionary" and "Semantic Structure Dictionary".

Table 1. Construction of Semantic Pattern Dictionary [omitted here]

"Semantic Attribute System" has 3,000 categories in total. The semantic relationships among these categories are defined by a tree structure. It is a system of description words used for defining "Valency patterns". "Semantic Attribute" is also used to define "semantic usage" of words. Here, "semantic usage" means the way words are used. For example, the English noun "school" is used in many meanings such as the meaning of place, association, building and so on. If we say, "We take refuge at the school". The noun "school" is used as the meaning of a place. In the sentence, "The school is on fire," the noun "school" is used in the meaning of "building." Here, "place" and "building" are the "semantic usages" of the noun "school." This idea is basically different from the conventional concepts of "semantic features" or "case marker."

Next, "Semantic Word Dictionary" defines the semantic usage for 400,000 Japanese words. It is very important to notice that many words have two or more "semantic usages" in this dictionary. This made it possible to differentiate the meanings of a noun.

"Semantic Structure Dictionary" contains 16,000 semantic patterns for 6,000 verbs. In other words, many verbs have two or more sentence patterns. This made it possible to differentiate the meanings of a verb. In this dictionary, English Semantic Patterns are also defined in relation to Japanese "Semantic Patterns".

Table 2. Translation of Japanese verb "kakeru" [omitted here]

This dictionary was recompiled for human use and published by Iwanami Publisher in 1997. A CD-ROM version was also published in 1999.

Many kinds of semantic analysis have come to be realized by using this dictionary. The most prominent effect is correct translations of Japanese verbs and adjectives. Table 2 [omitted here] shows some examples for the translations of Japanese verb "kakeru". The dictionary has 101 "Semantic Patterns" for this verb. It was translated into about 30 ways in this table. The verb that has the most diverse meanings is "suru." This verb roughly corresponds to the English verb "do." In our dictionary, 319 patterns are defined for this verb.

This dictionary can be used for semantic analysis of nouns as well as verbs. It has already been used for dependency analysis as well. In this case, semantic analysis is used in syntactic analysis, so that it is difficult to separate syntactic analysis and semantic analysis.

This dictionary can be used for Kana-Kanji Translation, Voice Recognition and Information Retrieval. In the case of Information Retrieval, semantic retrieval methods and the Vector Space Method are studied.

6. Further Problems

(1) Remaining Subjects

As mentioned above, the current dictionary is partial and not yet completed. About 10,000 patterns of simple sentences still need to be added to this dictionary. There are many points to be improved. Much more important, though, is to start developing another two kinds of dictionaries. That is, the knowledge base for complex sentences and for noun phrases.

From our experience, 100,000 semantic patterns will be almost sufficient to cover Japanese expressions. To hear this, some of you will say "what a large number." And some other people will say "what a small number." Anyway, it is not impossible to compile this number of patterns. I hope that every nation will start to build such a knowledge base early in this century. I think that the compilation of this dictionary will create dramatic changes in natural language processing.

(2) Limitations of Semantic Dictionary

In order to overcome the limitations of conventional syntactic analysis, I proposed to build a Semantic Dictionary. Here, we need to talk about the limitations of Semantic Dictionaries. Any truth is not complete but partial. It changes into falsehood beyond the application domain. Semantic analysis realized by using a semantic dictionary has also limitations. These are related to the human ability to generate new concepts.

Since natural language is based on "Social Conventions", capability of expression is limited. New concepts that cannot be represented by conventional rules are continuously generated in human life. In these cases, the framework of metaphor is used. I think that this makes a natural language open-ended. If the same metaphor is repeated many times in the expression, it will be taken into the linguistic conventions to expand the language. Thus I would consider metaphor not to be an exceptional expression but a fundamental function in natural language.

However, the semantic dictionary proposed in my talk does not have a function to cover metaphor phenomena. Latkov's idea seems very useful to attack this problem. In the next step, we want to deal with this problem.

7. New Direction of Artificial Intelligence

Time is running out. One more point only: I would like to discuss the relation between the nonlinear phenomena of linguistic expressions.

It is difficult for a computer to realize human behaviour even if it is easy and unconscious for humans. In machine translations, we often experience that expressions easy even to a child are difficult to translate. This is because these expressions strongly reflect a Japanese way of thinking.

The limitation of the conventional research on Artificial Intelligence seems to be caused by the "reductive methodology" of modern science. Recently, research on Complexity has become very popular. The fundamental features of a complex system can also be seen in a natural language.

The first is the non-linear feature of the relationship between expressions and their meanings. To say nothing of idiomatic expressions, Frege's principle does not hold in many expressions. This means that it is important to consider the semantic interactions among words as a multi-body problem. This relates to the principle of the reciprocal relationship between quality and quantity. We often experience the fact that the algorithm assured by some small system is helpless to the solution of actual problems.

The second is self-organization of rules. Spontaneous generation of the linguistic norm means that the linguistic norm is self-organized through mutual interaction of human activity. And also, the same phenomena can be seen in the process of language acquisition. In this process, the circuit that recognizes the linguistic norm is self-organized in acquirer's brain through interaction with the outside.

The third is that language is an open system of non-equilibrium and evolves by itself. Languages are always exposed to perceptions that cannot be represented by conventional frameworks. New concepts are usually represented using analogy or metaphor. However, the frequent use of such expressions makes new rules. When analogy and metaphor are used, the entropy of expressions increases for a time. However, after generations of new words or expression by self-organization, it decreases. Thus, evolution of languages can be thought as a process of decreasing entropy.

The above-mentioned method based on the "Semantic Dictionary" aims to extract many nonlinear parts as units of an expression and to reconstruct the total expression, approximating by combining these units of expressions. Many difficulties are foreseen in the research on this complex system. We expect that our new approach will contribute to the research on this problem.

8. Concluding Remarks

My presentation can be summarized in the following 3 points:

The first is that one of the most important problems in natural language processing lies in the ambiguities of structures and meanings of expressions. The second is that we need to realize the full scale of semantic analysis. The third is that such a semantic analysis requires large-scale "Semantic Dictionaries" for linguistic expressions.

The problem of semantic analysis was put away on the shelf for about half a century. As seen in research by Winograd, the importance of knowledge was brought up from the field of artificial intelligence in understanding languages. However, no response was made to this problem from the field of natural language processing.

The problem is troublesome and the research tends to the easy way. However, as far as languages convey meaning, the analysis of their meanings is inevitable. I wish to start the study early in the new century.