Advanced Technology Program ATP Home Page NIST Home Page

Project Brief


Open Competition 5 - Information Technology

A Phrase-Based Statistical Approach to Understanding and Translating Natural Language


Develop and demonstrate technologies that will enable accurate machine understanding of human languages by isolating statistically significant phrases and mapping equivalencies in their usage.

Sponsor: Sehda, Inc.

1040 Noel Drive
Suite 100
Menlo Park, CA 94025
  • Project Performance Period: 6/1/2002 - 5/31/2005
  • Total project (est.): $1,556,209.00
  • Requested ATP funds: $1,305,751.00

Machines are currently unable to fully "understand" human language. Highly restricted vocabularies of individual words may be recognized in a specific context, but overall the words are not understood in the way humans do. New methods are needed to resolve semantic, syntactic, and even pragmatic ambiguities. Conventional approaches focusing on keywords, grammar rules, and simple probabilistic modeling appear to have reached their limits. In a two-year project, Sehda plans to develop and demonstrate novel technologies, usable by anyone with or without specialized linguistic knowledge, to automate the understanding of text and spontaneous conversation by mapping equivalencies in usage of phrases instead of focusing on the meaning of individual words. Sehda's approach is based on statistical modeling of human conversations. Research has shown that children learn their native language phrase by phrase rather than word by word; preliminary tests suggest this concept has promise for machine understanding as well as machine translation. The company's goal is to construct a network of equivalent phrases of conversational English using algorithmic procedures that automatically extract a significant number of phrases from text and organize them into semantically and syntactically equivalent classes. The same step will be taken for either French or Spanish. A mapping between the two languages will be used to produce valid translations. The overall challenge is to build and validate a viable system despite the very large scale of the challenge posed by natural usages, and to verify the heuristics for measuring the closeness between phrase meanings. ATP support is needed because Sehda is a small company and the project is too risky for external private investors, who are wary of the limited success of other natural language translation systems. If successfully developed and deployed, the new technology would provide the core language engine for a variety of applications in addition to language translation, such as speech recognition and data-mining. Speech interfaces could be built quickly and inexpensively, companies could translate product information easily, and customer service costs could be reduced through the use of automatic question-answering systems. The technology would reduce the cost of developing a natural speech recognition system by an estimated 50 to 80 percent.

For project information:
Farzad Ehsani, (650) 328-8877
farzad@sehda.com

ATP Project Manager
Christopher Currens, (301) 975-8503
christopher.currens@nist.gov


ATP website comments: webmaster-atp@nist.gov
Privacy Statement / Security Notice NIST Disclaimer NIST Information Quality Standards
NIST is an agency of the U.S. Commerce Department