Monday, October 22, 2007

Architecturel Design of a Voice Browser System

Assignment 2
System Architectural Design
[Voice Browser System]
ID No : 0700165
==========================================================================
Description of the system

Voice Browser System - enable access to the web using spoken interaction

Two types of clients are illustrated: telephony and data networking. The fundamental telephony client is, of course, the telephone, either wireline or wireless. The handset telephone requires PSTN (Public Switched Telephone Network) interface, which can be either tip/ring, T1, or higher level, and may include hybrid echo cancellation to remove line echoes for ASR barge-in over audio output. A speakerphone will also require an acoustic echo canceller to remove room echoes. The data network interface will require only acoustic echo cancellation if used with an open microphone since there is no line echo on data networks. The IP interface is shown for illustration only. Other data transport mechanisms can be used as well.

Once data has passed the client interface, it can be processed in a similar manner. One minor difference may be speech endpointing. Endpointing will most likely be performed either in the telephony interface or at the front-end of the ASR processor for speech input from telephony interface. For speech via the IP interface endpointing can be performed at the client as well as the ASR front-end. The choice of where endpointing occurs is coupled with the choice for echo cancellation.

It is currently not clear how non-speech data will be handled at the telephony interface. This can include inputs such as pointing device input from a "smart phone," address books and other client resident file data, and eventually even data like video. These smart telephone devices are now on the drawing boards of many suppliers. Some this traffic can be handled by WAP/WML, but there are still open issues with regards to multi-modality. Therefore voice markup language specifications should provide means for extending the language features.

Data from the ASR/DTMF (etc.) recognizer must be in a format compatible with the NL (Natural Language) interpreter. Typically this would be text, but might include non-textual components for pointing device input, in which case pointing coordinates can be associated with text and/or semantic tags. If the recognizer has detected valid input while output is still being presented, the recognizer can signal the presentation component to stop output. Barge-in may not be desirable for certain types of multi-media output, and should primarily be considered important for interrupting speech output. In some cases it may also be undesirable to interrupt speech output, such as in the processing of commands to change speaking volume or rate.

The recognizer can produce multiple outputs and associated confidence scores. The NL interpreter can also produce multiple interpretations. Interpreted NL output is coordinated with other modes of input that may require interpretation in the current NL context or may alter or augment the interpretation of the NL input. It is the responsibility of the multi-media integration module to produce possibly multiple coordinated joint interpretations of the multi-modal input and present these to the dialog manager. Context information can also be shared with the dialog manager to further refine the interpretation, including resolution of anaphora and implied expressions. The dialog manager is responsible for the final selection of best interpretation.

The dialog manager is also responsible for responding to the input statement. This responsibility can include resolving ambiguity, issuing instructions and/or queries to the task manager, collecting output from the task manager, formation of a natural language expression or visual presentation of the task manager output, and coordination of recognizer context.

The task manager is primarily an Application Program Interface (API), but can also include pragmatic and application specific reasoning. The task manager can be an agent, or proxy, can possess state, and can communicate with other agents or proxies for services. The primary application interface for the task manager is expected to be web servers, but can be other API's as well.

Finally, the presentation manager, or output media "renderer," has responsibility for formatting multi-media output in a coordinated manner. The presentation manager should be aware of the client device capabilities.





No comments: