Grammars and Speech Recognition
Grammars are used by speech recognizers to determine what the recognizer should listen for, and so describe the utterances a user may say. Starting with VoiceXML Version 2.0, the W3C requires that all VoiceXML platforms must support at least one common format, the XML Form of the W3C Speech Recognition Grammar Specification (SRGS). Plum implements the SRGS+XML grammar format for both Voice and DTMF grammars as well as JSpeech Grammar Format (JSGF). Refer to the W3C Speech Recognition Grammar Specification or the JSGF Specification for further detail.
SRGS+XML
In order to use SRGS grammars (sometimes referred to as ABNF), the type must be explicitly set within each grammar tag. To explicitly specify an SRGS+XML grammar, set the “type” attribute of the grammar tag to “application/srgs+xml”. Please also keep in mind that when you specify an SRGS+XML grammar, it is required that you include a root attribute.
The following list of tags are supported by Plum DEV for both inline and external grammars:
Tag | Description |
Root element of an XML grammar | |
Header declaration of meta content of an HTTP equivalent | |
Header declaration of XML metadata content | |
Header declaration of a pronunciation lexicon | |
Declare a named rule expansion of a grammar | |
Define a word or other entity that may serve as input | |
Refer to a rule defined locally or externally | |
Define an expansion with optional repeating and probability | |
Define a set of alternative rule expansions | |
Element contained within a rule definition that provides an example of input that matches the rule | |
Declare a named rule expansion of a grammar |
Below is an example of how an SRGS+XML grammar can be implemented:
JSGF
The JSpeech Grammar Format (JSGF) is a platform-independent, vendor-independent textual representation of grammars for use in speech recognition. JSGF adopts the style and conventions of the Java™ Programming Language in addition to the use of traditional grammar notations.
The JSGF grammar syntax is the default syntax for Plum DEV. JSGF can be used to specify speech and DTMF grammars. Refer to the JSGF Specification for further detail.
Since JSGF is not a part of the VoiceXML 2.0 specification, the integration of JSGF into Plum DEV has been based upon the VoiceXML 1.0 specification. JSGF is considered the default grammar type. To explicitly specify a JSGF grammar, set the “type” attribute of the grammar tag to “application/x-jsgf”.
Finally, it should be noted that no distinction is made between DTMF grammars and normal speech grammars now that the VoiceXML 1.0 <dtmf> tag has been removed. If you wish to have DTMF digits recognized, you should specify numerals in the grammar (e.g., “(1|2|3|4)+” for any input containing DTMF digits one through four. This will allow you to enter any number of digits one through four to match your grammar). The # and * symbols should be specified between double quotes. Spoken digits are specified in grammars as written-out words (e.g., “(one|two|three|four)+” for any spoken input containing the words one through four. This will allow you to say any number of digits one through four to match your grammar).
For example, this grammar will allow you to enter as many 1s or 2s on your phone keypad as an input to match:
SRGS+XML Tag Format
When using the SRGS+XML grammar format, one can allow for information to be executed when an <item> is said in the grammar. This is shown in the following example:
From this example, the grammar allows for the user to say one of the choices: Dog, Cat, or Turtle. Note that the <tag> tag, <rule> tag, <one-of> tag, and <item> tag are used for this SRGS+XML grammar. Also, note that for each <item>, the <tag> attributes allow for information to be linked to its corresponding <item>. For the <item> “Dog”, its attributes are that its “Name” is “Lassie”, “Age” is “5”, and “Color” is “brown”. For the <item> “Cat”, its attributes are that its “Name” is “Garfield”, “Age” is “7”, and “Color” is “orange”. For the <item> “Turtle”, its attributes are that its “Name” is “Franklin”, “Age” is “10”, and “Color” is “green”.
JSGF Tag Format
When using the JSGF grammar format, the tag format goes as follows:
<rule> = <action> {tag in here};
For example:
From this example, the grammar allows for the user to pick one of the choices: Dog, Cat, or Turtle. Note for this example that if the user says “Puppy” or “Dog”, the corresponding tag would be “Dog”. If the user says “Cat”, “Kitten”, or “Kitty”, the corresponding tag would be “Cat”.
For JSGF, you cannot link any information to any of the tags. Unlike the case of SRGS+XML, you cannot allow for “Dog” to have the attributes for “Name”, “Age”, and “Color”. You also cannot use JSGF for coding with ECMAScript. However, while JSGF is much simpler to use in terms of specifying tags, the performance of JSGF grammars is inferior to SRGS+XML grammars.
In terms of tag formatting, JSGF is the much simpler grammar to use, but should only be used for small grammars. For medium to large-sized grammars, SRGS+XML is the grammar type that should be used.
Form Interpretation Algorithm
The form interpretation algorithm (FIA) drives the interaction between the user and a VoiceXML form or menu. A menu can be viewed as a form containing a single field whose grammar and whose <filled> action are constructed from the <choice> elements.
The FIA must handle:
Form initialization.
Prompting, including the management of the prompt counters needed for prompt tapering.
Grammar activation and deactivation at the form and form item levels.
Entering the form with an utterance that matched one of the form's document-scoped grammars while the user was visiting a different form or menu.
Leaving the form because the user matched another form, menu, or link's document-scoped grammar.
Processing multiple field fills from one utterance, including the execution of the relevant <filled> actions.
Selecting the next form item to visit, and then processing that form item.
Choosing the correct catch element to handle any events thrown while processing a form item.
The main loop of the FIA has three phases:
The select phase: the next unfilled form item is selected for visiting.
The collect phase: the selected form item is visited, which prompts the user for input, enables the appropriate grammars, and then waits for and collects an input (such as a spoken phrase or DTMF key presses) or an event (such as a request for help or a noinput timeout).
The process phase: an input is processed by filling form items and executing <filled> elements to perform actions such as input validation. An event is processed by executing the appropriate event handler for that event type.
Note that the FIA may be given an input (a set of grammar slot/slot value pairs) that was collected while the user was in a different form's FIA. In this case the first iteration of the main loop skips the select and collect phases, and goes right to the process phase with that input. Also note that if an error occurs in the select or collect phase that causes an event to be generated, the event is thrown and the FIA moves directly into the process phase.
From this example, the FIA first goes to the <block> form item to welcome the user. Next, the FIA goes to the <initial> form item and prompts the user to choose a main course. If the user does not say anything after 2 reprompts, the FIA then goes to the <field> form item to prompt the user again to choose a main course. After a user makes a choice for a main course, the FIA then goes to the next <field> form item “go_ahead”. If the user confirms the choice, the FIA then goes to the <filled> form item, which tells the user what was ordered. So, a dialog for an inexperienced user might be:
Computer: Welcome to the Food Court. Remember to donate to charity! Computer: Please order a main course. The main courses are chicken or fish. Please say your order now. Human: (says nothing) Computer: Please order a main course. The main courses are chicken or fish. Please say your order now. Human: (says nothing) Computer: Please order a main course. The main courses are chicken or fish. Please say your order now. Human: (says nothing) Computer: Please choose a main course. Chicken or fish? Human: Fish. Computer: Did you order the fish? Human: Yes. Computer: You ordered the fish. Remember to donate to charity! Computer: (starts over again for a new customer)
For someone more experienced, the go_ahead field allows the user to speed up the dialog process (but the user still has to hear the reminder to donate to charity). The go_ahead field has its modal attribute set to true. This causes all grammars to be disabled except the ones defined in the current form item, so that the only grammar active during this field is the grammar for boolean. A dialog for an experienced user might be:
Computer: Welcome to the Food Court. Remember to donate to charity! Computer: P… Human: Chicken. Computer: D… Human: Yes. Computer: You ordered the chicken. Remember to donate to charity! Computer: (starts over again for a new customer)
Mixed Initiative Forms
To make a form mixed initiative, where both the computer and the human direct the conversation, it must have one or more form-level grammars. The dialog may be written in several ways. One common authoring style combines an <initial> element that prompts for a general response with <field> elements that prompt for specific information. More complex techniques, such as using the 'cond' attribute on <field> elements, may achieve a similar effect.
If a form has form-level grammars:
Its input items can be filled in any order.
More than one input item can be filled as a result of a single user utterance.
For example:
From this example, the user first hears “Say either dog or cat.” due to the <initial> tag. The user must then respond with a response that agrees with what is written inside of the <grammar> tags, which is “Dog” or “Cat”. When one of these choices is made, the field names “name”, “age”, and “color” are filled in with either the “name”, “age”, and “color” corresponding to “Dog” or the “name”, “age”, and “color” corresponding to “Cat”.
Built-in Grammars
Seven built-in grammars are supported. They can be referenced by name in the “type” attribute of the “field” tag or in the “src” attribute of the “grammar” tag. An example of using the “type” attribute of the “field” tag would be: <field name=“example_field” type=“boolean”>
When referencing built-in grammars in a “grammar” tag, the built-in type must be appended to “builtin:grammar/” or “builtin:dtmf/”. An example of this would be: <grammar src=“builtin:grammar/boolean”/>
From this example, the grammar is expecting an affirmative phrase (such as “yes”) or a negative phrase (such as “no”). See below for more information.
The supported built-in grammars are as follows:
Built-in Grammars | Description |
boolean | Inputs include affirmative and negative phrases appropriate to the current language. DTMF 1 is affirmative and 2 is negative. The result is ECMAScript true for affirmative or false for negative. The value will be submitted as the int 1 or the int 0. If the field value is subsequently used in <say-as> with the interpret-as value “vxml:boolean”, it will be spoken as an affirmative or negative phrase appropriate to the current language. |
date | Valid spoken inputs include phrases that specify a date, including a month day and year. DTMF inputs are: four digits for the year, followed by two digits for the month, and two digits for the day. The result is a fixed-length date string with format YYYYMMDD, e.g. “20180704”. If the year is not specified, yyyy is returned as “????”; if the month is not specified mm is returned as “??”; and if the day is not specified dd is returned as “??”. NOTE: You can also respond with the inputs of “today”, “tomorrow”, or “yesterday”. These responses will not return a YYYYMMDD string. Instead, the results, 0, +1, and -1 are returned respectively for those inputs. Also, there is no validity checking on the date, either for day-of-week or days-in-month validity. For example, the input, “April 31”, is not rejected even though April has only 30 days. |
digits | Valid spoken or DTMF inputs include one or more digits, 0 through 9. The result is a string of digits. If the result is subsequently used in <say-as> with the interpret-as value “vxml:digits”, it will be spoken as a sequence of digits appropriate to the current language. A user can say for example “two one two seven”, but not “twenty one hundred and twenty-seven”. |
currency | Valid spoken inputs include phrases that specify a currency amount. For DTMF input, the “*” key will act as the decimal point. The result is a string with the format UUUmm.nn, where UUU is the three character currency indicator according to ISO standard 4217 [ISO4217], or mm.nn if the currency is not spoken by the user or if the currency cannot be reliably determined (e.g. “dollar” and “peso” are ambiguous). If the field is subsequently used in <say-as> with the interpret-as value “vxml:currency”, it will be spoken as a currency amount appropriate to the current language. (dollars, cents, and euros can only be used). NOTE: The currency built-in accepts inputs such as “five dollars” (returns “5.00”, spoken as “five dollars”), “five U S dollars”, (returns “USD5.00”, spoken as “five US dollars”, “five twenty-five” (returns “5.25”, spoken as “five point two five”) |
number | Valid spoken inputs include phrases that specify numbers, such as “one hundred twenty-three”, or “five point three”. Valid DTMF input includes positive numbers entered using digits and “*” to represent a decimal point. The result is a string of digits from 0 to 9 and may optionally include a decimal point (“.”) and/or a plus or minus sign. ECMAScript automatically converts result strings to numerical values when used in numerical expressions. The result must not use a leading zero (which would cause ECMAScript to interpret as an octal number). If the field is subsequently used in <say-as> with the interpret-as value “vxml:number”, it will be spoken as a number appropriate to the current language. (up to 999,999,999,999 in limit) |
phone | Valid spoken inputs include phrases that specify a phone number. DTMF asterisk “*” represents “x”. The result is a string containing a telephone number consisting of a string of digits and optionally containing the character “x” to indicate a phone number with an extension. For North America, a result could be “8005551234×789”. If the field is subsequently used in <say-as> with the interpret-as value “vxml:phone”, it will be spoken as a phone number appropriate to the current language. Note: This works only for Nuance OSR engines. |
time | Valid spoken inputs include phrases that specify a time, including hours and minutes. The result is a five character string in the format hhmmx, where x is one of “a” for AM, “p” for PM, “h” to indicate a time specified using 24 hour clock, or “?” to indicate an ambiguous time. Input can be via DTMF. Because there is no DTMF convention for specifying AM/PM, in the case of DTMF input, the result will always end with “h” or “?”. If the field is subsequently used in <say-as> with the interpret-as value “vxml:time”, it will be spoken as a time appropriate to the current language. Note: This works only for Nuance OSR engines. |
lastname | Recognizes a last name and its spelling such as “Smith, S-M-I-T-H”. The last names that are recognized, however, only covers 90% of the U.S. population. See example below on how to specifically implement this builtin grammar. |
firstname | Recognizes a first name and its spelling such as “John, J-O-H-N”. The first names that are recognized, however, only covers 90% of the U.S. population. See example below on how to specifically implement this builtin grammar. |
uscitystate | Recognizes a US city and state such as “Plymouth, Massachusetts”. Returns 3 attributes: city, state, and county. To address these attributes in your script, you would type [variablename].city for the city, [variablename].state for the state, and [variablename].county for the county. |
The following parameters could be used for the boolean and digits grammars:
Parameters | Description |
digits?minlength=n | A string of at least n digits. Applicable to speech and DTMF grammars. If minlength conflicts with either the length or maxlength attributes then a error.badfetch event is thrown. |
digits?maxlength=n | A string of at most n digits. Applicable to speech and DTMF grammars. If maxlength conflicts with either the length or minlength attributes then a error.badfetch event is thrown. |
digits?length=n | A string of exactly n digits. Applicable to speech and DTMF grammars. If length conflicts with either the minlength or maxlength attributes then a error.badfetch event is thrown. |
boolean?y=d | A grammar that treats the keypress d as an affirmative answer. Applicable only to the DTMF grammar. |
boolean?n=d | A grammar that treats the keypress d as a negative answer. Applicable only to the DTMF grammar. |
The following example demonstrates how you can use a builtin digits grammar to restrict a digit entry between a minimum number of digits and a maximum number of digits:
Here's an example showing the use of the “lastname” and “firstname” builtin grammars:
From this example, note that we use a high sensitivity setting and a low confidence level setting for the application to best match the user's speech input. Also, there is a 19% chance that either the user's first name or last name will not be recognized by the application (see “lastname” and “firstname” built grammars above).
Extended Built-in Grammars
Plum has a set of extended grammars that can be used. They can be referenced by using the “src” attribute of the <grammar> tag. The supported extended grammars are as follows:
Extended Built-in Grammars | Description |
usstreetaddress | Recognizes a US street address that includes the street number, street name, and type of street such as “123 Main Street”, but needs a city and state or zip code to be specified first. The zip code has higher precedence over the city and state. So, if “Chicago, Illinois” and “02215” (a zip code in “Boston, Massachusetts”) were given, the “02215” would take precedence. Returns 5 attributes: StreetName, StreetNumber, StreetSuffix, StreetPreDirectional (if applies), and StreetPostDirectional (if applies). To address these attributes in your script, you would type [variablename].StreetName for the street name, [variablename].StreetNumber for the street number, [variablename].StreetSuffix for the street suffix, [variablename].StreetPreDirectional if the street has a “pre direction” in its name (i.e North Main Street), and [variablename].StreetPostDirectional if the street has a “post directon” in its name (i.e. Wee Kirk Rd SE). NOTE: Currently, the usstreetaddress grammar only works for English. |
usstreet | Recognizes a US street name such as “Main Street”, but needs a city and state or zip code to be specified first, as described above. Returns 4 attributes: StreetName, StreetSuffix, StreetPreDirectional (if applies), and StreetPostDirectional (if applies). To address these fields in your script, you would type [variablename].StreetName for the street name, [variablename].StreetSuffix for the street suffix, [variablename].StreetPreDirectional if the street has a “pre direction” in its name (i.e North Main Street), and [variablename].StreetPostDirectional if the street has a “post direction” in its name (i.e. Wee Kirk Rd SE). NOTE: Currently, the usstreet grammar only works for English. |
Here's an example showing how you can use the “uscitystate” built-in grammar along with the “usstreetaddress” and “usstreet” extended grammars:
From this example, the attribute “srcexpr” is used for the <grammar> tag to determine the URI to fetch for the street address. Finally, note that when “usstreetaddress” is used, a city and state must be defined in order for a US street address to be returned. The same case applies to when “usstreet” is used as well.
External Grammars
An external grammar is specified by an element of the form:
<grammar src=“URI” type=“media-type”>
External grammars can be used if you only wish to update your grammar file and do not wish to change your application code. The same rules apply for external grammars as they do for inline grammars. For example, you can use the same tags within an external grammar, listed in Section 3.1 above.
When creating a external grammar file, please note that you will need to name it with a .grxml extension. In your .grxml file, you will still need to add a <?xml version=“1.0”?> header at the top of your file. For the rest of your file, you will need to place the contents of your grammar within <grammar> tags. Below is an example of a VoiceXML script that utilizes an external grammar:
From this example, the user is prompted to enter their card number. In the entercardnumber.php file, there is a <grammar> tag that references numbergrammar.grxml as the grammar file. Also, note that in the <grammar> tag, we set the maxage value to 0 to ensure that when the numbergrammar.grxml file is called upon, the platform fetches a fresh copy of the file (in case the external grammar file ever needed to be changed).
Inside of the numbergrammar.grxml file, the rules of the grammar are set. The type attribute of the grammar is set to SRGS+XML, the root attribute is set to ROOT, and the mode is set to dtmf. For this grammar, the user is allowed to enter up to 255 digits (from 0 to 9) or enter “* #” on their phone keypad.
Recording Timeout Chart
Recording Timeout Behavior:
Tag | DTMFterm=true | DTMFterm=false |
silence | stops after timeout seconds; returns <noinput> | stops after timeout seconds; returns <noinput> |
speech followed by silence | stops finalsilence seconds after speech ends; returns recording | stops finalsilence seconds after speech ends; returns recording |
continuous speech | stops after maxtime seconds; returns recording | stops after maxtime seconds; returns recording |
speech followed by any DTMF | stops immediately after DTMF input ends; DTMF input written to termchar; returns recording | stops finalsilence seconds after DTMF input ends; returns recording including DTMF tones |
any DTMF | stops immediately after DTMF input ends; DTMF input written to termchar; returns a recording of dead air | stops finalsilence seconds after DTMF input ends; returns recording of DTMF tones |
Supported Languages
Language | US | UK |
American English (en-US) | x | x |
British English (en-GB) | - | x |
Spanish (es-US) | x | - |
German (de-DE) | x | - |
Canadian French (fr-CA) | x | - |
French (fr-FR) | x | - |
Italian (it-IT) | x | - |
Dutch (nl-NL) | x | x |
Polish (pl-PL) | x | - |
Brazilian Portuguese (pt-BR) | x | - |
Portuguese (pt-PT) | x | - |
Russian (ru-RU) | x | - |
Last updated