Closed Captioning Presentation Methods to Improve the Viewing and Communication Experiences of Deaf and Hard-of-Hearing Users
Grant Powell
School of Brain and Behavioral Sciences, University of Texas at Dallas
ACN 6341: Human Computer Interaction I
Professor Ericka Orrick
November 24, 2020
Introduction
Ever since closed captioning was first debuted in 1971 and first available for limited use in 1972, it has not gone through as much of a transformation to improve upon its ability to further address the needs of the deaf and hard-of-hearing (HOH) (Hunter, 2017). Closed captioning means that text showing what someone on the television is saying is not available the moment one turns on the television, which would be open captioning. It is available only when that person goes to the television settings and turns it on if they realize the need to improve their viewing experience (“What is the Difference Between Open and Closed Captioning,” 2020). Closed captioning was not available to the entire viewing public until 1980 when it was provided for widespread access by allowing deaf and HOH viewers the opportunity to purchase special decoder boxes that could be attached to the television in order to access the closed captioning capabilities. Thanks to the United States (US) Television Decoder Circuitry Act of 1993, decoders are embedded in all televisions that customers purchase in the U.S. (Branje et al., 2005, p. 2330).
In the US Hanson (2009) pointed out that the National Institute on Deafness and Other Communication Disorder’s (NIDCD) estimated statistical figures from 2004 showed that 28 million people in the US had some degree of hearing loss. Now, that figure for the US has increased to 48 million people with some degree of hearing loss including 477 million people worldwide according to the most recent estimated statistics available to date (Hearing Health Foundation, 2020).
Closed captioning is still presented as white text, in a single mono-spaced font and single font size, against a black background on most video platforms with some changes (see Appendix). For example, it has changed some from only all capital letters to mixed case letters, and it sometimes shows a small set of text colors and uses special characters such as music notes to inform the user when noise, sound, or music is being played (Branje et al., 2005, p. 2330). These minor changes are the result of advancements in encoder and decoder technologies from when it first appeared on analogue televisions (Fels et al., 2008, p. 506). With these changes, along with the fact that current closed captioning guidelines in the captioning industry have remained unchanged in the US since captioning was mandated by the Federal Communications Commission, captioners sometimes make creative choices that may not be obvious to the user such as assigning descriptive text to sound effects, paraphrasing long speeches, and omitting information deemed unnecessary to try to improve the user’s viewing experience (Hanson, 2009, p. 128 and Fels et al., 2008, p. 506). However, these changes are not enough to meet the needs of the deaf and HOH population because its needs and range of hearing loss is very diverse.
The major issue that separates people with hearing loss on the spectrum of hearing loss ranging from relatively little hearing loss to profound hearing loss is language acquisition skills. Because experiences of difficulty with pitch, timbre, and loudness will occur with hearing loss, speech perception difficulties will not only be visible, but reading and writing skills will inevitably be affected (Hanson, 2009, p. 126). Being able to understand this has implications for interface designers as it relates to closed captioning for the purpose of improving the end users’ video viewing experience either for entertainment, education, or communication purposes. For example, users who are deaf and HOH may have difficulty with reading because of speech perception difficulties. This is especially the case if they lost their hearing at birth or during the pre-lingual stage of child development because reading is based on the overall phonetic structure and understanding of grammar and text comprehension of a spoken language (Hanson, 2009, p. 127). Instead of presenting the text in closed captioning verbatim and making a captioner’s job harder to try to improve the user’s viewing experience, one of the ways designers can make closed captioning more user-friendly is to consider incorporating graphics and animations or animated texts to convey the displayed emotion and sound information (Fels et al., 2008, p. 507).
Emotive Captioning
There are two ideas present in the research literature that were discovered on the topic of closed captioning and the deaf and HOH that use graphics and animations or animated texts. The first one is emotive captioning and the second one is kinetic typography. The first idea called emotive captioning is a type of captioning that Branje et al. (2005) describes in their study as the use of a combination of graphics, color, icons and animations, and text to illustrate sound information. It involved using graphics to represent emotion and sound effects and by utilizing a production team to deploy the designed graphical captions by showing very short segments of two different television vignettes from an eight-vignette series titled Burnt Toast: Forever and Ever and Traffic Jammed (Branje et al., 2005, p. 2331). To decide on which graphics to convey sound information, the researchers had to decide what information to convey and how to best express them through graphics, and which graphics are the most appropriate (Branje et al., 2005, p. 2331). The most common emotions they found according to the research literature in order to figure out which graphics and icons to display in the captions were fear, anger, sadness, happiness, disgust, and surprise including two more that members of the production team wanted to add, which were sexy and love (Branje et al., 2005, p. 2331). The production team marked up their script using a Caption Markup tool to tag it with the different emotions and emotional intensities (Branje et al., 2005, 2332). The purpose of doing that is for the script to display a text-based file (Branje et al., 2005, 2332). Then some members of the production team marked up the script some more with their interpretation of the emotive characteristics of the show (Branje et al., 2005, 2332). What doing this does is it allows for their Rendering Engine tool to automatically create graphical pop-on captions using pre-designed image files that are associated with each different emotion selected for the study and allow for captions to be edited with image and text tools consisting of font style, size, and text color (Branje et al., 2005, 2332; see Appendix).
Branje et al.’s study had eleven participants consisting of six deaf, American Sign Language (ASL) users and five HOH users. To compare data, the researchers had the users complete a pre-test and post-test questionnaire and interview including discussions that took place amongst the users who were recorded watching the two video segments (Branje et al., 2005, p. 2332). Basically, the reactions to the three different versions of both video segments with conventional captions, emotive captions located in one location of the screen, and emotive captions located in different locations on the screen designed to show speaker identification were being compared between the group of deaf users with ASL knowledge and the group of HOH users without ASL knowledge (Branje et al., 2005, 2332).
Kinetic Typography
The second idea called kinetic typography is a type of captioning that may use, for example, trembling letters with scratchy typeface to convey a sense of terror, large font sizes to express the emotion associated with screaming, and a small font size to communicate the emotion associated with whispering (Fels et al., 2008, p. 507). Fels et al. (2008) did a study where they experimented with the usage of kinetic typography on twenty-five participants split into two groups, ten hearing (H) users and fifteen HOH users. The kinetic texts used in this study corresponded to five basic emotions: anger, fear, happiness, sadness, and disgust. These emotions were later combined to be turned into a range of fifteen total emotions that were depicted through kinetic texts with an emotional intensity modifier defined as high, medium, or low (Fels et al., p. 508). The different emotions expressed by kinetic texts were presented in two one-minute video segments from two episodes of a children’s show called Deaf Planet titled “Bad Vibes” and “To Air is Human (Fels et al., 2008, p. 509).” The episodes were presented to the participants in three versions: closed caption, enhanced animated text, and extreme animated text with static text shown at the bottom and animated text animated dynamically around the screen (Fels et al., 2008, p. 509; see Appendix). To get the results, the comparison of the data from the three versions of both shows were obtained by having the users complete a pre and post-test questionnaire, having the users complete a checklist, and videotaping the users comments and reactions to the viewing experience (Fels et al., 2008, p. 509).
Branje et al. (2005) found no significant differences in the users’ responses between the two video segments in their study on emotive captions. But, they did find that emotive captions tended to benefit the HOH users more than the deaf users because HOH users preferred using graphics, icons, and color to represent sound information, whereas, deaf users did not (Branje et al., 2005, 2336). The emotive captions were received positively by HOH users than deaf users in the areas of willingness to engage in conversation with hearing friends about the viewing content, appearance of the face icons in the graphics of the emotive captions, graphical representation of the emotions, and color of the emotive captions (Branje et al., 2005, 2336).
Fels et al. (2008) found a significant difference between the three viewing presentations of captioning in the study on kinetic typography in terms of likeability. Both H and HOH viewers preferred the enhanced animated captions instead of the conventional captions and extreme animated captions. The ratio of positive to negative comments for enhanced captions was higher on the positive side and positive results were higher with enhanced captions concerning the movement of text and portrayal of emotions by the moving text (Fels et al., 2008, p. 516). In terms of identifying the correct emotion being presented through the three different captioning presentations, based on the number of tries one uses in identifying the correct emotion, HOH viewers missed fewer emotions than H viewers (Fels et al., 2008, p.517). Significant differences amongst HOH viewers were found concerning text descriptions for sound effects where most viewers suggested that symbols be used for sound effects instead of text (Fels et al., 2008, p.517). That finding reiterates the findings from Branje et al. (2005) that reported that deaf and HOH viewers like animated symbols for representing sound effects (Fels et al., 2008, p.517). Basically, the results from this first study on the usage of kinetic texts suggests that animated texts is progressing as being the best path forward in helping HOH viewers who are indicating that they want more sound information in closed captioning (Fels et al., 2008, p.517). Despite some of the positive findings on the usage of emotive captioning and kinetic typography as closed captioning alternatives, closed captioning in either its current form or in the mentioned adaptations, especially from the results of the study by Branje et al. (2005), may still not be effective in meeting the needs of deaf users who only sign.
Sign Language Signing Interfaces
Deaf and HOH users who only sign need an adaptable form of closed captioning that involves communicating and describing what is being said, what the emotional content is, and whether music, sound, and noise is being played by providing a visual representation of a sign language signer communicating that information. Communicating all that information through sign interfaces has always come with a disadvantage. They have always required a live signer (Hanson, 2009, p. 129). Now, there is video phone technology that exists for deaf and HOH users that allows a signed phone conversation to occur that is as close as it gets for a sign interface to resemble automated closed caption derived from speech recognition technology (Hanson, 2009, p. 131). However, when a live signer is involved everything must be pre-recorded and there is no automatic signed translation of software and web content available for one to rely on if one is looking for signed versions of audio or multimedia that can be created in real time (Hanson, 2009, p. 129).
One of the solutions to dealing with this limitation on sign interfaces that has been created is providing an automatic sign presentation called “concatenated signing” where a software program is created by stringing words together in a sentence in order to concatenate signs for each word that were produced by a live signer (Hanson, 2009, p. 129). To ensure a seamless transition from one signed word to the next, since the transition from one video of a live signer to another video of a different live signer can be disjointed, computer algorithms have been created to help smoothen the flow (Hanson, 2009, p.129). Other disadvantages of providing an automatic sign presentation in this manner include the high production cost of creating a professional quality video, the increasing cost of making new videos each time content detail to a website or software changes, the large size of videos for storing and downloading, and the need for a sophisticated server architecture to accommodate streaming content that is making changes at runtime (Glauert et al., 2007). Despite its disadvantages, the automatic sign presentation as described is not necessarily a bad concept, but there is one automatic sign interface concept that is even more promising.
Virtual Sign Language Signing Avatars
The automatic sign interface concept that has the greatest potential to benefit deaf users who only sign is the use of virtual signing avatars (see Appendix). Signing avatars are operated by animation software in which motion data is generated in real time from a scripting notation designed for describing the avatar’s signing (Glauert et al., 2007). The use of signing avatars has six distinct advantages that make it an attractive alternative to sign interfaces on the internet, which have always been presented through video.
One is that a user can define the signs through scripting notation and compose sequences of signs on a desktop computer without the need for video or motion capture equipment (Glauert et al., 2007). Second, the continuous flow from sign to sign is very smooth because any piece of signing can be sequenced for any avatar (Glauert et al., 2007). Third, the details of the signing content can be edited without having to rerecord any sequences of signs (Glauert et al., 2007). Fourth, the concern for bandwidth and disk space is very low because, after the software is installed on one’s computer to translate scripting notation to motion data and create the avatar, scripting notation is all that is needed to be transmitted and its data can be much smaller than highly compressed videos or motion data of live signers (Glauert et al., 2007). Fifth, the frame rate concerning graphics and images is less of a concern because data transmitted by the end user can be set to whatever speed the user’s computer is capable of creating the avatar and avatar’s signing (Glauert et al., 2007). Finally, the user has more control over the avatar than with videos of live signers by adjusting the viewing angle of an avatar and switching amongst different types of avatars during playback (Glauert et al., 2007).
The use of signing avatars have been used primarily in different countries in the European Union (EU) as part of various projects and initiatives to provide counter clerks with a tool for communicating with deaf customers in post offices, to convert the stream of captioning from the EU’s version of closed captioning called teletext into signing, and to improve usage of broadcast and face-to-face interaction applications on the web for deaf users (Glauert et al., 2007). Although there are always improvements needing to be made to the implementation of signing avatars, based on the evaluations of deaf participants’ comprehension of single signs, signed sentences, and text chunks, the evaluation scores have been high amongst some deaf participants, thus, indicating that the concept including the modifications and adjustments that have been made to improve the presentation of signing avatars is working (Glauert et al., 2007). This is also reflected in the participants appreciation for the existence of signing avatars because they cited the avatars as helpful to those with severe reading difficulties, as offering them more user control of how to use the avatars to meet their needs, and how they could be helpful in giving information to them at train stations, airports, townhalls, hospitals, and travel agencies (Glauert et al., 2007).
Conclusion
Overall, the summarized results and conclusions from the evaluations of the signing avatars used in the EU show that they are benefitting deaf users who only sign and that the use of emotive captioning and kinetic typography in the studies mentioned show that they benefit HOH users. Although these closed captioning adaptations and alternatives do not benefit their respective users in every area, they still benefit them in certain noticeable areas. The presentation and introduction of closed captioning adaptations and alternatives in this review are not necessarily the only true way to enact them. More research and creative problem-solving should continue to take place to either further validate the findings from these closed captioning adaptations and alternatives, improve upon them, and find more newer ones. Continuing this may help change and loosen some of the strict captioning guidelines set by the US federal government that do not allow much flexibility in creating and providing more closed captioning adaptations and alternatives (Fels et al., 2008, p.506). Meanwhile, countries in the EU have been implementing and providing more closed captioning adaptations and alternatives for a while now (Fels et al., 2008, p. 506).
References
Branje, C., Fels, D. I., Hornburg, M. & Lee, D. G. (2005). Emotive Captioning and Access to
Television. AMCIS 2005 Proceedings, 2330-2337. http://aisel.aisnet.org/amcis2005/300
Fels, D. I., Hunt, R., Vy, Q., & Rashid, R. (2008). Dancing with Words: Using Animated Text
for Captioning. International Journal of Human-Computer Interaction, 24(5), 505-519.
https://doi.org/10.1080/10447310802142342
Glauert, J. R. W. & J. R. Kennaway (2007). Providing Signed Content on the Internet by
Synthesized Animation. ACM Transactions on Computer-Human Interaction, 14(3). http://doi.acm.org/10.1145/1279700.1279705
Hanson, V.L. (2009). Computing Technologies for Deaf and Hard of Hearing Users. In A. Sears
& J. A. Jacko (Eds.), Human-Computer Interaction: Designing for Diverse Users and
Domains (pp. 125-133). Taylor & Francis Group.
Hearing Loss and Tinnitus Statistics. (2020). Hearing Health Foundation. Retrieved from
Hunter, E. (2017, December). A Brief History of Closed Captioning. The Sign Language
Company. Retrieved fromhttps://signlanguageco.com/a-brief-history-of-closed-captioning/
What is the Difference Between Open and Closed Captioning. (2019). DO∙IT. Retrieved