Text-to-speech audio description

What is text-to-speech audio description?

Text-to-speech audio description (TTS AD) is a type of AD where instead of a human narrator, the script is read by speech synthesis software. TTS AD is not meant to eliminate human audio describers. The idea behind TTS AD is to increase the availability of audio description as it is believed to be more cost-effective than ‘traditional’ AD.

Text-to-speech audio description has several advantages. From the perspective of the audio description provider, TTS AD offers unequalled cost-effectiveness in terms of AD production in comparison with conventional methods of producing audio description as it does not require the recording of the AD script (for pre-recorded AD) nor does it incur any human labour costs for the reading out of the AD script (for live AD). Furthermore, in contrast to audio describers involved in the production of conventional AD, who should develop “the vocal instrument through work with speech and oral interpretation fundamentals” (Snyder 2008: 196), audio describers for TTS AD need not have any particular vocal skills.

From the point of view of end users, TTS AD also spares expenses to many blind and partially sighted people who already have speech synthesis software at home or at work and who are accustomed to using it in their daily lives. Thanks to the high quality of speech synthesis software available now in many languages, watching a film with synthetic AD can be an enjoyable and entertaining experience. If watching the audiovisual programme on the net, another advantage is that the solution does not require access to high-speed Internet connection because the viewer is simply offered a text file with the AD script (in .txt or .sub format) to be read out by a text-to-speech programme. The solution seems particularly attractive to those visually-impaired people who live in small towns and villages and thus cannot enjoy cinema screenings or theatre performances with AD, usually organised in large cities. Furthermore, TTS AD allows spectators with visual impairments to watch films and other audiovisual programmes on their own, without depending on others or being confined to the explanations of their sighted friends or family.

Text-to-speech audio description can be used for both domestic and dubbed productions (where only one language is heard) and for foreign programmes (where two languages can he heard: the original and the translation). For foreign materials, it can be combined with either audio subtitling of the dialogue (in subtitling countries) or with the voice-over translation (in Poland).

As anything, text-to-speech AD does have its downsides. First of all, it requires media literacy and, as such, it largely excludes visually impaired people, especially the elderly, who live outside modern high-tech information society and do not interact with the new digital technologies. Another important criticism directed at TTS AD to be anticipated will probably be rooted in the fact that it does not serve to promote integration and inclusion as the viewing takes place at home, often by only one visually impaired person. To this I can only say that the arrival of the DVD/home video has not sent cinema to the dustbin of cinematographic history. In the same way, text-to-speech AD is to complement, not eradicate, the experience of watching films. TTS AD is by no means intended to replace the audio description practice currently in use. Rather, it aims to supplement it and to increase the number of audio described films and audiovisual programmes made available to people with visual impairments.

Citation guidance: Szarkowska Agnieszka (2011) “Text-to-speech audio description. Towards a wider availability of AD”. In: Journal of Specialised Translation in January 2011.


“The Day of the Wacko”

by Agnieszka Szarkowska

The feature film selected for the experiment was Dzień Świra (The Day of the Wacko, 2002, dir. Marek Koterski), a tragi-comedy telling the story of a middle-aged Polish literature teacher, Adam Miauczyński.

It was decided that the AD script would be read by a female voice. This choice was motivated by three major factors. The first one stemmed from the fact that The Day of the Wacko consists largely of the main protagonist’s monologues, interspersed with him conversing with, or rather barking at, other characters. Since the main character is male, it was thought that it would be easier and less confusing for viewers to listen to audio description delivered by a female voice.

The second factor contributing to the selection of the female voice for synthetic AD was the unquestioned hegemony which has so far been enjoyed by male voice talents in Poland, where the study is carried out. Poland is a country where the dominant mode of audiovisual translation on television is voice-over (Szarkowska 2009), both for fiction and non-fiction genres. The overwhelming majority of voice-over artists, known as lektors, are male. With the advent of pre-recorded audio description in Poland, it was only natural for many people, accustomed to hearing male narrators, and for the lobby of lektors themselves, that AD should also be read by men. Hence, on all the DVDs with pre-recorded AD released on the Polish market so far, as well as on tens of hours of audio described TV series produced by public television (TVP) and made available online, the audio describer is always male.

Research questions
The key objective of the present study was to determine whether visually impaired viewers would find it acceptable for text-to-speech software to read AD scripts. To address this objective, the following three research questions were formulated:

1) Which AD voice would the visually impaired prefer if they had a choice between a human voice and a synthetic voice?
2) Would TTS AD be acceptable as an interim solution, until a system has been agreed to have a human voice reading out the AD?
3) Would TTS AD be acceptable as a permanent solution, next to AD read by a human voice?

The questionnaire was administered after a screening of The Day of the Wacko with text-to-speech AD on 4 December 2009. The screening was part of the conference Reha for the Blind in Poland, which took place in Warsaw.

The audience were first invited to watch the film and after the projection they were asked to provide answers to 15 questions, which were read out to them by sighted volunteers.

A total of twenty four people were interviewed (13 females, 11 males). Five were aged 18-25 years (3 females, 2 males), ten were aged 26-39 years (5 females, 5 males) and nine were aged 40-59 years (5 females, 4 males).

As for educational background, four respondents had primary education, eleven – secondary, and nine were university graduates.

Out of the total number of participants (n=24), sixteen (66%) were congenitally blind and six (33%) had an acquired sight loss. The level of participants’ sight loss was classified into four categories: mild, moderate, severe and profound. The scale used in the questionnaire was adopted from the research conducted by the RNIB (Freeman et al. 2008), which was based on the Network 1000 research report (Douglas et al. 2006).

As seen in the figure below, eight respondents (33%) had profound sight loss, three of them (13%) had severe sight loss, eight had moderate sight loss and five of them (21%) had mild sight loss:

Degree of sight loss among respondents (n=24)

With respect to research question (1), i.e. the preference for either a human or a synthetic voice to read out the AD script, the majority of respondents (n=13, 54%) stated they would prefer a human voice whilst two people (8%) claimed they preferred a synthetic voice over a human one. As many as one in four declared that the choice of human vs. synthetic voice depended on the programme. Many others were not sure and wanted to have more experience with TTS AD in order to be able to make an informed choice.

Reseach question 1 (n=24)

No significant patterns were found in terms of the preference for human/synthetic voice depending on the age, gender, education of the respondents or their previous exposure to synthetic voice. In order to formulate further generalisations on each of these variables, a larger sample would need to be examined.

Research questions (2) and (3) addressed the acceptance of TTS AD as either an interim or permanent solution. 95% of respondents (23 out of 24) were in favour of introducing TTS AD as an interim solution until there are more programmes available with human audio describers. Almost two in three respondents (n=14, 58%) supported TTS AD as a permanent solution, functioning next to AD with a human voice. One third (n=7, 29%) were against TTS AD as a permanent solution. Some respondents (n=3, 13%) said they would need more time and experience with TTS AD to make an informed choice.

Research questions (2) and (3) (in %, out of the total n=24)

The preference for TTS AD as an interim or permanent solution was then examined by different variables, such as the type and degree of sight loss as well as the age of the respondents:

TTS AD as an interim solution TTS AD as a permanent solution
yes no don’t know yes no don’t know
by type of sight loss
congenital 94% 6% 0 47% 35% 18%
acquired 100% 0 0 85% 15% 0
by degree of sight loss
mild 100% 0 0 80% 20% 0
moderate 100% 0 0 37% 25% 37%
severe 100% 0 0 66% 33% 0
profound 87% 13% 0 63% 37% 0
by age
18-25 100% 0 0 40% 40% 20%
26-39 90% 10% 0 60% 20% 20%
40-59 100% 0 0 66% 33% 0

TTS AD as an interim or permanent solution (n=24)

No significant patterns emerged in terms of the attitude towards TTS AD as an interim solution. As for TTS AD as a permanent solution, more respondents with acquired sight loss than those with congenital sight loss declared their support for the idea. Furthermore, perhaps somewhat surprisingly, there were more respondents from the older age brackets who declared their support for permanent TTS AD than respondents from the youngest group.

The familiarity of the respondents with computers, the Internet and speech synthesis software was found to be quite high. 21 out of 24 (87%) respondents have either a PC or a laptop at home and 18 respondents (75%) also have an Internet connection. The overwhelming majority of respondents (75%) use speech synthesis software on a regular basis, but only 5 people watch films with subtitles read out by text-to-speech software (many respondents were surprised to hear this is possible and were willing to try it out).

The overall results of the present study are in line with the RNIB report on the use of synthetic speech by blind and partially sighted people, which states that “listeners prefer natural sounding speech, both in comparing natural speech to synthetic speech and in comparing different synthetic voices” (Cryer and Home 2008: 7). It is worth noting, however, that while the visually impaired viewers in this study find natural speech preferable, many of them would find synthetic speech acceptable.

The high level of familiarity with technology demonstrated by the respondents may not be representative of an overall visually impaired population. While TTS AD may not be a feasible solution for older age groups, whose unfamiliarity with computers, the Internet and speech synthesis software remains a serious obstacle, the results of the present research still demonstrate an untapped potential for text-to-speech audio description.

Many thanks to Piotr Wasylczyk, Anna Jankowska, Robert Więckowski and Mateusz Ciborowski for their help with the drafting of the AD script; to Leen Petré from RNIB for her invaluable help and feedback on the design and interpretation of the questionnaire; to Marek Kalbarczyk from the Foundation of the Chance for the Blind for allowing me to organise the screening during the conference Reha for the Blind; to Ivo Software for letting me use the Ivona synthesiser at the screening; to all my friends and students who helped me distribute the questionnaire, and finally to all the respondents for participating in the experiment and sharing their opinions on TTS AD.

This work has been supported by research grant No. N N104 148038 of the Polish Ministry of Science and Higher Education for the years 2010-2011.
by Agnieszka Szarkowska, more information can be found in Journal of Specialised Translation.



by Agnieszka Szarkowska and Anna Jankowska

The film selected for research purposes was Volver, a 2006 Spanish drama directed by Pedro Almodóvar. As the film was already available on DVD with Polish voice-over translation, it was decided that this version would be used in the study, complemented with the AD script read by text-to-speech software. For the project, the speech synthesiser Ivona (Ivo Software) was used together with the synthetic voice Krzysztof (Loquendo).

voice: Kendra, Ivona Reader

Audio subtitling vs. voice-over
Volver was one of the few foreign (i.e. non-English speaking) films audio described in the UK and released on DVD. The AD script had to be accompanied by a translation of dialogues, which was done through audio subtitles read out by a female narrator. The choice of the female voice most probably stemmed from the nature of the film, where it is women who play most important characters. The AD script, in contrast, was read by a male voice talent. This solution enabled the audience not to confuse the AD script with the dialogues. However, the presence of one female voice for all the characters and the poor quality of the recording of audio subtitles, which drowned out the original Spanish voices so that they were hardly audible, resulted in viewers having difficulties recognizing which character was speaking as many scenes in the film features a few women talking. As a result, the overall quality of the AD was perceived as poor and the audio described film met with fierce criticism from the British visually impaired community.

It is worth noting at this point that the British audience is not used to hearing a translation of a film being read out to them on top of the original voices. Polish viewers, in contrast, have had many years of experience of listening to the voice-over translation of film dialogues on television, which makes them more accustomed to this AVT modality.
Audio subtitling and voice-over seem to be two audiovisual translation modalities which have a lot in common. First of all, they both consist of a translation of the dialogue list to a foreign or multilingual film. Secondly, the translation is read out to the target audience – the main difference being that in the case of voice-over, the target audience is simply conceived of as mainstream sighted population, whereas in audio subtitling it comprises a much smaller group of visually impaired people. Thirdly, the translation is usually read out by one voice talent (typically a male in Poland), while the voices of the original actors can still be heard in the background though their volume has been turned down. In contrast to the UK, Polish voice-over is always done in a professional recording studio, which usually guarantees good sound quality. Finally, apart from the different target audiences envisaged at the production stage, audio subtitling is created together with the AD script and thus allows for some flexibility in combining the two tracks, whereas in the case of Poland, AD would be added to a voiced-over film at a later stage, which makes it virtually impossible to introduce changes to the pre-recorded VO so that it can be seamlessly interwoven with the AD script.

Research questions
The key objective of the present study was to determine whether visually impaired viewers would find it acceptable for text-to-speech software to read AD scripts to voiced-over feature films. To address this objective, the following three research questions were formulated:
1) Which AD voice would the visually impaired prefer if they had a choice between a human voice and a synthetic voice?
2) Would TTS AD be acceptable as an interim solution, until a system has been agreed to have a human voice reading out the AD?
3) Would TTS AD be acceptable as a permanent solution, next to AD read by a human voice?

The screening took place at an informal meeting for blind and partially sighted people organised by the Foundation Chance for the Blind (Szansa dla Niewidomych) in Jachranka near Warsaw on 24 April 2010.

The audience were first invited to watch the film and after the projection they were asked to provide answers to 13 questions, which were read out by sighted volunteers.

After the screening, a total of 20 people were interviewed: 14 women (70%) and 6 men (30%). As shown in Table 1, five of them were blind (25%), 13 were partially sighted (65%) and two of them (10%) were sighted.
Table 1. Participants by age and degree of sight loss

Age bracket Blind Partially sighted Sighted
18-25 3
26-39 2 7 1
40-59 3 1
60-74 2 1
Total 5 13 2

Most participants (12 people, 67%) had a congenital sight loss, while one in third (8 people, 33%) acquired the sight loss at a later stage in life. Both the degree and type of sight loss was determined based on self-declarations of the participants.

12 out of 20 participants (67%) said they use text-to-speech software regularly, either at home or at work. Only 11 people (55%) had seen some films with audio description before, while nine of them had no prior experience of AD.

When asked about what voice they would prefer to read AD scripts, half of the participants (10 people, 50%) declared their preference for a human voice. Perhaps somewhat surprisingly, one person preferred a synthetic voice to read AD, whereas many others stated that this depends on the type of programme (6 people, 30%). Three participants (15%) were not sure and would like to have more experience with AD to make a more informed choice.

In terms of accepting TTS AD as either an interim or permanent solution, most participants were in favour of both (Table 2). Some expressed their concerns whether the introduction of TTS AD would not result in eliminating human voices and substituting them completely with synthetic voices.

Table 2. The acceptance of TTS AD as an interim or permanent solution

Interim Permanent
Yes 95% 70%
No 15%
Don’t know 5% 15%


Overall, all participants apart from one were in favour of introducing TTS AD as an interim solution, especially if it meant more audio described programmes accessible to people with visual impairments. The participants were slightly more sceptical, however, about the introduction of TTS AD as a permanent solution: while 70% of them support the idea, one in three is either against or unsure.

A closer examination of the preferences for TTS AD as an interim or permanent solution based on the degree of sight loss has shown a slight tendency on the part of blind participants to be more supportive of the idea (Table 3).

Table 3. The acceptance of TTS AD as an interim or permanent solution by degree of sight loss

TTS AD as interim TTS AD as permanent
yes no don’t know yes no don’t know
Blind 100% 80% 20%
Partially Sighted 92% 8% 70% 15% 15%

In terms of gender, it is female participants who appear to be more inclined to accept TTS AD than men (Table 4).

Table 4. The acceptance of TTS AD as an interim or permanent solution by gender

TTS AD as interim TTS AD as permanent
Gender yes no don’t know yes no don’t know
Women 100% 79% 14 7%
Men 83% 17% 50% 16% 34%


There seem to be no clearly discernible correlations between the preference for TTS AD as an interim or permanent solution in terms of participants’ age (Table 5).
Table 5. The acceptance of TTS AD as an interim or permanent solution by age

TTS AD as interim TTS AD as permanent
Age yes no don’t know yes no don’t know
18-25 100% 67% 33%
26-39 90% 10% 60% 20% 20%
40-59 100% 75% 25%
60-74 100% 100%


Interestingly, participants from elder age groups seem to be slightly more willing to see TTS AD as a permanent solution than those from younger age groups in the study. Naturally, the sample is too small to draw any further reaching conclusions.

Previous studies on synthetic speech revealed that the experience and exposure to text-to-speech software may positively influence the attitude towards it. This patterns seems to be confirmed in our study (Table 6).
Table 6. The acceptance of TTS AD as an interim or permanent solution by the use of TTS software

TTS AD as interim TTS AD as permanent
yes no don’t know yes no don’t know
TTS users 100% 75% 8% 17%
not TTS users 86% 14% 57% 29% 14%
40-59 100% 75% 25%

TTS users are more likely to accept TTS AD both as an interim and as a permanent solution when compared to those who have had no regular experience with speech synthesis software. This pattern is more noticeable with regard to TTS AD as a permanent solution.

A similar trend can be observed when it comes to the preference for either human or text-to-speech narrator (Table 7).
Table 7. The preference for human/synthetic narrators by the use of TTS software

TTS users not TTS users
Human 33% 64%
Synthetic 12%
Depends on the programme 42% 12%
Don’t know 25% 12%
Total 12 people 8 people

In the study, people who do not habitually use text-to-speech software were more likely to prefer human narrators, while regular TTS users were more open to the idea that TTS AD may be a good solution for some types of programmes, but not for all (42% stated the choice of the human/synthetic voice depended on the programme). This issue is pursued in further stages of our research when we investigate the application of TTS AD in non-fiction genres, such as a documentary and an educational programme.

This work has been supported by research grant No. N N104 148038 of the Polish Ministry of Science and Higher Education for the years 2010-2011.
Many thanks to Marzena Chrobak, the Polish translator of the Volver screenplay published by the ZNAK publishing house, for letting us use several fragments of her translation in the AD script.


“Once upon a time… life”

by Agnieszka Walczak

The role of audiovisual translation becomes more important in our daily lives. In general, until recent times, young visually-impaired media audiences have been the focus of a fairly small amount of academic research. The existing studies have concentrated on adult rather than young viewers of audio described programmes. Research concerning the accessibility and reception of audio description (AD) services by the group of blind and visually impaired children, especially in the case of educational programmes, is particularly lacking. Since this audience constitutes quite a large group within AD receivers in Poland, a research study focusing on AD for children was undertaken.

The study was aimed at examining the acceptability and reception of an educational programme with text-to-speech audio description (TTS AD) by young visually impaired viewers. It focused on a rather demanding audience group which requires special approach towards AD creation on the part of the audio describer. The task seemed to be even more challenging due to the fact that the audio visual material chosen for the project was not a feature film, but an animation programme designed to be used as an educational tool. It was the author’s contention that audio described films can greatly enhance the learning process of children and make classes more enjoyable.

About the audiovisual material used
The audiovisual material employed in the study was an episode from the educational animation series Once Upon a Time… Life. Directed by Albert Barillé, this programme was originally produced in France in 1987 and then aired in numerous countries in the world. The episode chosen for the purpose of this research was titled Blood and it was meant to be used in the biology/environment class in schools for blind and partially sighted children.

A definite advantage of the material was the combination of an entertaining storyline with a significant amount of factual information. Every episode of this series tells the story of a different organ or system within the human body. There are, for instance, episodes devoted to the functions of heart, brain, liver or kidneys and the ones dealing with lymphatic or nervous systems. The depiction of human body is made thanks to numerous animation characters introduced into the series. They are divided into two groups, namely the group of good characters represented by defence mechanism of the body (i.e. white blood cells) and the group of bad characters (i.e. viruses and bacteria) being a threat to the human body.

Screenshots and short clips of the film (in Polish) are presented below.

Here are two samples of text-to-speech audio description with two different voices:
TTS AD (IVONA text-to-speech, voice: Zosia by Loquendo)

TTS AD (IVONA text-to-speech, voice: Ewa by Ivo Software)

more >>
This text-to-speech audio description was recorded with Subik (www.subik.com.pl).

Why text-to-speech audio description?
Bearing in mind possible inconveniences as well as prohibitive costs connected with preparation of traditional pre-recorded human AD, the delivery of AD in this study was made with the use of text-to-speech software.

The freeware programme BESTplayer (version 2.0) together with the text-to-speech application Ivona Reader and a female Polish synthetic voice named Ewa (manufactured by Ivo Software) were used for the purposes of the screening.

Study participants
A total of 76 children (35 girls and 41 boys) participated in the study. They were the learners of the following schools:

  • Róża Czacka Educational Centre for Blind Children in Laski;
  • Louis Braille Special Educational Centre for Blind and Partially Sighted Children in Bydgoszcz; and
  • Special Educational Centre for Blind and Partially Sighted Children in Kraków.

They were aged between 8 and 17 years of age (see the chart below).

Participants by age


For Children
The questionnaire was administered after each of three screenings of the film.
The first part of the questionnaire aimed to establish participants’ personal characteristics, such as gender, age, type (congenital or acquired) and degree (blind or partially sighted) of sight loss. Then they were asked about their previous experience with audio described films as well as their familiarity with speech synthesis software.

The second part of the questionnaire was meant to verify whether the respondents could answer any questions concerning the film’s content after taking part in the screening.
The last part of the questionnaire focused on determining whether the text of TTS AD was clear and intelligible to them, on gathering opinions on the use of synthetic voice for reading AD and on the participants’ eagerness to watch other episodes of Once Upon a Time… Life series with TTS AD.

For Teachers
If possible, a specially prepared questionnaire was distributed also among teachers in order to collect their views and opinions concerning TTS AD and its use in educational films aimed at visually impaired children. The questionnaire was also designed to show whether it is possible for such programmes to be applied as additional didactic tools during the biology/environment classes in the future.
Below you can find partial results of the study. In order to obtain the full report, please contact me at: agnieszka_walczak(AT)hotmail.com.

In general, the results of the study appear to be quite promising. The overall findings confirm the assumption that animation series under analysis have the potential of becoming an educational tool for blind and partially sighted children. Since the majority of participants reported to gain new information after the screening of the film, which was the intention involved herein, it is suggested that the series could complement the courses of biology/environment classes, thus making the lessons more enjoyable. It was found that previous exposure both to audio described films as well as synthetic speech could affect the acceptability of the programme under analysis. Although the responses on the use of speech synthesis software to read the AD script were varied, with some negative commentaries on the speed rate and voice intelligibility, a large majority of participants enjoyed the voice employed (see the first chart below) and opted for future screenings of the other episodes of the series (see the second chart below).

Participants’ opinions on the synthetic voice used

Participants’ eagerness towards watching next episodes of the series

Both the screening and the questionnaire were greeted with much excitement by children and they arouse a lot of interest and curiosity also among teachers. Furthermore, not only learners, but also teachers were enthusiastic about this initiative and its innovativeness.

The generally positive results of the study suggest that the service of TTS AD in educational animation films should be developed in the future. Although there is still room for improvement, participants’ feedback seems to be the best motivation to undertake such actions. Among the commentaries elicited when conducting the questionnaire after the screening of the film, the following ones are definitely worth citing here:

“I’d really like to watch more episodes. Up till now I have watched films without AD. My parents described me the action of the film. But I prefer films with AD.”
Boy, 13 years old

“I liked the voice and the series is really interesting. If there were more episodes, I would definitely like to watch them.”
Girl, 14 years old

“I want to watch the next episodes, because thanks to them I can better understand what is going on in my body and that is very interesting to me.”
Girl, 15 years old

“I’m very happy that there are films with AD. I think this one was faultless. If I didn’t know this was a synthesiser, I would think a real person was reading the text. In general, good and clear AD.”
Girl, 13 years old

This experience is thought to open up not only a new accessibility avenue of university research, but also a possible accessibility mode that has a potential to be implemented on a wider scale. It may result to be a much cheaper and time effective alternative to traditional audio description provided by human voices.

Comments are closed.