What does the introduction of reply suggestions generated by AI do to human communication? Whose “voice” is represented in these machine-generated responses and whose voice is diminished by them?
Google, LinkedIn, and Facebook are now offering automated reply suggestions on their platforms, which are used by millions of people every day. The reply suggestions are generally short text snippets generated by machine learning models, trained on massive amounts of data. Google’s Smart Reply, for example, provides reply suggestions to Gmail users and is used in 10% of all mobile replies (Kannan, et al., 2016). These assistive-AI systems are part of a broader class of systems that aim to aid people in conducting everyday online tasks.
This project examines the semantic and stylistic similarity of Google’s Smart Reply suggestions to replies generated by a diverse set of people. The leading hypothesis is that underrepresented populations in Google training set (e.g. older adults, low-income people, less educated individuals) would produce responses that are less similar to the ones provided by Smart Replies both in terms of content and style. This hypothesis is motivated by algorithmic biases shown in Google’s image recognition, Apple’s face recognition products, and criminality risk-assessment software used by Chicago police (Seaver, 2013). Underlying all of these is the fundamental issue of training machine learning models based on available data rather than a representative sample of the target population, which introduces bias.
The study consists of two parts: collection of Smart Reply suggestions and human responses from a diverse set of people. For the first part, we collected 20 general-purpose, non-identifying emails and the Smart Replies suggestions associated with them. For the second part, we recruited a diverse population of English speakers from around the world with different ethnic backgrounds using the crowdsourcing platform Prolific AC. Each participant was asked to provide short replies to the same set of emails, following the completion of a short battery of demographic questions. The ongoing analysis examines the AI and human-generated responses, comparing the two sets in terms of semantic meaning and stylistic characteristics.
Nir Grinberg, Postdoctoral Fellow, Lazer Lab