Skip to main content

(Lack of) Anonymity in the Ubuntu Dialogue Corpus

Presented by:
Paul-Hieu Nguyen (Reed College)
Abstract:

Currently, researchers have paid great attention to anonymity and privacy in data sets, especially public data sets. Much research focuses on structured data, or when not structured, much research is centered on medical-related text. We approach anonymity using a popular human-human conversational data set, the Ubuntu Dialogue Corpus. Using three different named entity recognition models as well as regular expressions, we identify key examples located within the data set that demonstrate privacy threats. These issues include the presence of explicit identifiers (name, email), quasi-identifiers (location, age, gender, ethnicity, birthday, etc.), and sensitive information (password). We provide a procedure in order to scale the work to the entire data set or apply to other similar data sets. Finally, we reflect on the privacy threats and their implications, as well as other future work.