AE 11: Data science ethics

Part 1 - Data privacy

Consider the following scenario: There appears to be an increase in bicycle accidents around the school before and after class. You have been tasked with collecting data to help protect the health of your peers and improve your community. What data might you collect and how? What responsibility do you have to protect that data?

Specifically:

which data would you collect

Class Answers - Location, weather, severity

how would you collect the data

Class Answers - Traffic cameras, hospital reports, weather reports

how would you keep data private.

Class Answers - Not collect personal information, de-identify information

Part 2 - Predicting ethnicity - data ethics

Your turn (12 minutes): Imai and Khanna (2016) built a racial prediction algorithm using a Bayes classifier trained on voter registration records from Florida and the U.S. Census Bureau’s name list.

The following is the title and the abstract of the paper. Take a minute to read them.

Improving Ecological Inference by Predicting Individual Ethnicity from Voter Registration Record (Imran and Khan, 2016)

In both political behavior research and voting rights litigation, turnout and vote choice for different racial groups are often inferred using aggregate election results and racial composition. Over the past several decades, many statistical methods have been proposed to address this ecological inference problem. We propose an alternative method to reduce aggregation bias by predicting individual-level ethnicity from voter registration records. Building on the existing methodological literature, we use Bayes’s rule to combine the Census Bureau’s Surname List with various information from geocoded voter registration records. We evaluate the performance of the proposed methodology using approximately nine million voter registration records from Florida, where self-reported ethnicity is available. We find that it is possible to reduce the false positive rate among Black and Latino voters to 6% and 3%, respectively, while maintaining the true positive rate above 80%. Moreover, we use our predictions to estimate turnout by race and find that our estimates yields substantially less amounts of bias and root mean squared error than standard ecological inference estimates. We provide open-source software to implement the proposed methodology. The open-source software is available for implementing the proposed methodology.

The said “source software” is the wru package: https://github.com/kosukeimai/wru.

Then, if you feel comfortable, install the wru package and try it out using the sample data provided in the package. And if you don’t feel comfortable doing so, take a look at the results below. Was the publication of this model ethical? Does the open-source nature of the code affect your answer? Is it ethical to use this software? Does your answer change depending on the intended use?

# install.packages("wru")

library(tidyverse)
library(wru)

predict_race(voter.file = voters, surname.only = TRUE) %>% 
  select(surname, pred.whi, pred.bla, pred.his, pred.asi, pred.oth)

      surname     pred.whi     pred.bla     pred.his   pred.asi   pred.oth
1      Khanna 0.0049265455 0.0016079483 0.0023108109 0.88994257 0.10121213
2        Imai 0.0059040605 0.0007184811 0.0193291824 0.76407641 0.20997186
3      Rivera 0.0133185684 0.0121164643 0.8680470387 0.07086170 0.03565623
4     Fifield 0.5122939613 0.0052487193 0.0596200858 0.06003480 0.36280243
5        Zhou 0.0006977652 0.0006618759 0.0001766502 0.98773098 0.01073272
6    Ratkovic 0.3845625798 0.0176011903 0.0131533900 0.04793387 0.53674897
7     Johnson 0.1700395443 0.5164452995 0.0263012602 0.02193102 0.26528287
8       Lopez 0.0122148735 0.0074083470 0.9026491278 0.03610843 0.04161922
10 Wantchekon 0.1016771922 0.2873197891 0.3838356390 0.06225610 0.16491128
9       Morse 0.4688011703 0.1153905281 0.0411994830 0.05237443 0.32223438

If you have installed the package, re-run the code, this time to see what the package predicts for your race. Now consider the same questions again: Was the publication of this model ethical? Does the open-source nature of the code affect your answer? Is it ethical to use this software? Does your answer change depending on the intended use?

Answers will vary

Class answers - It depends on the usage

me <- tibble(surname = "Blake")

predict_race(voter.file = me, surname.only = TRUE)

Warning: Unknown or uninitialised column: `state`.

Proceeding with last name predictions...

ℹ All local files already up-to-date!
ℹ All local files already up-to-date!

  surname  pred.whi  pred.bla   pred.his    pred.asi   pred.oth
1   Blake 0.6663199 0.2276731 0.03137626 0.007787957 0.06684285

Part 3 - Bias

library(tidyverse)
data <- tibble(x = c(7,6,6,9,8,4.3,6,7,5.1,7,6.1,6.6,4,4.4,2,5.6,6.5,4.8,10,5,5,7,4.4,6.8) )

data |>
  ggplot() + 
  geom_histogram(aes(x = x), bins = 20) + 
  geom_vline(xintercept = 4.22 , colour = "red" , size = 2) +
  geom_vline(xintercept = mean(data$x) , color = "blue" , size = 2)

What are your major takeaways from this activity?

Class Answers - We were bias when sampling…. even if we tried not to be. We did not take a random sample.

How does this concept relate to bias in algorithms?

Class Answers - Bias can creep into how the models are developed / the data used with the models. We need to think critically and ask questions throughout the research process to try and be as objective as possible.

Optional

Part 4 - Stochastic parrots

Your turn (10 minutes):

Read the following title and abstract.

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 (Bender et. al., 2021)

The past 3 years of work in NLP have been characterized by the development and deployment of ever larger language models, especially for English. BERT, its variants, GPT-2/3, and others, most recently Switch-C, have pushed the boundaries of the possible both through architectural innovations and through sheer size. Using these pretrained models and the methodology of fine-tuning them for specific tasks, researchers have extended the state of the art on a wide array of tasks as measured by leaderboards on specific benchmarks for English. In this paper, we take a step back and ask: How big is too big? What are the possible risks associated with this technology and what paths are available for mitigating those risks? We provide recommendations including weighing the environmental and financial costs first, investing resources into curating and carefully documenting datasets rather than ingesting everything on the web, carrying out pre-development exercises evaluating how the planned approach fits into research and development goals and supports stakeholder values, and encouraging research directions beyond ever larger language models.

Have you used a natural language model before? Describe your use.
What is meant by “stochastic parrots” in the paper title?