“Big data”: the term is everywhere. We are living in the aftermath of a big data explosion, where mountains of data have been collected and are just waiting to be mined. Data scientists are the people who have the necessary skills to extract usable insights from this sort of data. As such, they are in high demand, prized by tech companies seeking to make better business decisions. However, what people need to remember is that the field of data science is so new, and even the top data scientists are still figuring out best practices.
One issue that has been discussed recently is ethics in data science. Many data scientists hail from academia, with PhD’s in Physics, Astronomy, Computational Biology, and others. They are all very good at working with data from their respective domains. But it is a very different thing to start working with data on humans, compared to data on astronomical observations, for example. Some of the data may be from people who may not have ever given explicit informed consent to have such data collected about them. It may be tempting to think that “mere information” is not a big deal, because it is not like an individual human is being harmed. But as Fairfield and Shtein wrote in 2014, “The traditional focus of social science has been on physical, rather than informational harms, and on not harming individuals, but big data impacts communities as much (or more) than individuals. Yet the notion that information is not a cognizable harm is not supportable in the context of an information-based society.”
Furthermore, there is so much that we cannot foresee about the implications of using data in a certain way. Using a dataset for a purpose that it was never intended for can have consequences, so it is important for data scientists to be careful and sensitive. They need to keep in mind the environment that the data was collected in, and how clean the data is. As Cathy O’Neil wrote, “People have too much trust in data to be intrinsically objective, even though it is in fact only as good as the human processes that collected it.” (2016) She wrote about a former project of hers in which her team was tasked to develop an algorithm to predict how long a family would be in New York City’s homeless services program. The aim was to pair each homeless family with the most appropriate service. The dataset to be used for training the algorithm spanned the last 30 years, and her team had to decide whether or not to use race as a predictor variable. In the end, they decided not to use it, because the dataset was sure to be biased against blacks. Due to the effects of racism, blacks have historically made up a large portion of the homeless population. If the algorithm found that blacks were less likely to find a job, and the algorithm was used by city officials today, then homeless black men might be less likely to receive job counseling.
Knowing all this, it can seem like data scientists of today are walking through minefields, because there is no telling when their work might have a negative impact on communities. Thus, it is important to keep the conversation alive, and to keep asking the right questions of ourselves. We must keep asking whether we are doing enough, and whether we have considered the unique needs of the communities whose data we handle.
References
[Untitled illustration of an iceberg with the words “BIG DATA” written on it]. Retrieved April 29, 2016 from http://timoelliott.com/blog/wp-content/uploads/2013/06/big-data-graphic-iceberg-690.jpg
Fairfield, J. and Shtein, H. (2014). Big Data, Big Problems: Emerging Issues in the Ethics of Data Science and Journalism. Journal of Mass Media Ethics, 29(1), 38-51.
O’Neil, C. (2016). The Ethical Data Scientist. Slate. http://www.slate.com/articles/technology/future_tense/2016/02/how_to_bring_better_ethics_to_data_science.html