Paullada et al. 2020 conducts a review of the existing research literature around the ethical and socialogical issues around (typical large) datasets commonly used in machine learning such as ImageNet and others.
This note isn’t a comprehensive summary of the entire review, just certain takeaways of particular interest.
- Databases of faces under-represent darker skinned faces (Buolamwini and Gebru 2018).
- Databases of images under-represent non-Western countries (DeVries et al. 2019).
- Databases of images reflect social stereotypes (Zhao et al. 2017, Burns et al. 2018).
- Database of toxicity labels words associated with queer identifies as toxic (Dixon et al. 2018)
Poor Data Collection
- Millions of non-consenual pornographic images in large image datasets (Prabhu and Birhane 2020)
- Annotating databases with labels is subjective and not ground truth (Miceli et al., 2020)
- Tsipras et al. 2020 argue ImageNet:
- Applies only one label to an image that might contain multiple objects
- Crowdworkers (Mechanical Turk, etc.) are biased towards the labeling processes.
Poor Data Documentation
- Widespread variation in datasets reporting how they produced their data (Geiger et al. 2019)
Miopic Focus On Benchmarks
- “… the machine learning community still has much to learn from other disciplines with respect to how they handle the data of human subjects. Unlike in the social sciences or medicine, the machine learning field has yet to develop the data management practices required to store and transmit sensitive human data” (Paullada et al. 2020).
- Data science should be human subject research and require IRB approval (Metcalf and Crawford 2016).
Data Use And Reuse
- It might not always be a good idea to take data gathered for one purpose and use it for another purpose.
- The DukeMTMC database was from eight security cameras around the campus, distributed without the conscent of the individuals in the videos. Eventually the database was taken down but is still widely avaliable online.
- ImageNet uses a number of sources of data where the licensing/copyright are unclear. ImageNet gets around this by not hosting the images themselves.
- Creative Commons license allows for training of machine learning models from it under fair use doctrine (Merkely 2019])
- Legal scholars argue it is okay to use copyrighted text to train machine learning models (Sag 2019).
- Using copyrighted text can combat bias by creating a larger corpus from which to train on (Levendowski 2017)
- “In closing, we advocate for a turn in the culture towards carefully collected datasets, rooted in their original contexts, distributed only in ways that respect the intellectual property and privacy rights of data creators and data subjects, and constructed in conversation with the relevant scientific and scholarly fields required to create datasets that faithfully model tasks and tasks which target relevant and realistic capabilities. Such datasets will undoubtedly be more expensive to create, in time, money and effort, and therefore smaller than today’s most celebrated benchmarks. This, in turn, will encourage work on approaches to machine learning (and to artificial intelligence beyond machine learning) that go beyond the current paradigm of techniques idolizing scale. Should this come to pass, we predict that machine learning as a field will be better positioned to understand how its technology impacts people and to design solutions that work with fidelity and equity in their deployment contexts.” (Paullada et al. 2020).