Git Product home page Git Product logo

Comments (2)

ludwigschmidt avatar ludwigschmidt commented on July 24, 2024

Thank you for the suggestion for improving DataComp. The cited study uses one of LAION’s NSFW classifiers to find CSAM content in LAION-5B. Unlike LAION-5B, we removed NSFW content when assembling DataComp, so to the best of our knowledge, the CSAM images in question are not in DataComp. We will review this issue in more depth and welcome specific suggestions for removing content from DataComp. For additional information, please see Section 3.2, Appendix E, and Appendix G of the DataComp paper, which describe our safety measures in more detail.

from datacomp.

ahundt avatar ahundt commented on July 24, 2024

Thank you for your reply. I appreciate your attention to my concerns. However, I would like to draw your attention to the fact that my name is already mentioned in the acknowledgement section on page 10 of your paper, indicating that I have previously read and shared several items about the design, construction, collection, and publication approach to this dataset with another member of your team. While they have been noted, unfortunately, these concerns have not been addressed in practice, to the best of my knowledge, which would require actions like those found in the papers I reference below.

Regarding CSAM, the 404 media article makes explicit the very high risk posed. I would appreciate it if you could substantively address the items in this issue since I was asking what you’ve done now beyond what is outlined in the paper.

Simply multiplying your own error rate figures by the scale of your dataset provides very large numbers for potentially problematic images in your dataset. Work by multiple Birhane et al papers as well as the Stanford group that verified the CSAM in LAION includes substantially more comprehensive evaluation steps that have not been completed, according to your paper.

Here is Dr. Birhane’s Google Scholar page with the relevant papers and methods:

  1. Multimodal Datasets
  2. Data-swamps
  3. LAION’s den
  4. Large image datasets

Here is the page with the Stanford group’s work detecting CSAM.

The paper stable bias is also likely to be relevant.
https://arxiv.org/abs/2303.11408

I would appreciate it if this matter were taken seriously and acted upon with equal or greater care and attention than authors of the papers I’ve provided have taken. The reasons detailed in the 404 media article make the risks, motivation for addressing the risks, and the impacts all crystal clear.

Thank you for your time and consideration.

from datacomp.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.