Comments (2)
Thank you for the suggestion for improving DataComp. The cited study uses one of LAION’s NSFW classifiers to find CSAM content in LAION-5B. Unlike LAION-5B, we removed NSFW content when assembling DataComp, so to the best of our knowledge, the CSAM images in question are not in DataComp. We will review this issue in more depth and welcome specific suggestions for removing content from DataComp. For additional information, please see Section 3.2, Appendix E, and Appendix G of the DataComp paper, which describe our safety measures in more detail.
from datacomp.
Thank you for your reply. I appreciate your attention to my concerns. However, I would like to draw your attention to the fact that my name is already mentioned in the acknowledgement section on page 10 of your paper, indicating that I have previously read and shared several items about the design, construction, collection, and publication approach to this dataset with another member of your team. While they have been noted, unfortunately, these concerns have not been addressed in practice, to the best of my knowledge, which would require actions like those found in the papers I reference below.
Regarding CSAM, the 404 media article makes explicit the very high risk posed. I would appreciate it if you could substantively address the items in this issue since I was asking what you’ve done now beyond what is outlined in the paper.
Simply multiplying your own error rate figures by the scale of your dataset provides very large numbers for potentially problematic images in your dataset. Work by multiple Birhane et al papers as well as the Stanford group that verified the CSAM in LAION includes substantially more comprehensive evaluation steps that have not been completed, according to your paper.
Here is Dr. Birhane’s Google Scholar page with the relevant papers and methods:
- Multimodal Datasets
- Data-swamps
- LAION’s den
- Large image datasets
Here is the page with the Stanford group’s work detecting CSAM.
The paper stable bias is also likely to be relevant.
https://arxiv.org/abs/2303.11408
I would appreciate it if this matter were taken seriously and acted upon with equal or greater care and attention than authors of the papers I’ve provided have taken. The reasons detailed in the 404 media article make the risks, motivation for addressing the risks, and the impacts all crystal clear.
Thank you for your time and consideration.
from datacomp.
Related Issues (20)
- 14% of SHA256 hashes not matching HOT 32
- the normal success rate and downloading speed? HOT 1
- `zeroshot_templates` split error for FairFace / UTKFace HOT 9
- Deduplication against evaluation sets HOT 1
- Metadata for datacomp-large text-based filter HOT 1
- Pretraining dataset HOT 1
- Training log HOT 1
- Frequency of Leaderboard Updates HOT 1
- About update metadata with the corresponding image sample in shards HOT 2
- ModuleNotFoundError: No module named 'training' HOT 2
- Availability of npy indices for large pool
- Average caption length for CommonPool HOT 1
- Downloading Commonpool XLarge
- ImageNet 21k based filtered dataset HOT 1
- Invalid files for Datacomp1B
- Problems in run train.py HOT 3
- Metadata downloading fails and no way to resume the download
- Redundant labels in iWILDCAM eval data
- Label Errors in ImageNet-O Eval Set
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from datacomp.