Git Product home page Git Product logo

pd-diffusion's People

Contributors

kmeisthax avatar

Stargazers

 avatar

Watchers

 avatar

pd-diffusion's Issues

Parse all rights assertions in the wikitext

Wikimedia Commons is ostensibly CC-BY-SA; but there are specific legal rationales, exclusions, and other rights assertions in the wikitext. These need to be parsed to determine the copyright status of the image. At a minimum:

  • The date the image was originally drawn, painted, or taken must be known.
  • Any copyright claim to label data independent of the underlying image must be known.
  • Image creation dates must precede the worldwide copyright cut-off date.
  • Copyright claims to label data must not be incompatible with CC-BY-SA.

Any data that breaches these rules should be dropped. Specifically, images with active copyright must be excluded from the dataset at export time, and label data not compatible with CC-BY-SA must be ignored.

Structured wikidata is claimed to be public domain. In practice this is either because the data itself is not copyrightable or because of explicit permissive licensing or dedications. While this may have odd copyright implications in countries that do not recognize public domain dedication, such as Germany, it does not affect the licensing status of generated imagery, so it will be allowed to remain for the time being.

Export to static dataset

Currently we use Dataset.from_generator to pull data from SQL into a Dataset. This has several problems; most notably, we can't use dataset preloading or other features that require pickle-able datasets.

Not all categories are scraped off each image

MediaWiki's API for scraping categories will only include either hidden or non-hidden categories, but not both. We need both in order to enforce category checks elsewhere in the code.

We also need a scraping pass for getting parent category data, since this only returns categories that are directly attached to the page.

Store precalculated CLIP vectors

Currently, CLIP calculation takes over an hour on a dataset of 90k images. This has to be done every time the training process restarts, which is a pain in the ass.

Intersects with #1 - if we move to static datasets then we need to also store CLIP vectors in that dataset. If we do this before static datasets then we need SQL tables to store CLIP data per trained model.

Filter duplicate images in the training set

Wikimedia Commons has a LOT of maps that are uploaded in multiple formats. For example:

https://commons.wikimedia.org/wiki/File:5th_plan,_from_7th_east_to_13th_Street_and_G_Street_south_to_East_Capitol_Street_-_(S.E._Washington_D.C.)._LOC_88694121.jpg

https://commons.wikimedia.org/wiki/File:5th_plan,_from_7th_east_to_13th_Street_and_G_Street_south_to_East_Capitol_Street_-_(S.E._Washington_D.C.)._LOC_88694121.tif

This is inflating the number of maps in the training set and causing the model to overfit on drawing maps rather than the condition specified by CLIP.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.