kmeisthax / pd-diffusion Goto Github PK
View Code? Open in Web Editor NEWLicense: Apache License 2.0
License: Apache License 2.0
Wikimedia Commons is ostensibly CC-BY-SA; but there are specific legal rationales, exclusions, and other rights assertions in the wikitext. These need to be parsed to determine the copyright status of the image. At a minimum:
Any data that breaches these rules should be dropped. Specifically, images with active copyright must be excluded from the dataset at export time, and label data not compatible with CC-BY-SA must be ignored.
Structured wikidata is claimed to be public domain. In practice this is either because the data itself is not copyrightable or because of explicit permissive licensing or dedications. While this may have odd copyright implications in countries that do not recognize public domain dedication, such as Germany, it does not affect the licensing status of generated imagery, so it will be allowed to remain for the time being.
Currently we use Dataset.from_generator
to pull data from SQL into a Dataset. This has several problems; most notably, we can't use dataset preloading or other features that require pickle-able datasets.
MediaWiki's API for scraping categories will only include either hidden or non-hidden categories, but not both. We need both in order to enforce category checks elsewhere in the code.
We also need a scraping pass for getting parent category data, since this only returns categories that are directly attached to the page.
Currently, CLIP calculation takes over an hour on a dataset of 90k images. This has to be done every time the training process restarts, which is a pain in the ass.
Intersects with #1 - if we move to static datasets then we need to also store CLIP vectors in that dataset. If we do this before static datasets then we need SQL tables to store CLIP data per trained model.
Wikimedia Commons has a LOT of maps that are uploaded in multiple formats. For example:
This is inflating the number of maps in the training set and causing the model to overfit on drawing maps rather than the condition specified by CLIP.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.