Here is my immediate theory of constructing a blackbox methodology to detect poisoned outputs of a given model (which can arbitrarily large). One would want to develope a black box methodology, owing to the fact that deployed models can go as big as 175 billion parameters. The main hypothesis of this apporach is that poisoned outputs are not prone to robustness, therefore exercise the robustness may reveal their identity:
For a given model, that takes an input S (can be image or text) and outputs G (can be image, text or probability vector) .The methodology would be to add three different small-scale models (as shown in below figure) over this given original model:
The augmentor model augments the input data (possibly image), so that the image information is same but it is not same element wise This augmented original input
In my demo of study like approach, I consider CIFAR-10 trained model as original model, and tried to reproduce the above study apporach by making the augmentor model approximate image and center cropped image. Making the similarity model as a simple cosine similarity function. And i didnt made the final classifier model, because it is my study, not a paper.
The main_model.ipynb point out towards poisoned model create and calculating the similarity profile. The "better_pure_model.ipynb" referes to create a non-poisoned model. The "augmentor.ipynb" creates the augmentor model, that fundamentally excerises robustness by generating querries similar to original querry to original model. Following is a sample input-output comparison of this augmentor model.
The comaprison of similarity profiles of poisoned and non-poisoned model can be shown below.
It is clear that there is difference of temporal differences between the time series associated to difference categories, which can be classified by a sequence classification model.