I've found a way to utilize LIME with VITs and will be sharing it in this post.
Along with text explainability libraries, LIME provides a set of functions for image explainability as well. To work with LIME in your vision project, you will need to add these modules:
Later, after your model is setup and ready, use lime_image like this:
Next, with the explainer object call explain_instance, which takes in an image or list of images and a predict function. Here we are tempted to put in the model.predict function for our Vision Transformer. That is wrong. In fact, you must make a helper function that manipulates the data before calling in for the prediction:
The first parameter is a numpy array of the image(s) I want analyzed. *Important* - it does not contain an extra batch dimension usually already added. That's added later in the helper function.
The second parameter is the name of the helper function called prod_fn. This function allows me to add the extra dimension needed by a Vision Transformer. Make sure when you call the explain_instance, that the input images in the first parameter do not have that first dimension yet.
Here is my implementation of the pred_fn which will take a single image or a list of them.
The output of the predict function is a 2D list. We only want the first dimension to be added to the return list.
Finally, the explanation can be used to mask off areas of the original image most salient to the model's output.
The process outputs these beautiful image explanations:
Original image - The model correctly labeled the image, "golden_retriever" |
The first masked image that shows those areas most salient to the decision. Note that the black background is due to the parameter "hide_rest" being set to True. |
The second masked image again shows the areas most salient to the decision but this time in context with the rest of the image. |
I hope this is helpful in your Vision Transformer explanations!