Visual data is informative, but they are also confusing, intentionally or unintentionally. Images that are visually similar, confusing, and manipulative to the human eye can benefit from the image pattern identification and associated description of these images at a fine-grained level. Understanding which token, words, phrases, or sentences evoke the best meaning, intention, and motivation of an image captured in real-life can have wide applications. Our research will attempt to understand this use of the objects and complimentary cues like motivation or feelings behind descriptions (as seen in the real world e.g. in news articles, video interviews with transcription, etc.) to find images that best match the fine-grained descriptions. These language-based heuristics, we contend, will not always result in an unequivocal interpretation of images, but will at least explain at what point and why interpretations differ.