Fine-grained image recognition is a problem in Computer Vision which focuses on discriminating between objects that appear similar. Two images of an object in this problem classification can appear vastly different while images from different classes can appear nearly identical. To solve this problem, one must determine regions of significance or salient regions of this image and determine the classification from these regions. Current approaches take the approach of a hard extraction of these regions or some small deviation off hard extraction. Furthermore, in most cases, these regions are used without the spatial context of the region positioning in the image, using an approach similar to the bag-of-words model found in natural language processing. The approach described in this paper will abandon salient region proposals, electing instead to decompose the image into a series of subsets. Each of these subsets will undergo the same feature extraction process, carried out by a series of convolution and pooling layers. The output of this process will be used as the input to a recurrent neural network, ultimately classify the initial image. In processing the image in this fashion, each of these subsets’ salience in the context of the larger classification will be determined. A standardized implementation of this architecture has not yet been completed. As such, results indicative of performance can not currently be determined.
This work is licensed under a Creative Commons Attribution 4.0 International License.