—Image caption generation is field of research between the fields of machine vision and natural language processing. Based on the results of evidence, it is a difficult for the machine to understand an image like a human. Most of the proposed methods in this field of automatic image description production follow the encoder-decoder framework. In these proposed methods, each word in step (n) is generated based on the characteristics or features of the image and the previously generated (pre-generated) words. Recently, the attention mechanism, which usually creates a spatial map that highlights the image areas associated with each word, is widely used in researches. In this paper also uses the encoder-decoder framework. The encoder part of our model uses ResNet101 extract the features and the decoder part of model uses three parts: Attention-LSTM, Language-LSTM, and Attention-Layer. This paper uses a attention mechanism that uses local evidence to better demonstrate image features. Our method was able to generate good captions and also improve the evaluation metrics of METEOR, ROUGH.