开发一个深度学习模型
,逐步使用Keras使用Python自动描述照片。
字幕生成是一个具有挑战性的人工智能问题,其中必须为给定照片生成文本描述。
它既需要计算机视觉的方法来理解图像的内容,又需要自然语言处理领域的语言模型才能将对图像的理解正确地转换为单词。最近,深度学习方法已在此问题的示例上取得了最新的成果。
深度学习方法已经证明了字幕生成问题的最新技术成果。这些方法最令人印象深刻的是,可以定义单个端到端模型以预测给定照片的字幕,而不需要复杂的数据准备或专门设计的模型的流水线。
在本教程中,您将发现如何从头开始开发照片字幕深度学习模型。
完成本教程后,您将知道:
- 如何准备照片和文本数据以训练深度学习模型。
- 如何设计和训练深度学习字幕生成模型。
- 如何评估火车标题生成模型并使用它为全新照片添加标题。
本教程分为6部分。他们是:
- 照片和字幕数据集
- 准备照片数据
- 准备文本数据
- 开发深度学习模型
- 逐步加载训练(新)
- 评估模型
- 产生新字幕
Python环境
本教程假定您已安装Python SciPy环境,理想情况下是使用Python 3。
您必须在TensorFlow后端安装Keras。本教程还假定您已安装了NumPy和NLTK库。
在继续之前,让我们检查一下您的深度学习库版本。
运行以下脚本并检查您的版本号:
1
2
3
4
5
6
|
# tensorflow version
import tensorflow
print(‘tensorflow: %s’ % tensorflow.__version__)
# keras version
import keras
print(‘keras: %s’ % keras.__version__)
|
运行脚本应显示相同的库版本号或更高版本
1
2
|
tensorflow: 2.4.0
keras: 2.4.3
|
让我们潜入。
照片和字幕数据集
Flickr8K数据集是开始使用图像字幕的一个很好的数据集。
原因是因为它是现实的并且相对较小,因此您可以下载它并使用CPU在工作站上构建模型。
该数据集的明确描述在2013年发表的论文“将图像描述作为排名任务:数据,模型和评估指标”中。
作者将数据集描述如下:
我们引入了一个新的基准集合,用于基于句子的图像描述和搜索,该集合包含8,000张图像,每张图像都与五个不同的标题配对,这些标题提供了对显着实体和事件的清晰描述。
…
这些图像是从六个不同的Flickr组中选择的,通常不包含任何知名人物或位置,而是手动选择以描绘各种场景和情况。
—将图像描述作为排名任务来构架:数据,模型和评估指标,2013年。
该数据集是免费提供的。您必须填写一份申请表,数据集的链接将通过电子邮件发送给您。我很想为您链接到它们,但是电子邮件地址明确要求:“请不要重新分发数据集”。
您可以使用下面的链接来请求数据集(请注意,这可能不再起作用,请参见下文):
在短时间内,您将收到一封电子邮件,其中包含指向两个文件的链接:
- Flickr8k_Dataset.zip(1 GB)所有照片的档案。
- Flickr8k_text.zip(2.2 MB)包含照片所有文字描述的存档。
更新(2019年2月):官方站点似乎已被删除(尽管该表格仍然有效)。这是我的数据集GitHub存储库中的一些直接下载链接:
下载数据集并将其解压缩到当前工作目录中。您将有两个目录:
- Flickr8k_Dataset:包含8092张JPEG格式的照片。
- Flickr8k_text:包含许多文件,这些文件包含照片的不同描述来源。
该数据集具有预定义的训练数据集(6,000张图像),开发数据集(1,000张图像)和测试数据集(1,000张图像)。
可以用来评估模型技能的一种度量是BLEU得分。作为参考,以下是在测试数据集上评估时,熟练模型的一些基本BLEU分数:
- BLEU-1:0.401至0.578。
- BLEU-2:0.176至0.390。
- BLEU-3:0.099至0.260。
- BLEU-4:0.059至0.170。
我们稍后将在评估模型时描述BLEU指标。
接下来,让我们看一下如何加载图像。
准备照片数据
我们将使用预先训练的模型来解释照片的内容。
有很多型号可供选择。在这种情况下,我们将使用在2014年ImageNet竞赛中获胜的牛津视觉几何学组或VGG模型。在此处了解有关该模型的更多信息:
Keras直接提供了这种预先训练的模型。请注意,首次使用此模型时,Keras将从互联网上下载模型权重,大约为500 MB。这可能需要几分钟的时间,具体取决于您的Internet连接。
我们可以将此模型用作更广泛的图像标题模型的一部分。问题是,这是一个大型模型,每当我们要测试新的语言模型配置(下游)时,通过网络运行每张照片都是多余的。
相反,我们可以使用预先训练的模型预先计算“照片特征”,并将其保存到文件中。然后,我们可以稍后加载这些功能,并将其作为对数据集中给定照片的解释,将其输入到我们的模型中。通过完整的VGG模型运行照片没有什么不同。只是我们会提前完成一次。
这是一项优化,它将使训练我们的模型更快并消耗更少的内存。
我们可以使用VGG类在Keras中加载VGG模型。我们将从已加载的模型中删除最后一层,因为这是用于预测照片分类的模型。我们不希望对图像进行分类,但是我们对进行分类之前的照片内部表示感兴趣。这些是模型从照片中提取的“特征”。
Keras还提供了用于将加载的照片重塑为模型的首选大小的工具(例如3通道224 x 224像素图像)。
下面是一个名为extract_features()的函数,给定目录名称,该函数将加载每张照片,为VGG做准备,并从VGG模型中收集预测的特征。图像特征是一维4,096元素向量。
该函数将图像标识符的字典返回到图像特征
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
|
# extract features from each photo in the directory
def extract_features(directory):
# load the model
model = VGG16()
# re-structure the model
model = Model(inputs=model.inputs, outputs=model.layers[–2].output)
# summarize
print(model.summary())
# extract features from each photo
features = dict()
for name in listdir(directory):
# load an image from file
filename = directory + ‘/’ + name
image = load_img(filename, target_size=(224, 224))
# convert the image pixels to a numpy array
image = img_to_array(image)
# reshape data for the model
image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
# prepare the image for the VGG model
image = preprocess_input(image)
# get features
feature = model.predict(image, verbose=0)
# get image id
image_id = name.split(‘.’)[0]
# store feature
features[image_id] = feature
print(‘>%s’ % name)
return features
|
我们可以调用此函数来准备用于测试模型的照片数据,然后将生成的字典保存到名为“ features.pkl ”的文件中。
下面列出了完整的示例。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
|
from os import listdir
from pickle import dump
from keras.applications.vgg16 import VGG16
from keras.preprocessing.image import load_img
from keras.preprocessing.image import img_to_array
from keras.applications.vgg16 import preprocess_input
from keras.models import Model
# extract features from each photo in the directory
def extract_features(directory):
# load the model
model = VGG16()
# re-structure the model
model = Model(inputs=model.inputs, outputs=model.layers[–2].output)
# summarize
print(model.summary())
# extract features from each photo
features = dict()
for name in listdir(directory):
# load an image from file
filename = directory + ‘/’ + name
image = load_img(filename, target_size=(224, 224))
# convert the image pixels to a numpy array
image = img_to_array(image)
# reshape data for the model
image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
# prepare the image for the VGG model
image = preprocess_input(image)
# get features
feature = model.predict(image, verbose=0)
# get image id
image_id = name.split(‘.’)[0]
# store feature
features[image_id] = feature
print(‘>%s’ % name)
return features
# extract features from all images
directory = ‘Flickr8k_Dataset’
features = extract_features(directory)
print(‘Extracted Features: %d’ % len(features))
# save to file
dump(features, open(‘features.pkl’, ‘wb’))
|
根据您的硬件,运行此数据准备步骤可能需要一些时间,而在配备现代工作站的CPU上,可能需要一个小时。
运行结束时,您将把提取的特征存储在’ features.pkl ‘中,以备后用。该文件的大小约为127 MB。
准备文本数据
数据集包含每张照片的多个描述,并且描述文本需要进行一些最少的清洁。
首先,我们将加载包含所有描述的文件。
1
2
3
4
5
6
7
8
9
10
11
12
13
|
# load doc into memory
def load_doc(filename):
# open the file as read only
file = open(filename, ‘r’)
# read all text
text = file.read()
# close the file
file.close()
return text
filename = ‘Flickr8k_text/Flickr8k.token.txt’
# load descriptions
doc = load_doc(filename)
|
每张照片都有一个唯一的标识符。该标识符用于照片文件名和描述的文本文件中。
接下来,我们将逐步浏览照片说明列表。下面定义了一个函数load_descriptions(),给定已加载的文档文本,该函数会将照片标识符的字典返回至描述。每个照片标识符都映射到一个或多个文本描述的列表。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
|
# extract descriptions for images
def load_descriptions(doc):
mapping = dict()
# process lines
for line in doc.split(‘\n’):
# split line by white space
tokens = line.split()
if len(line) < 2:
continue
# take the first token as the image id, the rest as the description
image_id, image_desc = tokens[0], tokens[1:]
# remove filename from image id
image_id = image_id.split(‘.’)[0]
# convert description tokens back to string
image_desc = ‘ ‘.join(image_desc)
# create the list if needed
if image_id not in mapping:
mapping[image_id] = list()
# store description
mapping[image_id].append(image_desc)
return mapping
# parse descriptions
descriptions = load_descriptions(doc)
print(‘Loaded: %d ‘ % len(descriptions))
|
接下来,我们需要清理描述文本。这些描述已被标记化并且易于使用。
我们将通过以下方式来清理文本,以减少我们需要使用的单词的词汇量:
- 将所有单词转换为小写。
- 删除所有标点符号。
- 删除所有长度不超过一个字符的单词(例如“ a”)。
- 删除所有带有数字的单词。
下面定义了clean_descriptions()函数,给定描述的图像标识符字典,逐步执行每个描述并清除文本。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
import string
def clean_descriptions(descriptions):
# prepare translation table for removing punctuation
table = str.maketrans(”, ”, string.punctuation)
for key, desc_list in descriptions.items():
for i in range(len(desc_list)):
desc = desc_list[i]
# tokenize
desc = desc.split()
# convert to lower case
desc = [word.lower() for word in desc]
# remove punctuation from each token
desc = [w.translate(table) for w in desc]
# remove hanging ‘s’ and ‘a’
desc = [word for word in desc if len(word)>1]
# remove tokens with numbers in them
desc = [word for word in desc if word.isalpha()]
# store as string
desc_list[i] = ‘ ‘.join(desc)
# clean descriptions
clean_descriptions(descriptions)
|
清理后,我们可以总结词汇表的大小。
理想情况下,我们想要一个既表达力又尽可能小的词汇表。词汇量越小,模型的训练速度就越快。
作为参考,我们可以将干净的描述转换为集合并打印其大小,以了解数据集词汇量的大小。
1
2
3
4
5
6
7
8
9
10
11
|
# convert the loaded descriptions into a vocabulary of words
def to_vocabulary(descriptions):
# build a list of all description strings
all_desc = set()
for key in descriptions.keys():
[all_desc.update(d.split()) for d in descriptions[key]]
return all_desc
# summarize vocabulary
vocabulary = to_vocabulary(descriptions)
print(‘Vocabulary Size: %d’ % len(vocabulary))
|
最后,我们可以将图像标识符和描述的字典保存到名为descriptions.txt的新文件中,每行一个图像标识符和描述。
下面定义了save_descriptions()函数,给定一个包含标识符到描述的映射以及一个文件名的字典,该函数将映射保存到文件。
1
2
3
4
5
6
7
8
9
10
11
12
13
|
# save descriptions to file, one per line
def save_descriptions(descriptions, filename):
lines = list()
for key, desc_list in descriptions.items():
for desc in desc_list:
lines.append(key + ‘ ‘ + desc)
data = ‘\n’.join(lines)
file = open(filename, ‘w’)
file.write(data)
file.close()
# save descriptions
save_descriptions(descriptions, ‘descriptions.txt’)
|
综上所述,下面提供了完整的清单。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
|
import string
# load doc into memory
def load_doc(filename):
# open the file as read only
file = open(filename, ‘r’)
# read all text
text = file.read()
# close the file
file.close()
return text
# extract descriptions for images
def load_descriptions(doc):
mapping = dict()
# process lines
for line in doc.split(‘\n’):
# split line by white space
tokens = line.split()
if len(line) < 2:
continue
# take the first token as the image id, the rest as the description
image_id, image_desc = tokens[0], tokens[1:]
# remove filename from image id
image_id = image_id.split(‘.’)[0]
# convert description tokens back to string
image_desc = ‘ ‘.join(image_desc)
# create the list if needed
if image_id not in mapping:
mapping[image_id] = list()
# store description
mapping[image_id].append(image_desc)
return mapping
def clean_descriptions(descriptions):
# prepare translation table for removing punctuation
table = str.maketrans(”, ”, string.punctuation)
for key, desc_list in descriptions.items():
for i in range(len(desc_list)):
desc = desc_list[i]
# tokenize
desc = desc.split()
# convert to lower case
desc = [word.lower() for word in desc]
# remove punctuation from each token
desc = [w.translate(table) for w in desc]
# remove hanging ‘s’ and ‘a’
desc = [word for word in desc if len(word)>1]
# remove tokens with numbers in them
desc = [word for word in desc if word.isalpha()]
# store as string
desc_list[i] = ‘ ‘.join(desc)
# convert the loaded descriptions into a vocabulary of words
def to_vocabulary(descriptions):
# build a list of all description strings
all_desc = set()
for key in descriptions.keys():
[all_desc.update(d.split()) for d in descriptions[key]]
return all_desc
# save descriptions to file, one per line
def save_descriptions(descriptions, filename):
lines = list()
for key, desc_list in descriptions.items():
for desc in desc_list:
lines.append(key + ‘ ‘ + desc)
data = ‘\n’.join(lines)
file = open(filename, ‘w’)
file.write(data)
file.close()
filename = ‘Flickr8k_text/Flickr8k.token.txt’
# load descriptions
doc = load_doc(filename)
# parse descriptions
descriptions = load_descriptions(doc)
print(‘Loaded: %d ‘ % len(descriptions))
# clean descriptions
clean_descriptions(descriptions)
# summarize vocabulary
vocabulary = to_vocabulary(descriptions)
print(‘Vocabulary Size: %d’ % len(vocabulary))
# save to file
save_descriptions(descriptions, ‘descriptions.txt’)
|
首先运行示例,将打印加载的照片描述数量(8,092)和干净词汇表的大小(8,763个单词)。
1
2
|
Loaded: 8,092
Vocabulary Size: 8,763
|
最后,将干净的描述写入’ descriptions.txt ‘。
看一下文件,我们可以看到描述已经准备好进行建模。文件中描述的顺序可能会有所不同。
1
2
3
4
5
6
|
2252123185_487f21e336 bunch on people are seated in stadium
2252123185_487f21e336 crowded stadium is full of people watching an event
2252123185_487f21e336 crowd of people fill up packed stadium
2252123185_487f21e336 crowd sitting in an indoor stadium
2252123185_487f21e336 stadium full of people watch game
…
|
开发深度学习模型
在本节中,我们将定义深度学习模型并将其适合于训练数据集。
本节分为以下几部分:
- 加载数据中。
- 定义模型。
- 拟合模型。
- 完整的例子。
加载数据中
首先,我们必须加载准备好的照片和文本数据,以便可以使用它来拟合模型。
我们将训练训练数据集中所有照片和标题上的数据。在训练期间,我们将监视开发数据集上模型的性能,并使用该性能来决定何时将模型保存到file。
训练和开发数据集已分别在Flickr_8k.trainImages.txt和Flickr_8k.devImages.txt文件中预定义,它们均包含照片文件名列表。从这些文件名中,我们可以提取照片标识符,并使用这些标识符过滤每组照片和描述。
给定训练或开发集的文件名,下面的函数load_set()将加载一组预定义的标识符。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
# load doc into memory
def load_doc(filename):
# open the file as read only
file = open(filename, ‘r’)
# read all text
text = file.read()
# close the file
file.close()
return text
# load a pre-defined list of photo identifiers
def load_set(filename):
doc = load_doc(filename)
dataset = list()
# process line by line
for line in doc.split(‘\n’):
# skip empty lines
if len(line) < 1:
continue
# get the image identifier
identifier = line.split(‘.’)[0]
dataset.append(identifier)
return set(dataset)
|
现在,我们可以使用一组预定义的火车或开发标识符来加载照片和描述。
以下是函数load_clean_descriptions(),该函数从’ descriptions.txt ‘中为给定的标识符集加载已清理的文本描述,并将标识符字典返回到文本描述列表。
我们将开发的模型将在给定照片的情况下生成标题,并且标题将一次生成一个单词。先前生成的单词序列将作为输入提供。因此,我们将需要一个“第一个单词”来启动生成过程,并需要一个“最后一个单词”来表示字幕的结束。
为此,我们将使用字符串“ startseq ”和“ endseq ”。这些标记在加载时会添加到加载的描述中。在对文本进行编码之前,请务必立即执行此操作,以便令牌也可以正确编码。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
# load clean descriptions into memory
def load_clean_descriptions(filename, dataset):
# load document
doc = load_doc(filename)
descriptions = dict()
for line in doc.split(‘\n’):
# split line by white space
tokens = line.split()
# split id from description
image_id, image_desc = tokens[0], tokens[1:]
# skip images not in the set
if image_id in dataset:
# create list
if image_id not in descriptions:
descriptions[image_id] = list()
# wrap description in tokens
desc = ‘startseq ‘ + ‘ ‘.join(image_desc) + ‘ endseq’
# store
descriptions[image_id].append(desc)
return descriptions
|
接下来,我们可以为给定的数据集加载照片特征。
下面定义了一个名为load_photo_features()的函数,该函数加载整个照片描述集,然后为给定的照片标识符集返回感兴趣的子集。
这不是很有效。但是,这将使我们快速启动并运行。
1
2
3
4
5
6
7
|
# load photo features
def load_photo_features(filename, dataset):
# load all features
all_features = load(open(filename, ‘rb’))
# filter features
features = {k: all_features[k] for k in dataset}
return features
|
我们可以在这里暂停并测试到目前为止开发的所有内容。
下面列出了完整的代码示例。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
|
from pickle import load
# load doc into memory
def load_doc(filename):
# open the file as read only
file = open(filename, ‘r’)
# read all text
text = file.read()
# close the file
file.close()
return text
# load a pre-defined list of photo identifiers
def load_set(filename):
doc = load_doc(filename)
dataset = list()
# process line by line
for line in doc.split(‘\n’):
# skip empty lines
if len(line) < 1:
continue
# get the image identifier
identifier = line.split(‘.’)[0]
dataset.append(identifier)
return set(dataset)
# load clean descriptions into memory
def load_clean_descriptions(filename, dataset):
# load document
doc = load_doc(filename)
descriptions = dict()
for line in doc.split(‘\n’):
# split line by white space
tokens = line.split()
# split id from description
image_id, image_desc = tokens[0], tokens[1:]
# skip images not in the set
if image_id in dataset:
# create list
if image_id not in descriptions:
descriptions[image_id] = list()
# wrap description in tokens
desc = ‘startseq ‘ + ‘ ‘.join(image_desc) + ‘ endseq’
# store
descriptions[image_id].append(desc)
return descriptions
# load photo features
def load_photo_features(filename, dataset):
# load all features
all_features = load(open(filename, ‘rb’))
# filter features
features = {k: all_features[k] for k in dataset}
return features
# load training dataset (6K)
filename = ‘Flickr8k_text/Flickr_8k.trainImages.txt’
train = load_set(filename)
print(‘Dataset: %d’ % len(train))
# descriptions
train_descriptions = load_clean_descriptions(‘descriptions.txt’, train)
print(‘Descriptions: train=%d’ % len(train_descriptions))
# photo features
train_features = load_photo_features(‘features.pkl’, train)
print(‘Photos: train=%d’ % len(train_features))
|
首先运行此示例,将6,000个照片标识符加载到训练数据集中。然后,这些功能将用于过滤和加载清理后的描述文字和预先计算的照片功能。
我们快到了。
1
2
3
|
Dataset: 6,000
Descriptions: train=6,000
Photos: train=6,000
|
描述文本需要先编码为数字,然后才能作为输入呈现给模型或与模型的预测进行比较。
编码数据的第一步是创建从单词到唯一整数值的一致映射。Keras提供了Tokenizer类,该类可以从加载的描述数据中学习此映射。
下面定义了to_lines()以将描述字典转换为字符串列表,以及create_tokenizer() 函数,该函数将适合给定加载了照片描述文本的Tokenizer。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
# convert a dictionary of clean descriptions to a list of descriptions
def to_lines(descriptions):
all_desc = list()
for key in descriptions.keys():
[all_desc.append(d) for d in descriptions[key]]
return all_desc
# fit a tokenizer given caption descriptions
def create_tokenizer(descriptions):
lines = to_lines(descriptions)
tokenizer = Tokenizer()
tokenizer.fit_on_texts(lines)
return tokenizer
# prepare tokenizer
tokenizer = create_tokenizer(train_descriptions)
vocab_size = len(tokenizer.word_index) + 1
print(‘Vocabulary Size: %d’ % vocab_size)
|
现在,我们可以对文本进行编码了。
每个描述将被分解成单词。该模型将提供一个单词和照片,并生成下一个单词。然后,将描述的前两个词作为图像的输入提供给模型,以生成下一个词。这就是训练模型的方式。
例如,输入序列“在田野里奔跑的小女孩”将被分成6对输入输出对以训练模型:
1
2
3
4
5
6
7
|
X1, X2 (text sequence), y (word)
photo startseq, little
photo startseq, little, girl
photo startseq, little, girl, running
photo startseq, little, girl, running, in
photo startseq, little, girl, running, in, field
photo startseq, little, girl, running, in, field, endseq
|
以后,当使用该模型生成描述时,生成的单词将被串联起来并作为输入递归提供,以生成图像的标题。
给定标记符,最大序列长度以及所有描述和图片的字典,下面名为create_sequences()的函数会将数据转换为数据输入/输出对,以训练模型。该模型有两个输入数组:一个用于照片功能,一个用于编码文本。该模型有一个输出,它是文本序列中已编码的下一个单词。
输入文本被编码为整数,将被输入到单词嵌入层。照片特征将直接输入模型的另一部分。该模型将输出预测,该预测将是词汇表中所有单词的概率分布。
因此,输出数据将是每个单词的单次热编码版本,代表理想化的概率分布,除了实际单词位置(其值为1)以外,所有单词位置的值均为0。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
|
# create sequences of images, input sequences and output words for an image
def create_sequences(tokenizer, max_length, descriptions, photos, vocab_size):
X1, X2, y = list(), list(), list()
# walk through each image identifier
for key, desc_list in descriptions.items():
# walk through each description for the image
for desc in desc_list:
# encode the sequence
seq = tokenizer.texts_to_sequences([desc])[0]
# split one sequence into multiple X,y pairs
for i in range(1, len(seq)):
# split into input and output pair
in_seq, out_seq = seq[:i], seq[i]
# pad input sequence
in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
# encode output sequence
out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
# store
X1.append(photos[key][0])
X2.append(in_seq)
y.append(out_seq)
return array(X1), array(X2), array(y)
|
我们将需要计算最长描述中的最大单词数。下面定义了一个名为max_length()的简短帮助函数。
1
2
3
4
|
# calculate the length of the description with the most words
def max_length(descriptions):
lines = to_lines(descriptions)
return max(len(d.split()) for d in lines)
|
现在,我们已经足够加载训练和开发数据集的数据,并将加载的数据转换为输入-输出对,以适合深度学习模型。
定义模型
作者提供了一个很好的模型示意图,如下所示。
图像字幕合并模型的示意图
我们将分三部分描述该模型:
- 照片特征提取器。这是在ImageNet数据集上预先训练的16层VGG模型。我们已经使用VGG模型(没有输出层)对照片进行了预处理,并将使用该模型预测的提取特征作为输入。
- 序列处理器。这是一个单词嵌入层,用于处理文本输入,其后是长短期记忆(LSTM)递归神经网络层。
- 解码器(由于缺少更好的名称)。特征提取器和序列处理器都输出固定长度的向量。将这些合并在一起并由密集层进行处理以进行最终预测。
照片特征提取器模型期望输入的照片特征为4,096个元素的向量。这些由密集层处理以产生照片的256个元素表示。
序列处理器模型期望输入序列具有预定义的长度(34个字),该序列被馈送到使用掩码忽略填充值的嵌入层中。接下来是具有256个存储单元的LSTM层。
两种输入模型均产生256个元素的向量。此外,两个输入模型都以50%辍学的形式使用正则化。这是为了减少训练数据集的过度拟合,因为此模型配置学习速度非常快。
解码器模型使用加法运算合并来自两个输入模型的向量。然后将其馈送到Dense 256神经元层,然后馈送到最终输出Dense层,该层在整个输出词汇表上针对序列中的下一个单词进行softmax预测。
下面名为define_model()的函数定义并返回准备好拟合的模型。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
|
# define the captioning model
def define_model(vocab_size, max_length):
# feature extractor model
inputs1 = Input(shape=(4096,))
fe1 = Dropout(0.5)(inputs1)
fe2 = Dense(256, activation=‘relu’)(fe1)
# sequence model
inputs2 = Input(shape=(max_length,))
se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)
se2 = Dropout(0.5)(se1)
se3 = LSTM(256)(se2)
# decoder model
decoder1 = add([fe2, se3])
decoder2 = Dense(256, activation=‘relu’)(decoder1)
outputs = Dense(vocab_size, activation=‘softmax’)(decoder2)
# tie it together [image, seq] [word]
model = Model(inputs=[inputs1, inputs2], outputs=outputs)
model.compile(loss=‘categorical_crossentropy’, optimizer=‘adam’)
# summarize model
print(model.summary())
plot_model(model, to_file=‘model.png’, show_shapes=True)
return model
|
要了解模型的结构,特别是图层的形状,请参见下面列出的摘要。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
|
____________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
====================================================================================================
input_2 (InputLayer) (None, 34) 0
____________________________________________________________________________________________________
input_1 (InputLayer) (None, 4096) 0
____________________________________________________________________________________________________
embedding_1 (Embedding) (None, 34, 256) 1940224 input_2[0][0]
____________________________________________________________________________________________________
dropout_1 (Dropout) (None, 4096) 0 input_1[0][0]
____________________________________________________________________________________________________
dropout_2 (Dropout) (None, 34, 256) 0 embedding_1[0][0]
____________________________________________________________________________________________________
dense_1 (Dense) (None, 256) 1048832 dropout_1[0][0]
____________________________________________________________________________________________________
lstm_1 (LSTM) (None, 256) 525312 dropout_2[0][0]
____________________________________________________________________________________________________
add_1 (Add) (None, 256) 0 dense_1[0][0]
lstm_1[0][0]
____________________________________________________________________________________________________
dense_2 (Dense) (None, 256) 65792 add_1[0][0]
____________________________________________________________________________________________________
dense_3 (Dense) (None, 7579) 1947803 dense_2[0][0]
====================================================================================================
Total params: 5,527,963
Trainable params: 5,527,963
Non-trainable params: 0
____________________________________________________________________________________________________
|
我们还创建了一个图以可视化网络结构,从而更好地帮助理解了两个输入流。
字幕生成深度学习模型的图解
拟合模型
现在我们知道了如何定义模型,我们可以将其拟合到训练数据集上。
该模型可以快速学习,并且可以快速拟合训练数据集。因此,我们将在保持开发数据集上监视训练模型的技能。当开发数据集上的模型技能在某个时代结束时得到改善时,我们会将整个模型保存到文件中。
在运行结束时,我们可以将训练数据集上具有最佳技能的已保存模型用作最终模型。
为此,我们可以在Keras中定义一个ModelCheckpoint并指定它来监视验证数据集上的最小损失,并将模型保存到文件名中同时包含训练损失和验证损失的文件中。
1
2
3
|
# define checkpoint callback
filepath = ‘model-ep{epoch:03d}-loss{loss:.3f}-val_loss{val_loss:.3f}.h5’
checkpoint = ModelCheckpoint(filepath, monitor=‘val_loss’, verbose=1, save_best_only=True, mode=‘min’)
|
然后,我们可以通过callbacks参数在对fit()的调用中指定检查点。我们还必须通过validation_data参数在fit()中指定开发数据集。
我们仅将模型拟合20个纪元,但鉴于训练数据量大,每个纪元在现代硬件上可能需要30分钟。
1
2
|
# fit model
model.fit([X1train, X2train], ytrain, epochs=20, verbose=2, callbacks=[checkpoint], validation_data=([X1test, X2test], ytest))
|
完整的例子
下面列出了将模型拟合到训练数据上的完整示例。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
|
from numpy import array
from pickle import load
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.utils import plot_model
from keras.models import Model
from keras.layers import Input
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding
from keras.layers import Dropout
from keras.layers.merge import add
from keras.callbacks import ModelCheckpoint
# load doc into memory
def load_doc(filename):
# open the file as read only
file = open(filename, ‘r’)
# read all text
text = file.read()
# close the file
file.close()
return text
# load a pre-defined list of photo identifiers
def load_set(filename):
doc = load_doc(filename)
dataset = list()
# process line by line
for line in doc.split(‘\n’):
# skip empty lines
if len(line) < 1:
continue
# get the image identifier
identifier = line.split(‘.’)[0]
dataset.append(identifier)
return set(dataset)
# load clean descriptions into memory
def load_clean_descriptions(filename, dataset):
# load document
doc = load_doc(filename)
descriptions = dict()
for line in doc.split(‘\n’):
# split line by white space
tokens = line.split()
# split id from description
image_id, image_desc = tokens[0], tokens[1:]
# skip images not in the set
if image_id in dataset:
# create list
if image_id not in descriptions:
descriptions[image_id] = list()
# wrap description in tokens
desc = ‘startseq ‘ + ‘ ‘.join(image_desc) + ‘ endseq’
# store
descriptions[image_id].append(desc)
return descriptions
# load photo features
def load_photo_features(filename, dataset):
# load all features
all_features = load(open(filename, ‘rb’))
# filter features
features = {k: all_features[k] for k in dataset}
return features
# covert a dictionary of clean descriptions to a list of descriptions
def to_lines(descriptions):
all_desc = list()
for key in descriptions.keys():
[all_desc.append(d) for d in descriptions[key]]
return all_desc
# fit a tokenizer given caption descriptions
def create_tokenizer(descriptions):
lines = to_lines(descriptions)
tokenizer = Tokenizer()
tokenizer.fit_on_texts(lines)
return tokenizer
# calculate the length of the description with the most words
def max_length(descriptions):
lines = to_lines(descriptions)
return max(len(d.split()) for d in lines)
# create sequences of images, input sequences and output words for an image
def create_sequences(tokenizer, max_length, descriptions, photos, vocab_size):
X1, X2, y = list(), list(), list()
# walk through each image identifier
for key, desc_list in descriptions.items():
# walk through each description for the image
for desc in desc_list:
# encode the sequence
seq = tokenizer.texts_to_sequences([desc])[0]
# split one sequence into multiple X,y pairs
for i in range(1, len(seq)):
# split into input and output pair
in_seq, out_seq = seq[:i], seq[i]
# pad input sequence
in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
# encode output sequence
out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
# store
X1.append(photos[key][0])
X2.append(in_seq)
y.append(out_seq)
return array(X1), array(X2), array(y)
# define the captioning model
def define_model(vocab_size, max_length):
# feature extractor model
inputs1 = Input(shape=(4096,))
fe1 = Dropout(0.5)(inputs1)
fe2 = Dense(256, activation=‘relu’)(fe1)
# sequence model
inputs2 = Input(shape=(max_length,))
se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)
se2 = Dropout(0.5)(se1)
se3 = LSTM(256)(se2)
# decoder model
decoder1 = add([fe2, se3])
decoder2 = Dense(256, activation=‘relu’)(decoder1)
outputs = Dense(vocab_size, activation=‘softmax’)(decoder2)
# tie it together [image, seq] [word]
model = Model(inputs=[inputs1, inputs2], outputs=outputs)
model.compile(loss=‘categorical_crossentropy’, optimizer=‘adam’)
# summarize model
print(model.summary())
plot_model(model, to_file=‘model.png’, show_shapes=True)
return model
# train dataset
# load training dataset (6K)
filename = ‘Flickr8k_text/Flickr_8k.trainImages.txt’
train = load_set(filename)
print(‘Dataset: %d’ % len(train))
# descriptions
train_descriptions = load_clean_descriptions(‘descriptions.txt’, train)
print(‘Descriptions: train=%d’ % len(train_descriptions))
# photo features
train_features = load_photo_features(‘features.pkl’, train)
print(‘Photos: train=%d’ % len(train_features))
# prepare tokenizer
tokenizer = create_tokenizer(train_descriptions)
vocab_size = len(tokenizer.word_index) + 1
print(‘Vocabulary Size: %d’ % vocab_size)
# determine the maximum sequence length
max_length = max_length(train_descriptions)
print(‘Description Length: %d’ % max_length)
# prepare sequences
X1train, X2train, ytrain = create_sequences(tokenizer, max_length, train_descriptions, train_features, vocab_size)
# dev dataset
# load test set
filename = ‘Flickr8k_text/Flickr_8k.devImages.txt’
test = load_set(filename)
print(‘Dataset: %d’ % len(test))
# descriptions
test_descriptions = load_clean_descriptions(‘descriptions.txt’, test)
print(‘Descriptions: test=%d’ % len(test_descriptions))
# photo features
test_features = load_photo_features(‘features.pkl’, test)
print(‘Photos: test=%d’ % len(test_features))
# prepare sequences
X1test, X2test, ytest = create_sequences(tokenizer, max_length, test_descriptions, test_features, vocab_size)
# fit model
# define the model
model = define_model(vocab_size, max_length)
# define checkpoint callback
filepath = ‘model-ep{epoch:03d}-loss{loss:.3f}-val_loss{val_loss:.3f}.h5’
checkpoint = ModelCheckpoint(filepath, monitor=‘val_loss’, verbose=1, save_best_only=True, mode=‘min’)
# fit model
model.fit([X1train, X2train], ytrain, epochs=20, verbose=2, callbacks=[checkpoint], validation_data=([X1test, X2test], ytest))
|
首先运行示例将打印已加载的训练和开发数据集的摘要。
1
2
3
4
5
6
7
8
|
Dataset: 6,000
Descriptions: train=6,000
Photos: train=6,000
Vocabulary Size: 7,579
Description Length: 34
Dataset: 1,000
Descriptions: test=1,000
Photos: test=1,000
|
在模型总结之后,我们可以了解训练和验证(开发)输入输出对的总数。
1
|
Train on 306,404 samples, validate on 50,903 samples
|
然后运行模型,将最佳模型保存到.h5文件中。
在我的跑步中,最佳的验证结果已保存到文件中:
- 型号-ep002-loss3.245-val_loss3.612.h5
该模型在第2阶段结束时保存,训练数据集的损失为3.245,开发数据集的损失为3.612
注意:由于算法或评估程序的随机性,或者数值精度的差异,您的结果可能会有所不同。考虑运行该示例几次并比较平均结果。
让我知道您在下面的评论中得到了什么。
您收到类似以下的错误消息吗
1
|
Memory Error
|
如果是这样,请参阅下一节。
逐步加载训练
注意:如果上一节没有问题,请跳过本节。本部分适用于没有足够的内存来像上一部分中所述训练模型的人员(例如,出于任何原因无法使用AWS EC2)。
字幕模型的训练确实假设您有很多RAM。
上一节中的代码不是高效的内存,并且假定您在具有32GB或64GB RAM的大型EC2实例上运行。如果在8GB RAM的工作站上运行代码,则无法训练模型。
一种解决方法是使用渐进式加载。这在详细的倒数第二个部分题为“讨论逐行加载在后”:
我建议先阅读该部分,然后再继续。
如果要使用渐进式加载来训练此模型,本节将向您展示如何进行。
第一步,我们必须定义一个可用作数据生成器的函数。
我们将使事情变得非常简单,并让数据生成器每批生成一张照片的数据。这将是为照片及其描述集生成的所有序列。
data_generator()下面的函数 将是数据生成器,并将使用加载的文本描述,照片功能,标记生成器和最大长度。在这里,我假设您可以将训练数据存储在内存中,我相信8GB的RAM应该足够。
这是如何运作的?阅读我刚才提到的介绍数据生成器的文章。
1
2
3
4
5
6
7
8
9
|
# data generator, intended to be used in a call to model.fit_generator()
def data_generator(descriptions, photos, tokenizer, max_length, vocab_size):
# loop for ever over images
while 1:
for key, desc_list in descriptions.items():
# retrieve the photo feature
photo = photos[key][0]
in_img, in_seq, out_word = create_sequences(tokenizer, max_length, desc_list, photo, vocab_size)
yield [in_img, in_seq], out_word
|
您可以看到我们正在调用create_sequence()函数来为单张照片而不是整个数据集创建一批数据。这意味着我们必须更新create_sequences()函数以删除“迭代所有描述” for循环。
更新后的功能如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
# create sequences of images, input sequences and output words for an image
def create_sequences(tokenizer, max_length, desc_list, photo, vocab_size):
X1, X2, y = list(), list(), list()
# walk through each description for the image
for desc in desc_list:
# encode the sequence
seq = tokenizer.texts_to_sequences([desc])[0]
# split one sequence into multiple X,y pairs
for i in range(1, len(seq)):
# split into input and output pair
in_seq, out_seq = seq[:i], seq[i]
# pad input sequence
in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
# encode output sequence
out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
# store
X1.append(photo)
X2.append(in_seq)
y.append(out_seq)
return array(X1), array(X2), array(y)
|
现在,我们几乎拥有了所需的一切。
注意,这是一个非常基本的数据生成器。它提供的最大内存节省是在拟合模型之前在内存中不包含训练和测试数据的展开序列,这些样本(例如,来自create_sequences()的结果)是根据每张照片创建的。
进一步改进此数据生成器的一些现成想法包括:
- 随机化每个时代的照片顺序。
- 处理带有照片ID的列表,并根据需要加载文本和照片数据,以进一步减少内存使用量。
- 每批次可产生一张以上的照片样本。
我自己过去曾经历过这些变化。让我知道您是否愿意以及如何发表评论。
您可以通过直接调用数据生成器来进行完整性检查,如下所示:
1
2
3
4
5
6
|
# test the data generator
generator = data_generator(train_descriptions, train_features, tokenizer, max_length, vocab_size)
inputs, outputs = next(generator)
print(inputs[0].shape)
print(inputs[1].shape)
print(outputs.shape)
|
运行此健全性检查将显示一批价值多少的序列是什么样的,在这种情况下,将训练47个样本以拍摄第一张照片。
1
2
3
|
(47, 4096)
(47, 34)
(47, 7579)
|
最后,我们可以在模型上使用 fit_generator()函数来使用此数据生成器训练模型。
在这个简单的示例中,我们将放弃开发数据集和模型检查点的加载,并在每次训练之后简单地保存模型。然后,您可以在训练后返回并加载/评估每个保存的模型,以找到我们损失最小的模型,然后在下一部分中使用它。
使用数据生成器训练模型的代码如下:
1
2
3
4
5
6
7
8
9
10
|
# train the model, run epochs manually and save after each epoch
epochs = 20
steps = len(train_descriptions)
for i in range(epochs):
# create the data generator
generator = data_generator(train_descriptions, train_features, tokenizer, max_length, vocab_size)
# fit for one epoch
model.fit_generator(generator, epochs=1, steps_per_epoch=steps, verbose=1)
# save model
model.save(‘model_’ + str(i) + ‘.h5’)
|
而已。现在,您可以使用渐进式加载来训练模型并节省大量RAM。这也可能会慢很多。
下面列出了用于训练字幕生成模型的带有渐进式加载(使用数据生成器)的完整更新示例。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
|
from numpy import array
from pickle import load
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.utils import plot_model
from keras.models import Model
from keras.layers import Input
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding
from keras.layers import Dropout
from keras.layers.merge import add
from keras.callbacks import ModelCheckpoint
# load doc into memory
def load_doc(filename):
# open the file as read only
file = open(filename, ‘r’)
# read all text
text = file.read()
# close the file
file.close()
return text
# load a pre-defined list of photo identifiers
def load_set(filename):
doc = load_doc(filename)
dataset = list()
# process line by line
for line in doc.split(‘\n’):
# skip empty lines
if len(line) < 1:
continue
# get the image identifier
identifier = line.split(‘.’)[0]
dataset.append(identifier)
return set(dataset)
# load clean descriptions into memory
def load_clean_descriptions(filename, dataset):
# load document
doc = load_doc(filename)
descriptions = dict()
for line in doc.split(‘\n’):
# split line by white space
tokens = line.split()
# split id from description
image_id, image_desc = tokens[0], tokens[1:]
# skip images not in the set
if image_id in dataset:
# create list
if image_id not in descriptions:
descriptions[image_id] = list()
# wrap description in tokens
desc = ‘startseq ‘ + ‘ ‘.join(image_desc) + ‘ endseq’
# store
descriptions[image_id].append(desc)
return descriptions
# load photo features
def load_photo_features(filename, dataset):
# load all features
all_features = load(open(filename, ‘rb’))
# filter features
features = {k: all_features[k] for k in dataset}
return features
# covert a dictionary of clean descriptions to a list of descriptions
def to_lines(descriptions):
all_desc = list()
for key in descriptions.keys():
[all_desc.append(d) for d in descriptions[key]]
return all_desc
# fit a tokenizer given caption descriptions
def create_tokenizer(descriptions):
lines = to_lines(descriptions)
tokenizer = Tokenizer()
tokenizer.fit_on_texts(lines)
return tokenizer
# calculate the length of the description with the most words
def max_length(descriptions):
lines = to_lines(descriptions)
return max(len(d.split()) for d in lines)
# create sequences of images, input sequences and output words for an image
def create_sequences(tokenizer, max_length, desc_list, photo, vocab_size):
X1, X2, y = list(), list(), list()
# walk through each description for the image
for desc in desc_list:
# encode the sequence
seq = tokenizer.texts_to_sequences([desc])[0]
# split one sequence into multiple X,y pairs
for i in range(1, len(seq)):
# split into input and output pair
in_seq, out_seq = seq[:i], seq[i]
# pad input sequence
in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
# encode output sequence
out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
# store
X1.append(photo)
X2.append(in_seq)
y.append(out_seq)
return array(X1), array(X2), array(y)
# define the captioning model
def define_model(vocab_size, max_length):
# feature extractor model
inputs1 = Input(shape=(4096,))
fe1 = Dropout(0.5)(inputs1)
fe2 = Dense(256, activation=‘relu’)(fe1)
# sequence model
inputs2 = Input(shape=(max_length,))
se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)
se2 = Dropout(0.5)(se1)
se3 = LSTM(256)(se2)
# decoder model
decoder1 = add([fe2, se3])
decoder2 = Dense(256, activation=‘relu’)(decoder1)
outputs = Dense(vocab_size, activation=‘softmax’)(decoder2)
# tie it together [image, seq] [word]
model = Model(inputs=[inputs1, inputs2], outputs=outputs)
# compile model
model.compile(loss=‘categorical_crossentropy’, optimizer=‘adam’)
# summarize model
model.summary()
plot_model(model, to_file=‘model.png’, show_shapes=True)
return model
# data generator, intended to be used in a call to model.fit_generator()
def data_generator(descriptions, photos, tokenizer, max_length, vocab_size):
# loop for ever over images
while 1:
for key, desc_list in descriptions.items():
# retrieve the photo feature
photo = photos[key][0]
in_img, in_seq, out_word = create_sequences(tokenizer, max_length, desc_list, photo, vocab_size)
yield [in_img, in_seq], out_word
# load training dataset (6K)
filename = ‘Flickr8k_text/Flickr_8k.trainImages.txt’
train = load_set(filename)
print(‘Dataset: %d’ % len(train))
# descriptions
train_descriptions = load_clean_descriptions(‘descriptions.txt’, train)
print(‘Descriptions: train=%d’ % len(train_descriptions))
# photo features
train_features = load_photo_features(‘features.pkl’, train)
print(‘Photos: train=%d’ % len(train_features))
# prepare tokenizer
tokenizer = create_tokenizer(train_descriptions)
vocab_size = len(tokenizer.word_index) + 1
print(‘Vocabulary Size: %d’ % vocab_size)
# determine the maximum sequence length
max_length = max_length(train_descriptions)
print(‘Description Length: %d’ % max_length)
# define the model
model = define_model(vocab_size, max_length)
# train the model, run epochs manually and save after each epoch
epochs = 20
steps = len(train_descriptions)
for i in range(epochs):
# create the data generator
generator = data_generator(train_descriptions, train_features, tokenizer, max_length, vocab_size)
# fit for one epoch
model.fit_generator(generator, epochs=1, steps_per_epoch=steps, verbose=1)
# save model
model.save(‘model_’ + str(i) + ‘.h5’)
|
也许评估每个保存的模型,然后选择一个在保留数据集上损失最少的最终模型。下一节可能对此有所帮助。
您是否在教程中使用了此新增功能?
你是怎么去的?
评估模型
一旦模型适合,我们就可以在保持测试数据集上评估其预测的技巧。
我们将通过为测试数据集中的所有照片生成描述并使用标准成本函数评估这些预测来评估模型。
首先,我们需要能够使用经过训练的模型为照片生成描述。
这涉及传入起始描述令牌’ startseq ‘,生成一个单词,然后以生成的单词作为输入递归调用模型,直到序列令牌的结尾达到’ endseq ‘或达到最大描述长度。
下面名为generate_desc()的函数实现了此行为,并在给定训练模型和给定准备照片的情况下生成文本描述作为输入。它调用函数word_for_id()以便将整数预测映射回单词。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
|
# map an integer to a word
def word_for_id(integer, tokenizer):
for word, index in tokenizer.word_index.items():
if index == integer:
return word
return None
# generate a description for an image
def generate_desc(model, tokenizer, photo, max_length):
# seed the generation process
in_text = ‘startseq’
# iterate over the whole length of the sequence
for i in range(max_length):
# integer encode input sequence
sequence = tokenizer.texts_to_sequences([in_text])[0]
# pad input
sequence = pad_sequences([sequence], maxlen=max_length)
# predict next word
yhat = model.predict([photo,sequence], verbose=0)
# convert probability to integer
yhat = argmax(yhat)
# map integer to word
word = word_for_id(yhat, tokenizer)
# stop if we cannot map the word
if word is None:
break
# append as input for generating the next word
in_text += ‘ ‘ + word
# stop if we predict the end of the sequence
if word == ‘endseq’:
break
return in_text
|
我们将为测试数据集中和火车数据集中的所有照片生成预测。
下面名为validate_model()的函数将针对给定的照片描述和照片特征数据集评估经过训练的模型。使用语料库BLEU分数来收集和评估实际的描述和预测的描述,并对其进行汇总,该分数总结了生成的文本与预期文本的接近程度。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
# evaluate the skill of the model
def evaluate_model(model, descriptions, photos, tokenizer, max_length):
actual, predicted = list(), list()
# step over the whole set
for key, desc_list in descriptions.items():
# generate description
yhat = generate_desc(model, tokenizer, photos[key], max_length)
# store actual and predicted
references = [d.split() for d in desc_list]
actual.append(references)
predicted.append(yhat.split())
# calculate BLEU score
print(‘BLEU-1: %f’ % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0)))
print(‘BLEU-2: %f’ % corpus_bleu(actual, predicted, weights=(0.5, 0.5, 0, 0)))
print(‘BLEU-3: %f’ % corpus_bleu(actual, predicted, weights=(0.3, 0.3, 0.3, 0)))
print(‘BLEU-4: %f’ % corpus_bleu(actual, predicted, weights=(0.25, 0.25, 0.25, 0.25)))
|
BLEU分数用于文本翻译中,以根据一种或多种参考翻译评估翻译后的文本。
在这里,我们将每个生成的描述与照片的所有参考描述进行比较。然后,我们为1、2、3和4个累积n-gram计算BLEU分数。
所述NLTK Python库实现BLEU分数计算在corpus_bleu()函数。接近1.0的分数越高越好,接近零的分数越差。
我们可以将所有这些与上一节中用于加载数据的功能放在一起。首先,我们需要加载训练数据集以准备Tokenizer,以便我们可以将生成的单词编码为模型的输入序列。至关重要的是,我们使用与训练模型时完全相同的编码方案对生成的单词进行编码。
然后,我们使用这些函数来加载测试数据集。
下面列出了完整的示例。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
|
from numpy import argmax
from pickle import load
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import load_model
from nltk.translate.bleu_score import corpus_bleu
# load doc into memory
def load_doc(filename):
# open the file as read only
file = open(filename, ‘r’)
# read all text
text = file.read()
# close the file
file.close()
return text
# load a pre-defined list of photo identifiers
def load_set(filename):
doc = load_doc(filename)
dataset = list()
# process line by line
for line in doc.split(‘\n’):
# skip empty lines
if len(line) < 1:
continue
# get the image identifier
identifier = line.split(‘.’)[0]
dataset.append(identifier)
return set(dataset)
# load clean descriptions into memory
def load_clean_descriptions(filename, dataset):
# load document
doc = load_doc(filename)
descriptions = dict()
for line in doc.split(‘\n’):
# split line by white space
tokens = line.split()
# split id from description
image_id, image_desc = tokens[0], tokens[1:]
# skip images not in the set
if image_id in dataset:
# create list
if image_id not in descriptions:
descriptions[image_id] = list()
# wrap description in tokens
desc = ‘startseq ‘ + ‘ ‘.join(image_desc) + ‘ endseq’
# store
descriptions[image_id].append(desc)
return descriptions
# load photo features
def load_photo_features(filename, dataset):
# load all features
all_features = load(open(filename, ‘rb’))
# filter features
features = {k: all_features[k] for k in dataset}
return features
# covert a dictionary of clean descriptions to a list of descriptions
def to_lines(descriptions):
all_desc = list()
for key in descriptions.keys():
[all_desc.append(d) for d in descriptions[key]]
return all_desc
# fit a tokenizer given caption descriptions
def create_tokenizer(descriptions):
lines = to_lines(descriptions)
tokenizer = Tokenizer()
tokenizer.fit_on_texts(lines)
return tokenizer
# calculate the length of the description with the most words
def max_length(descriptions):
lines = to_lines(descriptions)
return max(len(d.split()) for d in lines)
# map an integer to a word
def word_for_id(integer, tokenizer):
for word, index in tokenizer.word_index.items():
if index == integer:
return word
return None
# generate a description for an image
def generate_desc(model, tokenizer, photo, max_length):
# seed the generation process
in_text = ‘startseq’
# iterate over the whole length of the sequence
for i in range(max_length):
# integer encode input sequence
sequence = tokenizer.texts_to_sequences([in_text])[0]
# pad input
sequence = pad_sequences([sequence], maxlen=max_length)
# predict next word
yhat = model.predict([photo,sequence], verbose=0)
# convert probability to integer
yhat = argmax(yhat)
# map integer to word
word = word_for_id(yhat, tokenizer)
# stop if we cannot map the word
if word is None:
break
# append as input for generating the next word
in_text += ‘ ‘ + word
# stop if we predict the end of the sequence
if word == ‘endseq’:
break
return in_text
# evaluate the skill of the model
def evaluate_model(model, descriptions, photos, tokenizer, max_length):
actual, predicted = list(), list()
# step over the whole set
for key, desc_list in descriptions.items():
# generate description
yhat = generate_desc(model, tokenizer, photos[key], max_length)
# store actual and predicted
references = [d.split() for d in desc_list]
actual.append(references)
predicted.append(yhat.split())
# calculate BLEU score
print(‘BLEU-1: %f’ % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0)))
print(‘BLEU-2: %f’ % corpus_bleu(actual, predicted, weights=(0.5, 0.5, 0, 0)))
print(‘BLEU-3: %f’ % corpus_bleu(actual, predicted, weights=(0.3, 0.3, 0.3, 0)))
print(‘BLEU-4: %f’ % corpus_bleu(actual, predicted, weights=(0.25, 0.25, 0.25, 0.25)))
# prepare tokenizer on train set
# load training dataset (6K)
filename = ‘Flickr8k_text/Flickr_8k.trainImages.txt’
train = load_set(filename)
print(‘Dataset: %d’ % len(train))
# descriptions
train_descriptions = load_clean_descriptions(‘descriptions.txt’, train)
print(‘Descriptions: train=%d’ % len(train_descriptions))
# prepare tokenizer
tokenizer = create_tokenizer(train_descriptions)
vocab_size = len(tokenizer.word_index) + 1
print(‘Vocabulary Size: %d’ % vocab_size)
# determine the maximum sequence length
max_length = max_length(train_descriptions)
print(‘Description Length: %d’ % max_length)
# prepare test set
# load test set
filename = ‘Flickr8k_text/Flickr_8k.testImages.txt’
test = load_set(filename)
print(‘Dataset: %d’ % len(test))
# descriptions
test_descriptions = load_clean_descriptions(‘descriptions.txt’, test)
print(‘Descriptions: test=%d’ % len(test_descriptions))
# photo features
test_features = load_photo_features(‘features.pkl’, test)
print(‘Photos: test=%d’ % len(test_features))
# load the model
filename = ‘model-ep002-loss3.245-val_loss3.612.h5’
model = load_model(filename)
# evaluate model
evaluate_model(model, test_descriptions, test_features, tokenizer, max_length)
|
运行示例将打印BLEU分数。
注意:由于算法或评估程序的随机性,或者数值精度的差异,您的结果可能会有所不同。考虑运行该示例几次并比较平均结果。
我们可以看到分数在该问题的熟练模型的预期范围之内,并且接近该范围的顶部。所选择的模型配置绝不是最优化的。
1
2
3
4
|
BLEU-1: 0.579114
BLEU-2: 0.344856
BLEU-3: 0.252154
BLEU-4: 0.131446
|
产生新字幕
现在我们知道如何开发和评估字幕生成模型了,如何使用它?
在模型文件中,几乎所有我们需要为全新照片生成标题的内容。
我们还需要令牌生成器来在生成序列时对模型生成的单词进行编码,以及在定义模型时使用的输入序列的最大长度(例如34)。
我们可以对最大序列长度进行硬编码。通过文本编码,我们可以创建令牌生成器并将其保存到文件中,以便我们可以在需要时快速加载它,而无需整个Flickr8K数据集。一种替代方法是在训练过程中使用我们自己的词汇表文件并映射到整数函数。
我们可以像以前一样创建Tokenizer,并将其另存为pickle文件tokenizer.pkl。下面列出了完整的示例。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
|
from keras.preprocessing.text import Tokenizer
from pickle import dump
# load doc into memory
def load_doc(filename):
# open the file as read only
file = open(filename, ‘r’)
# read all text
text = file.read()
# close the file
file.close()
return text
# load a pre-defined list of photo identifiers
def load_set(filename):
doc = load_doc(filename)
dataset = list()
# process line by line
for line in doc.split(‘\n’):
# skip empty lines
if len(line) < 1:
continue
# get the image identifier
identifier = line.split(‘.’)[0]
dataset.append(identifier)
return set(dataset)
# load clean descriptions into memory
def load_clean_descriptions(filename, dataset):
# load document
doc = load_doc(filename)
descriptions = dict()
for line in doc.split(‘\n’):
# split line by white space
tokens = line.split()
# split id from description
image_id, image_desc = tokens[0], tokens[1:]
# skip images not in the set
if image_id in dataset:
# create list
if image_id not in descriptions:
descriptions[image_id] = list()
# wrap description in tokens
desc = ‘startseq ‘ + ‘ ‘.join(image_desc) + ‘ endseq’
# store
descriptions[image_id].append(desc)
return descriptions
# covert a dictionary of clean descriptions to a list of descriptions
def to_lines(descriptions):
all_desc = list()
for key in descriptions.keys():
[all_desc.append(d) for d in descriptions[key]]
return all_desc
# fit a tokenizer given caption descriptions
def create_tokenizer(descriptions):
lines = to_lines(descriptions)
tokenizer = Tokenizer()
tokenizer.fit_on_texts(lines)
return tokenizer
# load training dataset (6K)
filename = ‘Flickr8k_text/Flickr_8k.trainImages.txt’
train = load_set(filename)
print(‘Dataset: %d’ % len(train))
# descriptions
train_descriptions = load_clean_descriptions(‘descriptions.txt’, train)
print(‘Descriptions: train=%d’ % len(train_descriptions))
# prepare tokenizer
tokenizer = create_tokenizer(train_descriptions)
# save the tokenizer
dump(tokenizer, open(‘tokenizer.pkl’, ‘wb’))
|
现在,我们可以在需要时加载令牌生成器,而不必加载注释的整个训练数据集。
现在,让我们为新照片生成描述。
下面是我在Flickr上随机选择的一张新照片(根据许可许可提供)。
一条狗的照片在海滩的。
我们将使用我们的模型为其生成描述。
下载照片并使用文件名“ example.jpg ”将其保存到本地目录。
首先,我们必须从tokenizer.pkl加载Tokenizer,并定义填充输入所需的要生成的序列的最大长度。
1
2
3
4
|
# load the tokenizer
tokenizer = load(open(‘tokenizer.pkl’, ‘rb’))
# pre-define the max sequence length (from training)
max_length = 34
|
然后,我们必须像以前一样加载模型。
1
2
|
# load the model
model = load_model(‘model-ep002-loss3.245-val_loss3.612.h5’)
|
接下来,我们必须加载用于描述和提取特征的照片。
我们可以通过重新定义模型并向其中添加VGG-16模型来做到这一点,或者我们可以使用VGG模型来预测功能并将其用作我们现有模型的输入。我们将使用后者,并使用数据准备期间使用的extract_features()函数的修改版本,但适用于处理单张照片。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
# extract features from each photo in the directory
def extract_features(filename):
# load the model
model = VGG16()
# re-structure the model
model = Model(inputs=model.inputs, outputs=model.layers[–2].output)
# load the photo
image = load_img(filename, target_size=(224, 224))
# convert the image pixels to a numpy array
image = img_to_array(image)
# reshape data for the model
image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
# prepare the image for the VGG model
image = preprocess_input(image)
# get features
feature = model.predict(image, verbose=0)
return feature
# load and prepare the photograph
photo = extract_features(‘example.jpg’)
|
然后,我们可以使用在评估模型时定义的generate_desc()函数来生成描述。
下面列出了为全新的独立照片生成描述的完整示例。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
|
from pickle import load
from numpy import argmax
from keras.preprocessing.sequence import pad_sequences
from keras.applications.vgg16 import VGG16
from keras.preprocessing.image import load_img
from keras.preprocessing.image import img_to_array
from keras.applications.vgg16 import preprocess_input
from keras.models import Model
from keras.models import load_model
# extract features from each photo in the directory
def extract_features(filename):
# load the model
model = VGG16()
# re-structure the model
model = Model(inputs=model.inputs, outputs=model.layers[–2].output)
# load the photo
image = load_img(filename, target_size=(224, 224))
# convert the image pixels to a numpy array
image = img_to_array(image)
# reshape data for the model
image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
# prepare the image for the VGG model
image = preprocess_input(image)
# get features
feature = model.predict(image, verbose=0)
return feature
# map an integer to a word
def word_for_id(integer, tokenizer):
for word, index in tokenizer.word_index.items():
if index == integer:
return word
return None
# generate a description for an image
def generate_desc(model, tokenizer, photo, max_length):
# seed the generation process
in_text = ‘startseq’
# iterate over the whole length of the sequence
for i in range(max_length):
# integer encode input sequence
sequence = tokenizer.texts_to_sequences([in_text])[0]
# pad input
sequence = pad_sequences([sequence], maxlen=max_length)
# predict next word
yhat = model.predict([photo,sequence], verbose=0)
# convert probability to integer
yhat = argmax(yhat)
# map integer to word
word = word_for_id(yhat, tokenizer)
# stop if we cannot map the word
if word is None:
break
# append as input for generating the next word
in_text += ‘ ‘ + word
# stop if we predict the end of the sequence
if word == ‘endseq’:
break
return in_text
# load the tokenizer
tokenizer = load(open(‘tokenizer.pkl’, ‘rb’))
# pre-define the max sequence length (from training)
max_length = 34
# load the model
model = load_model(‘model-ep002-loss3.245-val_loss3.612.h5’)
# load and prepare the photograph
photo = extract_features(‘example.jpg’)
# generate description
description = generate_desc(model, tokenizer, photo, max_length)
print(description)
|
注意:由于算法或评估程序的随机性,或者数值精度的差异,您的结果可能会有所不同。考虑运行该示例几次并比较平均结果。
在这种情况下,生成的描述如下:
1
|
startseq dog is running across the beach endseq
|
您可以删除开始标记和结束标记,并为良好的自动照片字幕模型奠定基础。