数据项_Tensorflow_datasets中batch(batch_size)和shuffle(buffer_size)理解

作者：品位人生2602905223 | 来源：互联网 | 2023-09-01 12:58

篇首语：本文由编程笔记#小编为大家整理，主要介绍了Tensorflow_datasets中batch(batch_size)和shuffle(buffer_size)理解相关的知识，希望对你有一

篇首语：本文由编程笔记#小编为大家整理，主要介绍了Tensorflow_datasets中batch(batch_size)和shuffle(buffer_size)理解相关的知识，希望对你有一定的参考价值。

相关内容引用&＃xff1a;https://zhuanlan.zhihu.com/p/42417456

1.shuffle(buffer_size)

tensorflow中的数据集类Dataset有一个shuffle方法&＃xff0c;用来打乱数据集中数据顺序&＃xff0c;训练时非常常用。其中shuffle方法有一个参数buffer_size&＃xff0c;文档的解释如下&＃xff1a;

dataset.shuffle(buffer_size, seed&＃61;None, reshuffle_each_iteration&＃61;None) Randomly shuffles the elements of this dataset. This dataset fills a buffer with &＃96;buffer_size&＃96; elements, then randomly samples elements from this buffer, replacing the selected elements with new elements. For perfect shuffling, a buffer size greater than or equal to the full size of the dataset is required. For instance, if your dataset contains 10,000 elements but &＃96;buffer_size&＃96; is set to 1,000, then &＃96;shuffle&＃96; will initially select a random element from only the first 1,000 elements in the buffer. Once an element is selected, its space in the buffer is replaced by the next (i.e. 1,001-st) element, maintaining the 1,000 element buffer. &＃96;reshuffle_each_iteration&＃96; controls whether the shuffle order should be different for each epoch.

首先&＃xff0c;Dataset会取所有数据的前buffer_size数据项&＃xff0c;填充 buffer&＃xff0c;如下图

然后&＃xff0c;从buffer中随机选择一条数据输出。假设随机选中了&＃xff0c;item 7&＃xff0c;那么buffer中item 7对应的位置就空出来了。

然后&＃xff0c;从Dataset中顺序选择最新的一条数据填充到buffer中。这里顺序选择到的是item 10。

然后在从Buffer中随机选择下一条数据输出。

用一个实际的例子来说明&＃xff1a;

import tensorflow as tf import numpy as np buffer_size&＃61;4 data &＃61; np.array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]) label &＃61; np.array([0, 0, 1, 0, 1, 1, 0, 1, 0, 0]) dataset &＃61; tf.data.Dataset.from_tensor_slices((data, label)) dataset &＃61; dataset.shuffle(buffer_size) it &＃61; dataset.__iter__() for i in range(10): x, y &＃61; it.next() print(x, y)

输出&＃xff1a;

tf.Tensor(0.1, shape&＃61;(), dtype&＃61;float64) tf.Tensor(0, shape&＃61;(), dtype&＃61;int32) tf.Tensor(0.2, shape&＃61;(), dtype&＃61;float64) tf.Tensor(0, shape&＃61;(), dtype&＃61;int32) tf.Tensor(0.6, shape&＃61;(), dtype&＃61;float64) tf.Tensor(1, shape&＃61;(), dtype&＃61;int32) tf.Tensor(0.5, shape&＃61;(), dtype&＃61;float64) tf.Tensor(1, shape&＃61;(), dtype&＃61;int32) tf.Tensor(0.8, shape&＃61;(), dtype&＃61;float64) tf.Tensor(1, shape&＃61;(), dtype&＃61;int32) tf.Tensor(0.7, shape&＃61;(), dtype&＃61;float64) tf.Tensor(0, shape&＃61;(), dtype&＃61;int32) tf.Tensor(0.4, shape&＃61;(), dtype&＃61;float64) tf.Tensor(0, shape&＃61;(), dtype&＃61;int32) tf.Tensor(0.3, shape&＃61;(), dtype&＃61;float64) tf.Tensor(1, shape&＃61;(), dtype&＃61;int32) tf.Tensor(0.9, shape&＃61;(), dtype&＃61;float64) tf.Tensor(0, shape&＃61;(), dtype&＃61;int32) tf.Tensor(1.0, shape&＃61;(), dtype&＃61;float64) tf.Tensor(0, shape&＃61;(), dtype&＃61;int32)

0.1, 0.2, 0.3, 0.4	0.1&＃xff08;随机选中&＃xff09;	~~0.1~~, 0.2, 0.3, 0.4
0.5, 0.2, 0.3, 0.4	0.2&＃xff08;随机选中&＃xff09;	0.5, ~~0.2~~, 0.3, 0.4
0.5, 0.6, 0.3, 0.4	0.6&＃xff08;随机选中&＃xff09;	0.5, ~~0.6~~, 0.3, 0.4
0.5, 0.7, 0.3, 0.4	0.5&＃xff08;随机选中&＃xff09;	~~0.5~~, 0.7, 0.3, 0.4
0.8, 0.7, 0.3, 0.4	0.8&＃xff08;随机选中&＃xff09;	~~0.8~~, 0.7, 0.3, 0.4
0.9, 0.7, 0.3, 0.4	0.7&＃xff08;随机选中&＃xff09;	0.9, ~~0.7~~, 0.3, 0.4
0.9, 1.0, 0.3, 0.4	0.4&＃xff08;随机选中&＃xff09;	0.9, 1.0, 0.3, ~~0.4~~
0.9, 1.0, 0.3	0.3&＃xff08;随机选中&＃xff09;	0.9, 1.0, ~~0.3~~
0.9, 1.0	0.9&＃xff08;随机选中&＃xff09;	~~0.9~~, 1.0
1.0	1.0&＃xff08;随机选中&＃xff09;	~~1.0~~

如此&＃xff0c;shuffle 后的dataset序列为上述output中的序列。

2.batch(batch_size)

import tensorflow as tf import numpy as np dataset &＃61; tf.data.Dataset.from_tensor_slices(np.array([1, 2, 3, 4, 5, 6, 7,8,9,10,11,12,13,14,15,16])) #有序的 batch_dataset&＃61;dataset.batch(4) for ele in batch_dataset: print(ele)

output:

tf.Tensor([1 2 3 4], shape&＃61;(4,), dtype&＃61;int32) tf.Tensor([5 6 7 8], shape&＃61;(4,), dtype&＃61;int32) tf.Tensor([ 9 10 11 12], shape&＃61;(4,), dtype&＃61;int32) tf.Tensor([13 14 15 16], shape&＃61;(4,), dtype&＃61;int32)

这里batch就是从dataset中按顺序分成4个批次&＃xff0c;仔细看可以知道上面所有输出结果都是有序的&＃xff0c;这在机器学习中用来训练模型是浪费资源且没有意义的&＃xff0c;所以我们需要将数据打乱&＃xff0c;这样每批次训练的时候所用到的数据集是不一样的&＃xff0c;这样啊可以提高模型训练效果。

因此需要和shuffle结合起来使用。

3.shuffle&＃xff08;buffer_size&＃xff09;&＃43; batch(batch_size)

import tensorflow as tf import numpy as np dataset &＃61; tf.data.Dataset.from_tensor_slices(np.array([1, 2, 3, 4, 5, 6, 7,8,9,10,11,12,13,14,15,16])) dataset1&＃61;dataset.shuffle(16) dataset2&＃61;dataset1.batch(2) for i in dataset1: print(i) print("separate") for j in dataset2: print(j)

output&＃xff1a;

tf.Tensor(3, shape&＃61;(), dtype&＃61;int32) tf.Tensor(1, shape&＃61;(), dtype&＃61;int32) tf.Tensor(16, shape&＃61;(), dtype&＃61;int32) tf.Tensor(15, shape&＃61;(), dtype&＃61;int32) tf.Tensor(13, shape&＃61;(), dtype&＃61;int32) tf.Tensor(12, shape&＃61;(), dtype&＃61;int32) tf.Tensor(6, shape&＃61;(), dtype&＃61;int32) tf.Tensor(5, shape&＃61;(), dtype&＃61;int32) tf.Tensor(11, shape&＃61;(), dtype&＃61;int32) tf.Tensor(4, shape&＃61;(), dtype&＃61;int32) tf.Tensor(10, shape&＃61;(), dtype&＃61;int32) tf.Tensor(7, shape&＃61;(), dtype&＃61;int32) tf.Tensor(8, shape&＃61;(), dtype&＃61;int32) tf.Tensor(14, shape&＃61;(), dtype&＃61;int32) tf.Tensor(2, shape&＃61;(), dtype&＃61;int32) tf.Tensor(9, shape&＃61;(), dtype&＃61;int32) separate tf.Tensor([8 2], shape&＃61;(2,), dtype&＃61;int32) tf.Tensor([4 7], shape&＃61;(2,), dtype&＃61;int32) tf.Tensor([ 3 12], shape&＃61;(2,), dtype&＃61;int32) tf.Tensor([ 9 16], shape&＃61;(2,), dtype&＃61;int32) tf.Tensor([10 5], shape&＃61;(2,), dtype&＃61;int32) tf.Tensor([15 14], shape&＃61;(2,), dtype&＃61;int32) tf.Tensor([ 1 13], shape&＃61;(2,), dtype&＃61;int32) tf.Tensor([ 6 11], shape&＃61;(2,), dtype&＃61;int32)

在这里buffer_size&＃xff1a;该函数的作用就是先构建buffer&＃xff0c;大小为buffer_size&＃xff0c;然后从dataset中提取数据将它填满。batch操作&＃xff0c;从buffer中提取。如果buffer_size小于Dataset的大小&＃xff0c;每次提取buffer中的数据&＃xff0c;会再次从Dataset中抽取数据将它填满&＃xff08;当然是之前没有抽过的&＃xff09;。所以一般最好的方式是buffer_size&＃61;Dataset_size

交换shuffle 和 batch的前后会有什么不同呢&＃xff1f;

t1 &＃61; t.shuffle(int).batch(int)

#这个是先打乱t的顺序&＃xff0c;然后batch

t2 &＃61; t.batch(int).shuffle(int)

#这个是打乱batch的顺序

dataset3&＃61;dataset.shuffle(2) dataset4&＃61;dataset3.batch(16) for i in dataset3: print(i) print("separate") for j in dataset4: print(j)

输出&＃xff1a;

tf.Tensor([1 2], shape&＃61;(2,), dtype&＃61;int32) tf.Tensor([3 4], shape&＃61;(2,), dtype&＃61;int32) tf.Tensor([5 6], shape&＃61;(2,), dtype&＃61;int32) tf.Tensor([7 8], shape&＃61;(2,), dtype&＃61;int32) tf.Tensor([ 9 10], shape&＃61;(2,), dtype&＃61;int32) tf.Tensor([11 12], shape&＃61;(2,), dtype&＃61;int32) tf.Tensor([13 14], shape&＃61;(2,), dtype&＃61;int32) tf.Tensor([15 16], shape&＃61;(2,), dtype&＃61;int32) separate tf.Tensor([11 12], shape&＃61;(2,), dtype&＃61;int32) tf.Tensor([13 14], shape&＃61;(2,), dtype&＃61;int32) tf.Tensor([15 16], shape&＃61;(2,), dtype&＃61;int32) tf.Tensor([5 6], shape&＃61;(2,), dtype&＃61;int32) tf.Tensor([1 2], shape&＃61;(2,), dtype&＃61;int32) tf.Tensor([3 4], shape&＃61;(2,), dtype&＃61;int32) tf.Tensor([7 8], shape&＃61;(2,), dtype&＃61;int32) tf.Tensor([ 9 10], shape&＃61;(2,), dtype&＃61;int32)

推荐阅读

byte
大类|电阻器_使用Requests、Etree、BeautifulSoup、Pandas和Path库进行数据抓取与处理 | 将指定区域内容保存为HTML和Excel格式

大类|电阻器_使用Requests、Etree、BeautifulSoup、Pandas和Path库进行数据抓取与处理 | 将指定区域内容保存为HTML和Excel格式 ... [详细]

蜡笔小新 2024-11-11 19:05:59
function
使用JavaScript生成Java兼容的UUID代码实现与优化技巧

本文介绍了UUID（通用唯一标识符）的概念及其在JavaScript中生成Java兼容UUID的代码实现与优化技巧。UUID是一个128位的唯一标识符，广泛应用于分布式系统中以确保唯一性。文章详细探讨了如何利用JavaScript生成符合Java标准的UUID，并提供了多种优化方法，以提高生成效率和兼容性。 ... [详细]

蜡笔小新 2024-11-05 18:19:54
window
vue引入echarts地图的四种方式

一、vue中引入echart1、安装echarts:npminstallecharts--save2、在main.js文件中引入echarts实例: Vue.prototype.$echartsecharts3、在需要用到echart图形的vue文件中引入: importechartsfrom&quot;echarts&quot;;4、如果用到map（地图），还 ... [详细]

蜡笔小新 2024-11-15 13:07:46
php
嵌入式Linux工程师笔试题精选

本文整理了一份基础的嵌入式Linux工程师笔试题，涵盖填空题、编程题和简答题，旨在帮助考生更好地准备考试。 ... [详细]

蜡笔小新 2024-11-15 10:42:13
php
iOS 不定参数详解

iOS 不定参数详解 ... [详细]

蜡笔小新 2024-11-14 17:12:05
object
Python基础：使用NLTK和Python构建机器学习应用

本文节选自《NLTK基础教程——用NLTK和Python库构建机器学习应用》一书的第1章第1.2节，作者Nitin Hardeniya。本文将带领读者快速了解Python的基础知识，为后续的机器学习应用打下坚实的基础。 ... [详细]

蜡笔小新 2024-11-13 21:23:34
byte
浅析python实现布隆过滤器及Redis中的缓存穿透原理_python

本文带你了解了位图的实现，布隆过滤器的原理及Python中的使用，以及布隆过滤器如何应对Redis中的缓存穿透，相信你对布隆过滤 ... [详细]

蜡笔小新 2024-11-13 16:43:07
nodejs
Webpack 初探：Import 和 Require 的使用

本文介绍了 Webpack 中 Import 和 Require 的基本概念和使用方法，帮助读者更好地理解和应用模块化开发。 ... [详细]

蜡笔小新 2024-11-13 16:34:13
php
使用 Python 封装依赖方法构建测试用例的依赖关系

本文介绍如何通过 Python 的 `unittest` 和 `functools` 模块封装一个依赖方法，用于管理测试用例之间的依赖关系。该方法能够确保在某个测试用例失败时，依赖于它的其他测试用例将被跳过。 ... [详细]

蜡笔小新 2024-11-13 10:42:38
ip
Mac上安装Jupyter Notebook的详细步骤与技巧

本文将详细介绍如何在Mac上安装Jupyter Notebook，并提供一些常见的问题解决方法。通过这些步骤，您将能够顺利地在Mac上运行Jupyter Notebook。 ... [详细]

蜡笔小新 2024-11-12 00:45:51
web
Python错误重试让多少开发者头疼？高效解决方案出炉

### 优化后的摘要在处理 Python 开发中的错误重试问题时，许多开发者常常感到困扰。为了应对这一挑战，`tenacity` 库提供了一种高效的解决方案。首先，通过 `pip install tenacity` 安装该库。使用时，可以通过简单的规则配置重试策略。例如，可以设置多个重试条件，使用 `|`（或）和 `&`（与）操作符组合不同的参数，从而实现灵活的错误重试机制。此外，`tenacity` 还支持自定义等待时间、重试次数和异常处理，为开发者提供了强大的工具来提高代码的健壮性和可靠性。 ... [详细]

蜡笔小新 2024-11-11 10:33:20
php
2018 HDU 多校联合第五场 G题：Glad You Game（线段树优化解法）

题目链接：http://acm.hdu.edu.cn/showproblem.php?pid=6356在《Glad You Game》中，Steve 面临一个复杂的区间操作问题。该题可以通过线段树进行高效优化。具体来说，线段树能够快速处理区间更新和查询操作，从而大大提高了算法的效率。本文详细介绍了线段树的构建和维护方法，并给出了具体的代码实现，帮助读者更好地理解和应用这一数据结构。 ... [详细]

蜡笔小新 2024-11-08 19:17:23
php
Python编程实现足球联赛赛程安排的策略与简易示例分析

每年，意甲、德甲、英超和西甲等各大足球联赛的赛程表都是球迷们关注的焦点。本文通过 Python 编程实现了一种生成赛程表的方法，该方法基于蛇形环算法。具体而言，将所有球队排列成两列的环形结构，左侧球队对阵右侧球队，首支队伍固定不动，其余队伍按顺时针方向循环移动，从而确保每场比赛不重复。此算法不仅高效，而且易于实现，为赛程安排提供了可靠的解决方案。 ... [详细]

蜡笔小新 2024-11-07 17:41:40
php
机器学习中的标准化缩放、最小-最大缩放及鲁棒缩放技术解析

机器学习中的标准化缩放、最小-最大缩放及鲁棒缩放技术解析 ... [详细]

蜡笔小新 2024-11-05 15:46:18

品位人生2602905223

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章