热门标签 | HotTags
当前位置:  开发笔记 > 编程语言 > 正文

TensorFlow2利用泰坦尼克号获救CSV数据集完成数据预处理

本案例使用TensorFlow2加载CSV数据到tf.data.Dataset中,使用经典的数据集:泰坦尼克乘客数据。1.导入所需的库importt

本案例使用TensorFlow2加载CSV数据到tf.data.Dataset中,使用经典的数据集:泰坦尼克乘客数据。

1. 导入所需的库

import tensorflow as tf
import numpy as np
import pandas as pdimport functoolsfor i in [tf,np,pd]:print(i.__name__,": ",i.__version__,sep="")

输出:

tensorflow: 2.2.0
numpy: 1.17.4
pandas: 0.25.3

2. 下载并导入数据


2.1 下载数据到本地

trainDataUrl = "https://storage.googleapis.com/tf-datasets/titanic/train.csv"
testDataUrl = "https://storage.googleapis.com/tf-datasets/titanic/eval.csv"trainFilePath = tf.keras.utils.get_file("trainTitanic.csv",trainDataUrl)
testFilePath = tf.keras.utils.get_file("testTitanic.csv",testDataUrl)

输出:

Downloading data from https://storage.googleapis.com/tf-datasets/titanic/train.csv
32768/30874 [===============================] - 1s 29us/step
Downloading data from https://storage.googleapis.com/tf-datasets/titanic/eval.csv
16384/13049 [=====================================] - 0s 28us/step

Windows系统中下载的文件保存在:系统盘:\users\用户名.keras\datasets目录下

2.2 加载数据

labelColumn = "survived" # 指定数据标签的列名
labels = [0,1]def getDataset(filePath, **kwargs):dataset = tf.data.experimental.make_csv_dataset(filePath,batch_size=5,label_name=labelColumn,na_value="?",num_epochs=1,ignore_errors=True,**kwargs)return datasetrawTrainData = getDataset(trainFilePath)
rawTestData = getDataset(testFilePath)

def showBatch(dataset):for batch, label in dataset.take(1):for key, value in batch.items():print("{:20s}:{}".format(key,value.numpy()))print("{:20s}:{}".format("label",label.numpy()))showBatch(rawTrainData)

输出:

sex :[b'male' b'female' b'male' b'male' b'male']
age :[50. 30. 28. 31. 27.]
n_siblings_spouses :[0 0 0 1 0]
parch :[0 0 0 0 0]
fare :[ 13. 106.425 8.4583 52. 8.6625]
class :[b'Second' b'First' b'Third' b'First' b'Third']
deck :[b'unknown' b'unknown' b'unknown' b'B' b'unknown']
embark_town :[b'Southampton' b'Cherbourg' b'Queenstown' b'Southampton' b'Southampton']
alone :[b'y' b'y' b'y' b'n' b'y']
label :[0 1 0 0 1]

3. 数据预处理

通过CSV文件导入的数据每列的数据类型可能不一样,这就需要将数据喂给模型前进行数据预处理。可以使用sklearn等工具进行前处理,再将数据传给TensorFlow。也可以使用TensorFlow内置的tf.feature_column工具,使用该工具的优点是如果训练的模型需要保存或分享给他人,则数据预处理的部分也会被保存。

3.1 特征选择——连续性数据

selectColumns = ['survived', 'age', 'n_siblings_spouses', 'parch', 'fare'] # 选择其中的几列进行分析
defaults = [0, 0.0, 0.0, 0.0, 0.0]
tempDataset = getDataset(trainFilePath,select_columns=selectColumns,column_defaults=defaults)showBatch(tempDataset)

输出:

age :[25. 43. 18. 55.5 47. ]
n_siblings_spouses :[1. 0. 0. 0. 1.]
parch :[2. 0. 0. 0. 1.]
fare :[151.55 8.05 7.7958 8.05 52.5542]
label :[0 0 0 0 1]

example_batch, labels_batch = next(iter(tempDataset))# 将所有列打包到一起
def pack(features, label):return tf.stack(list(features.values()),axis=1),labelpacked_dataset = tempDataset.map(pack)for fetures, labels in packed_dataset.take(1):print(fetures.numpy(),labels.numpy(),sep="\n\n")

输出:

[[28. 1. 0. 14.4542][28. 0. 0. 7.2292][52. 1. 0. 78.2667][48. 1. 2. 65. ][30.5 0. 0. 8.05 ]][0 0 1 1 0]

# 定义一个通用的预处理类:选择部分特征并打包到单列中
class PackNumericFeatures(object):def __init__(self, names):self.names = namesdef __call__(self, features, labels):numeric_features = [features.pop(name) for name in self.names]numeric_features = [tf.cast(feat, tf.float32) for feat in numeric_features]numeric_features = tf.stack(numeric_features, axis=1)features["numeric"] = numeric_featuresreturn features, labelsnumericFeatures = ["age","n_siblings_spouses","parch","fare"]packed_train_data = rawTrainData.map(PackNumericFeatures(numericFeatures))
packed_test_data = rawTestData.map(PackNumericFeatures(numericFeatures))showBatch(packed_train_data)

输出:

sex :[b'male' b'male' b'male' b'male' b'male']
class :[b'Third' b'Third' b'Third' b'Third' b'Second']
deck :[b'unknown' b'unknown' b'unknown' b'unknown' b'unknown']
embark_town :[b'Cherbourg' b'Southampton' b'Queenstown' b'Cherbourg' b'Southampton']
alone :[b'n' b'y' b'n' b'y' b'y']
numeric :[[15. 1. 1. 7.2292][25. 0. 0. 7.05 ][ 7. 4. 1. 29.125 ][28. 0. 0. 7.8958][28. 0. 0. 13. ]]
label :[0 0 0 0 1]

example_batch, labels_batch = next(iter(packed_train_data))

3.2 连续性数据归一化

连接性数据通常需要做归一化操作。

desc = pd.read_csv(trainFilePath)[numericFeatures].describe()
desc

输出:

mean = np.array(desc.T["mean"])
std = np.array(desc.T["std"])print(mean, type(mean))
print(std, type(std))

输出:

[29.63130781 0.54545455 0.37958533 34.38539856]
[12.51181763 1.1510896 0.79299921 54.5977305 ]

def normalization(data, mean, std):return (data - mean)/stdnormalizer = functools.partial(normalization,mean=mean, std=std)numeric_column = tf.feature_column.numeric_column("numeric",normalizer_fn=normalizer,shape=[len(numericFeatures)])
numeric_columns = [numeric_column]
numeric_column

输出:

NumericColumn(key='numeric', shape=(4,), default_value=None, dtype=tf.float32, normalizer_fn=functools.partial(, mean=array([29.63130781, 0.54545455, 0.37958533, 34.38539856]), std=array([12.51181763, 1.1510896 , 0.79299921, 54.5977305 ])))
In [49]:

example_batch["numeric"]

输出:

array([[28. , 1. , 0. , 24.15],[51. , 0. , 0. , 8.05],[ 6. , 0. , 1. , 33. ],[26. , 0. , 0. , 10.5 ],[16. , 0. , 0. , 26. ]], dtype=float32)>

numeric_layer = tf.keras.layers.DenseFeatures(numeric_columns)
numeric_layer(example_batch).numpy()

输出:

array([[-0.13038135, 0.39488277, -0.4786705 , -0.18746932],[ 1.7078807 , -0.47385937, -0.4786705 , -0.4823534 ],[-1.888719 , -0.47385937, 0.7823648 , -0.02537466],[-0.2902302 , -0.47385937, -0.4786705 , -0.4374797 ],[-1.0894746 , -0.47385937, -0.4786705 , -0.15358512]],dtype=float32)

3.3 特征选择——离散性数据

CATEGORIES = {'sex': ['male', 'female'],'class' : ['First', 'Second', 'Third'],'deck' : ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],'embark_town' : ['Cherbourg', 'Southhampton', 'Queenstown'],'alone' : ['y', 'n']
}categorical_columns = []
for feature, vocab in CATEGORIES.items():cat_col = tf.feature_column.categorical_column_with_vocabulary_list(key=feature, vocabulary_list=vocab)categorical_columns.append(tf.feature_column.indicator_column(cat_col))categorical_columns

输出:

[IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='sex', vocabulary_list=('male', 'female'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='class', vocabulary_list=('First', 'Second', 'Third'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='deck', vocabulary_list=('A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='embark_town', vocabulary_list=('Cherbourg', 'Southhampton', 'Queenstown'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='alone', vocabulary_list=('y', 'n'), dtype=tf.string, default_value=-1, num_oov_buckets=0))]

categorical_layer = tf.keras.layers.DenseFeatures(categorical_columns)
print(categorical_layer(example_batch).numpy()[0])

输出:

[0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0.]

3.4 合并连续数据和离散数据层

preprocessing_layer = tf.keras.layers.DenseFeatures(categorical_columns+numeric_columns)print(preprocessing_layer(example_batch).numpy()[0])

输出:

[ 0. 1. 0. 0. 1. 0.0. 0. 0. 0. 0. 0.0. 0. 0. 0. 0. 1.-0.13038135 0.39488277 -0.4786705 -0.18746932 1. 0. ]

4. 构建模型

model = tf.keras.Sequential([preprocessing_layer,tf.keras.layers.Dense(128,activation="relu"),tf.keras.layers.Dense(128,activation="relu"),tf.keras.layers.Dense(1)
])model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),optimizer="adam",metrics=["accuracy"])

5. 训练及评估模型

train_data = packed_train_data.shuffle(500)
test_data = packed_test_datamodel.fit(train_data, epochs=20)

输出:

Epoch 1/20
126/126 [==============================] - 0s 3ms/step - loss: 0.4894 - accuracy: 0.7544
Epoch 2/20
126/126 [==============================] - 0s 984us/step - loss: 0.4147 - accuracy: 0.8230
Epoch 3/20
126/126 [==============================] - 0s 921us/step - loss: 0.4002 - accuracy: 0.8309
Epoch 4/20
126/126 [==============================] - 0s 849us/step - loss: 0.3894 - accuracy: 0.8325
Epoch 5/20
126/126 [==============================] - 0s 841us/step - loss: 0.3812 - accuracy: 0.8485
Epoch 6/20
126/126 [==============================] - 0s 833us/step - loss: 0.3729 - accuracy: 0.8341
Epoch 7/20
126/126 [==============================] - 0s 849us/step - loss: 0.3716 - accuracy: 0.8421
Epoch 8/20
126/126 [==============================] - 0s 984us/step - loss: 0.3647 - accuracy: 0.8453
Epoch 9/20
126/126 [==============================] - 0s 770us/step - loss: 0.3472 - accuracy: 0.8501
Epoch 10/20
126/126 [==============================] - 0s 794us/step - loss: 0.3470 - accuracy: 0.8533
Epoch 11/20
126/126 [==============================] - 0s 841us/step - loss: 0.3449 - accuracy: 0.8421
Epoch 12/20
126/126 [==============================] - 0s 873us/step - loss: 0.3360 - accuracy: 0.8485
Epoch 13/20
126/126 [==============================] - 0s 857us/step - loss: 0.3313 - accuracy: 0.8565
Epoch 14/20
126/126 [==============================] - 0s 857us/step - loss: 0.3293 - accuracy: 0.8533
Epoch 15/20
126/126 [==============================] - 0s 873us/step - loss: 0.3236 - accuracy: 0.8644
Epoch 16/20
126/126 [==============================] - 0s 897us/step - loss: 0.3336 - accuracy: 0.8581
Epoch 17/20
126/126 [==============================] - 0s 770us/step - loss: 0.3185 - accuracy: 0.8565
Epoch 18/20
126/126 [==============================] - 0s 778us/step - loss: 0.3118 - accuracy: 0.8596
Epoch 19/20
126/126 [==============================] - 0s 794us/step - loss: 0.3130 - accuracy: 0.8581
Epoch 20/20
126/126 [==============================] - 0s 929us/step - loss: 0.3099 - accuracy: 0.8644
Out[58]:

test_loss, test_accuracy = model.evaluate(test_data)print("\nTest Loss: {}, Test Accuracy: {}".format(test_loss,test_accuracy))

输出:

53/53 [==============================] - 0s 906us/step - loss: 0.4588 - accuracy: 0.8561Test Loss: 0.45877906680107117, Test Accuracy: 0.8560606241226196
In [60]:

predictions = model.predict(test_data)for prediction, survived in zip(predictions[:10], list(test_data)[0][1][:10]):prediction = tf.sigmoid(prediction).numpy()print("Predicted survied: {:.2%}".format(prediction[0]),"| Actual outcome: ",("survived" if bool(survived) else "died"))

输出:

Predicted survied: 39.23% | Actual outcome: survived
Predicted survied: 99.98% | Actual outcome: survived
Predicted survied: 86.77% | Actual outcome: died
Predicted survied: 79.44% | Actual outcome: survived
Predicted survied: 30.34% | Actual outcome: died

 

 

 

 

 

 

 

 

 

 


推荐阅读
  • 本文介绍如何使用 Python 的 DOM 和 SAX 方法解析 XML 文件,并通过示例展示了如何动态创建数据库表和处理大量数据的实时插入。 ... [详细]
  • 本文介绍了如何使用Python爬取妙笔阁小说网仙侠系列中所有小说的信息,并将其保存为TXT和CSV格式。主要内容包括如何构造请求头以避免被网站封禁,以及如何利用XPath解析HTML并提取所需信息。 ... [详细]
  • PTArchiver工作原理详解与应用分析
    PTArchiver工作原理及其应用分析本文详细解析了PTArchiver的工作机制,探讨了其在数据归档和管理中的应用。PTArchiver通过高效的压缩算法和灵活的存储策略,实现了对大规模数据的高效管理和长期保存。文章还介绍了其在企业级数据备份、历史数据迁移等场景中的实际应用案例,为用户提供了实用的操作建议和技术支持。 ... [详细]
  • Python 序列图分割与可视化编程入门教程
    本文介绍了如何使用 Python 进行序列图的快速分割与可视化。通过一个实际案例,详细展示了从需求分析到代码实现的全过程。具体包括如何读取序列图数据、应用分割算法以及利用可视化库生成直观的图表,帮助非编程背景的用户也能轻松上手。 ... [详细]
  • 使用System.getProperty()获取系统属性
    本文详细介绍了如何使用System.getProperty()方法获取Java运行时环境中的各种系统属性,包括Java版本、操作系统信息等。 ... [详细]
  • 用C语言实现的科学计算器,支持2种常量,10种基本函数,Ans寄存器。相对来说拓展性应该是不错的,思路是首先化简复杂名称的函 ... [详细]
  • 在项目需要国际化处理时,即支持多种语言切换的功能,通常有两种方案:单个包和多个包。本文将重点讨论单个包的实现方法。 ... [详细]
  • Lua IO操作详解
    本文介绍了Lua中的IO操作,包括简单模式和完整模式下的文件处理方法,以及相关的系统调用。 ... [详细]
  • 本文介绍了如何在Spring框架中使用AspectJ实现AOP编程,重点讲解了通过注解配置切面的方法,包括方法执行前和方法执行后的增强处理。阅读本文前,请确保已安装并配置好AspectJ。 ... [详细]
  • 短视频app源码,Android开发底部滑出菜单首先依赖三方库implementationandroidx.appcompat:appcompat:1.2.0im ... [详细]
  • Leetcode学习成长记:天池leetcode基础训练营Task01数组
    前言这是本人第一次参加由Datawhale举办的组队学习活动,这个活动每月一次,之前也一直关注,但未亲身参与过,这次看到活动 ... [详细]
  • Cookie学习小结
    Cookie学习小结 ... [详细]
  • Hadoop的文件操作位于包org.apache.hadoop.fs里面,能够进行新建、删除、修改等操作。比较重要的几个类:(1)Configurati ... [详细]
  • 本文介绍如何使用OpenCV和线性支持向量机(SVM)模型来开发一个简单的人脸识别系统,特别关注在只有一个用户数据集时的处理方法。 ... [详细]
  • WinMain 函数详解及示例
    本文详细介绍了 WinMain 函数的参数及其用途,并提供了一个具体的示例代码来解析 WinMain 函数的实现。 ... [详细]
author-avatar
JackY-小袋鼠
这个家伙很懒,什么也没留下!
PHP1.CN | 中国最专业的PHP中文社区 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved | 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有