数据挖掘中数据预处理方法_数据挖掘中的数据预处理

作者：狮子座刘娜_676 | 来源：互联网 | 2023-08-28 17:04

数据挖掘中数据预处理方法Inthepreviousarticle,wehavediscussedtheDataExplorationwithwhichwehavestartedad

数据挖掘中数据预处理方法

In the previous article, we have discussed the Data Exploration with which we have started a detailed journey towards data mining. We have learnt about Data Exploration, Statistical Description of Data, Concept of Data Visualization & Various technique of Data Visualization.

在上一篇文章中&＃xff0c;我们讨论了数据探索&＃xff0c;并由此开始了详细的数据挖掘之旅。我们已经了解了数据探索 &＃xff0c;数据统计描述&＃xff0c;数据可视化的概念以及各种数据可视化技术。

In this article we will be discussing,

在本文中&＃xff0c;我们将讨论

Need of Data Preprocessing
需要数据预处理
Data Cleaning Process
数据清理流程
Data Integration Process
数据整合流程
Data Reduction Process
数据缩减流程
Data Transformation Process
数据转换过程

1)需要数据预处理 (1) Need of Data Preprocessing)

Data preprocessing refers to the set of techniques implemented on the databases to remove noisy, missing, and inconsistent data. Different Data preprocessing techniques involved in data mining are data cleaning, data integration, data reduction, and data transformation.

数据预处理是指在数据库上实施的用于消除噪声&＃xff0c;丢失和不一致数据的技术集。数据挖掘中涉及的不同数据预处理技术是数据清理&＃xff0c;数据集成&＃xff0c;数据缩减和数据转换。

The need for data preprocessing arises from the fact that the real-time data and many times the data of the database is often incomplete and inconsistent which may result in improper and inaccurate data mining results. Thus to improve the quality of data on which the observation and analysis are to be done, it is treated with these four steps of data preprocessing. More the improved data, More will be the accurate observation and prediction.

数据预处理的需求源于以下事实&＃xff1a;实时数据以及很多时候数据库的数据通常不完整且不一致&＃xff0c;这可能导致数据挖掘结果不正确和不准确。因此&＃xff0c;为了提高要进行观察和分析的数据的质量&＃xff0c;可以通过数据预处理的这四个步骤对其进行处理。改进的数据越多&＃xff0c;准确的观察和预测就越多。

Fig 1: Steps of Data Preprocessing

图1&＃xff1a;数据预处理步骤

2)数据清理流程 (2) Data Cleaning Process)

Data in the real world is usually incomplete, incomplete and noisy. The data cleaning process includes the procedure which aims at filling the missing values, smoothing out the noise which determines the outliers and rectifies the inconsistencies in data. Let us discuss the basic methods of data cleaning,

现实世界中的数据通常不完整&＃xff0c;不完整且嘈杂。数据清除过程包括旨在填补缺失值&＃xff0c;消除噪声的过程&＃xff0c;该噪声确定了异常值并纠正了数据中的不一致之处。让我们讨论数据清理的基本方法&＃xff0c;

2.1. Missing Values

2.1。 缺失值

Assume that you are dealing with any data like sales and customer data and you observe that there are several attributes from which the data is missing. One cannot compute data with missing values. In this case, there are some methods which sort out this problem. Let us go through them one by one,

假设您正在处理任何数据(例如销售和客户数据)&＃xff0c;并且发现缺少一些属性。不能计算缺少值的数据。在这种情况下&＃xff0c;有一些方法可以解决此问题。让我们一一讲解

2.1.1. Ignore the tuple:

2.1.1。 忽略元组&＃xff1a;

If there is no class label specified then we could go for this method. It is not effective in the case if the percentage of missing values per attribute changes considerably.

如果未指定类标签&＃xff0c;则可以使用此方法。如果每个属性的缺失值百分比发生很大变化&＃xff0c;则此方法无效。

2.1.2. Enter the missing value manually or fill it with global constant:

2.1.2。 手动输入缺少的值或用全局常数填充它&＃xff1a;

When the database contains large missing values, then filling manually method is not feasible. Meanwhile, this method is time-consuming. Another method is to fill it with some global constant.

当数据库包含较大的缺失值时&＃xff0c;手动填充方法不可行。同时&＃xff0c;此方法很耗时。另一种方法是用一些全局常数填充它。

2.1.3. Filling the missing value with attribute mean or by using the most probable value:

2.1.3。 使用属性均值或使用最可能的值来填充缺失值&＃xff1a;

Filling the missing value with attribute value can be the other option. Filling with the most probable value uses regression, Bayesian formulation or decision tree.

用属性值填充缺失值可以是另一种选择。用回归&＃xff0c;贝叶斯公式或决策树填充最可能的值。

2.2. Noisy Data

2.2。 噪音数据

Noise refers to any error in a measured variable. If a numerical attribute is given you need to smooth out the data by eliminating noise. Some data smoothing techniques are as follows,

噪声是指测量变量中的任何误差。如果给定了数字属性&＃xff0c;则需要通过消除噪声来平滑数据。一些数据平滑技术如下&＃xff0c;

2.2.1. Binning:

2.2.1。 装箱&＃xff1a;

Smoothing by bin means: In smoothing by bin means, each value in a bin is replaced by the mean value of the bin.
按bin方式进行平滑&＃xff1a;在按bin方式进行平滑处理中&＃xff0c;将bin中的每个值替换为bin的平均值。
Smoothing by bin median: In this method, each bin value is replaced by its bin median value.
按bin中值进行平滑 &＃xff1a;在这种方法中&＃xff0c;每个bin值都将替换为其bin中值。
Smoothing by bin boundary: In smoothing by bin boundaries, the minimum and maximum values in a given bin are identified as the bin boundaries. Every value of bin is then replaced with the closest boundary value.
按bin边界进行平滑&＃xff1a;在按bin边界进行平滑处理中&＃xff0c;将给定bin中的最小值和最大值标识为bin边界。然后将bin的每个值替换为最接近的边界值。

Let us understand with an example,

让我们以一个例子来理解&＃xff0c;

Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

价格排序数据(美元)&＃xff1a;4、8、9、15、21、21、24、25、26、28、29、34

Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34 Smoothing by bin median: - Bin 1: 9 9, 9, 9 - Bin 2: 24, 24, 24, 24 - Bin 3: 29, 29, 29, 29

2.2.2. Regression:

2.2.2。 回归&＃xff1a;

Regression is used to predict the value. Linear regression uses the formula of a straight line which predicts the value of y on the specified value of x whereas multiple linear regression is used to predict the value of a variable is predicted by using given values of two or more variables.

回归用于预测值。线性回归使用直线公式来预测y在x的指定值上的值&＃xff0c;而多元线性回归用于预测变量的值是通过使用两个或多个变量的给定值来预测的。

3)数据整合过程 (3) Data Integration Process)

Data Integration is a data preprocessing technique that involves combining data from multiple heterogeneous data sources into a coherent data store and supply a unified view of the info. These sources may include multiple data cubes, databases or flat files.

数据集成是一种数据预处理技术&＃xff0c;涉及将来自多个异构数据源的数据组合到一个一致的数据存储中&＃xff0c;并提供信息的统一视图。这些源可能包括多个数据多维数据集&＃xff0c;数据库或平面文件。

3.1. Approaches

3.1。 方法

There are mainly 2 major approaches for data integration – one is "tight coupling approach" and another is the "loose coupling approach".

数据集成主要有2种主要方法-一种是“紧密耦合方法” &＃xff0c;另一种是“松散耦合方法” 。

Tight Coupling:

紧耦合&＃xff1a;

Here, a knowledge warehouse is treated as an information retrieval component.

在这里&＃xff0c;知识仓库被视为信息检索组件。

In this coupling, data is combined from different sources into one physical location through the method of ETL – Extraction, Transformation, and Loading.

在这种耦合中&＃xff0c;通过ETL(提取&＃xff0c;转换和加载)方法将数据从不同源组合到一个物理位置。

Loose Coupling:

松耦合&＃xff1a;

Here, an interface is as long as it takes the query from the user, transforms it during away the source database can understand then sends the query on to the source databases to get the result. And the data only remains within the actual source databases.

在这里&＃xff0c;接口只要它从用户那里获取查询&＃xff0c;并在源数据库可以理解的时间内对其进行转换&＃xff0c;然后将查询发送到源数据库以获取结果。并且数据仅保留在实际的源数据库中。

3.2. Issues in Data Integration

3.2。 数据集成中的问题

There are not any issues to think about during data integration: Schema Integration, Redundancy, Detection and determination of knowledge value conflicts. These are explained in short as below,

数据集成期间没有任何问题可考虑&＃xff1a;架构集成&＃xff0c;冗余&＃xff0c;知识值冲突的检测和确定。这些简述如下&＃xff1a;

3.1.1. Schema Integration:

3.1.1。 模式集成&＃xff1a;

Integrate metadata from different sources.

集成来自不同来源的元数据。

The real-world entities from multiple sources are matched mentioned because of the entity identification problem.

由于实体标识问题&＃xff0c;提到了来自多个来源的真实实体。

For example, How can the info analyst and computer make certain that customer id in one database and customer number in another regard to an equivalent attribute.

例如&＃xff0c;信息分析师和计算机如何才能确定一个数据库中的客户ID和其他方面的客户编号是否具有等效属性。

3.2.2. Redundancy:

3.2.2。 冗余&＃xff1a;

An attribute could also be redundant if it is often derived or obtaining from another attribute or set of the attribute.

如果某个属性通常是从另一个属性或该属性的集合派生或获取的&＃xff0c;则它也可能是多余的。

Inconsistencies in attribute also can cause redundancies within the resulting data set.

属性不一致还会导致结果数据集内的冗余。

Some redundancies are often detected by correlation analysis.

经常通过相关分析来检测一些冗余。

3.3.3. Detection and determination of data value conflicts:

3.3.3。 检测和确定数据值冲突&＃xff1a;

This is the third important issues in data integration. Attribute values from another different source may differ for an equivalent world entity. An attribute in one system could also be recorded at a lower level abstraction than the "same" attribute in another.

这是数据集成中的第三个重要问题。对于等效的世界实体&＃xff0c;来自另一个不同来源的属性值可能有所不同。与另一个系统中的“ same”属性相比&＃xff0c;一个系统中的属性也可以以较低的抽象级别记录。

4)数据缩减流程 (4) Data Reduction Process)

Data warehouses usually store large amounts of data the data mining operation takes a long time to process this data. The data reduction technique helps to minimize the size of the dataset without affecting the result. The following are the methods that are commonly used for data reduction,

数据仓库通常存储大量数据&＃xff0c;数据挖掘操作需要很长时间才能处理此数据。数据缩减技术有助于在不影响结果的情况下最小化数据集的大小。以下是通常用于数据缩减的方法&＃xff0c;

Data cube aggregation
数据立方体聚合

Refers to a method where aggregation operations are performed on data to create a data cube, which helps to analyze business trends and performance.
指对数据执行聚合操作以创建数据多维数据集的方法&＃xff0c;该方法有助于分析业务趋势和性能。
Attribute subset selection
属性子集选择

Refers to a method where redundant attributes or dimensions or irrelevant data may be identified and removed.
指可以识别和删除冗余属性或尺寸或不相关数据的方法。
Dimensionality reduction
降维

Refers to a method where encoding techniques are used to minimize the size of the data set.
指的是一种使用编码技术来最小化数据集大小的方法。
Numerosity reduction
减少雾度

Refers to a method where smaller data representation replaces the data.
指的是较小的数据表示替换数据的方法。
Discretization and concept hierarchy generation
离散化和概念层次生成

Refers to methods where higher conceptual values replace raw data values for attributes. Data discretization is a type of numerosity reduction for the automatic generation of concept hierarchies.
指的是较高的概念值替换属性的原始数据值的方法。数据离散化是一种用于自动生成概念层次结构的数量减少方法。

5)数据整合流程 (5) Data Integration Process)

In data transformation process data are transformed from one format to a different format, that&＃39;s more appropriate for data processing.

在数据转换过程中&＃xff0c;数据从一种格式转换为另一种格式&＃xff0c;这更适合数据处理。

Some Data Transformation Strategies,

一些数据转换策略&＃xff0c;

Smoothing
平滑处理

Smoothing may be a process of removing noise from the info.
平滑可能是从信息中消除噪音的过程。
Aggregation
聚合

Aggregation may be a process where summary or aggregation operations are applied to the info.
汇总可以是将摘要或汇总操作应用于信息的过程。
Generalization
概括

In generalization, low-level data are replaced with high-level data by using concept hierarchies climbing.
一般而言&＃xff0c;通过使用概念层次爬升将低层数据替换为高层数据。
Normalization
正常化

Normalization scaled attribute data so on fall within a little specified range, such as 0.0 to 1.0.
归一化缩放属性数据等等都落在一个较小的指定范围内&＃xff0c;例如0.0到1.0。
Attribute Construction
属性构造

In Attribute construction, new attributes are constructed from the given set of attributes.
在属性构造中&＃xff0c;从给定的属性集中构造新的属性。

翻译自: https://www.includehelp.com/basics/data-preprocessing-in-data-mining.aspx

数据挖掘中数据预处理方法

推荐阅读

get
POJ 2482 星空中的星星：利用线段树与扫描线算法解决

在《POJ 2482 星空中的星星》问题中，通过运用线段树和扫描线算法，可以高效地解决星星在窗口内的计数问题。该方法不仅能够快速处理大规模数据，还能确保时间复杂度的最优性，适用于各种复杂的星空模拟场景。 ... [详细]

蜡笔小新 2024-11-09 12:09:08
get
在范围[0..n-1]中产生m个不同的随机数 - Generating m distinct random numbers in the range [0..n-1]

Ihavetwomethodsofgeneratingmdistinctrandomnumbersintherange[0..n-1]我有两种方法在范围[0.n-1]中生 ... [详细]

蜡笔小新 2024-11-13 09:49:14
case
单片微机原理P3：80C51外部拓展系统

　　外部拓展其实是个相对来说很好玩的章节，可以真正开始用单片机写程序了，比较重要的是外部存储器拓展，81C55拓展，矩阵键盘，动态显示，DAC和ADC。0.IO接口电路概念与存 ... [详细]

蜡笔小新 2024-11-12 19:51:29
get
开机自启动的几种方式

0x01快速自启动目录快速启动目录自启动方式源于Windows中的一个目录，这个目录一般叫启动或者Startup。位于该目录下的PE文件会在开机后进行自启动 ... [详细]

蜡笔小新 2024-11-12 11:16:30
get
基于Net Core 3.0与Web API的前后端分离开发：Vue.js在前端的应用

本文介绍了如何使用Net Core 3.0和Web API进行前后端分离开发，并重点探讨了Vue.js在前端的应用。后端采用MySQL数据库和EF Core框架进行数据操作，开发环境为Windows 10和Visual Studio 2019，MySQL服务器版本为8.0.16。文章详细描述了API项目的创建过程、启动步骤以及必要的插件安装，为开发者提供了一套完整的开发指南。 ... [详细]

蜡笔小新 2024-11-11 10:58:21
install
使用Shell脚本高效部署MHA高可用集群

本文介绍了如何利用Shell脚本高效地部署MHA（MySQL High Availability）高可用集群。通过详细的脚本编写和配置示例，展示了自动化部署过程中的关键步骤和注意事项。该方法不仅简化了集群的部署流程，还提高了系统的稳定性和可用性。 ... [详细]

蜡笔小新 2024-11-10 10:15:46
get
如何在C#中配置组合框的背景颜色？

如何在C#中配置组合框的背景颜色？ ... [详细]

蜡笔小新 2024-11-08 13:06:59
ip
计算机视觉领域介绍 | 自然语言驱动的跨模态行人重识别前沿技术综述（上篇）

本文介绍了计算机视觉领域的最新进展，特别是自然语言驱动的跨模态行人重识别技术。上篇内容详细探讨了该领域的基础理论、关键技术及当前的研究热点，为读者提供了全面的概述。 ... [详细]

蜡笔小新 2024-11-07 12:41:08
get
com.sun.javadoc.PackageDoc.exceptions()方法的使用及代码示例

com.sun.javadoc.PackageDoc.exceptions()方法的使用及代码示例 ... [详细]

蜡笔小新 2024-11-13 10:47:33
get
Java 并发编程：深入解析 AtomicInteger 和 CAS 无锁算法

在多线程并发环境中，普通变量的操作往往是线程不安全的。本文通过一个简单的例子，展示了如何使用 AtomicInteger 类及其核心的 CAS 无锁算法来保证线程安全。 ... [详细]

蜡笔小新 2024-11-12 16:40:04
get
c/c++常用代码doc,ppt,xls文件格式转PDF格式[转]

[转]doc,ppt,xls文件格式转PDF格式http:blog.csdn.netlee353086articledetails7920355确实好用。需要注意的是#import ... [详细]

蜡笔小新 2024-11-12 16:19:40
get
杜甫《喜晴》的两种英译比较

本文对比了杜甫《喜晴》的两种英文翻译版本：a. Pleased with Sunny Weather 和 b. Rejoicing in Clearing Weather。a 版由 alexcwlin 翻译并经 Adam Lam 编辑，b 版则由哈佛大学的宇文所安教授 (Prof. Stephen Owen) 翻译。 ... [详细]

蜡笔小新 2024-11-12 15:02:28
get
求助：如何使用Pull方法解析标签内容，悬赏50分求完美解决方案

在处理 XML 数据时，如果需要解析 `` 标签的内容，可以采用 Pull 解析方法。Pull 解析是一种高效的 XML 解析方式，适用于流式数据处理。具体实现中，可以通过 Java 的 `XmlPullParser` 或其他类似的库来逐步读取和解析 XML 文档中的 `` 元素。这样不仅能够提高解析效率，还能减少内存占用。本文将详细介绍如何使用 Pull 解析方法来提取 `` 标签的内容，并提供一个示例代码，帮助开发者快速解决问题。 ... [详细]

蜡笔小新 2024-11-09 11:50:14
case
深入理解排序算法：集合 1（编程语言中的高效排序工具）

深入理解排序算法：集合 1（编程语言中的高效排序工具） ... [详细]

蜡笔小新 2024-11-08 18:01:53
get
C++ 开发实战：实用技巧与经验分享

C++ 开发实战：实用技巧与经验分享 ... [详细]

蜡笔小新 2024-11-07 20:31:03

狮子座刘娜_676

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章