ProgrammingExercise8:AnomalyDetectionandRecommenderSystems第一部分

作者：梦三国 | 来源：互联网 | 2023-10-11 11:36

大家好，我是MacJiang,今天和大家分享Coursera-StanfordUniversity-MachineLearning-ProgrammingExerci

大家好&＃xff0c;我是Mac Jiang,今天和大家分享Coursera-Stanford University-Machine Learning-Programming Exercise 8:Anomaly Detection and Recommender System的第一部分&＃xff1a;Anomaly Detection。第二部分Recommender System的代码我将会在接下来的博客中给出。我的代码虽然是通过了系统的测试&＃xff0c;但不一定是最好的&＃xff0c;如果博友有更好的想法&＃xff0c;请留言联系&＃xff0c;谢谢&＃xff01;希望我的博客可以为您带来一些学习上的帮助&＃xff01;

这部分的主要内容是Anomaly Detection&＃xff0c;即异常点检测。异常点检测的作用是找出那些明显偏离的样本&＃xff0c;在生产中他能帮助找到不合格的产品&＃xff0c;在生活中他可以帮助检测用户的银行卡是否被盗刷&＃xff0c;在机器学习中有着广泛的应用。异常点检测的一般算法是利用高斯分布函数实现的&＃xff0c;利用已有样本训练适合的高斯分布函数&＃xff08;可高维&＃xff09;&＃xff0c;并设定阈值epsilon&＃xff0c;如果样本p(x)小于epsilon则认为它是异常点。

由于正常点的样本占绝大多数&＃xff0c;只有极少数的异常样本&＃xff0c;所以这是一个偏斜类。对于偏斜类好坏的评价&＃xff0c;我们采用F1 score&＃xff0c;因此我们利用F1 score决定到底选取那个epsilon。

这次实验是对一个二维数据集进行异常检测。具体过程为&＃xff1a;

1.首先输入二维数据&＃xff0c;建立高斯分布函数拟合这些样本。

2.由于我们不知道具体阈值epsilon的最优取值&＃xff0c;我们取多个epsilon的值&＃xff0c;分别计算他的F1 score并比较&＃xff0c;取最大的F1 score对应的epsilon为最优阈值。

3.根据得到的epsilon&＃xff0c;标出样本中p(x)小于epsilon的点&＃xff0c;即为异常点。

1.数据集和文件说明

数据集&＃xff1a;ex8data1.mat---即上面所述的二位数据集合&＃xff0c;二位数据可以可视化

ex8data2.mat----用于异常检测的多维数据集合&＃xff0c;实际上它的维度为11维&＃xff0c;不可以可视化

文件&＃xff1a;ex8.m---控制异常检测进行的过程的控制文件

multivariateGuassian.mat---利用已有的均值mu和方差sigma2建立多位高斯分布函数

estimateGusssian.m---对于输入的样本X&＃xff0c;分别对每一维计算均值mu和方差sigma2&＃xff0c;保存在向量中&＃xff0c;需要完善代码&＃xff01;

selectThreshold.m---对于多个输入epsilon分别计算F1 score&＃xff0c;去最大F1 score对应的epsilon为最优解&＃xff0c;需要完善代码&＃xff01;

2.ex8.m过程说明

%% Initialization clear ; close all; clc %% &＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61; Part 1: Load Example Dataset &＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61; % We start this exercise by using a small dataset that is easy to % visualize. % % Our example case consists of 2 network server statistics across % several machines: the latency and throughput of each machine. % This exercise will help us find possibly faulty (or very fast) machines. % fprintf(&＃39;Visualizing example dataset for outlier detection.\n\n&＃39;); % The following command loads the dataset. You should now have the % variables X, Xval, yval in your environment load(&＃39;ex8data1.mat&＃39;);% Visualize the example dataset plot(X(:, 1), X(:, 2), &＃39;bx&＃39;); axis([0 30 0 30]); xlabel(&＃39;Latency (ms)&＃39;); ylabel(&＃39;Throughput (mb/s)&＃39;); fprintf(&＃39;Program paused. Press enter to continue.\n&＃39;); pause%% &＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61; Part 2: Estimate the dataset statistics &＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61; % For this exercise, we assume a Gaussian distribution for the dataset. % % We first estimate the parameters of our assumed Gaussian distribution, % then compute the probabilities for each of the points and then visualize % both the overall distribution and where each of the points falls in % terms of that distribution. % fprintf(&＃39;Visualizing Gaussian fit.\n\n&＃39;); % Estimate my and sigma2 [mu sigma2] &＃61; estimateGaussian(X); % Returns the density of the multivariate normal at each data point (row) % of X p &＃61; multivariateGaussian(X, mu, sigma2); % Visualize the fit visualizeFit(X, mu, sigma2); xlabel(&＃39;Latency (ms)&＃39;); ylabel(&＃39;Throughput (mb/s)&＃39;); fprintf(&＃39;Program paused. Press enter to continue.\n&＃39;); pause;%% &＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61; Part 3: Find Outliers &＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61; % Now you will find a good epsilon threshold using a cross-validation set % probabilities given the estimated Gaussian distribution % pval &＃61; multivariateGaussian(Xval, mu, sigma2); [epsilon F1] &＃61; selectThreshold(yval, pval); fprintf(&＃39;Best epsilon found using cross-validation: %e\n&＃39;, epsilon); fprintf(&＃39;Best F1 on Cross Validation Set: %f\n&＃39;, F1); fprintf(&＃39; (you should see a value epsilon of about 8.99e-05)\n\n&＃39;); % Find the outliers in the training set and plot the outliers &＃61; find(p % Draw a red circle around those outliers hold on plot(X(outliers, 1), X(outliers, 2), &＃39;ro&＃39;, &＃39;LineWidth&＃39;, 2, &＃39;MarkerSize&＃39;, 10); hold offfprintf(&＃39;Program paused. Press enter to continue.\n&＃39;); pause;%% &＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61; Part 4: Multidimensional Outliers &＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61; % We will now use the code from the previous part and apply it to a % harder problem in which more features describe each datapoint and only % some features indicate whether a point is an outlier. % % Loads the second dataset. You should now have the % variables X, Xval, yval in your environment load(&＃39;ex8data2.mat&＃39;); % Apply the same steps to the larger dataset [mu sigma2] &＃61; estimateGaussian(X); % Training set p &＃61; multivariateGaussian(X, mu, sigma2); % Cross-validation set pval &＃61; multivariateGaussian(Xval, mu, sigma2); % Find the best threshold [epsilon F1] &＃61; selectThreshold(yval, pval); fprintf(&＃39;Best epsilon found using cross-validation: %e\n&＃39;, epsilon); fprintf(&＃39;Best F1 on Cross Validation Set: %f\n&＃39;, F1); fprintf(&＃39;# Outliers found: %d\n&＃39;, sum(p fprintf(&＃39; (you should see a value epsilon of about 1.38e-18)\n\n&＃39;); pause

Part1:Load Example Dataset---导入数据并可视化

Part2:Esitimate the dataset statistics---调用estimateGuassian.m对每一维度分别计算均值mu(i),方差sigma2(i),之后利用得到数据建立多位高斯分布&＃xff0c;并调用selectThreshold.m计算最有epsilon

Part3:Find Outliers---利用上面训练得到的高斯分布和epsilon值找到那些异常点

Part4:Multidimensional Outliers---调用ex8data2.mat中的数据&＃xff0c;这是11维的数据&＃xff0c;对这个数据进行异常检测&＃xff0c;得到异常点。注意&＃xff1a;11维的数据不能可视化

2.estimateGaussian.m实现过程

function [mu sigma2] &＃61; estimateGaussian(X) %ESTIMATEGAUSSIAN This function estimates the parameters of a %Gaussian distribution using the data in X % [mu sigma2] &＃61; estimateGaussian(X), % The input X is the dataset with each n-dimensional data point in one row % The output is an n-dimensional vector mu, the mean of the data set % and the variances sigma^2, an n x 1 vector % % Useful variables [m, n] &＃61; size(X); % You should return these values correctly mu &＃61; zeros(n, 1); sigma2 &＃61; zeros(n, 1); % &＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61; YOUR CODE HERE &＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61; % Instructions: Compute the mean of the data and the variances % In particular, mu(i) should contain the mean of % the data for the i-th feature and sigma2(i) % should contain variance of the i-th feature. % %这里当然也可以采取for循环的方式计算&＃xff0c;但是向量法的计算速度快一些 mu &＃61; sum(X)&＃39; / m; %sum是对X按列求和的意思 temp &＃61; X&＃39; - repmat(mu,1,m); %repmat(A,m,n)&＃xff0c;把矩阵A复制m*n份&＃xff0c;行为m&＃xff0c;列为n%这里也可用temp &＃61; bsxfun(&＃64;minus, X&＃39;, mu) sigma2 &＃61; sum(temp.^2,2) / m; %sum(A,2)对矩阵按列求和

注意&＃xff1a;可能有些同学会用for循环的方法编写代码&＃xff0c;但是一般来说利用向量的方法会比循环的方法快很多&＃xff0c;所以如果能利用向量方法的最好利用向量

3.selectThreshold.m实现过程

function [bestEpsilon bestF1] &＃61; selectThreshold(yval, pval) %SELECTTHRESHOLD Find the best threshold (epsilon) to use for selecting %outliers % [bestEpsilon bestF1] &＃61; SELECTTHRESHOLD(yval, pval) finds the best % threshold to use for selecting outliers based on the results from a % validation set (pval) and the ground truth (yval). % bestEpsilon &＃61; 0; bestF1 &＃61; 0; F1 &＃61; 0; stepsize &＃61; (max(pval) - min(pval)) / 1000; for epsilon &＃61; min(pval):stepsize:max(pval)% &＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61; YOUR CODE HERE &＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;% Instructions: Compute the F1 score of choosing epsilon as the% threshold and place the value in F1. The code at the% end of the loop will compare the F1 score for this% choice of epsilon and set it to be the best epsilon if% it is better than the current choice of epsilon.% % Note: You can use predictions &＃61; (pval %yval表示的是它是不是异常点&＃xff0c;是为1不是为0。 %pval表示的是利用已经训练出的系统对xval计算p(x)的值&＃xff0c;若小于ipsilon我们就认为它是异常点 %tp:true possitive&＃xff0c;即它实际为1我们又预测它为1的样本数 %fp&＃xff1a;false possitive即他实际为0我们预测他为1的样本数 %fn:false negative,即它实际为1我们预测他为0的样本数 %prec &＃61; tp/&＃xff08;tp&＃43;fn&＃xff09;查准率&＃xff1b;rec &＃61; tp/&＃xff08;tp&＃43;fn&＃xff09;回收率&＃xff1b;F1 srcore &＃61; 2*prec*rec/(prec&＃43;rec) cvPrediction &＃61; pval tp &＃61; sum((cvPrediction &＃61;&＃61; 1) & (yval &＃61;&＃61; 1)); %cvPrediction为我们的预测值&＃xff0c;yval为实际值 fp &＃61; sum((cvPrediction &＃61;&＃61; 1) & (yval &＃61;&＃61; 0)); fn &＃61; sum((cvPrediction &＃61;&＃61; 0) & (yval &＃61;&＃61; 1)); prec &＃61; tp / (tp &＃43; fp); rec &＃61; tp / (tp &＃43; fn); F1 &＃61; 2 * prec * rec / (prec &＃43; rec);% &＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;&＃61;if F1 > bestF1bestF1 &＃61; F1;bestEpsilon &＃61; epsilon;end end end

From:http://blog.csdn.net/a1015553840/article/details/50913824

推荐阅读

spring
Squaretest：自动生成功能测试代码的高效插件

本文将介绍一款名为Squaretest的高效插件，该工具能够自动生成功能测试代码。使用这款插件的主要原因是公司近期加强了代码质量的管控，对各项目进行了严格的单元测试评估。Squaretest不仅提高了测试代码的生成效率，还显著提升了代码的质量和可靠性。 ... [详细]

蜡笔小新 2024-11-07 15:34:27
text
第六章：枚举类型与switch结构的应用分析

第六章深入探讨了枚举类型与 `switch` 结构在编程中的应用。枚举类型（`enum`）是一种将一组相关常量组织在一起的数据类型，广泛存在于多种编程语言中。例如，在 Cocoa 框架中，处理文本对齐时常用 `NSTextAlignment` 枚举来表示不同的对齐方式。通过结合 `switch` 结构，可以更清晰、高效地实现基于枚举值的逻辑分支，提高代码的可读性和维护性。 ... [详细]

蜡笔小新 2024-11-07 14:36:27
input
技术日志：使用 Ruby 爬虫抓取拉勾网职位数据并生成词云分析报告

技术日志：使用 Ruby 爬虫抓取拉勾网职位数据并生成词云分析报告 ... [详细]

蜡笔小新 2024-11-07 14:33:19
input
如何更有效地提升对支持部门的协助与支撑？ - Enhancing Support for the Support Department: Strategies and Best Practices

尽管我们尽最大努力，任何软件开发过程中都难免会出现缺陷。为了更有效地提升对支持部门的协助与支撑，本文探讨了多种策略和最佳实践，旨在通过改进沟通、增强培训和支持流程来减少这些缺陷的影响，并提高整体服务质量和客户满意度。 ... [详细]

蜡笔小新 2024-11-07 06:55:33
input
手指触控|Android电容屏幕驱动调试指南

手指触控|Android电容屏幕驱动调试指南 ... [详细]

蜡笔小新 2024-11-07 01:42:20
io
深入理解 Java 控制结构的全面指南

深入理解 Java 控制结构的全面指南 ... [详细]

蜡笔小新 2024-11-06 16:40:43
text
深入解析 Golang 中 Context 的功能与应用

本文详细探讨了 Golang 中 Context 的核心功能及其应用场景，通过深入解析其工作机制，帮助读者更好地理解和运用这一重要特性，对于提升代码质量和项目开发效率具有重要的参考价值。 ... [详细]

蜡笔小新 2024-11-06 16:35:57
input
【图像分类实战】利用DenseNet在PyTorch中实现秃头识别

本文详细介绍了如何使用DenseNet模型在PyTorch框架下实现秃头识别。首先，文章概述了项目所需的库和全局参数设置。接着，对图像进行预处理并读取数据集。随后，构建并配置DenseNet模型，设置训练和验证流程。最后，通过测试阶段验证模型性能，并提供了完整的代码实现。本文不仅涵盖了技术细节，还提供了实用的操作指南，适合初学者和有经验的研究人员参考。 ... [详细]

蜡笔小新 2024-11-06 15:21:35
include
蓝桥杯物联网基础教程：通过GPIO输入控制LED5的点亮与熄灭

本教程详细介绍了如何利用STM32的GPIO接口通过输入信号控制LED5的点亮与熄灭。内容涵盖GPIO的基本配置、按键检测及LED驱动方法，适合具有STM32基础的读者学习和实践。 ... [详细]

蜡笔小新 2024-11-06 14:39:27
command
在Ubuntu系统中配置Python环境变量的方法与技巧

在Ubuntu系统中配置Python环境变量是确保项目顺利运行的关键步骤。本文介绍了如何将Windows上的Django项目迁移到Ubuntu，并解决因虚拟环境导致的模块缺失问题。通过详细的操作指南，帮助读者正确配置虚拟环境，确保所有第三方库都能被正确识别和使用。此外，还提供了一些实用的技巧，如如何检查环境变量配置是否正确，以及如何在多个虚拟环境之间切换。 ... [详细]

蜡笔小新 2024-11-05 21:42:25
hash
Android系统支持的图像格式及其版本兼容性（涵盖存储、HTTP传输、相机功能、SparseArray应用与系统升级）

本文探讨了Android系统中支持的图像格式及其在不同版本中的兼容性问题，重点涵盖了存储、HTTP传输、相机功能以及SparseArray的应用。文章详细分析了从Android 10 (API 29) 到Android 11 的存储规范变化，并讨论了这些变化对图像处理的影响。此外，还介绍了如何通过系统升级和代码优化来解决版本兼容性问题，以确保应用程序在不同Android版本中稳定运行。 ... [详细]

蜡笔小新 2024-11-05 14:02:29
cmd
PyCharm 使用进阶技巧与程序调试方法综述

PyCharm 作为 JetBrains 出品的知名集成开发环境（IDE），提供了丰富的功能和强大的工具支持，包括项目视图、代码结构视图、代码导航、语法高亮、自动补全和错误检测等。本文详细介绍了 PyCharm 的高级使用技巧和程序调试方法，旨在帮助开发者提高编码效率和调试能力。此外，还探讨了如何利用 PyCharm 的插件系统扩展其功能，以满足不同开发场景的需求。 ... [详细]

蜡笔小新 2024-11-05 05:26:48
io
深入解析Zebra中的线程机制及其应用

本文详细探讨了Zebra路由软件中的线程机制及其实际应用。通过对Zebra线程模型的深入分析，揭示了其在高效处理网络路由任务中的关键作用。文章还介绍了线程同步与通信机制，以及如何通过优化线程管理提升系统性能。此外，结合具体应用场景，展示了Zebra线程机制在复杂网络环境下的优势和灵活性。 ... [详细]

蜡笔小新 2024-11-04 19:18:15
io
优化后的标题：hCalendar微格式：深入解析事件与时间、地点相关的活动标记方法

本文深入探讨了 hCalendar 微格式在事件与时间、地点相关活动标记中的应用。作为微格式系列文章的第四篇，前文已分别介绍了 rel 属性用于定义链接关系、XFN 微格式增强链接的人际关系描述以及 hCard 微格式对个人和组织信息的描述。本次将重点解析 hCalendar 如何通过结构化数据标记，提高事件信息的可读性和互操作性。 ... [详细]

蜡笔小新 2024-11-04 17:57:52
input
ICM Technex 2017与Codeforces第400轮（合并组）C题：莫莉的化学物质分析

在解决区间相关问题时，我发现自己经常缺乏有效的思维方式，即使是简单的题目也常常需要很长时间才能找到解题思路，而一旦得到提示便能迅速理解。题目要求对一个包含n个元素的数组进行操作，并给出一个参数k，具体任务是…… ... [详细]

蜡笔小新 2024-11-05 15:59:59

梦三国

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章