当前位置: 开发笔记 > 编程语言 > 正文

python数据挖掘入门与实践pdf读书笔记_数据挖掘实践指南读书笔记6

作者：2012开始飞翔 | 来源：互联网 | 2023-06-11 05:01

写在之前本书涉及的源程序和数据都可以在以下网站中找到：http:guidetodatamining.com这本书理论比较简单，书中错误较少，

写在之前

本书涉及的源程序和数据都可以在以下网站中找到&＃xff1a;http://guidetodatamining.com/

这本书理论比较简单&＃xff0c;书中错误较少&＃xff0c;动手锻炼较多&＃xff0c;如果每个代码都自己写出来&＃xff0c;收获不少。总结&＃xff1a;适合入门。

欢迎转载&＃xff0c;转载请注明出处&＃xff0c;如有问题欢迎指正。

合集地址&＃xff1a;https://www.zybuluo.com/hainingwyx/note/559139

朴素贝叶斯和文本

训练阶段&＃xff1a;

将标识为同一假设的文档合并成一个文本文件

计算词在该文件中的出现次数n,形成一个词汇表

对于词汇表中的每个词w_k计算器在文本中的出现次数&＃xff0c;记为n_k

对词汇表中的每个词(去除停用词)w_k&＃xff0c;计算

6ffe77e4f57a

class BayesText:

def __init__(self, trainingdir, stopwordlist):

"""This class implements a naive Bayes approach to text

classification

trainingdir is the training data. Each subdirectory of

trainingdir is titled with the name of the classification

category -- those subdirectories in turn contain the text

files for that category.

The stopwordlist is a list of words (one per line) will be

removed before any counting takes place.

"""

self.vocabulary &＃61; {}

self.prob &＃61; {}

self.totals &＃61; {}

self.stopwords &＃61; {} #停用词字典

f &＃61; open(stopwordlist)

for line in f:

self.stopwords[line.strip()] &＃61; 1

f.close()

categories &＃61; os.listdir(trainingdir)

#filter out files that are not directories

self.categories &＃61; [filename for filename in categories

if os.path.isdir(trainingdir &＃43; filename)]

print("Counting ...")

for category in self.categories:

print(&＃39; &＃39; &＃43; category)

# 计算当前类别的单词和单词数量&＃xff0c;单词的总量

(self.prob[category],

self.totals[category]) &＃61; self.train(trainingdir, category)

# I am going to eliminate any word in the 所有种类的单词库vocabulary

# that doesn&＃39;t occur at least 3 times

toDelete &＃61; []

for word in self.vocabulary:

if self.vocabulary[word] <3:

# mark word for deletion

# can&＃39;t delete now because you can&＃39;t delete

# from a list you are currently iterating over

toDelete.append(word)

# now delete

for word in toDelete:

del self.vocabulary[word]

# now compute probabilities

vocabLength &＃61; len(self.vocabulary)

print("Computing probabilities:")

for category in self.categories:

print(&＃39; &＃39; &＃43; category)

denominator &＃61; self.totals[category] &＃43; vocabLength

for word in self.vocabulary:

if word in self.prob[category]:

count &＃61; self.prob[category][word]

else:

count &＃61; 1

# 条件概率计算

self.prob[category][word] &＃61; (float(count &＃43; 1)

/ denominator)

print ("DONE TRAINING\n\n")

# input&＃xff1a;trainingdir训练文件的目录, category训练文件的种类

# return: (counts, total) (当前文件的单词和单词数量,所有单词的数量)

def train(self, trainingdir, category):

"""counts word occurrences for a particular category"""

currentdir &＃61; trainingdir &＃43; category

files &＃61; os.listdir(currentdir)

counts &＃61; {}

total &＃61; 0

for file in files:

#print(currentdir &＃43; &＃39;/&＃39; &＃43; file)

f &＃61; codecs.open(currentdir &＃43; &＃39;/&＃39; &＃43; file, &＃39;r&＃39;, &＃39;iso8859-1&＃39;)

for line in f:

tokens &＃61; line.split()

for token in tokens:

# get rid of punctuation and lowercase token

token &＃61; token.strip(&＃39;\&＃39;".,?:-&＃39;)

token &＃61; token.lower()

if token !&＃61; &＃39;&＃39; and not token in self.stopwords:

self.vocabulary.setdefault(token, 0)

self.vocabulary[token] &＃43;&＃61; 1#所有文档的单词和单词数量

counts.setdefault(token, 0)

counts[token] &＃43;&＃61; 1#当前文件的单词和单词数量

total &＃43;&＃61; 1#所有单词的数量

f.close()

return(counts, total)

# test code

bT &＃61; BayesText(trainingDir, stoplistfile)

bT.prob[&＃39;rec.motorcycles&＃39;]["god"]

分类阶段&＃xff1a;

6ffe77e4f57a

如果概率非常小&＃xff0c;Python无法计算&＃xff0c;可以采用取对数的形式。

停用词&＃xff1a;当停用词是噪声时&＃xff0c;去掉这些词能减少处理量&＃xff0c;提高性能。个别情况下要重新考虑停用词。如性犯罪者会比一般人更多使用me、you这类词语。

def classify(self, itemVector, numVector):

"""Return class we think item Vector is in"""

results &＃61; []

sqrt2pi &＃61; math.sqrt(2 * math.pi)

for (category, prior) in self.prior.items():

prob &＃61; prior

col &＃61; 1

for attrValue in itemVector:

if not attrValue in self.conditional[category][col]:

# we did not find any instances of this attribute value

# occurring with this category so prob &＃61; 0

prob &＃61; 0

else:

prob &＃61; prob * self.conditional[category][col][attrValue]

col &＃43;&＃61; 1

col &＃61; 1

for x in numVector:

mean &＃61; self.means[category][col]

ssd &＃61; self.ssd[category][col]

ePart &＃61; math.pow(math.e, -(x - mean)**2/(2*ssd**2))

prob &＃61; prob * ((1.0 / (sqrt2pi*ssd)) * ePart)

col &＃43;&＃61; 1

results.append((prob, category))

# return the category with the highest probability

#print(results)

return(max(results)[1])

# test code

bT.classify(testDir&＃43; &＃39;rec.motorcycles/104673&＃39;)

10-fold cross

from __future__ import print_function

import os, codecs, math

class BayesText:

# input:训练文件目录&＃xff0c;停用词&＃xff0c;忽略的文件子集

def __init__(self, trainingdir, stopwordlist, ignoreBucket):

"""This class implements a naive Bayes approach to text

classification

trainingdir is the training data. Each subdirectory of

trainingdir is titled with the name of the classification

category -- those subdirectories in turn contain the text

files for that category.

The stopwordlist is a list of words (one per line) will be

removed before any counting takes place.

"""

self.vocabulary &＃61; {}

self.prob &＃61; {}

self.totals &＃61; {}

self.stopwords &＃61; {}

f &＃61; open(stopwordlist)

for line in f:

self.stopwords[line.strip()] &＃61; 1

f.close()

categories &＃61; os.listdir(trainingdir)

#filter out files that are not directories&＃xff0c;in this program, neg and pos

self.categories &＃61; [filename for filename in categories

if os.path.isdir(trainingdir &＃43; filename)]

print("Counting ...")

for category in self.categories:

#print(&＃39; &＃39; &＃43; category)

(self.prob[category],

self.totals[category]) &＃61; self.train(trainingdir, category,

ignoreBucket)

# I am going to eliminate any word in the vocabulary

# that doesn&＃39;t occur at least 3 times

toDelete &＃61; []

for word in self.vocabulary:

if self.vocabulary[word] <3:

# mark word for deletion

# can&＃39;t delete now because you can&＃39;t delete

# from a list you are currently iterating over

toDelete.append(word)

# now delete

for word in toDelete:

del self.vocabulary[word]

# now compute probabilities

vocabLength &＃61; len(self.vocabulary)

#print("Computing probabilities:")

for category in self.categories:

#print(&＃39; &＃39; &＃43; category)

denominator &＃61; self.totals[category] &＃43; vocabLength

for word in self.vocabulary:

if word in self.prob[category]:

count &＃61; self.prob[category][word]

else:

count &＃61; 1

self.prob[category][word] &＃61; (float(count &＃43; 1)

/ denominator)

#print ("DONE TRAINING\n\n")

def train(self, trainingdir, category, bucketNumberToIgnore):

"""counts word occurrences for a particular category"""

ignore &＃61; "%i" % bucketNumberToIgnore

currentdir &＃61; trainingdir &＃43; category

directories &＃61; os.listdir(currentdir)

counts &＃61; {}

total &＃61; 0

for directory in directories:

if directory !&＃61; ignore:

currentBucket &＃61; trainingdir &＃43; category &＃43; "/" &＃43; directory

files &＃61; os.listdir(currentBucket)

#print(" " &＃43; currentBucket)

for file in files:

f &＃61; codecs.open(currentBucket &＃43; &＃39;/&＃39; &＃43; file, &＃39;r&＃39;, &＃39;iso8859-1&＃39;)

for line in f:

tokens &＃61; line.split()

for token in tokens:

# get rid of punctuation and lowercase token

token &＃61; token.strip(&＃39;\&＃39;".,?:-&＃39;)

token &＃61; token.lower()

if token !&＃61; &＃39;&＃39; and not token in self.stopwords:

self.vocabulary.setdefault(token, 0)

self.vocabulary[token] &＃43;&＃61; 1

counts.setdefault(token, 0)

counts[token] &＃43;&＃61; 1

total &＃43;&＃61; 1

f.close()

return(counts, total)

def classify(self, filename):

results &＃61; {}

for category in self.categories:

results[category] &＃61; 0

f &＃61; codecs.open(filename, &＃39;r&＃39;, &＃39;iso8859-1&＃39;)

for line in f:

tokens &＃61; line.split()

for token in tokens:

#print(token)

token &＃61; token.strip(&＃39;\&＃39;".,?:-&＃39;).lower()

if token in self.vocabulary:

for category in self.categories:

if self.prob[category][token] &＃61;&＃61; 0:

print("%s %s" % (category, token))

results[category] &＃43;&＃61; math.log(

self.prob[category][token])

f.close()

results &＃61; list(results.items())

results.sort(key&＃61;lambda tuple: tuple[1], reverse &＃61; True)

# for debugging I can change this to give me the entire list

return results[0][0]

# input: 测试文件的分类目录&＃xff0c;当前类别&＃xff0c; 忽略子集

# return: 当前类别下的分类结果{0:12,1&＃xff1a;23}

def testCategory(self, direc, category, bucketNumber):

results &＃61; {}

directory &＃61; direc &＃43; ("%i/" % bucketNumber)

#print("Testing " &＃43; directory)

files &＃61; os.listdir(directory)

total &＃61; 0

#correct &＃61; 0

for file in files:

total &＃43;&＃61; 1

result &＃61; self.classify(directory &＃43; file)

results.setdefault(result, 0)

results[result] &＃43;&＃61; 1

#if result &＃61;&＃61; category:

# correct &＃43;&＃61; 1

return results

# input: 测试文件目录&＃xff0c; 忽略的子集文件

# return: 所有类别的分类结果{1:{0:12,1&＃xff1a;23},}

def test(self, testdir, bucketNumber):

"""Test all files in the test directory--that directory is

organized into subdirectories--each subdir is a classification

category"""

results &＃61; {}

categories &＃61; os.listdir(testdir)

#filter out files that are not directories

categories &＃61; [filename for filename in categories if

os.path.isdir(testdir &＃43; filename)]

for category in categories:

#print(".", end&＃61;"")

results[category] &＃61; self.testCategory(

testdir &＃43; category &＃43; &＃39;/&＃39;, category, bucketNumber)

return results

def tenfold(dataPrefix, stoplist):

results &＃61; {}

for i in range(0,10):

bT &＃61; BayesText(dataPrefix, stoplist, i)

r &＃61; bT.test(theDir, i)

for (key, value) in r.items():

results.setdefault(key, {})

for (ckey, cvalue) in value.items():

results[key].setdefault(ckey, 0)

results[key][ckey] &＃43;&＃61; cvalue

categories &＃61; list(results.keys())

categories.sort()

print( "\n Classified as: ")

header &＃61; " "

subheader &＃61; " &＃43;"

for category in categories:

header &＃43;&＃61; "% 2s " % category

subheader &＃43;&＃61; "-----&＃43;"

print (header)

print (subheader)

total &＃61; 0.0

correct &＃61; 0.0

for category in categories:

row &＃61; " %s |" % category

for c2 in categories:

if c2 in results[category]:

count &＃61; results[category][c2]

else:

count &＃61; 0

row &＃43;&＃61; " %3i |" % count

total &＃43;&＃61; count

if c2 &＃61;&＃61; category:

correct &＃43;&＃61; count

print(row)

print(subheader)

print("\n%5.3f percent correct" %((correct * 100) / total))

print("total of %i instances" % total)

# change these to match your directory structure

prefixPath &＃61; "data/review_polarity/"

theDir &＃61; prefixPath &＃43; "/txt_sentoken/"

stoplistfile &＃61; prefixPath &＃43; "stopwords25.txt"

tenfold(theDir, stoplistfile)

推荐阅读

object
使用JSONObiect和Gson相关方法实现json数据与kotlin对象的相互转换

本文介绍了如何使用JSONObiect和Gson相关方法实现json数据与kotlin对象的相互转换。首先解释了JSON的概念和数据格式，然后详细介绍了相关API，包括JSONObject和Gson的使用方法。接着讲解了如何将json格式的字符串转换为kotlin对象或List，以及如何将kotlin对象转换为json字符串。最后提到了使用Map封装json对象的特殊情况。文章还对JSON和XML进行了比较，指出了JSON的优势和缺点。 ... [详细]

蜡笔小新 2023-12-11 16:20:50
import
Spring源码解密之默认标签的解析方式分析

本文分析了Spring源码解密中默认标签的解析方式。通过对命名空间的判断，区分默认命名空间和自定义命名空间，并采用不同的解析方式。其中，bean标签的解析最为复杂和重要。 ... [详细]

蜡笔小新 2023-12-14 17:24:50
header
【译】发送表单数据

这是原文链接：sendingformdata许多情况下，我们使用表单发送数据到服务器。服务器处理数据并返回响应给用户。这看起来很简单，但是 ... [详细]

蜡笔小新 2023-12-14 16:19:10
input
Open judge C16H: Magical Balls 快速幂+逆元问题解析

本文主要解析了Open judge C16H问题中涉及到的Magical Balls的快速幂和逆元算法，并给出了问题的解析和解决方法。详细介绍了问题的背景和规则，并给出了相应的算法解析和实现步骤。通过本文的解析，读者可以更好地理解和解决Open judge C16H问题中的Magical Balls部分。 ... [详细]

蜡笔小新 2023-12-14 12:03:27
filter
Perl的测试框架Test::Base简介及使用方法

本文介绍了Perl的测试框架Test::Base，它是一个数据驱动的测试框架，可以自动进行单元测试，省去手工编写测试程序的麻烦。与Test::More完全兼容，使用方法简单。以plural函数为例，展示了Test::Base的使用方法。 ... [详细]

蜡笔小新 2023-12-13 20:05:31
import
Golang如何使用Cookie跟踪位置

关键词：Golang, Cookie, 跟踪位置, net/http/cookiejar, package main, golang.org/x/net/publicsuffix, io/ioutil, log, net/http, net/http/cookiejar ... [详细]

蜡笔小新 2023-12-13 15:47:22
post
java 模拟get post请求_Java后台模拟发送http的get和post请求，并测试

个人学习使用：谨慎参考1Client类importcom.thoughtworks.gauge.Step;importcom.thoughtworks.gauge.T ... [详细]

蜡笔小新 2023-12-13 14:20:23
search
深度学习中的Vision Transformer (ViT)详解

本文详细介绍了深度学习中的Vision Transformer (ViT)方法。首先介绍了相关工作和ViT的基本原理，包括图像块嵌入、可学习的嵌入、位置嵌入和Transformer编码器等。接着讨论了ViT的张量维度变化、归纳偏置与混合架构、微调及更高分辨率等方面。最后给出了实验结果和相关代码的链接。本文的研究表明，对于CV任务，直接应用纯Transformer架构于图像块序列是可行的，无需依赖于卷积网络。 ... [详细]

蜡笔小新 2023-12-12 15:26:38
input
通过Go SDK（Amazon S3）从Bucket生成Torrent - Generate Torrent from Bucket via Go SDK (Amazon S3)

Imtryingtofigureoutawaytogeneratetorrentfilesfromabucket,usingtheAWSSDKforGo.我正 ... [详细]

蜡笔小新 2023-12-12 14:13:01
input
如何查询zone下的表的信息

本文介绍了如何通过TcaplusDB知识库查询zone下的表的信息。包括请求地址、GET请求参数说明、返回参数说明等内容。通过curl方法发起请求，并提供了请求示例。 ... [详细]

蜡笔小新 2023-12-12 08:26:32
input
CSS3选择器的使用方法详解，提高Web开发效率和精准度

本文详细介绍了CSS3新增的选择器方法，包括属性选择器的使用。通过CSS3选择器，可以提高Web开发的效率和精准度，使得查找元素更加方便和快捷。同时，本文还对属性选择器的各种用法进行了详细解释，并给出了相应的代码示例。通过学习本文，读者可以更好地掌握CSS3选择器的使用方法，提升自己的Web开发能力。 ... [详细]

蜡笔小新 2023-12-14 14:37:52
input
postman测试登录后的接口_使用postman进行接口测试的方法(测试用户管理模块)

本文介绍了使用postman进行接口测试的方法，以测试用户管理模块为例。首先需要下载并安装postman，然后创建基本的请求并填写用户名密码进行登录测试。接下来可以进行用户查询和新增的测试。在新增时，可以进行异常测试，包括用户名超长和输入特殊字符的情况。通过测试发现后台没有对参数长度和特殊字符进行检查和过滤。 ... [详细]

蜡笔小新 2023-12-14 10:29:45
search
Python自动提取文本中的时间（包含中文日期）及特殊时间识别方法

本文介绍了在处理不规则数据时如何使用Python自动提取文本中的时间日期，包括使用dateutil.parser模块统一日期字符串格式和使用datefinder模块提取日期。同时，还介绍了一段使用正则表达式的代码，可以支持中文日期和一些特殊的时间识别，例如'2012年12月12日'、'3小时前'、'在2012/12/13哈哈'等。 ... [详细]

蜡笔小新 2023-12-12 12:09:33
input
Swing组件及其用法，图标接口的定义和创建方法

本文介绍了Swing组件的用法，重点讲解了图标接口的定义和创建方法。图标接口用来将图标与各种组件相关联，可以是简单的绘画或使用磁盘上的GIF格式图像。文章详细介绍了图标接口的属性和绘制方法，并给出了一个菱形图标的实现示例。该示例可以配置图标的尺寸、颜色和填充状态。 ... [详细]

蜡笔小新 2023-12-11 21:03:59
input
求矩阵鞍点的个数

本文介绍了一个编程问题，要求求解一个给定n阶方阵的鞍点个数。通过输入格式的描述，可以了解到输入的是一个n阶方阵，每个元素都是整数。通过输出格式的描述，可以了解到输出的是鞍点的个数。通过题目集全集传送门，可以了解到提供了两个函数is_line_max和is_rank_min，用于判断一个元素是否为鞍点。本文还提供了三个样例，分别展示了不同情况下的输入和输出。 ... [详细]

蜡笔小新 2023-12-11 09:50:19

2012开始飞翔

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章