热门标签 | HotTags
当前位置:  开发笔记 > 编程语言 > 正文

python数据挖掘入门与实践pdf读书笔记_数据挖掘实践指南读书笔记6

写在之前本书涉及的源程序和数据都可以在以下网站中找到:http:guidetodatamining.com这本书理论比较简单,书中错误较少,

写在之前

本书涉及的源程序和数据都可以在以下网站中找到:http://guidetodatamining.com/

这本书理论比较简单,书中错误较少,动手锻炼较多,如果每个代码都自己写出来,收获不少。总结:适合入门。

欢迎转载,转载请注明出处,如有问题欢迎指正。

合集地址:https://www.zybuluo.com/hainingwyx/note/559139

朴素贝叶斯和文本

训练阶段:

将标识为同一假设的文档合并成一个文本文件

计算词在该文件中的出现次数n,形成一个词汇表

对于词汇表中的每个词w_k计算器在文本中的出现次数,记为n_k

对词汇表中的每个词(去除停用词)w_k,计算

6ffe77e4f57a

class BayesText:

def __init__(self, trainingdir, stopwordlist):

"""This class implements a naive Bayes approach to text

classification

trainingdir is the training data. Each subdirectory of

trainingdir is titled with the name of the classification

category -- those subdirectories in turn contain the text

files for that category.

The stopwordlist is a list of words (one per line) will be

removed before any counting takes place.

"""

self.vocabulary = {}

self.prob = {}

self.totals = {}

self.stopwords = {} #停用词字典

f = open(stopwordlist)

for line in f:

self.stopwords[line.strip()] = 1

f.close()

categories = os.listdir(trainingdir)

#filter out files that are not directories

self.categories = [filename for filename in categories

if os.path.isdir(trainingdir + filename)]

print("Counting ...")

for category in self.categories:

print(' ' + category)

# 计算当前类别的单词和单词数量,单词的总量

(self.prob[category],

self.totals[category]) = self.train(trainingdir, category)

# I am going to eliminate any word in the 所有种类的单词库vocabulary

# that doesn't occur at least 3 times

toDelete = []

for word in self.vocabulary:

if self.vocabulary[word] <3:

# mark word for deletion

# can&#39;t delete now because you can&#39;t delete

# from a list you are currently iterating over

toDelete.append(word)

# now delete

for word in toDelete:

del self.vocabulary[word]

# now compute probabilities

vocabLength &#61; len(self.vocabulary)

print("Computing probabilities:")

for category in self.categories:

print(&#39; &#39; &#43; category)

denominator &#61; self.totals[category] &#43; vocabLength

for word in self.vocabulary:

if word in self.prob[category]:

count &#61; self.prob[category][word]

else:

count &#61; 1

# 条件概率计算

self.prob[category][word] &#61; (float(count &#43; 1)

/ denominator)

print ("DONE TRAINING\n\n")

# input&#xff1a;trainingdir训练文件的目录, category训练文件的种类

# return: (counts, total) (当前文件的单词和单词数量,所有单词的数量)

def train(self, trainingdir, category):

"""counts word occurrences for a particular category"""

currentdir &#61; trainingdir &#43; category

files &#61; os.listdir(currentdir)

counts &#61; {}

total &#61; 0

for file in files:

#print(currentdir &#43; &#39;/&#39; &#43; file)

f &#61; codecs.open(currentdir &#43; &#39;/&#39; &#43; file, &#39;r&#39;, &#39;iso8859-1&#39;)

for line in f:

tokens &#61; line.split()

for token in tokens:

# get rid of punctuation and lowercase token

token &#61; token.strip(&#39;\&#39;".,?:-&#39;)

token &#61; token.lower()

if token !&#61; &#39;&#39; and not token in self.stopwords:

self.vocabulary.setdefault(token, 0)

self.vocabulary[token] &#43;&#61; 1#所有文档的单词和单词数量

counts.setdefault(token, 0)

counts[token] &#43;&#61; 1#当前文件的单词和单词数量

total &#43;&#61; 1#所有单词的数量

f.close()

return(counts, total)

# test code

bT &#61; BayesText(trainingDir, stoplistfile)

bT.prob[&#39;rec.motorcycles&#39;]["god"]

分类阶段&#xff1a;

6ffe77e4f57a

如果概率非常小&#xff0c;Python无法计算&#xff0c;可以采用取对数的形式。

停用词&#xff1a;当停用词是噪声时&#xff0c;去掉这些词能减少处理量&#xff0c;提高性能。个别情况下要重新考虑停用词。如性犯罪者会比一般人更多使用me、you这类词语。

def classify(self, itemVector, numVector):

"""Return class we think item Vector is in"""

results &#61; []

sqrt2pi &#61; math.sqrt(2 * math.pi)

for (category, prior) in self.prior.items():

prob &#61; prior

col &#61; 1

for attrValue in itemVector:

if not attrValue in self.conditional[category][col]:

# we did not find any instances of this attribute value

# occurring with this category so prob &#61; 0

prob &#61; 0

else:

prob &#61; prob * self.conditional[category][col][attrValue]

col &#43;&#61; 1

col &#61; 1

for x in numVector:

mean &#61; self.means[category][col]

ssd &#61; self.ssd[category][col]

ePart &#61; math.pow(math.e, -(x - mean)**2/(2*ssd**2))

prob &#61; prob * ((1.0 / (sqrt2pi*ssd)) * ePart)

col &#43;&#61; 1

results.append((prob, category))

# return the category with the highest probability

#print(results)

return(max(results)[1])

# test code

bT.classify(testDir&#43; &#39;rec.motorcycles/104673&#39;)

10-fold cross

from __future__ import print_function

import os, codecs, math

class BayesText:

# input:训练文件目录&#xff0c;停用词&#xff0c;忽略的文件子集

def __init__(self, trainingdir, stopwordlist, ignoreBucket):

"""This class implements a naive Bayes approach to text

classification

trainingdir is the training data. Each subdirectory of

trainingdir is titled with the name of the classification

category -- those subdirectories in turn contain the text

files for that category.

The stopwordlist is a list of words (one per line) will be

removed before any counting takes place.

"""

self.vocabulary &#61; {}

self.prob &#61; {}

self.totals &#61; {}

self.stopwords &#61; {}

f &#61; open(stopwordlist)

for line in f:

self.stopwords[line.strip()] &#61; 1

f.close()

categories &#61; os.listdir(trainingdir)

#filter out files that are not directories&#xff0c;in this program, neg and pos

self.categories &#61; [filename for filename in categories

if os.path.isdir(trainingdir &#43; filename)]

print("Counting ...")

for category in self.categories:

#print(&#39; &#39; &#43; category)

(self.prob[category],

self.totals[category]) &#61; self.train(trainingdir, category,

ignoreBucket)

# I am going to eliminate any word in the vocabulary

# that doesn&#39;t occur at least 3 times

toDelete &#61; []

for word in self.vocabulary:

if self.vocabulary[word] <3:

# mark word for deletion

# can&#39;t delete now because you can&#39;t delete

# from a list you are currently iterating over

toDelete.append(word)

# now delete

for word in toDelete:

del self.vocabulary[word]

# now compute probabilities

vocabLength &#61; len(self.vocabulary)

#print("Computing probabilities:")

for category in self.categories:

#print(&#39; &#39; &#43; category)

denominator &#61; self.totals[category] &#43; vocabLength

for word in self.vocabulary:

if word in self.prob[category]:

count &#61; self.prob[category][word]

else:

count &#61; 1

self.prob[category][word] &#61; (float(count &#43; 1)

/ denominator)

#print ("DONE TRAINING\n\n")

def train(self, trainingdir, category, bucketNumberToIgnore):

"""counts word occurrences for a particular category"""

ignore &#61; "%i" % bucketNumberToIgnore

currentdir &#61; trainingdir &#43; category

directories &#61; os.listdir(currentdir)

counts &#61; {}

total &#61; 0

for directory in directories:

if directory !&#61; ignore:

currentBucket &#61; trainingdir &#43; category &#43; "/" &#43; directory

files &#61; os.listdir(currentBucket)

#print(" " &#43; currentBucket)

for file in files:

f &#61; codecs.open(currentBucket &#43; &#39;/&#39; &#43; file, &#39;r&#39;, &#39;iso8859-1&#39;)

for line in f:

tokens &#61; line.split()

for token in tokens:

# get rid of punctuation and lowercase token

token &#61; token.strip(&#39;\&#39;".,?:-&#39;)

token &#61; token.lower()

if token !&#61; &#39;&#39; and not token in self.stopwords:

self.vocabulary.setdefault(token, 0)

self.vocabulary[token] &#43;&#61; 1

counts.setdefault(token, 0)

counts[token] &#43;&#61; 1

total &#43;&#61; 1

f.close()

return(counts, total)

def classify(self, filename):

results &#61; {}

for category in self.categories:

results[category] &#61; 0

f &#61; codecs.open(filename, &#39;r&#39;, &#39;iso8859-1&#39;)

for line in f:

tokens &#61; line.split()

for token in tokens:

#print(token)

token &#61; token.strip(&#39;\&#39;".,?:-&#39;).lower()

if token in self.vocabulary:

for category in self.categories:

if self.prob[category][token] &#61;&#61; 0:

print("%s %s" % (category, token))

results[category] &#43;&#61; math.log(

self.prob[category][token])

f.close()

results &#61; list(results.items())

results.sort(key&#61;lambda tuple: tuple[1], reverse &#61; True)

# for debugging I can change this to give me the entire list

return results[0][0]

# input: 测试文件的分类目录&#xff0c;当前类别&#xff0c; 忽略子集

# return: 当前类别下的分类结果{0:12,1&#xff1a;23}

def testCategory(self, direc, category, bucketNumber):

results &#61; {}

directory &#61; direc &#43; ("%i/" % bucketNumber)

#print("Testing " &#43; directory)

files &#61; os.listdir(directory)

total &#61; 0

#correct &#61; 0

for file in files:

total &#43;&#61; 1

result &#61; self.classify(directory &#43; file)

results.setdefault(result, 0)

results[result] &#43;&#61; 1

#if result &#61;&#61; category:

# correct &#43;&#61; 1

return results

# input: 测试文件目录&#xff0c; 忽略的子集文件

# return: 所有类别的分类结果{1:{0:12,1&#xff1a;23},}

def test(self, testdir, bucketNumber):

"""Test all files in the test directory--that directory is

organized into subdirectories--each subdir is a classification

category"""

results &#61; {}

categories &#61; os.listdir(testdir)

#filter out files that are not directories

categories &#61; [filename for filename in categories if

os.path.isdir(testdir &#43; filename)]

for category in categories:

#print(".", end&#61;"")

results[category] &#61; self.testCategory(

testdir &#43; category &#43; &#39;/&#39;, category, bucketNumber)

return results

def tenfold(dataPrefix, stoplist):

results &#61; {}

for i in range(0,10):

bT &#61; BayesText(dataPrefix, stoplist, i)

r &#61; bT.test(theDir, i)

for (key, value) in r.items():

results.setdefault(key, {})

for (ckey, cvalue) in value.items():

results[key].setdefault(ckey, 0)

results[key][ckey] &#43;&#61; cvalue

categories &#61; list(results.keys())

categories.sort()

print( "\n Classified as: ")

header &#61; " "

subheader &#61; " &#43;"

for category in categories:

header &#43;&#61; "% 2s " % category

subheader &#43;&#61; "-----&#43;"

print (header)

print (subheader)

total &#61; 0.0

correct &#61; 0.0

for category in categories:

row &#61; " %s |" % category

for c2 in categories:

if c2 in results[category]:

count &#61; results[category][c2]

else:

count &#61; 0

row &#43;&#61; " %3i |" % count

total &#43;&#61; count

if c2 &#61;&#61; category:

correct &#43;&#61; count

print(row)

print(subheader)

print("\n%5.3f percent correct" %((correct * 100) / total))

print("total of %i instances" % total)

# change these to match your directory structure

prefixPath &#61; "data/review_polarity/"

theDir &#61; prefixPath &#43; "/txt_sentoken/"

stoplistfile &#61; prefixPath &#43; "stopwords25.txt"

tenfold(theDir, stoplistfile)



推荐阅读
  • 本文介绍了如何使用JSONObiect和Gson相关方法实现json数据与kotlin对象的相互转换。首先解释了JSON的概念和数据格式,然后详细介绍了相关API,包括JSONObject和Gson的使用方法。接着讲解了如何将json格式的字符串转换为kotlin对象或List,以及如何将kotlin对象转换为json字符串。最后提到了使用Map封装json对象的特殊情况。文章还对JSON和XML进行了比较,指出了JSON的优势和缺点。 ... [详细]
  • Spring源码解密之默认标签的解析方式分析
    本文分析了Spring源码解密中默认标签的解析方式。通过对命名空间的判断,区分默认命名空间和自定义命名空间,并采用不同的解析方式。其中,bean标签的解析最为复杂和重要。 ... [详细]
  • 这是原文链接:sendingformdata许多情况下,我们使用表单发送数据到服务器。服务器处理数据并返回响应给用户。这看起来很简单,但是 ... [详细]
  • 本文主要解析了Open judge C16H问题中涉及到的Magical Balls的快速幂和逆元算法,并给出了问题的解析和解决方法。详细介绍了问题的背景和规则,并给出了相应的算法解析和实现步骤。通过本文的解析,读者可以更好地理解和解决Open judge C16H问题中的Magical Balls部分。 ... [详细]
  • 本文介绍了Perl的测试框架Test::Base,它是一个数据驱动的测试框架,可以自动进行单元测试,省去手工编写测试程序的麻烦。与Test::More完全兼容,使用方法简单。以plural函数为例,展示了Test::Base的使用方法。 ... [详细]
  • 关键词:Golang, Cookie, 跟踪位置, net/http/cookiejar, package main, golang.org/x/net/publicsuffix, io/ioutil, log, net/http, net/http/cookiejar ... [详细]
  • 个人学习使用:谨慎参考1Client类importcom.thoughtworks.gauge.Step;importcom.thoughtworks.gauge.T ... [详细]
  • 深度学习中的Vision Transformer (ViT)详解
    本文详细介绍了深度学习中的Vision Transformer (ViT)方法。首先介绍了相关工作和ViT的基本原理,包括图像块嵌入、可学习的嵌入、位置嵌入和Transformer编码器等。接着讨论了ViT的张量维度变化、归纳偏置与混合架构、微调及更高分辨率等方面。最后给出了实验结果和相关代码的链接。本文的研究表明,对于CV任务,直接应用纯Transformer架构于图像块序列是可行的,无需依赖于卷积网络。 ... [详细]
  • Imtryingtofigureoutawaytogeneratetorrentfilesfromabucket,usingtheAWSSDKforGo.我正 ... [详细]
  • 如何查询zone下的表的信息
    本文介绍了如何通过TcaplusDB知识库查询zone下的表的信息。包括请求地址、GET请求参数说明、返回参数说明等内容。通过curl方法发起请求,并提供了请求示例。 ... [详细]
  • CSS3选择器的使用方法详解,提高Web开发效率和精准度
    本文详细介绍了CSS3新增的选择器方法,包括属性选择器的使用。通过CSS3选择器,可以提高Web开发的效率和精准度,使得查找元素更加方便和快捷。同时,本文还对属性选择器的各种用法进行了详细解释,并给出了相应的代码示例。通过学习本文,读者可以更好地掌握CSS3选择器的使用方法,提升自己的Web开发能力。 ... [详细]
  • 本文介绍了使用postman进行接口测试的方法,以测试用户管理模块为例。首先需要下载并安装postman,然后创建基本的请求并填写用户名密码进行登录测试。接下来可以进行用户查询和新增的测试。在新增时,可以进行异常测试,包括用户名超长和输入特殊字符的情况。通过测试发现后台没有对参数长度和特殊字符进行检查和过滤。 ... [详细]
  • 本文介绍了在处理不规则数据时如何使用Python自动提取文本中的时间日期,包括使用dateutil.parser模块统一日期字符串格式和使用datefinder模块提取日期。同时,还介绍了一段使用正则表达式的代码,可以支持中文日期和一些特殊的时间识别,例如'2012年12月12日'、'3小时前'、'在2012/12/13哈哈'等。 ... [详细]
  • 本文介绍了Swing组件的用法,重点讲解了图标接口的定义和创建方法。图标接口用来将图标与各种组件相关联,可以是简单的绘画或使用磁盘上的GIF格式图像。文章详细介绍了图标接口的属性和绘制方法,并给出了一个菱形图标的实现示例。该示例可以配置图标的尺寸、颜色和填充状态。 ... [详细]
  • 本文介绍了一个编程问题,要求求解一个给定n阶方阵的鞍点个数。通过输入格式的描述,可以了解到输入的是一个n阶方阵,每个元素都是整数。通过输出格式的描述,可以了解到输出的是鞍点的个数。通过题目集全集传送门,可以了解到提供了两个函数is_line_max和is_rank_min,用于判断一个元素是否为鞍点。本文还提供了三个样例,分别展示了不同情况下的输入和输出。 ... [详细]
author-avatar
2012开始飞翔
这个家伙很懒,什么也没留下!
PHP1.CN | 中国最专业的PHP中文社区 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved | 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有