热门标签 | HotTags
当前位置:  开发笔记 > 编程语言 > 正文

K-fold交叉验证实现python。-K-foldcrossvalidationimplementationpython

Iamtryingtoimplementthek-foldcross-validationalgorithminpython.IknowSKLearnprovidesan

I am trying to implement the k-fold cross-validation algorithm in python. I know SKLearn provides an implementation but still... This is my code as of right now.

我正在尝试在python中实现k-fold交叉验证算法。我知道SKLearn提供了一个实现,但是…这是我现在的代码。

from sklearn import metrics
import numpy as np

class Cross_Validation:

@staticmethod
def partition(vector, fold, k):
    size = vector.shape[0]
    start = (size/k)*fold
    end = (size/k)*(fold+1)
    validation = vector[start:end]
    if str(type(vector)) == "":
        indices = range(start, end)
        mask = np.ones(vector.shape[0], dtype=bool)
        mask[indices] = False
        training = vector[mask]
    elif str(type(vector)) == "":
        training = np.concatenate((vector[:start], vector[end:]))
    return training, validation

@staticmethod
def Cross_Validation(learner, k, examples, labels):
    train_folds_score = []
    validation_folds_score = []
    for fold in range(0, k):
        training_set, validation_set = Cross_Validation.partition(examples, fold, k)
        training_labels, validation_labels = Cross_Validation.partition(labels, fold, k)
        learner.fit(training_set, training_labels)
        training_predicted = learner.predict(training_set)
        validation_predicted = learner.predict(validation_set)
        train_folds_score.append(metrics.accuracy_score(training_labels, training_predicted))
        validation_folds_score.append(metrics.accuracy_score(validation_labels, validation_predicted))
    return train_folds_score, validation_folds_score

The learner parameter is a classifier from SKlearn library, k is the number of folds, examples is a sparse matrix produced by the CountVectorizer (again SKlearn) that is the representation of the bag of words. For example:

学习者参数是SKlearn库中的分类器,k是折叠数,例子是由CountVectorizer(再次SKlearn)制作的稀疏矩阵,它是单词包的表示形式。例如:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from Cross_Validation import Cross_Validation as cv

vectorizer = CountVectorizer(stop_words='english', lowercase=True, min_df=2, analyzer="word")
data = vectorizer.fit_transform("""textual data""")
clfMNB = MultinomialNB(alpha=.0001)
score = cv.Cross_Validation(clfMNB, 10, data, labels)
print "Train score" + str(score[0])
print "Test score" + str(score[1])

I'm assuming there is some logic error somewhere since the scores are 95% on the training set (as expected) but practically 0 on the test test, but I can't find it.

我假设有一些逻辑上的错误,因为在训练集上的分数是95%(如预期的),但是在测试测试中几乎是0,但是我找不到。

I hope I was clear. Thanks in advance.

我希望我是清白的。提前谢谢。

________________________________EDIT___________________________________

________________________________EDIT___________________________________

This is the code that loads the text into the vector that can be passed to the vectorizer. It also returns the label vector.

这是将文本加载到可以传递给vectorizer的向量的代码。它还返回标签向量。

from nltk.tokenize import word_tokenize
from Categories_Data import categories
import numpy as np
import codecs
import glob
import os
import re

class Data_Preprocessor:

def tokenize(self, text):
    tokens = word_tokenize(text)
    alpha = [t for t in tokens if unicode(t).isalpha()]
    return alpha

def header_not_fully_removed(self, text):
    if ":" in text.splitlines()[0]:
        return len(text.splitlines()[0].split(":")[0].split()) == 1
    else:
        return False

def strip_newsgroup_header(self, text):
    _before, _blankline, after = text.partition('\n\n')
    if len(after) > 0 and self.header_not_fully_removed(after):
        after = self.strip_newsgroup_header(after)
    return after

def strip_newsgroup_quoting(self, text):
    _QUOTE_RE = re.compile(r'(writes in|writes:|wrote:|says:|said:'r'|^In article|^Quoted from|^\||^>)')
    good_lines = [line for line in text.split('\n')
        if not _QUOTE_RE.search(line)]
    return '\n'.join(good_lines)

def strip_newsgroup_footer(self, text):
    lines = text.strip().split('\n')
    for line_num in range(len(lines) - 1, -1, -1):
        line = lines[line_num]
        if line.strip().strip('-') == '':
            break
    if line_num > 0:
        return '\n'.join(lines[:line_num])
    else:
        return text

def raw_to_vector(self, path, to_be_stripped=["header", "footer", "quoting"], noise_threshold=-1):
    base_dir = os.getcwd()
    train_data = []
    label_data = []
    for category in categories:
        os.chdir(base_dir)
        os.chdir(path+"/"+category[0])
        for filename in glob.glob("*"):
            with codecs.open(filename, 'r', encoding='utf-8', errors='replace') as target:
                data = target.read()
                if "quoting" in to_be_stripped:
                    data = self.strip_newsgroup_quoting(data)
                if "header" in to_be_stripped:
                    data = self.strip_newsgroup_header(data)
                if "footer" in to_be_stripped:
                    data = self.strip_newsgroup_footer(data)
                if len(data) > noise_threshold:
                    train_data.append(data)
                    label_data.append(category[1])
    os.chdir(base_dir)
    return np.array(train_data), np.array(label_data)

This is what "from Categories_Data import categories" imports...

这是“从分类数据导入类别”导入的内容……

categories = [
    ('alt.atheism',0),
    ('comp.graphics',1),
    ('comp.os.ms-windows.misc',2),
    ('comp.sys.ibm.pc.hardware',3),
    ('comp.sys.mac.hardware',4),
    ('comp.windows.x',5),
    ('misc.forsale',6),
    ('rec.autos',7),
    ('rec.motorcycles',8),
    ('rec.sport.baseball',9),
    ('rec.sport.hockey',10),
    ('sci.crypt',11),
    ('sci.electronics',12),
    ('sci.med',13),
    ('sci.space',14),
    ('soc.religion.christian',15),
    ('talk.politics.guns',16),
    ('talk.politics.mideast',17),
    ('talk.politics.misc',18),
    ('talk.religion.misc',19)
 ]

1 个解决方案

#1


2  

The reason why your validation score is low is subtle.

你的验证分数低的原因很微妙。

The issue is how you have partitioned the dataset. Remember, when doing cross-validation you should randomly split the dataset. It is the randomness that you are missing.

问题是如何划分数据集。记住,在进行交叉验证时,应该随机地分割数据集。这就是你缺少的随机性。

Your data is loaded category by category, which means in your input dataset, class labels and examples follow one after the other. By not doing the random split, you have completely removed a class which your model never sees during the training phase and hence you get a bad result on your test/validation phase.

您的数据是按类别加载的,这意味着在您的输入数据集中,类标签和示例会跟随一个接着一个。通过不进行随机分割,您已经完全删除了在训练阶段中您的模型从未看到的类,因此您将在测试/验证阶段得到一个糟糕的结果。

You can solve this by doing a random shuffle. So, do this:

你可以通过随机洗牌来解决这个问题。所以,这样做:

from sklearn.utils import shuffle    

processor = Data_Preprocessor()
td, tl = processor.raw_to_vector(path="C:/Users/Pankaj/Downloads/ng/")
vectorizer = CountVectorizer(stop_words='english', lowercase=True, min_df=2, analyzer="word")
data = vectorizer.fit_transform(td)
# Shuffle the data and labels
data, tl = shuffle(data, tl, random_state=0)
clfMNB = MultinomialNB(alpha=.0001)
score = Cross_Validation.Cross_Validation(clfMNB, 10, data, tl)

print("Train score" + str(score[0]))
print("Test score" + str(score[1]))

推荐阅读
  • 本文探讨了如何使用Scrapy框架构建高效的数据采集系统,以及如何通过异步处理技术提升数据存储的效率。同时,文章还介绍了针对不同网站采用的不同采集策略。 ... [详细]
  • 视觉Transformer综述
    本文综述了视觉Transformer在计算机视觉领域的应用,从原始Transformer出发,详细介绍了其在图像分类、目标检测和图像分割等任务中的最新进展。文章不仅涵盖了基础的Transformer架构,还深入探讨了各类增强版Transformer模型的设计思路和技术细节。 ... [详细]
  • 【MySQL】frm文件解析
    官网说明:http:dev.mysql.comdocinternalsenfrm-file-format.htmlfrm是MySQL表结构定义文件,通常frm文件是不会损坏的,但是如果 ... [详细]
  • 本文介绍了使用Python和C语言编写程序来计算一个给定数值的平方根的方法。通过迭代算法,我们能够精确地得到所需的结果。 ... [详细]
  • 使用Python构建网页版图像编辑器
    本文详细介绍了一款基于Python开发的网页版图像编辑工具,具备多种图像处理功能,如黑白转换、铅笔素描效果等。 ... [详细]
  • 本文分享了作者在使用LaTeX过程中的几点心得,涵盖了从文档编辑、代码高亮、图形绘制到3D模型展示等多个方面的内容。适合希望深入了解LaTeX高级功能的用户。 ... [详细]
  • 笔记说明重学前端是程劭非(winter)【前手机淘宝前端负责人】在极客时间开的一个专栏,每天10分钟,重构你的前端知识体系& ... [详细]
  • 题目描述:Balala Power! 时间限制:4000/2000 MS (Java/Other) 内存限制:131072/131072 K (Java/Other)。题目背景及问题描述详见正文。 ... [详细]
  • Gradle 是 Android Studio 中默认的构建工具,了解其基本配置对于开发效率的提升至关重要。本文将详细介绍如何在 Gradle 中定义和使用共享变量,以确保项目的一致性和可维护性。 ... [详细]
  • 汇总了2023年7月7日最新的网络安全新闻和技术更新,包括最新的漏洞披露、工具发布及安全事件。 ... [详细]
  • 本文详细介绍如何在SSM(Spring + Spring MVC + MyBatis)框架中实现分页功能。包括分页的基本概念、数据准备、前端分页栏的设计与实现、后端分页逻辑的编写以及最终的测试步骤。 ... [详细]
  • 本文详细介绍了如何使用C#实现不同类型的系统服务账户(如Windows服务、计划任务和IIS应用池)的密码重置方法。 ... [详细]
  • ArcBlock 发布 ABT 节点 1.0.31 版本更新
    2020年11月9日,ArcBlock 区块链基础平台发布了 ABT 节点开发平台的1.0.31版本更新,此次更新带来了多项功能增强与性能优化。 ... [详细]
  • 1、编写一个Java程序在屏幕上输出“你好!”。programmenameHelloworld.javapublicclassHelloworld{publicst ... [详细]
  • Pandas参数调整实用指南
    在使用Pandas处理数据时,由于数据集大小和格式的不同,相同的函数或方法可能产生不同的效果。了解如何调整Pandas的参数设置,可以让我们更灵活地应对各种数据挑战,提高数据分析的效率和质量。 ... [详细]
author-avatar
呆保保_369
这个家伙很懒,什么也没留下!
PHP1.CN | 中国最专业的PHP中文社区 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved | 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有