对于X和Y，SKlearn重塑警告-SKlearnreshapewarningforXandY

作者：生活趣图分享 | 来源：互联网 | 2023-06-03 18:10

ImnewtoMachineLearningandworkingonaprojectusingpython(3.6),pandas,NumpyandSKLearn.我

I'm new to Machine Learning and working on a project using python(3.6), pandas, Numpy and SKLearn.

我是机器学习的新手,并使用python(3.6),pandas,Numpy和SKLearn开展项目。

My DataFrame is:

我的DataFrame是:

discount   tax   total   subtotal   productid
  3         0     20       13        002
  10        3     106      94        003
  46.49     6     21       20        004

Here's how I have performed the classification:

以下是我执行分类的方法:

df_full = pd.read_excel('input/Potential_Learning_Patterns.xlsx', sheet_name=0)
df_full.head()
#for convert to numeric
df_full['discount'] = pd.to_numeric(df_full['discount'], errors='coerce')
df_full['productdiscount'] = pd.to_numeric(df_full['discount'], errors='coerce')
df_full['Class'] = ((df_full['discount'] > 20) & 
                (df_full['tax'] == 0) &
                (df_full['productdiscount'] > 20) &
                (df_full['total'] > 100)).astype(int)
print (df_full)

# Get some sample data from entire dataset
data = df_full.sample(frac = 0.1, random_state = 1)
print(data.shape)
data.isnull().sum()
# Convert excel data into matrix
columns = "invoiceid locationid timestamp customerid discount tax total subtotal productid quantity productprice productdiscount invoice_products_id producttax invoice_payments_id paymentmethod paymentdetails amount Class(0/1) Class".split()
X = pd.DataFrame.as_matrix(data, columns=columns)
Y = data.Class
# temp = np.array(temp).reshape((len(temp), 1)
Y = Y.values.reshape(Y.shape[0], 1)
X.shape
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.06)
X_test, X_dev, Y_test, Y_dev = train_test_split(X_test, Y_test, test_size = .5)

# Check if there is Classification Values - 0/1 in training set and other set 
np.where(Y_train == 1)
np.where(Y_test == 1)
np.where(Y_dev == 1)

# Determine no of fraud cases in dataset
Fraud = data[data['Class'] == 1]
Valid = data[data['Class'] == 0]

# calculate percentages for Fraud & Valid 
outlier_fraction = len(Fraud) / float(len(Valid))
print(outlier_fraction)

print('Fraud Cases : {}'.format(len(Fraud)))
print('Valid Cases : {}'.format(len(Valid)))

# Correlation matrix
corrmat = data.corr()
fig = plt.figure( figsize = (12, 9))

sns.heatmap(corrmat, vmax = .8, square = True)
plt.show()

Here's how I have applied reshaping :

以下是我应用重塑的方法:

# Get all the columns from dataframe
columns = data.columns.tolist()

# Filter the columns to remove data we don't want
columns = [c for c in columns if c not in ["Class"] ]

# store the variables we want to predicting on
target = "Class"
for column in data.columns:
    if data[column].dtype == type(object):
        le = LabelEncoder()
        data[column] = le.fit_transform(data[column])
        X = data[column]
    X = data[column]        
    Y = data[target]

    # Print the shapes of X & Y
    print(X.shape)
    print(Y.shape)
    # define a random state
state = 1

# define the outlier detection method
classifiers = {
    "Isolation Forest": IsolationForest(max_samples=len(X),
                                       cOntamination=outlier_fraction,
                                       random_state=state),
    "Local Outlier Factor": LocalOutlierFactor(
    n_neighbors = 20,
    cOntamination= outlier_fraction)
}



# fit the model
n_outliers = len(Fraud)

for i, (clf_name, clf) in enumerate(classifiers.items()):

    # fit te data and tag outliers
    if clf_name == "Local Outlier Factor":
        y_pred = clf.fit_predict(X)
        scores_pred = clf.negative_outlier_factor_
    else:
        clf.fit(X)
        scores_pred = clf.decision_function(X)
        y_pred = clf.predict(X)

    # Reshape the prediction values to 0 for valid and 1 for fraudulent
    y_pred[y_pred == 1] = 0
    y_pred[y_pred == -1] = 1

    n_errors = (y_pred != Y).sum()

    # run classification metrics 
    print('{}:{}').format(clf_name, n_errors)
    print(accuracy_score(Y, y_pred ))
    print(classification_report(Y, y_pred ))

The code works fine till reshaping the sample and target. But when I try fit method for my classifiers it returns an error like:

代码工作正常,直到重塑样本和目标。但是当我为我的分类器尝试fit方法时,它返回一个错误,如:

ValueError: Expected 2D array, got 1D array instead: array=[1 0]. Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

ValueError:预期的2D数组,改为获得1D数组:array = [1 0]。如果数据具有单个要素,则使用array.reshape(-1,1)重新整形数据;如果数据包含单个样本,则使用array.reshape(1,-1)重新整形数据。

I'm new to machine learning, what I did wrong here? I have multiple features how I can correctly reshape my sample arrays?

我是机器学习的新手,我在这里做错了什么?我有多个功能如何正确地重塑我的样本数组?

Help me, please! Thanks in advance!

请帮帮我!提前致谢!

1 个解决方案

#1

In the following loop you are overwriting variable X with a single column (Series) in each loop iteration:

在以下循环中,您将在每个循环迭代中使用单个列(系列)覆盖变量X:

for column in data.columns:
    if data[column].dtype == type(object):
        le = LabelEncoder()
        data[column] = le.fit_transform(data[column])
        X = data[column]
    X = data[column]        #  <------- NOTE: 
    Y = data[target]

actually you can define X and Y after the loop as follows:

实际上你可以在循环后定义X和Y,如下所示:

X = data.drop(target, 1)
Y = data[target]

the vast majority of sklrean methods do accept pandas DataFrames and Series as input data sets...

绝大多数sklrean方法都接受pandas DataFrames和Series作为输入数据集......

推荐阅读

ip
在范围[0..n-1]中产生m个不同的随机数 - Generating m distinct random numbers in the range [0..n-1]

Ihavetwomethodsofgeneratingmdistinctrandomnumbersintherange[0..n-1]我有两种方法在范围[0.n-1]中生 ... [详细]

蜡笔小新 2024-11-13 09:49:14
ip
poj 3352 Road Construction

poj 3352 Road Construction ... [详细]

蜡笔小新 2024-11-12 11:24:39
header
大类|电阻器_使用Requests、Etree、BeautifulSoup、Pandas和Path库进行数据抓取与处理 | 将指定区域内容保存为HTML和Excel格式

大类|电阻器_使用Requests、Etree、BeautifulSoup、Pandas和Path库进行数据抓取与处理 | 将指定区域内容保存为HTML和Excel格式 ... [详细]

蜡笔小新 2024-11-11 19:05:59
header
如何将Python与Excel高效结合：常用操作技巧解析

本文深入探讨了如何将Python与Excel高效结合，涵盖了一系列实用的操作技巧。文章内容详尽，步骤清晰，注重细节处理，旨在帮助读者掌握Python与Excel之间的无缝对接方法，提升数据处理效率。 ... [详细]

蜡笔小新 2024-11-11 15:18:30
ip
Python 字符串处理全解：常用操作与技巧汇总

本文全面解析了 Python 中字符串处理的常用操作与技巧。首先介绍了如何通过 `s.strip()`, `s.lstrip()` 和 `s.rstrip()` 方法去除字符串中的空格和特殊符号。接着，详细讲解了字符串复制的方法，包括使用 `sStr1 = sStr2` 进行简单的赋值复制。此外，还探讨了字符串连接、分割、替换等高级操作，并提供了丰富的示例代码，帮助读者深入理解和掌握这些实用技巧。 ... [详细]

蜡笔小新 2024-11-10 09:01:29
buffer
Scala学习指南：从零开始掌握基础

本指南从零开始介绍Scala编程语言的基础知识，重点讲解了Scala解释器REPL（读取-求值-打印-循环）的使用方法。REPL是Scala开发中的重要工具，能够帮助初学者快速理解和实践Scala的基本语法和特性。通过详细的示例和练习，读者将能够熟练掌握Scala的基础概念和编程技巧。 ... [详细]

蜡笔小新 2024-11-07 18:07:59
ip
Python内置模块详解：正则表达式re模块的应用与解析

正则表达式是一种强大的文本处理工具，通过特定的字符序列来定义搜索模式。本文详细介绍了Python内置的`re`模块，探讨了其在字符串匹配、验证和提取中的应用。例如，可以通过正则表达式验证电子邮件地址、电话号码、QQ号、密码、URL和IP地址等。此外，文章还深入解析了`re`模块的各种函数和方法，提供了丰富的示例代码，帮助读者更好地理解和使用这一工具。 ... [详细]

蜡笔小新 2024-11-07 17:25:01
object
利用Python实现高效语音识别技术

本文探讨了利用Python实现高效语音识别技术的方法。通过使用先进的语音处理库和算法，本文详细介绍了如何构建一个准确且高效的语音识别系统。提供的代码示例和实验结果展示了该方法在实际应用中的优越性能。相关文件可从以下链接下载：链接：https://pan.baidu.com/s/1RWNVHuXMQleOrEi5vig_bQ，提取码：p57s。 ... [详细]

蜡笔小新 2024-11-07 13:05:53
ip
如何更有效地提升对支持部门的协助与支撑？ - Enhancing Support for the Support Department: Strategies and Best Practices

尽管我们尽最大努力，任何软件开发过程中都难免会出现缺陷。为了更有效地提升对支持部门的协助与支撑，本文探讨了多种策略和最佳实践，旨在通过改进沟通、增强培训和支持流程来减少这些缺陷的影响，并提高整体服务质量和客户满意度。 ... [详细]

蜡笔小新 2024-11-07 06:55:33
usb
杜甫《喜晴》的两种英译比较

本文对比了杜甫《喜晴》的两种英文翻译版本：a. Pleased with Sunny Weather 和 b. Rejoicing in Clearing Weather。a 版由 alexcwlin 翻译并经 Adam Lam 编辑，b 版则由哈佛大学的宇文所安教授 (Prof. Stephen Owen) 翻译。 ... [详细]

蜡笔小新 2024-11-12 15:02:28
ip
javascript分页类支持页码格式

前端时间因为项目需要，要对一个产品下所有的附属图片进行分页显示，没考虑ajax一张张请求，所以干脆一次性全部把图片out，然 ... [详细]

蜡笔小新 2024-11-12 14:58:57
ip
开机自启动的几种方式

0x01快速自启动目录快速启动目录自启动方式源于Windows中的一个目录，这个目录一般叫启动或者Startup。位于该目录下的PE文件会在开机后进行自启动 ... [详细]

蜡笔小新 2024-11-12 11:16:30
ip
在CentOS 7环境中安装配置Redis及使用Redis Desktop Manager连接时的注意事项与技巧

在 CentOS 7 环境中安装和配置 Redis 时，需要注意一些关键步骤和最佳实践。本文详细介绍了从安装 Redis 到配置其基本参数的全过程，并提供了使用 Redis Desktop Manager 连接 Redis 服务器的技巧和注意事项。此外，还探讨了如何优化性能和确保数据安全，帮助用户在生产环境中高效地管理和使用 Redis。 ... [详细]

蜡笔小新 2024-11-11 18:27:44
ip
2018年湘潭大学程序设计竞赛在牛客网的时间数据分析报告

本报告对2018年湘潭大学程序设计竞赛在牛客网上的时间数据进行了详细分析。通过统计参赛者在各个时间段的活跃情况，揭示了比赛期间的编程频率和时间分布特点。此外，报告还探讨了选手在准备过程中面临的挑战，如保持编程手感、学习逆向工程和PWN技术，以及熟悉Linux环境等。这些发现为未来的竞赛组织和培训提供了 valuable 的参考。 ... [详细]

蜡笔小新 2024-11-11 16:10:24
ip
图像分类训练方案优化设计

针对图像分类任务的训练方案进行了优化设计。通过引入PyTorch等深度学习框架，利用其丰富的工具包和模块，如 `torch.nn` 和 `torch.nn.functional`，提升了模型的训练效率和分类准确性。优化方案包括数据预处理、模型架构选择和损失函数的设计等方面，旨在提高图像分类任务的整体性能。 ... [详细]

蜡笔小新 2024-11-07 16:45:46

生活趣图分享

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章