热门标签 | HotTags
当前位置:  开发笔记 > 编程语言 > 正文

python缺失值筛选_Python练习:数据缺失值处理

使用Excel自建了一个数据集,作为演示数据,如下:可见,数据中存在部分缺失值。第一步:读入excel文件。这

使用Excel自建了一个数据集,作为演示数据,如下:

d4d61123f52d9db40387fa8b7135e528.png

可见,数据中存在部分缺失值。

第一步:读入excel文件。

这里需要使用pandas库中的read_excel()函数。初次使用这个函数,可以看一看帮助文档~

一个很小的技巧,使用help()查看帮助文档时,第一,一定要明确函数所在的库名。第二,函数后面不能加(),否则会报错。错误原因是函数没有参数。因为,当函数本身是要输入参数的,一旦我们添加了括号,就必须输入参数。

7770c8f93d38539e4c24523c0f200ef8.png

正确的查看read_excel()函数的帮助文档方法如下:

import pandas as pd
help(pd.read_excel)
...
Help on function read_excel in module pandas.io.excel._base:
read_excel(io, sheet_name=0, header=0, names=None, index_col=None, usecols=None, squeeze=False, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skiprows=None, nrows=None, na_values=None, keep_default_na=True, verbose=False, parse_dates=False, date_parser=None, thousands=None, comment=None, skipfooter=0, convert_float=True, mangle_dupe_cols=True, **kwds)Read an Excel file into a pandas DataFrame.Supports `xls`, `xlsx`, `xlsm`, `xlsb`, and `odf` file extensionsread from a local filesystem or URL. Supports an option to reada single sheet or a list of sheets.Parameters----------io : str, bytes, ExcelFile, xlrd.Book, path object, or file-like objectAny valid string path is acceptable. The string could be a URL. ValidURL schemes include http, ftp, s3, and file. For file URLs, a host isexpected. A local file could be: ``file://localhost/path/to/table.xlsx``.If you want to pass in a path object, pandas accepts any ``os.PathLike``.By file-like object, we refer to objects with a ``read()`` method,such as a file handler (e.g. via builtin ``open`` function)or ``StringIO``.sheet_name : str, int, list, or None, default 0Strings are used for sheet names. Integers are used in zero-indexedsheet positions. Lists of strings/integers are used to requestmultiple sheets. Specify None to get all sheets.Available cases:* Defaults to ``0``: 1st sheet as a `DataFrame`* ``1``: 2nd sheet as a `DataFrame`* ``"Sheet1"``: Load sheet with name "Sheet1"* ``[0, 1, "Sheet5"]``: Load first, second and sheet named "Sheet5"as a dict of `DataFrame`* None: All sheets.header : int, list of int, default 0Row (0-indexed) to use for the column labels of the parsedDataFrame. If a list of integers is passed those row positions willbe combined into a ``MultiIndex``. Use None if there is no header.names : array-like, default NoneList of column names to use. If file contains no header row,then you should explicitly pass header=None.index_col : int, list of int, default NoneColumn (0-indexed) to use as the row labels of the DataFrame.Pass None if there is no such column. If a list is passed,those columns will be combined into a ``MultiIndex``. If asubset of data is selected with ``usecols``, index_colis based on the subset.usecols : int, str, list-like, or callable default None* If None, then parse all columns.* If str, then indicates comma separated list of Excel column lettersand column ranges (e.g. "A:E" or "A,C,E:F"). Ranges are inclusive ofboth sides.* If list of int, then indicates list of column numbers to be parsed.* If list of string, then indicates list of column names to be parsed... versionadded:: 0.24.0* If callable, then evaluate each column name against it and parse thecolumn if the callable returns ``True``.Returns a subset of the columns according to behavior above... versionadded:: 0.24.0squeeze : bool, default FalseIf the parsed data only contains one column then return a Series.dtype : Type name or dict of column -> type, default NoneData type for data or columns. E.g. {'a': np.float64, 'b': np.int32}Use `object` to preserve data as stored in Excel and not interpret dtype.If converters are specified, they will be applied INSTEADof dtype conversion.engine : str, default NoneIf io is not a buffer or path, this must be set to identify io.Acceptable values are None, "xlrd", "openpyxl" or "odf".converters : dict, default NoneDict of functions for converting values in certain columns. Keys caneither be integers or column labels, values are functions that take oneinput argument, the Excel cell content, and return the transformedcontent.true_values : list, default NoneValues to consider as True.false_values : list, default NoneValues to consider as False.skiprows : list-likeRows to skip at the beginning (0-indexed).nrows : int, default NoneNumber of rows to parse... versionadded:: 0.23.0na_values : scalar, str, list-like, or dict, default NoneAdditional strings to recognize as NA/NaN. If dict passed, specificper-column NA values. By default the following values are interpretedas NaN: '', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan','1.#IND', '1.#QNAN', '', 'N/A', 'NA', 'NULL', 'NaN', 'n/a','nan', 'null'.keep_default_na : bool, default TrueWhether or not to include the default NaN values when parsing the data.Depending on whether `na_values` is passed in, the behavior is as follows:* If `keep_default_na` is True, and `na_values` are specified, `na_values`is appended to the default NaN values used for parsing.* If `keep_default_na` is True, and `na_values` are not specified, onlythe default NaN values are used for parsing.* If `keep_default_na` is False, and `na_values` are specified, onlythe NaN values specified `na_values` are used for parsing.* If `keep_default_na` is False, and `na_values` are not specified, nostrings will be parsed as NaN.Note that if `na_filter` is passed in as False, the `keep_default_na` and`na_values` parameters will be ignored.na_filter : bool, default TrueDetect missing value markers (empty strings and the value of na_values). Indata without any NAs, passing na_filter=False can improve the performanceof reading a large file.verbose : bool, default FalseIndicate number of NA values placed in non-numeric columns.parse_dates : bool, list-like, or dict, default FalseThe behavior is as follows:* bool. If True -> try parsing the index.* list of int or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3each as a separate date column.* list of lists. e.g. If [[1, 3]] -> combine columns 1 and 3 and parse asa single date column.* dict, e.g. {'foo' : [1, 3]} -> parse columns 1, 3 as date and callresult 'foo'If a column or index contains an unparseable date, the entire column orindex will be returned unaltered as an object data type. If you don`t want toparse some cells as date just change their type in Excel to "Text".For non-standard datetime parsing, use ``pd.to_datetime`` after ``pd.read_excel``.Note: A fast-path exists for iso8601-formatted dates.date_parser : function, optionalFunction to use for converting a sequence of string columns to an array ofdatetime instances. The default uses ``dateutil.parser.parser`` to do theconversion. Pandas will try to call `date_parser` in three different ways,advancing to the next if an exception occurs: 1) Pass one or more arrays(as defined by `parse_dates`) as arguments; 2) concatenate (row-wise) thestring values from the columns defined by `parse_dates` into a single arrayand pass that; and 3) call `date_parser` once for each row using one ormore strings (corresponding to the columns defined by `parse_dates`) asarguments.thousands : str, default NoneThousands separator for parsing string columns to numeric. Note thatthis parameter is only necessary for columns stored as TEXT in Excel,any numeric columns will automatically be parsed, regardless of displayformat.comment : str, default NoneComments out remainder of line. Pass a character or characters to thisargument to indicate comments in the input file. Any data between thecomment string and the end of the current line is ignored.skipfooter : int, default 0Rows at the end to skip (0-indexed).convert_float : bool, default TrueConvert integral floats to int (i.e., 1.0 --> 1). If False, all numericdata will be read in as floats: Excel stores all numbers as floatsinternally.mangle_dupe_cols : bool, default TrueDuplicate columns will be specified as 'X', 'X.1', ...'X.N', rather than'X'...'X'. Passing in False will cause data to be overwritten if thereare duplicate names in the columns.**kwds : optionalOptional keyword arguments can be passed to ``TextFileReader``.
Returns-------DataFrame or dict of DataFramesDataFrame from the passed in Excel file. See notes in sheet_nameargument for more information on when a dict of DataFrames is returned.See Also--------to_excel : Write DataFrame to an Excel file.to_csv : Write DataFrame to a comma-separated values (csv) file.read_csv : Read a comma-separated values (csv) file into DataFrame.read_fwf : Read a table of fixed-width formatted lines into DataFrame.Examples--------The file can be read using the file name as string or an open file object:>>> pd.read_excel('tmp.xlsx', index_col=0) # doctest: +SKIPName Value0 string1 11 string2 22 #Comment 3>>> pd.read_excel(open('tmp.xlsx', 'rb'),... sheet_name='Sheet3') # doctest: +SKIPUnnamed: 0 Name Value0 0 string1 11 1 string2 22 2 #Comment 3Index and header can be specified via the `index_col` and `header` arguments>>> pd.read_excel('tmp.xlsx', index_col=None, header=None) # doctest: +SKIP0 1 20 NaN Name Value1 0.0 string1 12 1.0 string2 23 2.0 #Comment 3Column types are inferred but can be explicitly specified>>> pd.read_excel('tmp.xlsx', index_col=0,... dtype={'Name': str, 'Value': float}) # doctest: +SKIPName Value0 string1 1.01 string2 2.02 #Comment 3.0True, False, and NA values, and thousands separators have defaults,but can be explicitly specified, too. Supply the values you would likeas strings or lists of strings!>>> pd.read_excel('tmp.xlsx', index_col=0,... na_values=['string1', 'string2']) # doctest: +SKIPName Value0 NaN 11 NaN 22 #Comment 3Comment lines in the excel input file can be skipped using the `comment` kwarg>>> pd.read_excel('tmp.xlsx', index_col=0, comment='#') # doctest: +SKIPName Value0 string1 1.01 string2 2.02 None NaN

第一次尝试导入文件报错,说是缺少需要的包,叫'xlrd',可以使用pip安装。

169ce50885a10d56bd7324040eba3817.png

安包命令:

41f11091d859c3ee0a2e1228a7347124.png

再次尝试导入文件:

798494159cd31b55feaabb52062dca52.png

可见,缺失值部分对应内容为NaN,表示Not a Number。

缺失值处理方法一:删除缺失值所在行

5122223203bb279e02879f94deb5c17b.png

注释:①dataframe数据格式有dropna()这个方法,可以直接修改并覆盖原始数据框。

②dropna()中参数都有默认值,如axis = 0,表示按行进行处理,how = 'any'表示,只要一行中有一个缺失值,就删除整行。how = 'all'表示,必须整行都确实,才删除该行。

缺失值处理方法二:填充

填充的方法有很多,常见的有:均值填充、最值填充、取前一个记录对应值进行填充、取后一个记录对应值填充、拉格朗日填充法等等。

演示一,使用均值填充:

7da22edb9f6c90772dd20e9f3c6319cc.png

注释:①这里的均值是指缺失值所在列的均值。

②fillna()函数中inplace = True表示使用新的值覆盖原来的值。默认是False。

演示二,使用下一条记录的对应值填充本行的缺失值。

d2ef04bcd7c61346ab8df02f8451cecc.png

注释:method = 'bfill'表示使用下一行的对应值填充本行的缺失值。b表示below。使用该方法填充时,有一种特殊情况需要注意,就是缺失值出现在最后一行,则无法使用下一行对应值进行填充。

同理,method = 'ffill',表示使用前一行的对应值填充本行的缺失值。f表示foreword。要特别注意缺失值出现在第一行的情况,该方法会失效。

其他关于fillna()方法的信息,查看帮助文档:

注意:查看某个类对象的方法,要先将类实例化,比如这里的数据框实例化为df。

import pandas as pd
... path = 'C:UsersCaraDesktopstuinfo.xlsx'
... df = pd.read_excel(io = path)
... help(df.fillna)
...
Help on method fillna in module pandas.core.frame:
fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None) -> Union[ForwardRef('DataFrame'), NoneType] method of pandas.core.frame.DataFrame instanceFill NA/NaN values using the specified method.Parameters----------value : scalar, dict, Series, or DataFrameValue to use to fill holes (e.g. 0), alternately adict/Series/DataFrame of values specifying which value to use foreach index (for a Series) or column (for a DataFrame). Values notin the dict/Series/DataFrame will not be filled. This value cannotbe a list.method : {'backfill', 'bfill', 'pad', 'ffill', None}, default NoneMethod to use for filling holes in reindexed Seriespad / ffill: propagate last valid observation forward to next validbackfill / bfill: use next valid observation to fill gap.axis : {0 or 'index', 1 or 'columns'}Axis along which to fill missing values.inplace : bool, default FalseIf True, fill in-place. Note: this will modify anyother views on this object (e.g., a no-copy slice for a column in aDataFrame).limit : int, default NoneIf method is specified, this is the maximum number of consecutiveNaN values to forward/backward fill. In other words, if there isa gap with more than this number of consecutive NaNs, it will onlybe partially filled. If method is not specified, this is themaximum number of entries along the entire axis where NaNs will befilled. Must be greater than 0 if not None.downcast : dict, default is NoneA dict of item->dtype of what to downcast if possible,or the string 'infer' which will try to downcast to an appropriateequal type (e.g. float64 to int64 if possible).Returns-------DataFrame or NoneObject with missing values filled or None if ``inplace=True``.See Also--------interpolate : Fill NaN values using interpolation.reindex : Conform object to new index.asfreq : Convert TimeSeries to specified frequency.Examples-------->>> df = pd.DataFrame([[np.nan, 2, np.nan, 0],... [3, 4, np.nan, 1],... [np.nan, np.nan, np.nan, 5],... [np.nan, 3, np.nan, 4]],... columns=list('ABCD'))>>> dfA B C D0 NaN 2.0 NaN 01 3.0 4.0 NaN 12 NaN NaN NaN 53 NaN 3.0 NaN 4Replace all NaN elements with 0s.>>> df.fillna(0)A B C D0 0.0 2.0 0.0 01 3.0 4.0 0.0 12 0.0 0.0 0.0 53 0.0 3.0 0.0 4We can also propagate non-null values forward or backward.>>> df.fillna(method='ffill')A B C D0 NaN 2.0 NaN 01 3.0 4.0 NaN 12 3.0 4.0 NaN 53 3.0 3.0 NaN 4Replace all NaN elements in column 'A', 'B', 'C', and 'D', with 0, 1,2, and 3 respectively.>>> values = {'A': 0, 'B': 1, 'C': 2, 'D': 3}>>> df.fillna(value=values)A B C D0 0.0 2.0 2.0 01 3.0 4.0 2.0 12 0.0 1.0 2.0 53 0.0 3.0 2.0 4Only replace the first NaN element.>>> df.fillna(value=values, limit=1)A B C D0 0.0 2.0 2.0 01 3.0 4.0 NaN 12 NaN 1.0 NaN 53 NaN 3.0 NaN 4

参考资料:

用Python玩转数据_中国大学MOOC(慕课)​www.icourse163.org
e238e24b693efdc9e3c3e367357612e8.png



推荐阅读
  • 开发笔记:加密&json&StringIO模块&BytesIO模块
    篇首语:本文由编程笔记#小编为大家整理,主要介绍了加密&json&StringIO模块&BytesIO模块相关的知识,希望对你有一定的参考价值。一、加密加密 ... [详细]
  • 使用Ubuntu中的Python获取浏览器历史记录原文: ... [详细]
  • 本文讨论了一个关于cuowu类的问题,作者在使用cuowu类时遇到了错误提示和使用AdjustmentListener的问题。文章提供了16个解决方案,并给出了两个可能导致错误的原因。 ... [详细]
  • YOLOv7基于自己的数据集从零构建模型完整训练、推理计算超详细教程
    本文介绍了关于人工智能、神经网络和深度学习的知识点,并提供了YOLOv7基于自己的数据集从零构建模型完整训练、推理计算的详细教程。文章还提到了郑州最低生活保障的话题。对于从事目标检测任务的人来说,YOLO是一个熟悉的模型。文章还提到了yolov4和yolov6的相关内容,以及选择模型的优化思路。 ... [详细]
  • 本文介绍了在Python3中如何使用选择文件对话框的格式打开和保存图片的方法。通过使用tkinter库中的filedialog模块的asksaveasfilename和askopenfilename函数,可以方便地选择要打开或保存的图片文件,并进行相关操作。具体的代码示例和操作步骤也被提供。 ... [详细]
  • Spring源码解密之默认标签的解析方式分析
    本文分析了Spring源码解密中默认标签的解析方式。通过对命名空间的判断,区分默认命名空间和自定义命名空间,并采用不同的解析方式。其中,bean标签的解析最为复杂和重要。 ... [详细]
  • 向QTextEdit拖放文件的方法及实现步骤
    本文介绍了在使用QTextEdit时如何实现拖放文件的功能,包括相关的方法和实现步骤。通过重写dragEnterEvent和dropEvent函数,并结合QMimeData和QUrl等类,可以轻松实现向QTextEdit拖放文件的功能。详细的代码实现和说明可以参考本文提供的示例代码。 ... [详细]
  • Java容器中的compareto方法排序原理解析
    本文从源码解析Java容器中的compareto方法的排序原理,讲解了在使用数组存储数据时的限制以及存储效率的问题。同时提到了Redis的五大数据结构和list、set等知识点,回忆了作者大学时代的Java学习经历。文章以作者做的思维导图作为目录,展示了整个讲解过程。 ... [详细]
  • Java学习笔记之面向对象编程(OOP)
    本文介绍了Java学习笔记中的面向对象编程(OOP)内容,包括OOP的三大特性(封装、继承、多态)和五大原则(单一职责原则、开放封闭原则、里式替换原则、依赖倒置原则)。通过学习OOP,可以提高代码复用性、拓展性和安全性。 ... [详细]
  • Go Cobra命令行工具入门教程
    本文介绍了Go语言实现的命令行工具Cobra的基本概念、安装方法和入门实践。Cobra被广泛应用于各种项目中,如Kubernetes、Hugo和Github CLI等。通过使用Cobra,我们可以快速创建命令行工具,适用于写测试脚本和各种服务的Admin CLI。文章还通过一个简单的demo演示了Cobra的使用方法。 ... [详细]
  • 本文介绍了iOS数据库Sqlite的SQL语句分类和常见约束关键字。SQL语句分为DDL、DML和DQL三种类型,其中DDL语句用于定义、删除和修改数据表,关键字包括create、drop和alter。常见约束关键字包括if not exists、if exists、primary key、autoincrement、not null和default。此外,还介绍了常见的数据库数据类型,包括integer、text和real。 ... [详细]
  • Java中包装类的设计原因以及操作方法
    本文主要介绍了Java中设计包装类的原因以及操作方法。在Java中,除了对象类型,还有八大基本类型,为了将基本类型转换成对象,Java引入了包装类。文章通过介绍包装类的定义和实现,解答了为什么需要包装类的问题,并提供了简单易用的操作方法。通过本文的学习,读者可以更好地理解和应用Java中的包装类。 ... [详细]
  • 本文介绍了在处理不规则数据时如何使用Python自动提取文本中的时间日期,包括使用dateutil.parser模块统一日期字符串格式和使用datefinder模块提取日期。同时,还介绍了一段使用正则表达式的代码,可以支持中文日期和一些特殊的时间识别,例如'2012年12月12日'、'3小时前'、'在2012/12/13哈哈'等。 ... [详细]
  • 图像因存在错误而无法显示 ... [详细]
  • 怎么在PHP项目中实现一个HTTP断点续传功能发布时间:2021-01-1916:26:06来源:亿速云阅读:96作者:Le ... [详细]
author-avatar
mis安小米
这个家伙很懒,什么也没留下!
PHP1.CN | 中国最专业的PHP中文社区 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved | 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有