三个Python常用的数据清洗处理方式总结python|数据清洗的总结_家电

关于python数据处理过程中三个主要的数据清洗说明，分别是缺失值/空格/重复值的数据清洗。

这里还是使用pandas来获取excel或者csv的数据源来进行数据处理。若是没有pandas的非标准库需要使用pip的方式安装一下。

pipinstallpandas准备一下需要处理的脏数据，这里选用的是excel数据，也可以选择其他的格式数据，下面是源数据截图。

使用pandas的read_excel()函数读取出我们需要处理的data.xlsx文件。

#Importingthepandaslibraryandgivingitanaliasofpd.importpandasaspd#Readingtheexcelfileandstoringitinavariablecalled`result_`result_=pd.read_excel('D:/test/data.xlsx')#Printingthedataframe.print(result_)注意，若是新的python环境直接安装pandas模块后执行上面的读取excel数据代码可能会出现没有openpyxl模块的情况。

这时候，我们使用pip的方式再次安装一下openpyxl即可。

pipinstallopenpyxl完成后再次执行读取excel数据的代码块会成功的返回结果。

#姓名年龄班级成绩表现#0Python集中营10121099A#1Python集中营111211100A#2Python集中营121212101A#3Python集中营131213102A#4Python集中营141214103A#5Python集中营151215104A#6Python集中营161216105A#7Python集中营171217106A#8Python集中营181218107A#9Python集中营191219108A#10Python集中营201220109A#11Python集中营211221110A#12Python集中营221222111A#13Python集中营231223112A#14Python集中营241224113A#15Python集中营251225114A#16Python集中营261226115A#17Python集中营271227116A#18Python集中营281228117A##Processfinishedwithexitcode0准备好数据源之后，我们使用三个方式来完成对源数据的数据清洗。

首先，将所有的列名称提取出来，使用DataFrame对象的columns函数进行提取。

为了减少代码块的使用，我们这里直接使用列表推导式的方式对列名称的空格进行清洗。

#Alistcomprehensionthatisiteratingoverthe`columns_`listandstrippingthewhitespacesfromeachelementofthe#list.result_.columns=[column_name.strip()forcolumn_nameincolumns_]#Printingthecolumnnamesofthedataframe.print(result_.columns.values)#['姓名''年龄''班级''成绩''表现']经过数据清洗后，发现所有的列名称空格情况已经被全部清洗了。若是存在某个列中的值空格需要清洗也可以采用strip函数进行清洗。

关于重复数据的判断有两种情况，一种是两行完全相同的数据即为重复数据。另外一种则是部分相同指的是某个列的数据是相同的需要清洗。

#The`duplicated()`functionisreturningabooleanseriesthatisTrueiftherowisaduplicateandFalseiftherowis#notaduplicate.repeat_num=result_.duplicated().sum()#Printingthenumberofduplicaterowsinthedataframe.print(repeat_num)#1通过上面的duplicated().sum()函数得到的是两个完全相同的数据行是多少。

接着则可以对源数据进行实际意义上的删除，使用DataFrame对象的drop_duplicates函数进行删除。

#The`drop_duplicates()`functionisdroppingtheduplicaterowsfromthedataframeandthe`inplace=True`is#modifyingthedataframeinplace.result_.drop_duplicates(inplace=True)#Printingthedataframe.print(result_)#姓名年龄班级成绩表现#0Python集中营10121099A#1Python集中营111211100A#2Python集中营121212101A#3Python集中营131213102A#4Python集中营141214103A#5Python集中营151215104A#6Python集中营161216105A#7Python集中营171217106A#8Python集中营181218107A#9Python集中营191219108A#10Python集中营201220109A#11Python集中营211221110A#12Python集中营221222111A#13Python集中营231223112A#14Python集中营241224113A#15Python集中营251225114A#16Python集中营261226115A#17Python集中营271227116A因为最后一行和第一行的数据是完全相同的，因此最后一行的数据已经被清洗掉了。

一般在数据清洗删除重复值之后需要重置索引，避免索引产生不连续性。

#The`range(result_.shape[0])`iscreatingalistofnumbersfrom0tothenumberofrowsinthedataframe.result_.index=range(result_.shape[0])#The`print(result_.index)`isprintingtheindexofthedataframe.print(result_.index)#RangeIndex(start=0,stop=18,step=1)

一般查看DataFrame数据对象的缺失值就是通过使用isnull函数来提取所有数据缺失的部分。

#The`isnull()`functionisreturningabooleanseriesthatisTrueifthevalueismissingandFalseifthevalue#isnotmissing.sul_=result_.isnull()#The`print(sul_)`isprintingthebooleanseriesthatisTrueifthevalueismissingandFalseifthevalueisnot#missing.print(sul_)#姓名年龄班级成绩表现#0FalseFalseFalseFalseFalse#1FalseFalseFalseFalseFalse#2FalseFalseFalseFalseFalse#3FalseFalseFalseFalseFalse#4FalseFalseFalseFalseFalse#5FalseFalseFalseFalseFalse#6FalseFalseFalseFalseFalse#7FalseFalseFalseFalseFalse#8FalseFalseFalseFalseFalse#9FalseFalseFalseFalseFalse#10FalseFalseFalseFalseFalse#11FalseFalseFalseFalseFalse#12FalseFalseFalseFalseFalse#13FalseFalseFalseFalseFalse#14FalseFalseFalseFalseFalse#15FalseFalseFalseFalseFalse#16FalseFalseFalseFalseFalse#17FalseFalseFalseFalseFalse返回的每一个单元格数据结果为False则代表这个单元格的数据是没有缺失的，或者也可以使用notnull来反向查看。

使用isnull函数不想显示很多的列表数据时，可以使用sum函数进行统计。

#The`isnull_sum=result_.isnull().sum()`isreturningaseriesthatisthesumofthebooleanseriesthatisTrueif#thevalueismissingandFalseifthevalueisnotmissing.isnull_sum=result_.isnull().sum()#The`isnull_sum=result_.isnull().sum()`isreturningaseriesthatisthesumofthebooleanseriesthatisTrueif#thevalueismissingandFalseifthevalueisnotmissing.print(isnull_sum)#姓名0#年龄0#班级0#成绩0#表现0#dtype:int64通过isnull函数处理后使用sum函数进行统计，统计后会返回每一列的数据单元格为空的个数。

接下来就是数据值的填补过程，通常可以筛选每一列中的空值填补固定的数据。

#The`result_.loc[result_.姓名.isnull(),'姓名']`isreturningaseriesthatisthevaluesofthecolumn`姓名`#wherethevaluesaremissing.The`'Python集中营'`isthevaluethatisbeingassignedtotheseries.result_.loc[result_.姓名.isnull(),'姓名']='Python集中营'#Printingthedataframe.print(result_)#姓名年龄班级成绩表现#0Python集中营10121099A#1Python集中营111211100A#2Python集中营121212101A#3Python集中营131213102A#4Python集中营141214103A#5Python集中营151215104A#6Python集中营161216105A#7Python集中营171217106A#8Python集中营181218107A#9Python集中营191219108A#10Python集中营201220109A#11Python集中营211221110A#12Python集中营221222111A#13Python集中营231223112A#14Python集中营241224113A#15Python集中营251225114A#16Python集中营261226115A#17Python集中营271227116A

数据清洗完成之后，可以使用DataFrame对象提供的to_csv/to_excel等函数进行特定格式的数据保存。

THE END

三个Python常用的数据清洗处理方式总结python

数据清洗总结

数据分析工作总结（精选9篇）

三个Python常用的数据清洗处理方式总结python

数据标注大总结（更新中）

Cleanits：制造业时序数据清洗系统传媒

数据工作总结（精选15篇）

数据处理工作总结

数据清洗工作总结

外呼数据分析（如何更好地利用外呼数据分析提升业务）

20个实战数据分析案例网站合集，可实操练习！

机器学习实战机器学习特征工程最全解读

工具｜Orange3：机器学习入门神器澎湃号·湃客澎湃新闻

一招教你看懂纯债债基的真实投资策略来源：宏观交易笔记作者：前海木兰又到一年一度的年终总结之时，各只基金的最终业绩榜单已经出炉，大家朋友圈开始纷纷被基金...

AI人工智能预处理数据的方法和技术有哪些？腾讯云开发者社区

3D变速节能西门子滚筒洗衣机WS12K2681W评测

信用阜新

信用阜新

阿里面试官惊叹：这种简历不用面了，直接来上班！

手把手教你用kano模型做需求分析系数

数据清洗范文