热门标签 | HotTags
当前位置:  开发笔记 > 编程语言 > 正文

TextcleaningwithExcelVBA

1.TheProblemRecentlyIcameacrossaprojectthatmycolleaguesheavilyrelythemselfonexcel.Normally

1. The Problem

Recently I came across a project that my colleagues heavily rely themself on excel. Normally that is not a problem and I do think excel is a good tool as well.

However in the workflow we have a text column waiting to be cleaned and the logic is somehow complex. Actually we have about 10-15 if-else statements.

Algorithmically that is not a problem because it is only a FuzzBuzz question. Which means you have an array, and you can loop over the array, and check each one of the element whether it meets your condition.

If it does meet the condition, you output a label and store it in an new array. You can write as many if-else statements as you need.

But in excel that is a real problem. In the excel‘s cell you can hardly write meaningful funtion when the logic is complex. Besides, excel lacks some most commonly used functions, like string match.

Although I can use R or Python to handle the problem but my colleague‘s excel background requires an easy solution. Firstly I think maybe build a console applicaiton to do the cleaning job is a good idea, for example C# or Go. 

Somehow, I am thinking whether it is a overkill on this FuzzBuzz question and I do meet some GBK csv encoding problem in both C# and Go.

That is how I came back to VBA, if one method is stupid but works, it is not stupid after all.

 

2. Normal VBA

Press Alt+F11 we can inter Excel VBA editor. We will build our function here. Later we can link the function to a button into our spreadsheet.

I don‘t want to off topic so let‘s say we copy-paste the csv data into an excel Sheet1. (I mean it does have a method to automatically copy data from csv file into excel sheet, but we don‘t discuss in this article)

Althought in excel spreadsheet we lack of some commonly used built-in functions, but in VBA we have some good functions:

(1). Check data range
        rows: Sheet1.UsedRange.Rows.Count
  Warning: This result could easily exceed integer, so do not declare an integer to store it.
(2). Initialize a cell
  Cells(num, label) = ...
  eg: Cells(1, "G") = "Classified_Label"
(3). Data type change
  Cstr()
  eg: current = CStr(Cells(1, "D"))
  There are also other kind of C*() as a family, like CInt(), CDbl() and so on.
(4). String Detect
  InStr()
  eg: InStr(current, "Alex")
  This function is especially useful because in excel spreadsheet FIND() is too bad design to use.
(5). Substring
  Mid()
  eg: Mid(current, 2, 2)

Go back to the probelm. The For-loop with If-Else is also as normal as we will do other languages and it is much easier than write them in excel spreadsheet.

Let‘s say our original text column is "D" and we want to output result in column "G". This will make us easy to write example code.


Sub adjust()
‘ determine data range
Dim length ‘ do not decalre it as an integer, because the range could bigger than intger.
length = Sheet1.UsedRange.Rows.Count

‘ initialize column name
Cells(1, "G") = "Classified_Label"

‘ adjust data ‘ do not declare it as an integer
Dim i
For i = 2 To length
Dim current As String
current = CStr(Cells(i, "D")) ‘ CStr() means toString()
If InStr(current, "Alice") Then
Cells(i, "G") = "Alex"
ElseIf InStr(current, "Bob") Then
Cells(i, "G") = "Bobe"
‘ .... write as many as your logic need
Else
Cells(i, "G") = "Others"
EndIf
Next
End Sub

 

3. VBA with ADO  

Ado is a library of standars and functions developed by MicroSoft. Using this tool we can use SQL statement in VBA.

In real work, I occasionally meet someone‘s PC that cannot use this method. Maybe they lack of drivers or they are using bad version of excel.

But still it is good to know and be packed at our backpoctet.


Sub adjust()
Dim CNN As Object
Set CNN = CreateObject("ADODB.CONNECTION")

Dim Driver As String
Driver = "Driver={Microsoft Text Driver (*.txt; *.csv)};" &_
"DBQ=" & ThisWorkbook.Path & "\;"
CNN.Open "Provider=MSDASQL;" & Driver

Dim file As String
file = "data.csv" ‘ can use InBox() function to require any data file on disk
Dim Query As String
Query = "SELECT * FROM " & file ‘ this step can be as complex as you need
Dim RS As Object
Set RS = CreateObject("ADODB.RECORDSET")
RS = CNN.Excute(Query)
With RS
For i = 1 To .Fields.Count
Cells(1, i).Value = .Fields(i - 1).Name
Next

Range("A2").CopyFromRecordset RS
ActiveSheet.Cells.Font.Name = "宋体"
.Close
End With

CNN.Close
Set CNN = Nothing
Set RS = Nothing
End Sub

The Query can be as complex as you need like what you will do in real SQL query, but in a more ACCESS SQL way. For example, we don‘t use CASE WHEN but IIF().

For an easy example:


Query = "SELECT *, IIF(InStr(Name, "Alice"), "Alex", " & _
"IIF(InStr(Name, "Bob"), "Bobe", "Others"))"

  

4. Finale

After our function is built, we can go back to Excel spreadsheet and add an geometry object into the spreadsheet. Right click the object and link  our VBA function to the button.

 


推荐阅读
  • 深入理解Cookie与Session会话管理
    本文详细介绍了如何通过HTTP响应和请求处理浏览器的Cookie信息,以及如何创建、设置和管理Cookie。同时探讨了会话跟踪技术中的Session机制,解释其原理及应用场景。 ... [详细]
  • 深入理解OAuth认证机制
    本文介绍了OAuth认证协议的核心概念及其工作原理。OAuth是一种开放标准,旨在为第三方应用提供安全的用户资源访问授权,同时确保用户的账户信息(如用户名和密码)不会暴露给第三方。 ... [详细]
  • 在计算机技术的学习道路上,51CTO学院以其专业性和专注度给我留下了深刻印象。从2012年接触计算机到2014年开始系统学习网络技术和安全领域,51CTO学院始终是我信赖的学习平台。 ... [详细]
  • CSS 布局:液态三栏混合宽度布局
    本文介绍了如何使用 CSS 实现液态的三栏布局,其中各栏具有不同的宽度设置。通过调整容器和内容区域的属性,可以实现灵活且响应式的网页设计。 ... [详细]
  • Linux 系统启动故障排除指南:MBR 和 GRUB 问题
    本文详细介绍了 Linux 系统启动过程中常见的 MBR 扇区和 GRUB 引导程序故障及其解决方案,涵盖从备份、模拟故障到恢复的具体步骤。 ... [详细]
  • 本文介绍了如何使用jQuery根据元素的类型(如复选框)和标签名(如段落)来获取DOM对象。这有助于更高效地操作网页中的特定元素。 ... [详细]
  • 1:有如下一段程序:packagea.b.c;publicclassTest{privatestaticinti0;publicintgetNext(){return ... [详细]
  • PHP 5.2.5 安装与配置指南
    本文详细介绍了 PHP 5.2.5 的安装和配置步骤,帮助开发者解决常见的环境配置问题,特别是上传图片时遇到的错误。通过本教程,您可以顺利搭建并优化 PHP 运行环境。 ... [详细]
  • 本文介绍如何在 Xcode 中使用快捷键和菜单命令对多行代码进行缩进,包括右缩进和左缩进的具体操作方法。 ... [详细]
  • 本文介绍了一款用于自动化部署 Linux 服务的 Bash 脚本。该脚本不仅涵盖了基本的文件复制和目录创建,还处理了系统服务的配置和启动,确保在多种 Linux 发行版上都能顺利运行。 ... [详细]
  • 在Linux系统中配置并启动ActiveMQ
    本文详细介绍了如何在Linux环境中安装和配置ActiveMQ,包括端口开放及防火墙设置。通过本文,您可以掌握完整的ActiveMQ部署流程,确保其在网络环境中正常运行。 ... [详细]
  • 如何在WPS Office for Mac中调整Word文档的文字排列方向
    本文将详细介绍如何使用最新版WPS Office for Mac调整Word文档中的文字排列方向。通过这些步骤,用户可以轻松更改文本的水平或垂直排列方式,以满足不同的排版需求。 ... [详细]
  • 理解存储器的层次结构有助于程序员优化程序性能,通过合理安排数据在不同层级的存储位置,提升CPU的数据访问速度。本文详细探讨了静态随机访问存储器(SRAM)和动态随机访问存储器(DRAM)的工作原理及其应用场景,并介绍了存储器模块中的数据存取过程及局部性原理。 ... [详细]
  • 几何画板展示电场线与等势面的交互关系
    几何画板是一款功能强大的物理教学软件,具备丰富的绘图和度量工具。它不仅能够模拟物理实验过程,还能通过定量分析揭示物理现象背后的规律,尤其适用于难以在实际实验中展示的内容。本文将介绍如何使用几何画板演示电场线与等势面之间的关系。 ... [详细]
  • 本文介绍如何通过Windows批处理脚本定期检查并重启Java应用程序,确保其持续稳定运行。脚本每30分钟检查一次,并在需要时重启Java程序。同时,它会将任务结果发送到Redis。 ... [详细]
author-avatar
华力
这个家伙很懒,什么也没留下!
PHP1.CN | 中国最专业的PHP中文社区 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved | 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有