热门标签 | HotTags
当前位置:  开发笔记 > 编程语言 > 正文

TextcleaningwithExcelVBA

1.TheProblemRecentlyIcameacrossaprojectthatmycolleaguesheavilyrelythemselfonexcel.Normally

1. The Problem

Recently I came across a project that my colleagues heavily rely themself on excel. Normally that is not a problem and I do think excel is a good tool as well.

However in the workflow we have a text column waiting to be cleaned and the logic is somehow complex. Actually we have about 10-15 if-else statements.

Algorithmically that is not a problem because it is only a FuzzBuzz question. Which means you have an array, and you can loop over the array, and check each one of the element whether it meets your condition.

If it does meet the condition, you output a label and store it in an new array. You can write as many if-else statements as you need.

But in excel that is a real problem. In the excel‘s cell you can hardly write meaningful funtion when the logic is complex. Besides, excel lacks some most commonly used functions, like string match.

Although I can use R or Python to handle the problem but my colleague‘s excel background requires an easy solution. Firstly I think maybe build a console applicaiton to do the cleaning job is a good idea, for example C# or Go. 

Somehow, I am thinking whether it is a overkill on this FuzzBuzz question and I do meet some GBK csv encoding problem in both C# and Go.

That is how I came back to VBA, if one method is stupid but works, it is not stupid after all.

 

2. Normal VBA

Press Alt+F11 we can inter Excel VBA editor. We will build our function here. Later we can link the function to a button into our spreadsheet.

I don‘t want to off topic so let‘s say we copy-paste the csv data into an excel Sheet1. (I mean it does have a method to automatically copy data from csv file into excel sheet, but we don‘t discuss in this article)

Althought in excel spreadsheet we lack of some commonly used built-in functions, but in VBA we have some good functions:

(1). Check data range
        rows: Sheet1.UsedRange.Rows.Count
  Warning: This result could easily exceed integer, so do not declare an integer to store it.
(2). Initialize a cell
  Cells(num, label) = ...
  eg: Cells(1, "G") = "Classified_Label"
(3). Data type change
  Cstr()
  eg: current = CStr(Cells(1, "D"))
  There are also other kind of C*() as a family, like CInt(), CDbl() and so on.
(4). String Detect
  InStr()
  eg: InStr(current, "Alex")
  This function is especially useful because in excel spreadsheet FIND() is too bad design to use.
(5). Substring
  Mid()
  eg: Mid(current, 2, 2)

Go back to the probelm. The For-loop with If-Else is also as normal as we will do other languages and it is much easier than write them in excel spreadsheet.

Let‘s say our original text column is "D" and we want to output result in column "G". This will make us easy to write example code.


Sub adjust()
‘ determine data range
Dim length ‘ do not decalre it as an integer, because the range could bigger than intger.
length = Sheet1.UsedRange.Rows.Count

‘ initialize column name
Cells(1, "G") = "Classified_Label"

‘ adjust data ‘ do not declare it as an integer
Dim i
For i = 2 To length
Dim current As String
current = CStr(Cells(i, "D")) ‘ CStr() means toString()
If InStr(current, "Alice") Then
Cells(i, "G") = "Alex"
ElseIf InStr(current, "Bob") Then
Cells(i, "G") = "Bobe"
‘ .... write as many as your logic need
Else
Cells(i, "G") = "Others"
EndIf
Next
End Sub

 

3. VBA with ADO  

Ado is a library of standars and functions developed by MicroSoft. Using this tool we can use SQL statement in VBA.

In real work, I occasionally meet someone‘s PC that cannot use this method. Maybe they lack of drivers or they are using bad version of excel.

But still it is good to know and be packed at our backpoctet.


Sub adjust()
Dim CNN As Object
Set CNN = CreateObject("ADODB.CONNECTION")

Dim Driver As String
Driver = "Driver={Microsoft Text Driver (*.txt; *.csv)};" &_
"DBQ=" & ThisWorkbook.Path & "\;"
CNN.Open "Provider=MSDASQL;" & Driver

Dim file As String
file = "data.csv" ‘ can use InBox() function to require any data file on disk
Dim Query As String
Query = "SELECT * FROM " & file ‘ this step can be as complex as you need
Dim RS As Object
Set RS = CreateObject("ADODB.RECORDSET")
RS = CNN.Excute(Query)
With RS
For i = 1 To .Fields.Count
Cells(1, i).Value = .Fields(i - 1).Name
Next

Range("A2").CopyFromRecordset RS
ActiveSheet.Cells.Font.Name = "宋体"
.Close
End With

CNN.Close
Set CNN = Nothing
Set RS = Nothing
End Sub

The Query can be as complex as you need like what you will do in real SQL query, but in a more ACCESS SQL way. For example, we don‘t use CASE WHEN but IIF().

For an easy example:


Query = "SELECT *, IIF(InStr(Name, "Alice"), "Alex", " & _
"IIF(InStr(Name, "Bob"), "Bobe", "Others"))"

  

4. Finale

After our function is built, we can go back to Excel spreadsheet and add an geometry object into the spreadsheet. Right click the object and link  our VBA function to the button.

 


推荐阅读
  • 深入理解Cookie与Session会话管理
    本文详细介绍了如何通过HTTP响应和请求处理浏览器的Cookie信息,以及如何创建、设置和管理Cookie。同时探讨了会话跟踪技术中的Session机制,解释其原理及应用场景。 ... [详细]
  • 技术分享:从动态网站提取站点密钥的解决方案
    本文探讨了如何从动态网站中提取站点密钥,特别是针对验证码(reCAPTCHA)的处理方法。通过结合Selenium和requests库,提供了详细的代码示例和优化建议。 ... [详细]
  • 在Linux系统中配置并启动ActiveMQ
    本文详细介绍了如何在Linux环境中安装和配置ActiveMQ,包括端口开放及防火墙设置。通过本文,您可以掌握完整的ActiveMQ部署流程,确保其在网络环境中正常运行。 ... [详细]
  • 本文介绍如何通过Windows批处理脚本定期检查并重启Java应用程序,确保其持续稳定运行。脚本每30分钟检查一次,并在需要时重启Java程序。同时,它会将任务结果发送到Redis。 ... [详细]
  • Vue 2 中解决页面刷新和按钮跳转导致导航栏样式失效的问题
    本文介绍了如何通过配置路由的 meta 字段,确保 Vue 2 项目中的导航栏在页面刷新或内部按钮跳转时,始终保持正确的 active 样式。具体实现方法包括设置路由的 meta 属性,并在 HTML 模板中动态绑定类名。 ... [详细]
  • 深入理解 Oracle 存储函数:计算员工年收入
    本文介绍如何使用 Oracle 存储函数查询特定员工的年收入。我们将详细解释存储函数的创建过程,并提供完整的代码示例。 ... [详细]
  • 本文总结了2018年的关键成就,包括职业变动、购车、考取驾照等重要事件,并分享了读书、工作、家庭和朋友方面的感悟。同时,展望2019年,制定了健康、软实力提升和技术学习的具体目标。 ... [详细]
  • 在计算机技术的学习道路上,51CTO学院以其专业性和专注度给我留下了深刻印象。从2012年接触计算机到2014年开始系统学习网络技术和安全领域,51CTO学院始终是我信赖的学习平台。 ... [详细]
  • CSS 布局:液态三栏混合宽度布局
    本文介绍了如何使用 CSS 实现液态的三栏布局,其中各栏具有不同的宽度设置。通过调整容器和内容区域的属性,可以实现灵活且响应式的网页设计。 ... [详细]
  • Linux 系统启动故障排除指南:MBR 和 GRUB 问题
    本文详细介绍了 Linux 系统启动过程中常见的 MBR 扇区和 GRUB 引导程序故障及其解决方案,涵盖从备份、模拟故障到恢复的具体步骤。 ... [详细]
  • 本文介绍了如何使用jQuery根据元素的类型(如复选框)和标签名(如段落)来获取DOM对象。这有助于更高效地操作网页中的特定元素。 ... [详细]
  • 本文介绍如何在 Xcode 中使用快捷键和菜单命令对多行代码进行缩进,包括右缩进和左缩进的具体操作方法。 ... [详细]
  • 前言--页数多了以后需要指定到某一页(只做了功能,样式没有细调)html ... [详细]
  • QUIC协议:快速UDP互联网连接
    QUIC(Quick UDP Internet Connections)是谷歌开发的一种旨在提高网络性能和安全性的传输层协议。它基于UDP,并结合了TLS级别的安全性,提供了更高效、更可靠的互联网通信方式。 ... [详细]
  • 导航栏样式练习:项目实例解析
    本文详细介绍了如何创建一个具有动态效果的导航栏,包括HTML、CSS和JavaScript代码的实现,并附有详细的说明和效果图。 ... [详细]
author-avatar
华力
这个家伙很懒,什么也没留下!
PHP1.CN | 中国最专业的PHP中文社区 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved | 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有