热门标签 | HotTags
当前位置:  开发笔记 > 编程语言 > 正文

TextcleaningwithExcelVBA

1.TheProblemRecentlyIcameacrossaprojectthatmycolleaguesheavilyrelythemselfonexcel.Normally

1. The Problem

Recently I came across a project that my colleagues heavily rely themself on excel. Normally that is not a problem and I do think excel is a good tool as well.

However in the workflow we have a text column waiting to be cleaned and the logic is somehow complex. Actually we have about 10-15 if-else statements.

Algorithmically that is not a problem because it is only a FuzzBuzz question. Which means you have an array, and you can loop over the array, and check each one of the element whether it meets your condition.

If it does meet the condition, you output a label and store it in an new array. You can write as many if-else statements as you need.

But in excel that is a real problem. In the excel‘s cell you can hardly write meaningful funtion when the logic is complex. Besides, excel lacks some most commonly used functions, like string match.

Although I can use R or Python to handle the problem but my colleague‘s excel background requires an easy solution. Firstly I think maybe build a console applicaiton to do the cleaning job is a good idea, for example C# or Go. 

Somehow, I am thinking whether it is a overkill on this FuzzBuzz question and I do meet some GBK csv encoding problem in both C# and Go.

That is how I came back to VBA, if one method is stupid but works, it is not stupid after all.

 

2. Normal VBA

Press Alt+F11 we can inter Excel VBA editor. We will build our function here. Later we can link the function to a button into our spreadsheet.

I don‘t want to off topic so let‘s say we copy-paste the csv data into an excel Sheet1. (I mean it does have a method to automatically copy data from csv file into excel sheet, but we don‘t discuss in this article)

Althought in excel spreadsheet we lack of some commonly used built-in functions, but in VBA we have some good functions:

(1). Check data range
        rows: Sheet1.UsedRange.Rows.Count
  Warning: This result could easily exceed integer, so do not declare an integer to store it.
(2). Initialize a cell
  Cells(num, label) = ...
  eg: Cells(1, "G") = "Classified_Label"
(3). Data type change
  Cstr()
  eg: current = CStr(Cells(1, "D"))
  There are also other kind of C*() as a family, like CInt(), CDbl() and so on.
(4). String Detect
  InStr()
  eg: InStr(current, "Alex")
  This function is especially useful because in excel spreadsheet FIND() is too bad design to use.
(5). Substring
  Mid()
  eg: Mid(current, 2, 2)

Go back to the probelm. The For-loop with If-Else is also as normal as we will do other languages and it is much easier than write them in excel spreadsheet.

Let‘s say our original text column is "D" and we want to output result in column "G". This will make us easy to write example code.


Sub adjust()
‘ determine data range
Dim length ‘ do not decalre it as an integer, because the range could bigger than intger.
length = Sheet1.UsedRange.Rows.Count

‘ initialize column name
Cells(1, "G") = "Classified_Label"

‘ adjust data ‘ do not declare it as an integer
Dim i
For i = 2 To length
Dim current As String
current = CStr(Cells(i, "D")) ‘ CStr() means toString()
If InStr(current, "Alice") Then
Cells(i, "G") = "Alex"
ElseIf InStr(current, "Bob") Then
Cells(i, "G") = "Bobe"
‘ .... write as many as your logic need
Else
Cells(i, "G") = "Others"
EndIf
Next
End Sub

 

3. VBA with ADO  

Ado is a library of standars and functions developed by MicroSoft. Using this tool we can use SQL statement in VBA.

In real work, I occasionally meet someone‘s PC that cannot use this method. Maybe they lack of drivers or they are using bad version of excel.

But still it is good to know and be packed at our backpoctet.


Sub adjust()
Dim CNN As Object
Set CNN = CreateObject("ADODB.CONNECTION")

Dim Driver As String
Driver = "Driver={Microsoft Text Driver (*.txt; *.csv)};" &_
"DBQ=" & ThisWorkbook.Path & "\;"
CNN.Open "Provider=MSDASQL;" & Driver

Dim file As String
file = "data.csv" ‘ can use InBox() function to require any data file on disk
Dim Query As String
Query = "SELECT * FROM " & file ‘ this step can be as complex as you need
Dim RS As Object
Set RS = CreateObject("ADODB.RECORDSET")
RS = CNN.Excute(Query)
With RS
For i = 1 To .Fields.Count
Cells(1, i).Value = .Fields(i - 1).Name
Next

Range("A2").CopyFromRecordset RS
ActiveSheet.Cells.Font.Name = "宋体"
.Close
End With

CNN.Close
Set CNN = Nothing
Set RS = Nothing
End Sub

The Query can be as complex as you need like what you will do in real SQL query, but in a more ACCESS SQL way. For example, we don‘t use CASE WHEN but IIF().

For an easy example:


Query = "SELECT *, IIF(InStr(Name, "Alice"), "Alex", " & _
"IIF(InStr(Name, "Bob"), "Bobe", "Others"))"

  

4. Finale

After our function is built, we can go back to Excel spreadsheet and add an geometry object into the spreadsheet. Right click the object and link  our VBA function to the button.

 


推荐阅读
  • SvpplyTable: 实现可扩展和可折叠的菜单动画
    SvpplyTable 是一个示例项目,旨在实现类似 Svpply 应用程序中的可扩展和可折叠的菜单动画效果。该项目托管在 GitHub 上,地址为 https://github.com/liuminqian/SvpplyTable。 ... [详细]
  • 线段树,注 ... [详细]
  • Java EE 平台集成了多种服务、API 和协议,旨在支持基于 Web 的多层应用程序开发。本文将详细介绍 Java EE 中的 13 种关键技术规范,帮助开发者更好地理解和应用这些技术。 ... [详细]
  • Gty的二逼妹子序列 - 分块与莫队算法的应用
    Autumn 和 Bakser 正在研究 Gty 的妹子序列,但遇到了一个难题。他们希望计算某个区间内美丽度属于 [a, b] 的妹子的美丽度种类数。本文将详细介绍如何利用分块和莫队算法解决这一问题。 ... [详细]
  • 阿里云 Aliplayer高级功能介绍(八):安全播放
    如何保障视频内容的安全,不被盗链、非法下载和传播,阿里云视频点播已经有一套完善的机 ... [详细]
  • 本文介绍了如何在 Qt 应用程序中实现状态栏、浮动窗口(铆接部件)和中心部件。通过简单的代码示例,详细解释了每个组件的创建和设置方法。 ... [详细]
  • 本文介绍了如何使用Postman构建和发送HTTP请求,包括四个主要部分:方法(Method)、URL、头部(Headers)和主体(Body)。特别强调了Body部分的重要性,并详细说明了不同类型的请求体。 ... [详细]
  • 本文介绍了 Confluence 6 中使用的其他 Cookie,这些 Cookie 主要用于存储产品的基本持久性和用户偏好设置,以提升用户体验。 ... [详细]
  • iOS snow animation
    CTSnowAnimationView.hCTMyCtripCreatedbyalexon1614.Copyright©2016年ctrip.Allrightsreserved.# ... [详细]
  • 如何解决TS1219:实验性装饰器功能可能在未来版本中更改的问题
    本文介绍了两种方法来解决TS1219错误:通过VSCode设置启用实验性装饰器,或在项目根目录下创建配置文件(jsconfig.json或tsconfig.json)。 ... [详细]
  • packagecom.panchan.tsmese.utils;importjava.lang.reflect.ParameterizedType;importjava.lang. ... [详细]
  • malloc 是 C 语言中的一个标准库函数,全称为 memory allocation,即动态内存分配。它用于在程序运行时申请一块指定大小的连续内存区域,并返回该区域的起始地址。当无法预先确定内存的具体位置时,可以通过 malloc 动态分配内存。 ... [详细]
  • 本文详细介绍了Linux系统中用于管理IPC(Inter-Process Communication)资源的两个重要命令:ipcs和ipcrm。通过这些命令,用户可以查看和删除系统中的消息队列、共享内存和信号量。 ... [详细]
  • 本文章提供了适用于 Cacti 的多核 CPU 监控模板,支持 2、4、8、12、16、24 和 32 核配置。请注意,0.87g 版本的 Cacti 需要手动修改哈希值为 0021 才能使用,而 0.88 及以上版本则可直接导入。 ... [详细]
  • 本文将详细介绍如何在 Vue 项目中使用 Handsontable 插件,包括 npm 安装、基本配置和常用功能的实现。 ... [详细]
author-avatar
华力
这个家伙很懒,什么也没留下!
PHP1.CN | 中国最专业的PHP中文社区 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved | 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有