我正在 VBA 中处理非常大的(45,000,000+ 个字符)字符串,我需要删除多余的 空白。
一个空格(又名,ASCII Code 32)可以,但任何具有两个或多个连续空格的部分应减少到只有一个。
我发现了一个类似的问题here,尽管OP对“非常长的字符串”的定义是只有39,000个字符。接受的答案是使用
Replace
: 的循环
Function MyTrim(s As String) As String
Do While InStr(s, " ") > 0
s = Replace$(s, " ", " ")
Loop
MyTrim = Trim$(s)
End Function
我尝试了这个方法,它“有效”,但速度痛苦慢:
Len In: 44930886 Len Out: 35322469 Runtime: 247.6 seconds
有没有更快的方法来从“非常大”的字符串中删除空格?
我怀疑性能问题是由于创建了大量的大型中间字符串造成的。因此,任何无需创建中间字符串或更少的中间字符串即可执行操作的方法都会表现更好。
正则表达式替换很有可能实现这一点。
Option Explicit
Sub Test(ByVal text As String)
Static Regex As Object
If Regex Is Nothing Then
Set Regex = CreateObject("VBScript.RegExp")
Regex.Global = True
Regex.MultiLine = True
End If
Regex.Pattern = " +" ' space, one or more times
Dim result As String: result = Regex.Replace(text, " ")
Debug.Print Len(result), Left(result, 20)
End Sub
输入 4500 万个字符的字符串大约需要一秒钟。
跑步者:
Sub Main()
Const ForReading As Integer = 1
Const FormatUTF16 As Integer = -1 ' aka TriStateTrue
Dim fso As Object: Set fso = CreateObject("Scripting.FileSystemObject")
Dim file As Object: Set file = fso.OpenTextFile("C:\ProgramData\test.txt", ForReading, False, FormatUTF16)
Dim text As String: text = file.ReadAll()
Set file = Nothing
Set fso = Nothing
Debug.Print Len(text), Left(text, 20)
Test (text)
End Sub
测试数据创建器(C#):
var substring = "××\n× ×× ";
var text = String.Join("", Enumerable.Repeat(substring, 45_000_000 / substring.Length));
var encoding = new UnicodeEncoding(false, false);
File.WriteAllText(@"C:\ProgramData\test.txt", text, encoding);
顺便说一句——由于 VBA(VB4、Java、JavaScript、C#、VB...)使用 UTF-16,因此空格字符是一个 UTF-16 代码单元
ChrW(32)
。 (任何与 ASCII 的相似或比较都是不必要的脑力体操,如果将其作为 ANSI [Chr(32)
] 写入代码,则会在幕后进行不必要的转换,并且对于不同的机器、用户和时间会有不同的行为。)
在 VBA 中,
String
的大小限制为大约 20 亿个字符。上面的“Replace
-Loop
”方法对于4500万个字符串花了247秒,超过4分钟。
理论上,这意味着 20 亿个字符串至少需要 3 个小时——如果它甚至没有崩溃的话——所以它并不完全实用。
Trim
,它与 VBA 的 Trim
函数不一样。
工作表功能
Trim
删除文本中的所有空格(单词之间的单个空格除外)。
问题是
Trim
,像所有用Application.WorksheetFunction
调用的函数一样,有32,767个字符的大小限制,并且[不幸的是]这也适用即使在使用字符串从VBA调用函数时那甚至不在牢房中。
但是,如果我们使用它来循环遍历我们的“巨大字符串”,我们仍然可以使用该函数,如下所示:
编辑:甚至不用理会这些废话(我的函数,在下面)!参见RegEx答案上面。
Function bigTrim(strIn As String) As String Const maxLen = 32766 Dim loops As Long, x As Long loops = Int(Len(strIn) / maxLen) If (Len(strIn) / maxLen) <> loops Then loops = loops + 1 For x = 1 To loops bigTrim = bigTrim & _ Application.WorksheetFunction.Trim(Mid(strIn, _ ((x - 1) * maxLen) + 1, maxLen)) Next x End Function
在与“
Replace
-Loop
”方法使用的同一字符串上运行此函数会产生更多更好的结果:
Len In: 44930886 Len Out: 35321845 Runtime: 33.6 seconds
这比“
Replace
-Loop
”方法快了 7 倍以上,and 成功删除了其他方法遗漏的 624 个空格。
(我想调查为什么第一种方法丢失了字符,但因为我知道我的字符串没有丢失任何东西,而且这个练习的目的是为了节省时间,那太愚蠢了!)☺
这个问题比答案看起来要有趣得多,因为OP提出的解决方案应该没有任何问题,因为理论上该算法非常高效。
事实证明,这里的问题是 VBA 内置
Replace
函数在底层的实现不佳,这导致它在大量替换的大字符串上完全卡住。
可以轻松地手动实现具有线性运行时间的
Replace
函数,其性能远远优于大字符串的内置函数。这里提供了此类实现的示例:
'Works like the inbuilt 'Replace', but only allocates the buffer once and is
'therefore much, much faster on large strings with many replacements
'This function is the renamed function `ReplaceFast` from here:
'https://github.com/guwidoe/VBA-StringTools
'Note that this implementation is slightly slower than the inbuilt 'Replace'
'function for short strings with few replacements
Public Function Replace(ByRef str As String, _
ByRef sFind As String, _
ByRef sReplace As String, _
Optional ByVal lStart As Long = 1, _
Optional ByVal lCount As Long = -1, _
Optional ByVal lCompare As VbCompareMethod _
= vbBinaryCompare) As String
Const methodName As String = "Replace"
If lStart < 1 Then Err.Raise 5, methodName, _
"Argument 'lStart' = " & lStart & " < 1, invalid"
If lCount < -1 Then Err.Raise 5, methodName, _
"Argument 'lCount' = " & lCount & " < -1, invalid"
lCount = lCount And &H7FFFFFFF
If Len(str) = 0 Or Len(sFind) = 0 Then
Replace = Mid$(str, lStart)
Exit Function
End If
Dim lenFind As Long: lenFind = Len(sFind)
Dim lenReplace As Long: lenReplace = Len(sReplace)
Dim bufferSizeChange As Long
bufferSizeChange = CountSubstring(str, sFind, lStart, lCount, lCompare) _
* (lenReplace - lenFind) - lStart + 1
If Len(str) + bufferSizeChange < 0 Then Exit Function
Replace = Space$(Len(str) + bufferSizeChange)
Dim i As Long: i = InStr(lStart, str, sFind, lCompare)
Dim j As Long: j = 1
Dim lastOccurrence As Long: lastOccurrence = lStart
Dim count As Long: count = 1
Do Until i = 0 Or count > lCount
Dim diff As Long: diff = i - lastOccurrence
If diff > 0 Then _
Mid$(Replace, j, diff) = Mid$(str, lastOccurrence, diff)
j = j + diff
If lenReplace <> 0 Then
Mid$(Replace, j, lenReplace) = sReplace
j = j + lenReplace
End If
count = count + 1
lastOccurrence = i + lenFind
i = InStr(lastOccurrence, str, sFind, lCompare)
Loop
If j <= Len(Replace) Then Mid$(Replace, j) = Mid$(str, lastOccurrence)
End Function
Public Function CountSubstring(ByRef str As String, _
ByRef subStr As String, _
Optional ByVal lStart As Long = 1, _
Optional ByVal lLimit As Long = -1, _
Optional ByVal lCompare As VbCompareMethod _
= vbBinaryCompare) As Long
Const methodName As String = "CountSubstring"
If lStart < 1 Then Err.Raise 5, methodName, _
"Argument 'Start' = " & lStart & " < 1, invalid"
If lLimit < -1 Then Err.Raise 5, methodName, _
"Argument 'lLimit' = " & lLimit & " < -1, invalid"
If subStr = vbNullString Then Exit Function
Dim lenSubStr As Long: lenSubStr = Len(subStr)
Dim i As Long: i = InStr(lStart, str, subStr, lCompare)
CountSubstring = 0
Do Until i = 0 Or lLimit = CountSubstring
CountSubstring = CountSubstring + 1
i = InStr(i + lenSubStr, str, subStr, lCompare)
Loop
End Function
只需将此代码粘贴到项目中即可修复原始代码的性能问题,而无需对其进行任何更改,只需覆盖内置的
Replace
函数即可。
在我的测试中,当改进的
Replace
函数存在时,原始代码应该只需要大约 1.5 秒来处理类似于 OP 示例的字符串,改进超过 100 倍!:
Sub DemoMyTrim()
Const LEN_INPUT_STR = 45000000
Dim inputStr As String: inputStr = RepeatString(" aaa", LEN_INPUT_STR / 5)
Dim t As Single: t = Timer()
Dim outStr As String: outStr = MyTrim(inputStr)
Debug.Print "Trimming took " & Timer() - t & " seconds."
Debug.Print "Len Out: " & Len(outStr)
End Sub
''RepeatString' function source:
'https://github.com/guwidoe/VBA-StringTools
Private Function RepeatString(ByRef str As String, _
Optional ByVal repeatTimes As Long = 2) As String
Const methodName As String = "RepeatString"
If repeatTimes < 0 Then Err.Raise 5, methodName, _
"Argument 'repeatTimes' = " & repeatTimes & " < 0, invalid"
If repeatTimes = 0 Then Exit Function
If LenB(str) = 2 Then
RepeatString = String$(repeatTimes, str)
Exit Function
End If
Dim newLength As Long: newLength = LenB(str) * repeatTimes
RepeatString = Space$((newLength + 1) \ 2)
If newLength Mod 2 = 1 Then RepeatString = MidB$(RepeatString, 2)
MidB$(RepeatString, 1) = str
If repeatTimes > 1 Then MidB$(RepeatString, LenB(str) + 1) = RepeatString
End Function
Public Function MyTrim(ByRef s As String) As String
MyTrim = s
Do While InStr(MyTrim, " ") > 0
MyTrim = Replace(MyTrim, " ", " ")
Loop
End Function
虽然这非常有趣,但它仍然比已接受的答案提出的正则表达式解决方案慢。
由于接受的答案使用 Mac 上不可用的正则表达式,我想提出另一种替代方案,它比原始算法更快,具有改进的
Replace
函数,并且仍然可以使用在任何平台上可用的 VBA 内置函数。
这可以通过 LibStringTools 库中的另一个函数实现:
'Replaces consecutive occurrences of 'substring' that repeat more than 'limit'
'times with exactly 'limit' consecutive occurrences
'Source:
'https://github.com/guwidoe/VBA-StringTools
Public Function LimitConsecutiveSubstringRepetition( _
ByRef str As String, _
Optional ByRef subStr As String = vbNewLine, _
Optional ByVal limit As Long = 1, _
Optional ByVal Compare As VbCompareMethod _
= vbBinaryCompare) _
As String
Const methodName As String = "LimitConsecutiveSubstringRepetition"
If limit < 0 Then Err.Raise 5, methodName, _
"Argument 'limit' = " & limit & " < 0, invalid"
If limit = 0 Then
LimitConsecutiveSubstringRepetition = Replace(str, subStr, _
vbNullString, , , Compare)
Exit Function
Else
LimitConsecutiveSubstringRepetition = str
End If
If Len(str) = 0 Then Exit Function
If Len(subStr) = 0 Then Exit Function
Dim i As Long: i = InStr(1, str, subStr, Compare)
Dim j As Long: j = 1
Dim lenSubStr As Long: lenSubStr = Len(subStr)
Dim lastOccurrence As Long: lastOccurrence = 1 - lenSubStr
Dim copyChunkSize As Long
Dim consecutiveCount As Long
Dim occurrenceDiff As Long
Do Until i = 0
occurrenceDiff = i - lastOccurrence
If occurrenceDiff = lenSubStr Then
consecutiveCount = consecutiveCount + 1
If consecutiveCount <= limit Then
copyChunkSize = copyChunkSize + occurrenceDiff
ElseIf consecutiveCount = limit + 1 Then
Mid$(LimitConsecutiveSubstringRepetition, j, copyChunkSize) = _
Mid$(str, i - copyChunkSize, copyChunkSize)
j = j + copyChunkSize
copyChunkSize = 0
End If
Else
copyChunkSize = copyChunkSize + occurrenceDiff
consecutiveCount = 1
End If
lastOccurrence = i
i = InStr(i + lenSubStr, str, subStr, Compare)
Loop
copyChunkSize = copyChunkSize + Len(str) - lastOccurrence - lenSubStr + 1
Mid$(LimitConsecutiveSubstringRepetition, j, copyChunkSize) = _
Mid$(str, Len(str) - copyChunkSize + 1)
LimitConsecutiveSubstringRepetition = _
Left$(LimitConsecutiveSubstringRepetition, j + copyChunkSize - 1)
End Function
Public Function CountSubstring(ByRef str As String, _
ByRef subStr As String, _
Optional ByVal lStart As Long = 1, _
Optional ByVal lLimit As Long = -1, _
Optional ByVal lCompare As VbCompareMethod _
= vbBinaryCompare) As Long
Const methodName As String = "CountSubstring"
If lStart < 1 Then Err.Raise 5, methodName, _
"Argument 'Start' = " & lStart & " < 1, invalid"
If lLimit < -1 Then Err.Raise 5, methodName, _
"Argument 'lLimit' = " & lLimit & " < -1, invalid"
If subStr = vbNullString Then Exit Function
Dim lenSubStr As Long: lenSubStr = Len(subStr)
Dim i As Long: i = InStr(lStart, str, subStr, lCompare)
CountSubstring = 0
Do Until i = 0 Or lLimit = CountSubstring
CountSubstring = CountSubstring + 1
i = InStr(i + lenSubStr, str, subStr, lCompare)
Loop
End Function
使用此功能,可以达到预期的效果,如下:
Dim inputStr as String
'... somehow populate input string
dim outStr as String
outStr = LimitConsecutiveSubstringRepetition(inputStr, " ", 1)