如何在不使用所有服务器内存的情况下使用 PowerShell 修改大型 CSV

Question

我正在使用 PowerShell 对 CSV 文件进行一些数据修改，然后再将其导入 Oracle。我在进程运行时观察了资源监视器，该进程正在耗尽服务器上所有 20 GB 的可用内存。我的一个 CSV 大约有 90 MB，有近 200 列和 100K 行。生成的 CSV 约为 120 MB。这是我当前使用的代码：

# Process Configuration File
$path = $PSScriptRoot + "\"

#Set Extraction Date-Time in format for Oracle Timestamp with TZ
$date = Get-Date -Format "yyyy-MM-dd HH:mm:ss K"

Import-Csv -Path ($path + 'documents.csv') -Encoding UTF8 |
   # Convert Date Time values that are always populated
   % {$_.document_creation_date__v = ([datetime]($_.document_creation_date__v)).ToString('yyyy-MM-dd HH:mm:ss K');$_} |
   % {$_.version_creation_date__v = ([datetime]($_.version_creation_date__v)).ToString('yyyy-MM-dd HH:mm:ss K');$_} |
   % {$_.version_modified_date__v = ([datetime]($_.version_modified_date__v)).ToString('yyyy-MM-dd HH:mm:ss K');$_} |

   # Convert DateTime values that may be blank
   % {if($_.binder_last_autofiled_date__v -gt ""){$_.binder_last_autofiled_date__v = ([datetime]($_.binder_last_autofiled_date__v)).ToString('yyyy-MM-dd HH:mm:ss K')};$_} |
   % {if($_.locked_date__v -gt ""){$_.locked_date__v = ([datetime]($_.locked_date__v)).ToString('yyyy-MM-dd HH:mm:ss K')};$_} |

   # Fix Multi-Select Picklist fields, replacing value divider with "|"
   % {$_.clinical_data__c = ((($_.clinical_data__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.composition_formulation_ingredients__c = ((($_.composition_formulation_ingredients__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.content_category__c = ((($_.content_category__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.crm_disable_actions__v = ((($_.crm_disable_actions__v).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.indication_dosage_administration__c = ((($_.indication_dosage_administration__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.pharmacodynamics_and_pharmacokinetics__c = ((($_.pharmacodynamics_and_pharmacokinetics__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.indication__c = ((($_.indication__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.rights_channel__v = ((($_.rights_channel__v).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.rights_language__v = ((($_.rights_language__v).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.safety__c = ((($_.safety__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.special_population__c = ((($_.special_population__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.storage_stability__c = ((($_.storage_stability__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.ta_subcategory__c = ((($_.ta_subcategory__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.tags__v = ((($_.tags__v).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.user_groups__c = ((($_.user_groups__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.vaccines__c = ((($_.vaccines__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.channels__c = ((($_.channels__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.material_type__c = ((($_.material_type__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.target_audience__c = ((($_.target_audience__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |

   # Trim values that can be too long
   % {$_.product__v = ($_.product__v)[0..254] -join "";$_} |

   # Add ExtractDate Column
   Select-Object *,@{Name='Extract_Date';Expression={$date}} |

   #Export Results
   Export-Csv ($path + 'VMC_DOCUMENTS.csv') -NoTypeInformation -Encoding UTF8

有没有比我目前正在做的更有效的方法来使用 PowerShell 修改大型 CSV 文件？该过程大约需要 10 分钟才能完成。我绝不是 PowerShell 专家，我是根据本网站的信息和 MS PowerShell 文档构建我的脚本的。任何建议将不胜感激。

以下是用于创建具有单个记录的示例文档.csv 的数据：

allow_pdf_download__v,allow_source_download__v,annotations_all__v,annotations_anchors__v,annotations_lines__v,annotations_links__v,annotations_notes__v,annotations_resolved__v,annotations_unresolved__v,associated_content_notes__c,author__c,batch_number__v,binder__v,binder_created_from__v,binder_last_autofiled_by__v,binder_last_autofiled_date__v,binder_locked__v,binder_metadata__v,bound_source_major_version__v,bound_source_minor_version__v,classification__v,clinical_data__c,composition_formulation_ingredients__c,content_category__c,copyright__c,copyright_license_expiration__c,copyright_owner__c,copyright_title__c,country__v,created_by__v,crosslink__v,date_permissions_obtained__c,decision_date__c,description_of_copyrighted_content__c,detail_group__v,disclaimer__c,document_creation_date__v,document_fit__v,document_host_url__v,document_number__v,source_type__c,dossier_type__c,duration_of_use__c,email_domain__v,email_template_type__v,expiration_date__c,external_id__v,extra_scientific_content__c,filename__v,format__v,from_address__v,from_name__v,ftp_source_location__v,grant_type__c,id,indication_disease__c,indication_dosage_administration__c,intended_use__c,language__c,last_modified_by__v,latest_source_major_version__v,latest_source_minor_version__v,latest_version__v,legacy_document_number__c,legal_approval_form__c,legal_approval_required__c,lifecycle__v,link_status__v,locked__v,locked_by__v,locked_date__v,major_version_number__v,md5checksum__v,members_of_public__c,minor_version_number__v,name__v,obtained_by__c,one_of_use__c,other__c,pages__v,payment_amount_usd__c,payment_date__c,payment_made__c,permissions_fee__c,pharmacodynamics_and_pharmacokinetics__c,product__v,public_content__v,publication_date__c,reapproval_cycle_count__c,reapproval_date__c,reason_for_iactivation__c,region_code__c,rendition_black_list_flag__v,reply_to_address__v,reply_to_name__v,response_type__c,restrict_fragments_by_product__v,restricted_countries__c,rights_channel__v,rights_countries__v,rights_expiration_date__v,rights_language__v,rights_other__v,rights_resource_type__v,safety__c,size__v,source__c,source_binding_rule__v,source_document_id__v,source_document_name__v,source_document_number__v,source_owner__v,source_vault_id__v,source_vault_name__v,special_population__c,start_date__c,status__v,storage_stability__c,subject__v,submission_date__c,subtype__v,tags__v,target__c,target_description__c,template_doctype__v,territory__v,therapeutic_area__c,title__v,type__v,use_location__c,user_groups__c,vaccines__c,version_created_by__v,version_creation_date__v,version_id,version_modified_date__v,clm_content__v,clm_id__v,crm_custom_reaction__v,crm_directory__v,crm_disable_actions__v,crm_enable_survey_overlay__v,crm_end_date__v,crm_hidden__v,crm_segment__v,crm_start_date__v,crm_survey__v,crm_training__v,engage_html_filename__v,cdn_content__v,check_consent__v,production_cdn_url__v,crm_product__v,ta_subcategory__c,notify_msls_of_significant_update__c,global_id__sys,global_version_id__sys,link__sys,version_link__sys,activity_end_date__c,activity_name__c,activity_start_date__c,activity_type__c,business_owner__c,channels__c,material_type__c,objective__c,proactive__c,target_audience__c,indication__c
"00W000000000101",,0,0,0,0,0,0,0,,,,false,,,,false,,,,,"Immunogenicity",,"Clinical Data,Special Population",false,,,,"00C000000000389",1436711,false,,,,,,2018-05-15T09:03:51.000Z,"Fit Width",,MED--TST-1923,,,,,,2020-06-10,2634,,Test.docx,application/vnd.openxmlformats-officedocument.wordprocessingml.document,,,,,10000,"Vaccines",,,,1,,,false,TST50316,,,Advanced LC,,false,,,3,398ea1bf3682f8c8e51cde5bd133bb73,false,0,Use of XXXXXXXXXXXXXXXX vaccine recombinant in Transplant Patients,,false,,4,,,,,,"00P000000001F36",true,,1,2018-08-31,,,false,,,,,,,,,,,,,16815,,,,,,,,,,,Expired,,,,Global Response,,,,,,,Use of XXXXXXXXXXX vaccine recombinant in Transplant Patients,Global Content (Advanced),,,,1436711,2018-05-15T09:03:51.000Z,10000_3_0,2020-07-02T13:17:11.000Z,false,,,,,false,,false,,,,false,,false,,,,,,23108_10000,23108_10000_19347,,,,,,,,,,,,,

Answer 1

PowerShell 的

Import-Csv

cmdlet 是一个众所周知的内存占用问题，主要是由于它构造的

[pscustomobject]

实例对内存的要求很高 - 请参阅 GitHub 问题 #7603。

有多种缓解策略，按复杂性升序排列：

在您的
```
ForEach-Object
```
(
```
%
```
) 脚本块中（您应该将单独的
```
%
```
调用合并到 one 中），每隔 1000 个对象强制进行一次垃圾回收，以缓解内存压力。
- 正如 Santiago Squarzon 指出的那样，
```
ForEach-Object
```
  的低效实现 - 从 PowerShell 7.2.x 开始，请参阅 GitHub 问题 #10982 - 加剧了内存消耗和运行时间方面的问题。
- 请参阅下面的代码，它将定期垃圾收集与
```
. { process { ... } }
```
  结合起来，作为
```
ForEach-Object
```
  的更快且更内存友好的替代方案。^[1]
使用自定义 PowerShell
```
class
```
来表示您的 CSV 行，但请注意，这会增加执行时间。
- 请参阅此答案了解示例。
- GitHub 问题 #8862 建议将此功能构建到
```
Import-Csv
```
  中，以便使其构造给定类型的实例来代替
```
[pscustomobject]
```
  。
如果上述方法太慢，您需要求助于第三方 .NET 解析器库，例如
```
CSVHelper
```
。
- 请参阅这篇博客文章，其中包含许多库的链接的比较基准，以及这个SO问题的答案（重点关注C#）。
- 不幸的是，从 v7.2.x 开始，在 PowerShell 中使用通用 .NET NuGet 包很麻烦。这个答案显示了当前的情况需要。 GitHub 问题 #6724 要求未来
```
Add-Type
```
  改进对 NuGet 包的直接支持。

这是代码的简化的表述，它实现定期垃圾收集以缓解内存压力：

# Process Configuration File
$path = $PSScriptRoot + '\'

#Set Extraction Date-Time in format for Oracle Timestamp with TZ
$date = Get-Date -Format "yyyy-MM-dd HH:mm:ss K"

# See above for why . { process { ... } } is used in lieu of % { ... }
$i = 0
Import-Csv -Path ($path + 'documents.csv') -Encoding UTF8 | . {
    process {

      # Perform garbage collection every 1000 objects 
      # in order to relieve memory pressure.
      if (++$i % 1000 -eq 0) { [GC]::Collect() }

      # Convert Date Time values that are always populated
      $_.document_creation_date__v = ([datetime]($_.document_creation_date__v)).ToString('yyyy-MM-dd HH:mm:ss K')
      $_.version_creation_date__v = ([datetime]($_.version_creation_date__v)).ToString('yyyy-MM-dd HH:mm:ss K')
      $_.version_modified_date__v = ([datetime]($_.version_modified_date__v)).ToString('yyyy-MM-dd HH:mm:ss K')

      # Convert DateTime values that may be blank
      if ($_.binder_last_autofiled_date__v -gt "") { $_.binder_last_autofiled_date__v = ([datetime]($_.binder_last_autofiled_date__v)).ToString('yyyy-MM-dd HH:mm:ss K') }
      if ($_.locked_date__v -gt "") { $_.locked_date__v = ([datetime]($_.locked_date__v)).ToString('yyyy-MM-dd HH:mm:ss K') }

      # Fix Multi-Select Picklist fields, replacing value divider with "|"
      $_.clinical_data__c = ((($_.clinical_data__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.composition_formulation_ingredients__c = ((($_.composition_formulation_ingredients__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.content_category__c = ((($_.content_category__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.crm_disable_actions__v = ((($_.crm_disable_actions__v).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.indication_dosage_administration__c = ((($_.indication_dosage_administration__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.pharmacodynamics_and_pharmacokinetics__c = ((($_.pharmacodynamics_and_pharmacokinetics__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.indication__c = ((($_.indication__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.rights_channel__v = ((($_.rights_channel__v).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.rights_language__v = ((($_.rights_language__v).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.safety__c = ((($_.safety__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.special_population__c = ((($_.special_population__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.storage_stability__c = ((($_.storage_stability__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.ta_subcategory__c = ((($_.ta_subcategory__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.tags__v = ((($_.tags__v).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.user_groups__c = ((($_.user_groups__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.vaccines__c = ((($_.vaccines__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.channels__c = ((($_.channels__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.material_type__c = ((($_.material_type__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.target_audience__c = ((($_.target_audience__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')

      # Trim values that can be too long
      $_.product__v = ($_.product__v)[0..254] -join ""

      # Finally add an ExtractDate Column and output the modified object
      # (-PassThru) - this obviates the need for a separate Select-Object call.
      Add-Member -InputObject $_ -PassThru -NotePropertyName 'Extract_Date' -NotePropertyValue $date
    }
  } |
  Export-Csv ($path + 'VMC_DOCUMENTS.csv') -NoTypeInformation -Encoding UTF8

^{[1] 请注意，变体}

& { process { ... } }

，即在 child 范围内执行，可以加速执行（请参阅这个答案获取解释），但会再次增加内存消耗，这就是为什么它不这里没用过。

Answer 2

在需要最高性能和灵活性的绝望情况下（但仍然需要 Powershell），我不得不使用

StreamReader

和

StreamWriter

进行自己的 CSV 处理。以下示例假设有一个三列源 CSV 文件，并输出另一个 CSV 文件，其中第一列中的值大写，第二列中的值小写：

$infilename = Join-Path $PSScriptRoot 'documents.csv'
$outfilename = Join-Path $PSScriptRoot 'VMC_DOCUMENTS.csv'
$bufsize = 1mb
$rowsep = "`r?`n"
$fieldsep = ","

New-Item -Force -Type "file" $outfilename

$readstream = New-Object -TypeName System.IO.StreamReader -ArgumentList $infilename
$writestream = New-Object -TypeName System.IO.StreamWriter -ArgumentList $outfilename

$writestream.WriteLine($readstream.ReadLine())
$partial = ''
$continue = $true
while ($continue) {
    [char[]]$chunk = New-Object char[] $bufsize
    $received = $readstream.Read($chunk, 0, $bufsize)
    $continue = ($received -gt 0)
    if ($continue -eq $false) {
        break
    }
    $chunkstr = $chunk -join ""
    $lines = (($partial, $chunkstr) -join "") -split $rowsep
    $partial = $lines[-1]
    for ($i = 0; $i -lt $lines.Length - 1; $i++) {
        $row = $lines[$i] -split ($fieldsep)
        
        # Process row/fields here:
        $new = ($row[0].ToUpper(), $row[1].ToLower(), $row[2]) -join $fieldsep 

        $writestream.WriteLine($new)
    }
}
$readstream.Close()
$writestream.Close()

请注意，CSV 解析非常初级，并且假设没有转义字符或需要引用。如果需要，可以利用使用正则表达式的更强大的逻辑。

可以使用

ReadLine

而不是块处理来简化上述过程，但前提是使用传统的换行符。上面的代码允许使用任意行分隔符。

如何在不使用所有服务器内存的情况下使用 PowerShell 修改大型 CSV

问题描述投票：0回答：2

2个回答

最新问题

如何在不使用所有服务器内存的情况下使用 PowerShell 修改大型 CSV

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2