如何在不使用所有服务器内存的情况下使用 PowerShell 修改大型 CSV

问题描述 投票:0回答:2

我正在使用 PowerShell 对 CSV 文件进行一些数据修改,然后再将其导入 Oracle。我在进程运行时观察了资源监视器,该进程正在耗尽服务器上所有 20 GB 的可用内存。我的一个 CSV 大约有 90 MB,有近 200 列和 100K 行。生成的 CSV 约为 120 MB。这是我当前使用的代码:

# Process Configuration File
$path = $PSScriptRoot + "\"

#Set Extraction Date-Time in format for Oracle Timestamp with TZ
$date = Get-Date -Format "yyyy-MM-dd HH:mm:ss K"

Import-Csv -Path ($path + 'documents.csv') -Encoding UTF8 |
   # Convert Date Time values that are always populated
   % {$_.document_creation_date__v = ([datetime]($_.document_creation_date__v)).ToString('yyyy-MM-dd HH:mm:ss K');$_} |
   % {$_.version_creation_date__v = ([datetime]($_.version_creation_date__v)).ToString('yyyy-MM-dd HH:mm:ss K');$_} |
   % {$_.version_modified_date__v = ([datetime]($_.version_modified_date__v)).ToString('yyyy-MM-dd HH:mm:ss K');$_} |

   # Convert DateTime values that may be blank
   % {if($_.binder_last_autofiled_date__v -gt ""){$_.binder_last_autofiled_date__v = ([datetime]($_.binder_last_autofiled_date__v)).ToString('yyyy-MM-dd HH:mm:ss K')};$_} |
   % {if($_.locked_date__v -gt ""){$_.locked_date__v = ([datetime]($_.locked_date__v)).ToString('yyyy-MM-dd HH:mm:ss K')};$_} |

   # Fix Multi-Select Picklist fields, replacing value divider with "|"
   % {$_.clinical_data__c = ((($_.clinical_data__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.composition_formulation_ingredients__c = ((($_.composition_formulation_ingredients__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.content_category__c = ((($_.content_category__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.crm_disable_actions__v = ((($_.crm_disable_actions__v).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.indication_dosage_administration__c = ((($_.indication_dosage_administration__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.pharmacodynamics_and_pharmacokinetics__c = ((($_.pharmacodynamics_and_pharmacokinetics__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.indication__c = ((($_.indication__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.rights_channel__v = ((($_.rights_channel__v).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.rights_language__v = ((($_.rights_language__v).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.safety__c = ((($_.safety__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.special_population__c = ((($_.special_population__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.storage_stability__c = ((($_.storage_stability__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.ta_subcategory__c = ((($_.ta_subcategory__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.tags__v = ((($_.tags__v).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.user_groups__c = ((($_.user_groups__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.vaccines__c = ((($_.vaccines__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.channels__c = ((($_.channels__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.material_type__c = ((($_.material_type__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.target_audience__c = ((($_.target_audience__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |

   # Trim values that can be too long
   % {$_.product__v = ($_.product__v)[0..254] -join "";$_} |

   # Add ExtractDate Column
   Select-Object *,@{Name='Extract_Date';Expression={$date}} |

   #Export Results
   Export-Csv ($path + 'VMC_DOCUMENTS.csv') -NoTypeInformation -Encoding UTF8

有没有比我目前正在做的更有效的方法来使用 PowerShell 修改大型 CSV 文件?该过程大约需要 10 分钟才能完成。我绝不是 PowerShell 专家,我是根据本网站的信息和 MS PowerShell 文档构建我的脚本的。任何建议将不胜感激。

以下是用于创建具有单个记录的示例文档.csv 的数据:

allow_pdf_download__v,allow_source_download__v,annotations_all__v,annotations_anchors__v,annotations_lines__v,annotations_links__v,annotations_notes__v,annotations_resolved__v,annotations_unresolved__v,associated_content_notes__c,author__c,batch_number__v,binder__v,binder_created_from__v,binder_last_autofiled_by__v,binder_last_autofiled_date__v,binder_locked__v,binder_metadata__v,bound_source_major_version__v,bound_source_minor_version__v,classification__v,clinical_data__c,composition_formulation_ingredients__c,content_category__c,copyright__c,copyright_license_expiration__c,copyright_owner__c,copyright_title__c,country__v,created_by__v,crosslink__v,date_permissions_obtained__c,decision_date__c,description_of_copyrighted_content__c,detail_group__v,disclaimer__c,document_creation_date__v,document_fit__v,document_host_url__v,document_number__v,source_type__c,dossier_type__c,duration_of_use__c,email_domain__v,email_template_type__v,expiration_date__c,external_id__v,extra_scientific_content__c,filename__v,format__v,from_address__v,from_name__v,ftp_source_location__v,grant_type__c,id,indication_disease__c,indication_dosage_administration__c,intended_use__c,language__c,last_modified_by__v,latest_source_major_version__v,latest_source_minor_version__v,latest_version__v,legacy_document_number__c,legal_approval_form__c,legal_approval_required__c,lifecycle__v,link_status__v,locked__v,locked_by__v,locked_date__v,major_version_number__v,md5checksum__v,members_of_public__c,minor_version_number__v,name__v,obtained_by__c,one_of_use__c,other__c,pages__v,payment_amount_usd__c,payment_date__c,payment_made__c,permissions_fee__c,pharmacodynamics_and_pharmacokinetics__c,product__v,public_content__v,publication_date__c,reapproval_cycle_count__c,reapproval_date__c,reason_for_iactivation__c,region_code__c,rendition_black_list_flag__v,reply_to_address__v,reply_to_name__v,response_type__c,restrict_fragments_by_product__v,restricted_countries__c,rights_channel__v,rights_countries__v,rights_expiration_date__v,rights_language__v,rights_other__v,rights_resource_type__v,safety__c,size__v,source__c,source_binding_rule__v,source_document_id__v,source_document_name__v,source_document_number__v,source_owner__v,source_vault_id__v,source_vault_name__v,special_population__c,start_date__c,status__v,storage_stability__c,subject__v,submission_date__c,subtype__v,tags__v,target__c,target_description__c,template_doctype__v,territory__v,therapeutic_area__c,title__v,type__v,use_location__c,user_groups__c,vaccines__c,version_created_by__v,version_creation_date__v,version_id,version_modified_date__v,clm_content__v,clm_id__v,crm_custom_reaction__v,crm_directory__v,crm_disable_actions__v,crm_enable_survey_overlay__v,crm_end_date__v,crm_hidden__v,crm_segment__v,crm_start_date__v,crm_survey__v,crm_training__v,engage_html_filename__v,cdn_content__v,check_consent__v,production_cdn_url__v,crm_product__v,ta_subcategory__c,notify_msls_of_significant_update__c,global_id__sys,global_version_id__sys,link__sys,version_link__sys,activity_end_date__c,activity_name__c,activity_start_date__c,activity_type__c,business_owner__c,channels__c,material_type__c,objective__c,proactive__c,target_audience__c,indication__c
"00W000000000101",,0,0,0,0,0,0,0,,,,false,,,,false,,,,,"Immunogenicity",,"Clinical Data,Special Population",false,,,,"00C000000000389",1436711,false,,,,,,2018-05-15T09:03:51.000Z,"Fit Width",,MED--TST-1923,,,,,,2020-06-10,2634,,Test.docx,application/vnd.openxmlformats-officedocument.wordprocessingml.document,,,,,10000,"Vaccines",,,,1,,,false,TST50316,,,Advanced LC,,false,,,3,398ea1bf3682f8c8e51cde5bd133bb73,false,0,Use of XXXXXXXXXXXXXXXX vaccine recombinant in Transplant Patients,,false,,4,,,,,,"00P000000001F36",true,,1,2018-08-31,,,false,,,,,,,,,,,,,16815,,,,,,,,,,,Expired,,,,Global Response,,,,,,,Use of XXXXXXXXXXX vaccine recombinant in Transplant Patients,Global Content (Advanced),,,,1436711,2018-05-15T09:03:51.000Z,10000_3_0,2020-07-02T13:17:11.000Z,false,,,,,false,,false,,,,false,,false,,,,,,23108_10000,23108_10000_19347,,,,,,,,,,,,,
powershell csv
2个回答
4
投票

PowerShell 的

Import-Csv
cmdlet 是一个众所周知的内存占用问题,主要是由于它构造的
[pscustomobject]
实例对内存的要求很高 - 请参阅 GitHub 问题 #7603

有多种缓解策略,按复杂性升序排列:

  • 在您的

    ForEach-Object
    (
    %
    ) 脚本块中(您应该将单独的
    %
    调用合并到 one 中),每隔 1000 个对象强制进行一次垃圾回收,以缓解内存压力。

    • 正如 Santiago Squarzon 指出的那样,

      ForEach-Object
      的低效实现 - 从 PowerShell 7.2.x 开始,请参阅 GitHub 问题 #10982 - 加剧了内存消耗和运行时间方面的问题。

    • 请参阅下面的代码,它将定期垃圾收集与

      . { process { ... } }
      结合起来,作为
      ForEach-Object
      的更快且更内存友好的替代方案。[1]

  • 使用自定义 PowerShell

    class
    来表示您的 CSV 行,但请注意,这会增加执行时间

    • 请参阅此答案了解示例。

    • GitHub 问题 #8862 建议将此功能构建到

      Import-Csv
      中,以便使其构造给定类型的实例来代替
      [pscustomobject]

  • 如果上述方法太慢,您需要求助于第三方 .NET 解析器库,例如

    CSVHelper


这是代码的简化的表述,它实现定期垃圾收集以缓解内存压力

# Process Configuration File
$path = $PSScriptRoot + '\'

#Set Extraction Date-Time in format for Oracle Timestamp with TZ
$date = Get-Date -Format "yyyy-MM-dd HH:mm:ss K"

# See above for why . { process { ... } } is used in lieu of % { ... }
$i = 0
Import-Csv -Path ($path + 'documents.csv') -Encoding UTF8 | . {
    process {

      # Perform garbage collection every 1000 objects 
      # in order to relieve memory pressure.
      if (++$i % 1000 -eq 0) { [GC]::Collect() }

      # Convert Date Time values that are always populated
      $_.document_creation_date__v = ([datetime]($_.document_creation_date__v)).ToString('yyyy-MM-dd HH:mm:ss K')
      $_.version_creation_date__v = ([datetime]($_.version_creation_date__v)).ToString('yyyy-MM-dd HH:mm:ss K')
      $_.version_modified_date__v = ([datetime]($_.version_modified_date__v)).ToString('yyyy-MM-dd HH:mm:ss K')

      # Convert DateTime values that may be blank
      if ($_.binder_last_autofiled_date__v -gt "") { $_.binder_last_autofiled_date__v = ([datetime]($_.binder_last_autofiled_date__v)).ToString('yyyy-MM-dd HH:mm:ss K') }
      if ($_.locked_date__v -gt "") { $_.locked_date__v = ([datetime]($_.locked_date__v)).ToString('yyyy-MM-dd HH:mm:ss K') }

      # Fix Multi-Select Picklist fields, replacing value divider with "|"
      $_.clinical_data__c = ((($_.clinical_data__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.composition_formulation_ingredients__c = ((($_.composition_formulation_ingredients__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.content_category__c = ((($_.content_category__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.crm_disable_actions__v = ((($_.crm_disable_actions__v).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.indication_dosage_administration__c = ((($_.indication_dosage_administration__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.pharmacodynamics_and_pharmacokinetics__c = ((($_.pharmacodynamics_and_pharmacokinetics__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.indication__c = ((($_.indication__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.rights_channel__v = ((($_.rights_channel__v).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.rights_language__v = ((($_.rights_language__v).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.safety__c = ((($_.safety__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.special_population__c = ((($_.special_population__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.storage_stability__c = ((($_.storage_stability__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.ta_subcategory__c = ((($_.ta_subcategory__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.tags__v = ((($_.tags__v).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.user_groups__c = ((($_.user_groups__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.vaccines__c = ((($_.vaccines__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.channels__c = ((($_.channels__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.material_type__c = ((($_.material_type__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.target_audience__c = ((($_.target_audience__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')

      # Trim values that can be too long
      $_.product__v = ($_.product__v)[0..254] -join ""

      # Finally add an ExtractDate Column and output the modified object
      # (-PassThru) - this obviates the need for a separate Select-Object call.
      Add-Member -InputObject $_ -PassThru -NotePropertyName 'Extract_Date' -NotePropertyValue $date
    }
  } |
  Export-Csv ($path + 'VMC_DOCUMENTS.csv') -NoTypeInformation -Encoding UTF8

[1] 请注意,变体

& { process { ... } }
,即在 child 范围内执行,可以 加速 执行(请参阅 这个答案 获取解释),但会再次增加内存消耗,这就是为什么它不这里没用过。


0
投票

在需要最高性能和灵活性的绝望情况下(但仍然需要 Powershell),我不得不使用

StreamReader
StreamWriter
进行自己的 CSV 处理。以下示例假设有一个三列源 CSV 文件,并输出另一个 CSV 文件,其中第一列中的值大写,第二列中的值小写:

$infilename = Join-Path $PSScriptRoot 'documents.csv'
$outfilename = Join-Path $PSScriptRoot 'VMC_DOCUMENTS.csv'
$bufsize = 1mb
$rowsep = "`r?`n"
$fieldsep = ","

New-Item -Force -Type "file" $outfilename

$readstream = New-Object -TypeName System.IO.StreamReader -ArgumentList $infilename
$writestream = New-Object -TypeName System.IO.StreamWriter -ArgumentList $outfilename

$writestream.WriteLine($readstream.ReadLine())
$partial = ''
$continue = $true
while ($continue) {
    [char[]]$chunk = New-Object char[] $bufsize
    $received = $readstream.Read($chunk, 0, $bufsize)
    $continue = ($received -gt 0)
    if ($continue -eq $false) {
        break
    }
    $chunkstr = $chunk -join ""
    $lines = (($partial, $chunkstr) -join "") -split $rowsep
    $partial = $lines[-1]
    for ($i = 0; $i -lt $lines.Length - 1; $i++) {
        $row = $lines[$i] -split ($fieldsep)
        
        # Process row/fields here:
        $new = ($row[0].ToUpper(), $row[1].ToLower(), $row[2]) -join $fieldsep 

        $writestream.WriteLine($new)
    }
}
$readstream.Close()
$writestream.Close()

请注意,CSV 解析非常初级,并且假设没有转义字符或需要引用。如果需要,可以利用使用正则表达式的更强大的逻辑。

可以使用

ReadLine
而不是块处理来简化上述过程,但前提是使用传统的换行符。上面的代码允许使用任意行分隔符。

© www.soinside.com 2019 - 2024. All rights reserved.