我正在使用 PowerShell 对 CSV 文件进行一些数据修改,然后再将其导入 Oracle。我在进程运行时观察了资源监视器,该进程正在耗尽服务器上所有 20 GB 的可用内存。我的一个 CSV 大约有 90 MB,有近 200 列和 100K 行。生成的 CSV 约为 120 MB。这是我当前使用的代码:
# Process Configuration File
$path = $PSScriptRoot + "\"
#Set Extraction Date-Time in format for Oracle Timestamp with TZ
$date = Get-Date -Format "yyyy-MM-dd HH:mm:ss K"
Import-Csv -Path ($path + 'documents.csv') -Encoding UTF8 |
# Convert Date Time values that are always populated
% {$_.document_creation_date__v = ([datetime]($_.document_creation_date__v)).ToString('yyyy-MM-dd HH:mm:ss K');$_} |
% {$_.version_creation_date__v = ([datetime]($_.version_creation_date__v)).ToString('yyyy-MM-dd HH:mm:ss K');$_} |
% {$_.version_modified_date__v = ([datetime]($_.version_modified_date__v)).ToString('yyyy-MM-dd HH:mm:ss K');$_} |
# Convert DateTime values that may be blank
% {if($_.binder_last_autofiled_date__v -gt ""){$_.binder_last_autofiled_date__v = ([datetime]($_.binder_last_autofiled_date__v)).ToString('yyyy-MM-dd HH:mm:ss K')};$_} |
% {if($_.locked_date__v -gt ""){$_.locked_date__v = ([datetime]($_.locked_date__v)).ToString('yyyy-MM-dd HH:mm:ss K')};$_} |
# Fix Multi-Select Picklist fields, replacing value divider with "|"
% {$_.clinical_data__c = ((($_.clinical_data__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
% {$_.composition_formulation_ingredients__c = ((($_.composition_formulation_ingredients__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
% {$_.content_category__c = ((($_.content_category__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
% {$_.crm_disable_actions__v = ((($_.crm_disable_actions__v).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
% {$_.indication_dosage_administration__c = ((($_.indication_dosage_administration__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
% {$_.pharmacodynamics_and_pharmacokinetics__c = ((($_.pharmacodynamics_and_pharmacokinetics__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
% {$_.indication__c = ((($_.indication__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
% {$_.rights_channel__v = ((($_.rights_channel__v).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
% {$_.rights_language__v = ((($_.rights_language__v).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
% {$_.safety__c = ((($_.safety__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
% {$_.special_population__c = ((($_.special_population__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
% {$_.storage_stability__c = ((($_.storage_stability__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
% {$_.ta_subcategory__c = ((($_.ta_subcategory__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
% {$_.tags__v = ((($_.tags__v).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
% {$_.user_groups__c = ((($_.user_groups__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
% {$_.vaccines__c = ((($_.vaccines__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
% {$_.channels__c = ((($_.channels__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
% {$_.material_type__c = ((($_.material_type__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
% {$_.target_audience__c = ((($_.target_audience__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
# Trim values that can be too long
% {$_.product__v = ($_.product__v)[0..254] -join "";$_} |
# Add ExtractDate Column
Select-Object *,@{Name='Extract_Date';Expression={$date}} |
#Export Results
Export-Csv ($path + 'VMC_DOCUMENTS.csv') -NoTypeInformation -Encoding UTF8
有没有比我目前正在做的更有效的方法来使用 PowerShell 修改大型 CSV 文件?该过程大约需要 10 分钟才能完成。我绝不是 PowerShell 专家,我是根据本网站的信息和 MS PowerShell 文档构建我的脚本的。任何建议将不胜感激。
以下是用于创建具有单个记录的示例文档.csv 的数据:
allow_pdf_download__v,allow_source_download__v,annotations_all__v,annotations_anchors__v,annotations_lines__v,annotations_links__v,annotations_notes__v,annotations_resolved__v,annotations_unresolved__v,associated_content_notes__c,author__c,batch_number__v,binder__v,binder_created_from__v,binder_last_autofiled_by__v,binder_last_autofiled_date__v,binder_locked__v,binder_metadata__v,bound_source_major_version__v,bound_source_minor_version__v,classification__v,clinical_data__c,composition_formulation_ingredients__c,content_category__c,copyright__c,copyright_license_expiration__c,copyright_owner__c,copyright_title__c,country__v,created_by__v,crosslink__v,date_permissions_obtained__c,decision_date__c,description_of_copyrighted_content__c,detail_group__v,disclaimer__c,document_creation_date__v,document_fit__v,document_host_url__v,document_number__v,source_type__c,dossier_type__c,duration_of_use__c,email_domain__v,email_template_type__v,expiration_date__c,external_id__v,extra_scientific_content__c,filename__v,format__v,from_address__v,from_name__v,ftp_source_location__v,grant_type__c,id,indication_disease__c,indication_dosage_administration__c,intended_use__c,language__c,last_modified_by__v,latest_source_major_version__v,latest_source_minor_version__v,latest_version__v,legacy_document_number__c,legal_approval_form__c,legal_approval_required__c,lifecycle__v,link_status__v,locked__v,locked_by__v,locked_date__v,major_version_number__v,md5checksum__v,members_of_public__c,minor_version_number__v,name__v,obtained_by__c,one_of_use__c,other__c,pages__v,payment_amount_usd__c,payment_date__c,payment_made__c,permissions_fee__c,pharmacodynamics_and_pharmacokinetics__c,product__v,public_content__v,publication_date__c,reapproval_cycle_count__c,reapproval_date__c,reason_for_iactivation__c,region_code__c,rendition_black_list_flag__v,reply_to_address__v,reply_to_name__v,response_type__c,restrict_fragments_by_product__v,restricted_countries__c,rights_channel__v,rights_countries__v,rights_expiration_date__v,rights_language__v,rights_other__v,rights_resource_type__v,safety__c,size__v,source__c,source_binding_rule__v,source_document_id__v,source_document_name__v,source_document_number__v,source_owner__v,source_vault_id__v,source_vault_name__v,special_population__c,start_date__c,status__v,storage_stability__c,subject__v,submission_date__c,subtype__v,tags__v,target__c,target_description__c,template_doctype__v,territory__v,therapeutic_area__c,title__v,type__v,use_location__c,user_groups__c,vaccines__c,version_created_by__v,version_creation_date__v,version_id,version_modified_date__v,clm_content__v,clm_id__v,crm_custom_reaction__v,crm_directory__v,crm_disable_actions__v,crm_enable_survey_overlay__v,crm_end_date__v,crm_hidden__v,crm_segment__v,crm_start_date__v,crm_survey__v,crm_training__v,engage_html_filename__v,cdn_content__v,check_consent__v,production_cdn_url__v,crm_product__v,ta_subcategory__c,notify_msls_of_significant_update__c,global_id__sys,global_version_id__sys,link__sys,version_link__sys,activity_end_date__c,activity_name__c,activity_start_date__c,activity_type__c,business_owner__c,channels__c,material_type__c,objective__c,proactive__c,target_audience__c,indication__c
"00W000000000101",,0,0,0,0,0,0,0,,,,false,,,,false,,,,,"Immunogenicity",,"Clinical Data,Special Population",false,,,,"00C000000000389",1436711,false,,,,,,2018-05-15T09:03:51.000Z,"Fit Width",,MED--TST-1923,,,,,,2020-06-10,2634,,Test.docx,application/vnd.openxmlformats-officedocument.wordprocessingml.document,,,,,10000,"Vaccines",,,,1,,,false,TST50316,,,Advanced LC,,false,,,3,398ea1bf3682f8c8e51cde5bd133bb73,false,0,Use of XXXXXXXXXXXXXXXX vaccine recombinant in Transplant Patients,,false,,4,,,,,,"00P000000001F36",true,,1,2018-08-31,,,false,,,,,,,,,,,,,16815,,,,,,,,,,,Expired,,,,Global Response,,,,,,,Use of XXXXXXXXXXX vaccine recombinant in Transplant Patients,Global Content (Advanced),,,,1436711,2018-05-15T09:03:51.000Z,10000_3_0,2020-07-02T13:17:11.000Z,false,,,,,false,,false,,,,false,,false,,,,,,23108_10000,23108_10000_19347,,,,,,,,,,,,,
Import-Csv
cmdlet 是一个众所周知的内存占用问题,主要是由于它构造的 [pscustomobject]
实例对内存的要求很高 - 请参阅 GitHub 问题 #7603。
有多种缓解策略,按复杂性升序排列:
ForEach-Object
(%
) 脚本块中(您应该将单独的 %
调用合并到 one 中),每隔 1000 个对象强制进行一次垃圾回收,以缓解内存压力。
正如 Santiago Squarzon 指出的那样,
ForEach-Object
的低效实现 - 从 PowerShell 7.2.x 开始,请参阅 GitHub 问题 #10982 - 加剧了内存消耗和运行时间方面的问题。
请参阅下面的代码,它将定期垃圾收集与
. { process { ... } }
结合起来,作为 ForEach-Object
的更快且更内存友好的替代方案。[1]
class
来表示您的 CSV 行,但请注意,这会增加执行时间。
请参阅此答案了解示例。
GitHub 问题 #8862 建议将此功能构建到
Import-Csv
中,以便使其构造给定类型的实例来代替 [pscustomobject]
。
如果上述方法太慢,您需要求助于第三方 .NET 解析器库,例如
CSVHelper
。
不幸的是,从 v7.2.x 开始,在 PowerShell 中使用通用 .NET NuGet 包很麻烦。 这个答案显示了当前的情况 需要。 GitHub 问题 #6724 要求未来
Add-Type
改进对 NuGet 包的直接支持。
这是代码的简化的表述,它实现定期垃圾收集以缓解内存压力:
# Process Configuration File
$path = $PSScriptRoot + '\'
#Set Extraction Date-Time in format for Oracle Timestamp with TZ
$date = Get-Date -Format "yyyy-MM-dd HH:mm:ss K"
# See above for why . { process { ... } } is used in lieu of % { ... }
$i = 0
Import-Csv -Path ($path + 'documents.csv') -Encoding UTF8 | . {
process {
# Perform garbage collection every 1000 objects
# in order to relieve memory pressure.
if (++$i % 1000 -eq 0) { [GC]::Collect() }
# Convert Date Time values that are always populated
$_.document_creation_date__v = ([datetime]($_.document_creation_date__v)).ToString('yyyy-MM-dd HH:mm:ss K')
$_.version_creation_date__v = ([datetime]($_.version_creation_date__v)).ToString('yyyy-MM-dd HH:mm:ss K')
$_.version_modified_date__v = ([datetime]($_.version_modified_date__v)).ToString('yyyy-MM-dd HH:mm:ss K')
# Convert DateTime values that may be blank
if ($_.binder_last_autofiled_date__v -gt "") { $_.binder_last_autofiled_date__v = ([datetime]($_.binder_last_autofiled_date__v)).ToString('yyyy-MM-dd HH:mm:ss K') }
if ($_.locked_date__v -gt "") { $_.locked_date__v = ([datetime]($_.locked_date__v)).ToString('yyyy-MM-dd HH:mm:ss K') }
# Fix Multi-Select Picklist fields, replacing value divider with "|"
$_.clinical_data__c = ((($_.clinical_data__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
$_.composition_formulation_ingredients__c = ((($_.composition_formulation_ingredients__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
$_.content_category__c = ((($_.content_category__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
$_.crm_disable_actions__v = ((($_.crm_disable_actions__v).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
$_.indication_dosage_administration__c = ((($_.indication_dosage_administration__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
$_.pharmacodynamics_and_pharmacokinetics__c = ((($_.pharmacodynamics_and_pharmacokinetics__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
$_.indication__c = ((($_.indication__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
$_.rights_channel__v = ((($_.rights_channel__v).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
$_.rights_language__v = ((($_.rights_language__v).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
$_.safety__c = ((($_.safety__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
$_.special_population__c = ((($_.special_population__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
$_.storage_stability__c = ((($_.storage_stability__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
$_.ta_subcategory__c = ((($_.ta_subcategory__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
$_.tags__v = ((($_.tags__v).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
$_.user_groups__c = ((($_.user_groups__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
$_.vaccines__c = ((($_.vaccines__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
$_.channels__c = ((($_.channels__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
$_.material_type__c = ((($_.material_type__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
$_.target_audience__c = ((($_.target_audience__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
# Trim values that can be too long
$_.product__v = ($_.product__v)[0..254] -join ""
# Finally add an ExtractDate Column and output the modified object
# (-PassThru) - this obviates the need for a separate Select-Object call.
Add-Member -InputObject $_ -PassThru -NotePropertyName 'Extract_Date' -NotePropertyValue $date
}
} |
Export-Csv ($path + 'VMC_DOCUMENTS.csv') -NoTypeInformation -Encoding UTF8
[1] 请注意,变体
& { process { ... } }
,即在 child 范围内执行,可以 加速 执行(请参阅 这个答案 获取解释),但会再次增加内存消耗,这就是为什么它不这里没用过。
在需要最高性能和灵活性的绝望情况下(但仍然需要 Powershell),我不得不使用
StreamReader
和 StreamWriter
进行自己的 CSV 处理。以下示例假设有一个三列源 CSV 文件,并输出另一个 CSV 文件,其中第一列中的值大写,第二列中的值小写:
$infilename = Join-Path $PSScriptRoot 'documents.csv'
$outfilename = Join-Path $PSScriptRoot 'VMC_DOCUMENTS.csv'
$bufsize = 1mb
$rowsep = "`r?`n"
$fieldsep = ","
New-Item -Force -Type "file" $outfilename
$readstream = New-Object -TypeName System.IO.StreamReader -ArgumentList $infilename
$writestream = New-Object -TypeName System.IO.StreamWriter -ArgumentList $outfilename
$writestream.WriteLine($readstream.ReadLine())
$partial = ''
$continue = $true
while ($continue) {
[char[]]$chunk = New-Object char[] $bufsize
$received = $readstream.Read($chunk, 0, $bufsize)
$continue = ($received -gt 0)
if ($continue -eq $false) {
break
}
$chunkstr = $chunk -join ""
$lines = (($partial, $chunkstr) -join "") -split $rowsep
$partial = $lines[-1]
for ($i = 0; $i -lt $lines.Length - 1; $i++) {
$row = $lines[$i] -split ($fieldsep)
# Process row/fields here:
$new = ($row[0].ToUpper(), $row[1].ToLower(), $row[2]) -join $fieldsep
$writestream.WriteLine($new)
}
}
$readstream.Close()
$writestream.Close()
请注意,CSV 解析非常初级,并且假设没有转义字符或需要引用。如果需要,可以利用使用正则表达式的更强大的逻辑。
可以使用
ReadLine
而不是块处理来简化上述过程,但前提是使用传统的换行符。上面的代码允许使用任意行分隔符。