如何以良好的性能和内存使用平衡来解析多个大型 XML 文件？

Question

我正在实现一个程序，该程序应该从 Angular 中的用户那里获取文件，将它们发送到 node.js 后端，在那里这些文件将被读取并解析为对象数组。

我使用了 javascript，因为对我来说，它是实现这个程序并测试它最快的。

如果您有任何其他语言建议，我会在下一次实现中考虑它。

到目前为止，我的 XML 文件最大为 110mb，最小为 16mb。

我必须能够解析多达 3000 个大小不同的文件。

所有这些文件都具有相同的结构和标签，顺序几乎相同。

XML 示例：

<start_xml_tag>
   <property1>something</property1>
   .
   .
       <property10>somethingElse</property10>
  <Offerte>
   <codice>123</codice>
   <nome>andrea</nome>
   <stato>italia</stato>
  </Offerte>
  ... 1Milion row after
  <Offerte>
   <codice>123</codice>
   <nome>andrea</nome>
   <stato>italia</stato>
  </Offerte>
  ...2 milion row after
</start_xml_tag>

得到的结果是一个包含对象的数组，其中每个对象是这样的：

{ 抄本：123，姓名：安德里亚所属国家：意大利 }

用户可以提前插入一个过滤器，如果字段 stato 等于（例如）“italia”，则对象中的结果应该仅是具有 stato = italia 的对象。

我会尽快提供我的代码，我明天才在家，所以我这里没有代码。

你能帮助我或告诉我我哪里想错了吗？

提前致谢！

到目前为止，我确实通过文本加载文件来实现这一点，文件数量较少，最多 400 个甚至 600 个。

然后我改用流 sax 来读取块中的文件，但是花了很长时间才完成解析，我不确定我是否犯了一些错误，它解析得很好，但 9 小时后它仍在运行。

块大小设置为 100000 字节，高于此值我会收到最大堆栈大小错误。

我现在正在尝试使用 xpath 方法来查看是否可以使用表达式针对所有“Offerte”子项，但我对此很陌生，我仍然认为我根本不会太高效。

import { Injectable } from '@angular/core';
import { HttpClient,  } from '@angular/common/http';

import * as sax from 'sax';
import { targetTags, TargetTags, startTagName, checkUnitReferenceNo, checkStatusCd } from 'src/models/util';

const BATCH_SIZE = 100;

@Injectable({
  providedIn: 'root'
})
export class BatchXmlService {

  constructor(private http: HttpClient) { }

  private result: TargetTags[] = [];

  async processFiles(files: FileList, unitReferenceNoFilter: any, statusCdFilters: any) {
    let result = [];
    for (let i = 0; i < files.length; i += BATCH_SIZE) {
      const batch = Array.from(files).slice(i, i + BATCH_SIZE);
      result.push(...await this.processBatch(batch, unitReferenceNoFilter, statusCdFilters));
    }
    return result;
  }

  async processBatch(batch: File[], unitReferenceNoFilter: any, statusCdFilters: any) {
    const promises = batch.map(file => this.parseFile(file, unitReferenceNoFilter, statusCdFilters));
    const parsedData = await Promise.all(promises);
    return parsedData.flatMap(x=>x);
  }

  // Function to parse a single XML file
  async parseFile(file: File, unitReferenceNoFilter: any, statusCdFilters: any): Promise<any[]> {
    const strict = true; 

    return new Promise((resolve, reject) => {
      const saxStream = sax.createStream(strict);

      saxStream.on('error', (error) => {
        reject(error);
      });

      const parsedData: any[] = []; // Array to store parsed data
      let lastTag = "";
      let inTargetSection = false; // Flag to track if we're within the target section
      let skipParsing = false;

      saxStream.on('opentag', (node) => {
        if (node.name === startTagName) {
          inTargetSection = true;
          parsedData.push({});
        } 
        else if (inTargetSection && !skipParsing) {
          lastTag = node.name;
          if(targetTags.includes(node.name)){
            // Store tag name as key and content as value in the current offer
            const currentOffer = parsedData[parsedData.length - 1];
            if (currentOffer) {
              currentOffer[node.name] = "";
            }
          }
        }
      });

      saxStream.on('text', (text) => {
        // Assuming you only care about text content within target tags
        if(text.trim() === "") return;
        if (inTargetSection && lastTag?.length > 0 && targetTags.includes(lastTag) && !skipParsing) {
          const currentOffer = parsedData[parsedData.length - 1];
          if(lastTag){
            if(
              lastTag === "UNIT_REFERENCE_NO" && !checkUnitReferenceNo(unitReferenceNoFilter, text.trim()) ||
              lastTag === "STATUS_CD" && !checkStatusCd(statusCdFilters, text.trim())
            ){
              skipParsing = true;
              parsedData.pop();
              return;
            }

            currentOffer[lastTag] = text.trim();
          }
        }
      });

      saxStream.on('closetag', (nodeName) => {
        if (nodeName === startTagName) {
          inTargetSection = false;
          skipParsing = false;
        }
      });

      saxStream.on('end', () => {
        resolve(parsedData);
      });

      const reader = new FileReader();
      reader.readAsArrayBuffer(file);

      reader.onload = () => {
        const arrayBuffer = reader.result as ArrayBuffer;
        const byteArray = new Uint8Array(arrayBuffer); // Convert to Uint8Array

        let remainingData = byteArray;

        while (remainingData.length > 0) {
          const chunkSize = Math.min(remainingData.length, 100000); // Read in chunks of 50000 bytes
          saxStream.write(String.fromCharCode.apply(null, remainingData.slice(0, chunkSize)));
          remainingData = remainingData.slice(chunkSize);
        }

        saxStream.end(); // Call end after processing all data
      };

      reader.onerror = (error) => {
        reject(error); // Handle file read errors
      };
    });
  }
}

<form [formGroup]="form" class="d-flex flex-column justify-content-around w-100" (ngSubmit)="generateExcel()" style="height: 300px">

    <div class="d-flex flex-column">
        <label for="Filtro_UNIT_REFERENCE_NO">Filtro per UNIT_REFERENCE_NO <sup style="color: red">*</sup></label>
        <div class="d-flex justify-content-between w-50">
            <select class="form-control" formControlName="UNIT_REFERENCE_NO" (change)="setUnitReferenceNoFilter()">
                <option *ngFor="let item of unitReferenceNoOptions" [value]="item.value">{{item.label}}</option>
            </select>
        </div>
    </div>

    <div class="d-flex flex-column">
        <label for="Filtro_STATUS_CD">Filtro per STATUS_CD</label>
        <div class="d-flex justify-content-between w-50">
            <ng-select [items]="statusCdOptions" class="w-100"
                [multiple]="true"
                placeholder="Se nessuna scelta e' selezionata allora si estraggono tutte"
                formControlName="STATUS_CD"
                bindLabel="label" 
                bindValue="value">
            </ng-select>
        </div>
    </div>

    <input #myInput for="files-xml" (click)="checkFiltersInserted($event)" formControlName="files" class="form-control mx-1" type="file" class="file-upload bg-secondary-subtle w-50" (change)="onUpload($event)" required webkitdirectory multiple />

    <button class="btn btn-primary w-25" type="submit" [disabled]="!isUnitReferenceNoFiltered || (loadingData && result !== [])">
        <span *ngIf="loadingData" class="spinner-border spinner-border-sm" role="status" aria-hidden="true"></span> 
        Genera Excel
    </button>
</form>

我的app.ts onUpload函数是这样的：

async onUpload(event: any) {
const files = event.target.files;
this.loadingData = true;
try{
  this.result = await this.batchXmlService.processFiles(files, this.form.controls['UNIT_REFERENCE_NO'].value, this.form.controls['STATUS_CD'].value);
  this.loadingData = false;
}
catch(error){
  this.loadingData = false;
}

}

我尝试尽可能地清除代码，我在之前的测试中混合了一些更多注释的代码。

我知道这是在 Angular.js 而不是 Node.js 上，我现在尝试在 Node.js 中编写相同的逻辑，以在后端委托此计算。

我还忘记了第二个过滤器是一个选择数组，在前面的 XML 示例中，字段“nome”可以具有不同的值，所有值都包含在多选选项列表中。

此过滤器的工作原理如下：如果列表包含 ['andrea', 'francesco', 'brian']

它会过滤对象，以便所有结果都与这三个字符串之一匹配。

Accepted XML example:
<Offerte>
   <codice>123</codice>
   <nome>andrea</nome>
   <stato>italia</stato>
</Offerte>

Discarded xml example:
<Offerte>
   <codice>123</codice>
   <nome>john</nome>
   <stato>italia</stato>
</Offerte>

Answer 1

尝试以下 PS 脚本

using assembly System.Xml.Linq

$xmlFilename = 'c:\temp\test.xml'
$reader = [System.Xml.XmlReader]::Create($xmlFilename)
$table = [System.Collections.ArrayList]::new()

While(-not $reader.EOF)
{
    if($reader.Name -ne 'Offerte')
    {
        $reader.ReadToFollowing('Offerte') | out-null;
    }
    if(-not $reader.EOF)
    {
        $offerte = [System.Xml.Linq.XElement][System.Xml.Linq.XElement]::ReadFrom($reader);
        $codice = $offerte.Element('codice').Value
        $nome = $offerte.Element('nome').Value
        $stato = $offerte.Element('stato').Value

       $newRow = [pscustomobject]@{
           codice = $codice
           nome = $nome
           stato = $stato
       }
       $table.Add($newRow) | out-null  
    }
}
$table

如何以良好的性能和内存使用平衡来解析多个大型 XML 文件？

问题描述投票：0回答：1

1个回答

最新问题

如何以良好的性能和内存使用平衡来解析多个大型 XML 文件？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1