我正在实现一个程序,该程序应该从 Angular 中的用户那里获取文件,将它们发送到 node.js 后端,在那里这些文件将被读取并解析为对象数组。
我使用了 javascript,因为对我来说,它是实现这个程序并测试它最快的。
如果您有任何其他语言建议,我会在下一次实现中考虑它。
到目前为止,我的 XML 文件最大为 110mb,最小为 16mb。
我必须能够解析多达 3000 个大小不同的文件。
所有这些文件都具有相同的结构和标签,顺序几乎相同。
XML 示例:
<start_xml_tag>
<property1>something</property1>
.
.
<property10>somethingElse</property10>
<Offerte>
<codice>123</codice>
<nome>andrea</nome>
<stato>italia</stato>
</Offerte>
... 1Milion row after
<Offerte>
<codice>123</codice>
<nome>andrea</nome>
<stato>italia</stato>
</Offerte>
...2 milion row after
</start_xml_tag>
得到的结果是一个包含对象的数组,其中每个对象是这样的:
{ 抄本:123, 姓名:安德里亚 所属国家: 意大利 }
用户可以提前插入一个过滤器,如果字段 stato 等于(例如)“italia”,则对象中的结果应该仅是具有 stato = italia 的对象。
我会尽快提供我的代码,我明天才在家,所以我这里没有代码。
你能帮助我或告诉我我哪里想错了吗?
提前致谢!
到目前为止,我确实通过文本加载文件来实现这一点,文件数量较少,最多 400 个甚至 600 个。
然后我改用流 sax 来读取块中的文件,但是花了很长时间才完成解析,我不确定我是否犯了一些错误,它解析得很好,但 9 小时后它仍在运行。
块大小设置为 100000 字节,高于此值我会收到最大堆栈大小错误。
我现在正在尝试使用 xpath 方法来查看是否可以使用表达式针对所有“Offerte”子项,但我对此很陌生,我仍然认为我根本不会太高效。
import { Injectable } from '@angular/core';
import { HttpClient, } from '@angular/common/http';
import * as sax from 'sax';
import { targetTags, TargetTags, startTagName, checkUnitReferenceNo, checkStatusCd } from 'src/models/util';
const BATCH_SIZE = 100;
@Injectable({
providedIn: 'root'
})
export class BatchXmlService {
constructor(private http: HttpClient) { }
private result: TargetTags[] = [];
async processFiles(files: FileList, unitReferenceNoFilter: any, statusCdFilters: any) {
let result = [];
for (let i = 0; i < files.length; i += BATCH_SIZE) {
const batch = Array.from(files).slice(i, i + BATCH_SIZE);
result.push(...await this.processBatch(batch, unitReferenceNoFilter, statusCdFilters));
}
return result;
}
async processBatch(batch: File[], unitReferenceNoFilter: any, statusCdFilters: any) {
const promises = batch.map(file => this.parseFile(file, unitReferenceNoFilter, statusCdFilters));
const parsedData = await Promise.all(promises);
return parsedData.flatMap(x=>x);
}
// Function to parse a single XML file
async parseFile(file: File, unitReferenceNoFilter: any, statusCdFilters: any): Promise<any[]> {
const strict = true;
return new Promise((resolve, reject) => {
const saxStream = sax.createStream(strict);
saxStream.on('error', (error) => {
reject(error);
});
const parsedData: any[] = []; // Array to store parsed data
let lastTag = "";
let inTargetSection = false; // Flag to track if we're within the target section
let skipParsing = false;
saxStream.on('opentag', (node) => {
if (node.name === startTagName) {
inTargetSection = true;
parsedData.push({});
}
else if (inTargetSection && !skipParsing) {
lastTag = node.name;
if(targetTags.includes(node.name)){
// Store tag name as key and content as value in the current offer
const currentOffer = parsedData[parsedData.length - 1];
if (currentOffer) {
currentOffer[node.name] = "";
}
}
}
});
saxStream.on('text', (text) => {
// Assuming you only care about text content within target tags
if(text.trim() === "") return;
if (inTargetSection && lastTag?.length > 0 && targetTags.includes(lastTag) && !skipParsing) {
const currentOffer = parsedData[parsedData.length - 1];
if(lastTag){
if(
lastTag === "UNIT_REFERENCE_NO" && !checkUnitReferenceNo(unitReferenceNoFilter, text.trim()) ||
lastTag === "STATUS_CD" && !checkStatusCd(statusCdFilters, text.trim())
){
skipParsing = true;
parsedData.pop();
return;
}
currentOffer[lastTag] = text.trim();
}
}
});
saxStream.on('closetag', (nodeName) => {
if (nodeName === startTagName) {
inTargetSection = false;
skipParsing = false;
}
});
saxStream.on('end', () => {
resolve(parsedData);
});
const reader = new FileReader();
reader.readAsArrayBuffer(file);
reader.onload = () => {
const arrayBuffer = reader.result as ArrayBuffer;
const byteArray = new Uint8Array(arrayBuffer); // Convert to Uint8Array
let remainingData = byteArray;
while (remainingData.length > 0) {
const chunkSize = Math.min(remainingData.length, 100000); // Read in chunks of 50000 bytes
saxStream.write(String.fromCharCode.apply(null, remainingData.slice(0, chunkSize)));
remainingData = remainingData.slice(chunkSize);
}
saxStream.end(); // Call end after processing all data
};
reader.onerror = (error) => {
reject(error); // Handle file read errors
};
});
}
}
<form [formGroup]="form" class="d-flex flex-column justify-content-around w-100" (ngSubmit)="generateExcel()" style="height: 300px">
<div class="d-flex flex-column">
<label for="Filtro_UNIT_REFERENCE_NO">Filtro per UNIT_REFERENCE_NO <sup style="color: red">*</sup></label>
<div class="d-flex justify-content-between w-50">
<select class="form-control" formControlName="UNIT_REFERENCE_NO" (change)="setUnitReferenceNoFilter()">
<option *ngFor="let item of unitReferenceNoOptions" [value]="item.value">{{item.label}}</option>
</select>
</div>
</div>
<div class="d-flex flex-column">
<label for="Filtro_STATUS_CD">Filtro per STATUS_CD</label>
<div class="d-flex justify-content-between w-50">
<ng-select [items]="statusCdOptions" class="w-100"
[multiple]="true"
placeholder="Se nessuna scelta e' selezionata allora si estraggono tutte"
formControlName="STATUS_CD"
bindLabel="label"
bindValue="value">
</ng-select>
</div>
</div>
<input #myInput for="files-xml" (click)="checkFiltersInserted($event)" formControlName="files" class="form-control mx-1" type="file" class="file-upload bg-secondary-subtle w-50" (change)="onUpload($event)" required webkitdirectory multiple />
<button class="btn btn-primary w-25" type="submit" [disabled]="!isUnitReferenceNoFiltered || (loadingData && result !== [])">
<span *ngIf="loadingData" class="spinner-border spinner-border-sm" role="status" aria-hidden="true"></span>
Genera Excel
</button>
</form>
我的app.ts onUpload函数是这样的:
async onUpload(event: any) {
const files = event.target.files;
this.loadingData = true;
try{
this.result = await this.batchXmlService.processFiles(files, this.form.controls['UNIT_REFERENCE_NO'].value, this.form.controls['STATUS_CD'].value);
this.loadingData = false;
}
catch(error){
this.loadingData = false;
}
}
我尝试尽可能地清除代码,我在之前的测试中混合了一些更多注释的代码。
我知道这是在 Angular.js 而不是 Node.js 上,我现在尝试在 Node.js 中编写相同的逻辑,以在后端委托此计算。
我还忘记了第二个过滤器是一个选择数组,在前面的 XML 示例中,字段“nome”可以具有不同的值,所有值都包含在多选选项列表中。
此过滤器的工作原理如下: 如果列表包含 ['andrea', 'francesco', 'brian']
它会过滤对象,以便所有结果都与这三个字符串之一匹配。
Accepted XML example:
<Offerte>
<codice>123</codice>
<nome>andrea</nome>
<stato>italia</stato>
</Offerte>
Discarded xml example:
<Offerte>
<codice>123</codice>
<nome>john</nome>
<stato>italia</stato>
</Offerte>
尝试以下 PS 脚本
using assembly System.Xml.Linq
$xmlFilename = 'c:\temp\test.xml'
$reader = [System.Xml.XmlReader]::Create($xmlFilename)
$table = [System.Collections.ArrayList]::new()
While(-not $reader.EOF)
{
if($reader.Name -ne 'Offerte')
{
$reader.ReadToFollowing('Offerte') | out-null;
}
if(-not $reader.EOF)
{
$offerte = [System.Xml.Linq.XElement][System.Xml.Linq.XElement]::ReadFrom($reader);
$codice = $offerte.Element('codice').Value
$nome = $offerte.Element('nome').Value
$stato = $offerte.Element('stato').Value
$newRow = [pscustomobject]@{
codice = $codice
nome = $nome
stato = $stato
}
$table.Add($newRow) | out-null
}
}
$table