解析以串行方式而不是表格,json等记录的数据的最佳方法?

问题描述 投票:1回答:2

我有一组日志文件,所有格式都基本上像这个例子(file1.text):

================================================
Running taskId=[updateFieldInTbl]
startTime: 16:03:34,580
------------------------------------------------
INFO:DBExecute: SQL=[       UPDATE tbl set field = value where thing > 0; ]

SQL: UPDATE tbl set field = value where thing > 0
Statement affected [746664] rows.
------------------------------------------------
Finished taskId=[updateFieldInTbl]
endTime: 16:06:30,571
elapsed: 00:02:55,991
failure: false
anyFailure: false
================================================
================================================
Running taskId=[calculateChecksum]
startTime: 16:06:30,571
------------------------------------------------
INFO:DBExecute: SQL=[       update tbl set checksum = MD5(CONCAT_WS('',field, field2, field3));     ]

SQL: update tbl set checksum = MD5(CONCAT_WS('',field, field2, field3)); 
Statement affected [9608630] rows.
================================================
=====  Greater than 5 minutes Review! ==========
================================================
------------------------------------------------
Finished taskId=[calculateChecksum]
endTime: 16:44:04,473
elapsed: 00:37:33,901
failure: false
anyFailure: false
================================================
================================================
Running taskId=[deleteMatchingChecksum]
startTime: 16:44:04,473
------------------------------------------------
INFO:DBExecute: SQL=[       delete tbl from tbl inner join other on tbl.checksum = other.checksum;  ]

SQL: delete tbl from tbl inner join other on tbl.checksum = other.checksum;
Statement affected [9276213] rows.
================================================
=====  Greater than 5 minutes Review! ==========
================================================
------------------------------------------------
Finished taskId=[deleteMatchingChecksum]
endTime: 17:49:26,817
elapsed: 01:05:22,344
failure: false
anyFailure: false
================================================
================================================
Running taskId=[deletemissinguserDataChecksum]
startTime: 17:49:26,817
------------------------------------------------
INFO:DBExecute: SQL=[       delete from tbl          where  some_id =0;  ]

SQL: delete from tbl          where  some_id =0;
Statement affected [0] rows.
------------------------------------------------
Finished taskId=[deletemissinguserDataChecksum]
endTime: 17:49:26,847
elapsed: 00:00:00,030
failure: false
anyFailure: false
================================================

我想将每个转换成如下所示:

file1 | taskId | startTime | endTime | elapsed | rowsAffected | Info | failure | anyFailure
file1 | updateFieldInTbl | 16:03:34 | 16:06:20 | 00:02:55 | 746664 | SQL=[       UPDATE tbl set field = value where thing > 0; ] | false | false
file1 | calculateChecksum | 16:06:30 | 16:44:04 | 00:37:33 | 9608630 | SQL=[       update tbl set checksum = MD5(CONCAT_WS('',field, field2, field3));     ] | false | false
file1 | deleteMatchingChecksum | 16:44:04 | 17:49:26 | 01:05:22 | 9276213 | SQL=[       delete tbl from tbl inner join other on tbl.checksum = other.checksum;  ] | false | false

通常,我只是开始系统登录到数据库表,因此日志已经采用易于使用的格式,但目前这不是一个选项,所以我必须解析现有的日志进入类似有用的东西。

你会推荐什么工具?我认为目标是尽可能使用bash脚本构建一些东西。任何有关如何构建解析器的指导都将非常感激。

bash awk text-processing text-parsing
2个回答
2
投票

我建议Awk处理:

awk 'NR==1{ 
         fn=substr(FILENAME,0,length(FILENAME)-5); 
         print fn" | taskId | startTime | endTime | elapsed | rowsAffected | Info | failure | anyFailure" 
     }
     /Running taskId/{ gsub(/^.+=\[|\]$/, ""); taskId=$0 }
     /startTime:/{ sub(/,.*/,"",$2); startTime=$2 }
     /INFO:/{ sub(/^INFO:DBExecute: /,""); info=$0 }
     / affected/{ gsub(/\[|\]/,"",$3); affected=$3 }
     /endTime/{ sub(/,.*/,"",$2); endTime=$2 }
     /elapsed/{ sub(/,.*/,"",$2); elapsed=$2 }
     /^failure/{ fail=$2 }
     /anyFailure/{ 
         printf "%s | %s | %s | %s | %s | %d | %s | %s | %s\n", 
                 fn, taskId, startTime, endTime, elapsed, affected, info, fail, $2 
     }' file1.text

输出:

file1 | taskId | startTime | endTime | elapsed | rowsAffected | Info | failure | anyFailure
file1 | updateFieldInTbl | 16:03:34 | 16:06:30 | 00:02:55 | 746664 | SQL=[       UPDATE tbl set field = value where thing > 0; ] | false | false
file1 | calculateChecksum | 16:06:30 | 16:44:04 | 00:37:33 | 9608630 | SQL=[       update tbl set checksum = MD5(CONCAT_WS('',field, field2, field3));     ] | false | false
file1 | deleteMatchingChecksum | 16:44:04 | 17:49:26 | 01:05:22 | 9276213 | SQL=[       delete tbl from tbl inner join other on tbl.checksum = other.checksum;  ] | false | false
file1 | deletemissinguserDataChecksum | 17:49:26 | 17:49:26 | 00:00:00 | 0 | SQL=[       delete from tbl          where  some_id =0;  ] | false | false

1
投票

FWIW我尽量避免使用特定的字段名称,因为大多数输入行遵循相同的格式,所以不需要测试所有值,因此只需单独输出不遵循通用格式的几行:

$ cat tst.awk
BEGIN { OFS="," }

!NF || /^([^[:alpha:]]|SQL|Finished)/ { next }

{ tag = val = $0 }

/^Running/ {
    prt()
    gsub(/^[^ ]+ |=.*/,"",tag)
    gsub(/.*\[|\].*/,"",val)
}

/^Statement/ {
    tag = "rowsAffected"
    gsub(/.*\[|\].*/,"",val)
}

/^[:[:alpha:]]+: / {
    sub(/:.*/,"",tag)
    sub(/^[:[:alpha:]]+: /,"",val)
}

{
    tags[++numTags] = tag
    tag2val[tag] = val
}

END { prt() }

function prt( tag,val,tagNr) {
    if (numTags > 0) {
        if ( ++recNr == 1 ) {
            printf "\"%s\"%s", "file", OFS
            for (tagNr=1; tagNr<=numTags; tagNr++) {
                tag = tags[tagNr]
                printf "\"%s\"%s", tag, (tagNr<numTags ? OFS : ORS)
            }
        }
        printf "\"%s\"%s", FILENAME, OFS
        for (tagNr=1; tagNr<=numTags; tagNr++) {
            tag = tags[tagNr]
            val = tag2val[tag]
            gsub(/"/,"\"\"",val)
            printf "\"%s\"%s", val, (tagNr<numTags ? OFS : ORS)
        }
    }
    delete tags
    delete tag2val
    numTags = 0
}

我还将其输出为CSV,以便您可以将其读入Excel或使用它做任何您喜欢的事情:

$ awk -f tst.awk file1
"file","taskId","startTime","INFO","rowsAffected","endTime","elapsed","failure","anyFailure"
"file1","updateFieldInTbl","16:03:34,580","SQL=[       UPDATE tbl set field = value where thing > 0; ]","746664","16:06:30,571","00:02:55,991","false","false"
"file1","calculateChecksum","16:06:30,571","SQL=[       update tbl set checksum = MD5(CONCAT_WS('',field, field2, field3));     ]","9608630","16:44:04,473","00:37:33,901","false","false"
"file1","deleteMatchingChecksum","16:44:04,473","SQL=[       delete tbl from tbl inner join other on tbl.checksum = other.checksum;  ]","9276213","17:49:26,817","01:05:22,344","false","false"
"file1","deletemissinguserDataChecksum","17:49:26,817","SQL=[       delete from tbl          where  some_id =0;  ]","0","17:49:26,847","00:00:00,030","false","false"

如果您真的关心订单,您可以轻松调整它以通过其特定标签而不是数字顺序输出字段值。

© www.soinside.com 2019 - 2024. All rights reserved.