我有一个非分隔的文本文件,包含大约100万行。
样本行
1YBL LOYALTY EXT 1000101172019001
2000100101000011512753184907301010614199100919699034659 [email protected] VIDYA SAGAR CROSS BANDRA WM DELHI 456471
3000000027
在以数字“2”,“1”,“3”(rowtype)开头的每一行上,我必须根据字符数插入分隔符,即在结尾0-1,1-20,21-25 ......所以上
如何使用Linux脚本执行此操作?
期望的输出
1|YBL LOYALTY EXT |10001|01172019|001
2|00010010100001151|2753|184907301010614199100919699034659 |[email protected] |VIDYA SAGAR |CROSS |BANDRA |WM |DELHI |456471
3|000000027
我试过这个命令
perl -ne ' if(/^2/) { @x=(1,19,6,4,3,8,20,60,40,40,40,40,30); $i=0;
while($i<=$#x) { $s=$x[$i]; $_=~s/(.{$s})/printf("%s|",$1);""/e;$i++ }
print "$_"} if(/^1/) { @x=(1,16,5,8); $i=0;
while($i<=$#x) { $s=$x[$i]; $_=~s/(.{$s})/printf("%s|",$1);""/e;$i++ }
print "$_" } if(/^3/) { @x=(1); $i=0;
while($i<=$#x) { $s=$x[$i]; $_=~s/(.{$s})/printf("%s|",$1);""/e;$i++ }
print "$_" }' filename`
输入行
1YBL LOYALTY EXT 1000112102018001
2000100101000002631653184911501010111199100919323739251 [email protected] VIJAY PANDEY PART OF GROUND FLOOR & BASEMENT SHOPPER STOP SV ROAD ANDHERI WEST LANDMARK-ERSTWHILE CRASSWORD BOOK STORE MUMBAI 400058
2000100101000019920453184964321010513199000919878857482 [email protected] MOHAMAD MAQSHUD MASTER H COLLECTION NEW SHIVPURI GALI NO 1 NEAR MAKHAN SINGH CHOWK LUDHIANA 141008
2000100101000023500853184923441010913197300919375580888 [email protected] JAYANTIBHAI TADA 44 KHODIYAR NAGAR B S ABHISHEK SUDAMA CHOWK KHODIYARNAGAR MOTA VARACHHA SURAT 395006
3000000066
预期输出
1|YBL LOYALTY EXT |10001|12102018|001
2|0001001010000026316|531849|1150|101|01111991|00919323739251 |[email protected] |VIJAY PANDEY |PART OF GROUND FLOOR & BASEMENT |SHOPPER STOP SV ROAD ANDHERI WEST |LANDMARK-ERSTWHILE CRASSWORD BOOK STORE |MUMBAI |400058
2|0001001010000199204|531849|6432|101|05131990|00919878857482 |[email protected] |MOHAMAD MAQSHUD MASTER |H COLLECTION NEW SHIVPURI |GALI NO 1 |NEAR MAKHAN SINGH CHOWK |LUDHIANA |141008
2|0001001010000235008|531849|2344|101|09131973|00919375580888 |[email protected] |JAYANTIBHAI TADA |44 KHODIYAR NAGAR B S ABHISHEK |SUDAMA CHOWK |KHODIYARNAGAR MOTA VARACHHA |SURAT |395006
3|000000066
得到这个但是
1|YBL LOYALTY EXT |10001|12102018|001
2|0001001010000026316|531849|1150|101|01111991|00919323739251 |[email protected] |VIJAY PANDEY |PART OF GROUND FLOOR & BASEMENT |SHOPPER STOP SV ROAD ANDHERI WEST |LANDMARK-ERSTWHILE CRASSWORD BOOK STORE |MUMBAI |400058
2|0001001010000199204|531849|6432|101|05131990|00919878857482 |[email protected] |MOHAMAD MAQSHUD MASTER |H COLLECTION NEW SHIVPURI |GALI NO 1 |NEAR MAKHAN SINGH CHOWK |LUDHIANA |141008
1|41008|
2|0001001010000235008|531849|2344|101|09131973|00919375580888 |[email protected] |JAYANTIBHAI TADA |44 KHODIYAR NAGAR B S ABHISHEK |SUDAMA CHOWK |KHODIYARNAGAR MOTA VARACHHA |SURAT |395006
3|95006
3|000000066
您也可以尝试Perl
perl -lpe ' if(/^2/) { @x=(1,17,4);
for $i (@x) { s/(.{$i})//; printf("%s|",$1) } }' input_file
给定的输入
$ cat rahman.txt
1YBL LOYALTY EXT 1000101172019001
2000100101000011512753184907301010614199100919699034659 [email protected] VIDYA SAGAR CROSS BANDRA WM DELHI 456471
3000000027
$ perl -lpe ' if(/^2/) { @x=(1,17,4);
for $i (@x) { s/(.{$i})//; printf("%s|",$1) } }' rahman.txt
1YBL LOYALTY EXT 1000101172019001
2|00010010100001151|2753|184907301010614199100919699034659 [email protected] VIDYA SAGAR CROSS BANDRA WM DELHI 456471
3000000027
$
只需将条目添加到@ x =(1,17,4).. @ x =(1,17,4,10,20)
EDIT1
要为可以按空格分割的字段添加分隔符,请使用以下内容
$ perl -lpe ' if(/^2/) { @x=(1,17,4);
for $i (@x) { s/(.{$i})//; printf("%s|",$1) } s/\S+\s+\K/|/g }' rahman.txt
1YBL LOYALTY EXT 1000101172019001
2|00010010100001151|2753|184907301010614199100919699034659 |[email protected] |VIDYA |SAGAR |CROSS |BANDRA |WM |DELHI |456471
3000000027
$
代码说明
Explanation
perl -lpe # use -p for printing by default at the end of perl one-liner
# this makes sure when you dont have a line starting with 2 the line is printed after the if statement.
' if(/^2/) # if - select line that starts with 2. $_ will have the current line
{
@x=(1,17,4); # x is an array to hold the widths of fields. - 1, 17, 4
for $i (@x) # open for loop to loop through the array x
{
s/(.{$i})//; # no variable is specified, so the substitution acts on the $_ i.e current line
# first instance is s/(.{1})// => match one character and store it in $1 capturing variable
# replace the captured part with nothing and update $_
# e.g if the line is "200010010100001151" .. loop one will capture "2" and $_ becomes "00010010100001151"
# loop 2 => s/(.{17})// matches 17 character and $1 stores "00010010100001151"
printf("%s|",$1) # print $1 along with delimiter pipe
} # end of for loop
} # end of if
# here is default print statement in perl that will print the $_ after all modification
' input_file
Aaditi
根据您的输入我得到以下结果。它工作正常..你看到了什么问题?
$ perl -ne ' if(/^2/) { @x=(1,19,6,4,3,8,20,60,40,40,40,40,30); $i=0;
> while($i<=$#x) { $s=$x[$i]; $_=~s/(.{$s})/printf("%s|",$1);""/e;$i++ }
> print "$_"} if(/^1/) { @x=(1,16,5,8); $i=0;
> while($i<=$#x) { $s=$x[$i]; $_=~s/(.{$s})/printf("%s|",$1);""/e;$i++ }
> print "$_" } if(/^3/) { @x=(1); $i=0;
> while($i<=$#x) { $s=$x[$i]; $_=~s/(.{$s})/printf("%s|",$1);""/e;$i++ }
> print "$_" }' rahman.txt
1|YBL LOYALTY EXT |10001|01172019|001
2|0001001010000115127|531849|0730|101|06141991|00919699034659 |[email protected] VID|YA SAGAR CRO|SS BAN|DRA WM | DEL|HI 456|471
3|000000027
$
Aadita:
得到了问题... $ _被修改,所以在/ ^ 2 / if循环结束时,$ _保持“141008”的值,然后满足下一个if(/ ^ 1 /)条件并且如果还要执行..为了避免它,只需将$ _复制到开头的$ line变量,然后在单独的if循环中检查$ line / / ^ 2 /,/ ^ 3 /,/ ^ 1 /。
$ perl -lne '$line=$_; if($line=~/^2/) { @x=(1,19,6,4,3,8,20,60,40,40,40,40,30); $i=0;
while($i<=$#x) { $s=$x[$i]; $_=~s/(.{$s})/printf("%s|",$1);""/e;$i++ }
print "$_" }
if($line=~/^1/) { @x=(1,16,5,8); $i=0;
while($i<=$#x) { $s=$x[$i]; $_=~s/(.{$s})/printf("%s|",$1);""/e;$i++ }
print "$_" }
if($line=~/^3/) { @x=(1); $i=0;
while($i<=$#x) { $s=$x[$i]; $_=~s/(.{$s})/printf("%s|",$1);""/e;$i++ }
print "$_" }' rahman2.txt
1|YBL LOYALTY EXT |10001|12102018|001
2|0001001010000026316|531849|1150|101|01111991|00919323739251 |[email protected] |VIJAY PANDEY |PART OF GROUND FLOOR & BASEMENT |SHOPPER STOP SV ROAD ANDHERI WEST |LANDMARK-ERSTWHILE CRASSWORD BOOK STORE |MUMBAI |400058
2|0001001010000199204|531849|6432|101|05131990|00919878857482 |[email protected] |MOHAMAD MAQSHUD MASTER |H COLLECTION NEW SHIVPURI |GALI NO 1 |NEAR MAKHAN SINGH CHOWK |LUDHIANA |141008
2|0001001010000235008|531849|2344|101|09131973|00919375580888 |[email protected] |JAYANTIBHAI TADA |44 KHODIYAR NAGAR B S ABHISHEK |SUDAMA CHOWK |KHODIYARNAGAR MOTA VARACHHA |SURAT |395006
3|000000066
$
使用FIELDWIDTHS的GNU awk:
$ awk -v FIELDWIDTHS='1 17 4 *' -v OFS='|' '/^2/{$1=$1; gsub(/\s+/,"&"OFS)} 1' file
1YBL LOYALTY EXT 1000101172019001
2|00010010100001151|2753|184907301010614199100919699034659 |[email protected] |VIDYA |SAGAR |CROSS |BANDRA |WM |DELHI |456471
3000000027
FIELDWIDTHS的上述用法表示输入应分为4个字段,宽度为1个字符,17个字符,4个字符,然后是其余字段。
为字段赋值时,awk重新编译记录,用值OFS替换输入字段分隔符,因此$ 1 = $ 1导致在FIELDWIDTHS描述的每个字段之间插入|
s。
一旦完成,仍然存在所有剩余的空格分隔文本以添加字段分隔符,以便gsub()在每个空间系列之后附加OFS。
较旧版本的gawk不支持*
意思是the rest of the line
- 如果你有这种情况,那么只需用像*
这样的大值替换99999
。
你的文件中有分隔符,你只是看不到它们:它是空格/制表符。所以你只需要替换那些,使用sed/xxx/|/g
命令(由xxx
我的意思是空格或TAB字符)。如果您怀疑字符是空格还是制表符,可以在十六进制编辑器中打开文件(空格为ASCII码32(十六进制:20),TAB为9(十六进制:09))。
您可以尝试使用gnu sed:
sed -E '/^2/{s//&|/;s/(.{19})(....)(\S+\s+)/\1|\2|\3|/}' infile
如果您没有FIELDSWIDTHS
,请尝试以下操作。
awk -v var="1,18,4" -v OFS="|" '
BEGIN{
num=split(var,array,",")
}
{
for(i=1;i<=num;i++){
val=val?(i==num?val substr($0,array[i-1]+1,array[i]):val substr($0,array[i-1]+1,array[i]) OFS):substr($0,1,array[i]) OFS
sum+=array[i]
}
if(sum==length($0)){
print val
}
else{
rest=substr($0,sum)
gsub(/[[:space:]]+/,"&"OFS,rest)
print val,rest
}
sum=rest=val=""
}
' Input_file