C 语言中的 zlib.h 如何比 python gzip 或 zlib 库具有更高的压缩率？

Question

我想将 C 转换为 Python3。我使用 zlib.h 和 gzFile 将 C 代码中的文件保存为“.gz”。

我知道Python3中的一些库如zlib、gzip、pandas.DataFrame.pickle(compress = "gzip")可以将文件保存为“.gz”格式。但是，压缩率却有很大不同！如何在Python3中完美实现C的“zlib.h”？

我已经尝试了我所知道的一切。

首先是示例C代码：

#include <stdio.h>
#include <stdlib.h>
#include <zlib.h>
#include <math.h>
#include <time.h>

int main()
{
    void init_genrand64(unsigned long);
    double genrand64_real2(void);

    init_genrand64((unsigned long long)time(NULL));
    double r = 0;

    gzFile fp = gzopen("test_in_C.gz", "wb");
    for (int i = 0; i < 293; ++i)
    {
        for (int j = 0; j < 10000; ++j)
        {
            r = genrand64_real2();
            if(j < 9999)
            {
                gzprintf(fp, "%.8lf,", r);
            }
            else
            {
                gzprintf(fp, "%.8lf\n", r);
            }
        }
    }
    gzclose(fp);
}

接下来是所有Python3代码：

import pandas as pd
import numpy as np

df = pd.DataFrame(columns = range(293), data = np.random.rand(10000,293))

#csv, plain txt
df.to_csv("test_dat.csv", index = False, header = False)

#pickle, binary
df.to_pickle("test_dat.pk")

#pickle, compressed binary
df.to_pickle("test_dat.pkgz", compression = "gzip")

#npy, binary
np.save("test_dat.npy", df.values)

import gzip

#npy, compressed binary
with gzip.open("test_dat.npygz", "wb") as f:
    np.save(f, df.values)
    
#csv, compressed binary
with gzip.open("test_dat.gz", "wb") as f:
    for i in range(len(df)):
        for j in range(len(df.columns)):
            if j < len(df.columns) - 1:
                f.write((str(df.iloc[i,j]) + ",").encode())
            else:
                f.write((str(df.iloc[i,j]) + "\n").encode())

#npy, compressed binary
import zlib
dat = zlib.compress(df.values)
with open("test_dat.zl", "wb") as f:
    f.write(dat)

文件大小：

test_in_C.gz   13MB
test_dat.csv   54MB (56460866B)
test_dat.pk    22MB (23440571B)
test_dat.npy   22MB (23440128B)
test_dat.pkgz  21MB (22104880B)
test_dat.npygz 21MB (22104445B)
test_dat.gz    24MB (25663307B)
test_dat.zl    21MB (22104671B)

在这个实验中，我们可以看到.npygz文件具有最好的压缩率，但与源自C的test_in_C.gz相差大约一倍。正如您所看到的，来自 C 的文件和来自 Python3 的文件具有相同的形状 (10000, 293)。

如何缩小这个差距？这对我来说是非常重要的问题，因为我将处理非常大的数据（每个文件大约 800MB * 来自 C 的 1,000）。如果这些文件的大小是两倍，也许我的 PC 和 Python 代码会被轰炸。

谢谢您的帮助。

Answer 1

import zlib
import numpy as np

data = np.random.rand(10000, 293)

# convert the data to string
data_str = "\n".join(",".join(f"{x:.8f}" for x in row) for row in data)

# Compress the data by using zlib with compression level 9
compressed_data = zlib.compress(data_str.encode(), level=9)

# save the compressed data to gzip file
with open("test_dat_zlib.gz", "wb") as f:
    f.write(compressed_data)

结果：

test_in_C.gz     13MB (13779791B)
test_dat_zlib.gz 13MB (13598171B)

C 语言中的 zlib.h 如何比 python gzip 或 zlib 库具有更高的压缩率？

问题描述投票：0回答：1

1个回答

最新问题

C 语言中的 zlib.h 如何比 python gzip 或 zlib 库具有更高的压缩率？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1