我有大约 300k 个文件,即大约 9GB 的医学文献。
我的目标是确定数据集中所有标记的频率并将它们序列化为 csv 文件(标记、频率)。
为了实现这一目标,我将
layers.TextVectorization
与 output_mode='count'
一起使用。
适应巨大的数据集进展顺利,但在检索词汇时我得到:
2024-04-20 22:38:56.832518: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-04-20 22:38:56.833545: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
Discoverd 323719 files
Finished adapting
Traceback (most recent call last):
File "(file_location_of_source_code)", line 55, in <module>
inverse_vocab = vectorize_layer.get_vocabulary()
File "D:\Anaconda\envs\src\lib\site-packages\keras\layers\preprocessing\text_vectorization.py", line 487, in get_vocabulary
return self._lookup_layer.get_vocabulary(include_special_tokens)
File "D:\Anaconda\envs\src\lib\site-packages\keras\layers\preprocessing\index_lookup.py", line 385, in get_vocabulary
self._tensor_vocab_to_numpy(vocab),
File "D:\Anaconda\envs\src\lib\site-packages\keras\layers\preprocessing\string_lookup.py", line 416, in _tensor_vocab_to_numpy
[tf.compat.as_text(x, self.encoding) for x in vocabulary]
File "D:\Anaconda\envs\src\lib\site-packages\keras\layers\preprocessing\string_lookup.py", line 416, in <listcomp>
[tf.compat.as_text(x, self.encoding) for x in vocabulary]
MemoryError
Process finished with exit code 1
我的代码中的相关部分:
files_root = pathlib.Path(r"directoryname")
files = tf.data.TextLineDataset.list_files(str(files_root/'*'))
text_ds = tf.data.TextLineDataset(files).filter(lambda x: tf.cast(tf.strings.length(x), bool))
# by default this tokenizes elements by new line
# we use dataset class bec it is for large amounts of data
# Converting the vocab to integer indexes, wrt their frequency
vectorize_layer = layers.TextVectorization(
standardize=custom_standardization,
output_mode='count')
# vectorize_layer.adapt(text_ds.batch(1024))
print(f"Discoverd {len(files)} files")
vectorize_layer.adapt(text_ds.batch(1024))
print("Finished adapting")
inverse_vocab = vectorize_layer.get_vocabulary() #<----ERROR
print("Vocabulary Retrieved, with the len:")
size_vocab = len(inverse_vocab)
print(size_vocab)
此外,我计划组合所有序列的频率数组,该方法适用于相当较小的数据集:
text_vector_ds = text_ds.batch(1024).prefetch(AUTOTUNE).map(vectorize_layer).unbatch()
print("Finished vectorization")
it = text_vector_ds.as_numpy_iterator()
freq_arr = None
for i, entry in enumerate(text_vector_ds.as_numpy_iterator()):
if i == 0:
freq_arr = np.zeros(len(entry))
freq_arr += entry.astype(int)
else:
freq_arr += entry.astype(int)
这个问题的解决方案是什么?我需要词汇表才能将实际标记映射到它们的频率。
我对tensorflow和keras相当陌生,欢迎对我的方法提出任何建议或批评。我确实需要一些有关处理大型数据集的指导。我的最终目标是将它们输入神经网络(更具体地说,是一些skipgram)。 非常感谢。
不幸的是,您很可能需要更多的内存来完全处理您的数据集。
据我所知:
vectorize_layer.get_vocabulary()
优化得不是很好,在某些时候它保留了词汇表的多个副本。而且看来你的词汇量很大。
对于问题的第一部分 - 考虑在
max_tokens
中使用 TextVectorization
。这将只保留顶级代币。
对于第二部分,我建议避免取消批处理,并考虑批量处理映射的数据。也许是这样的:
text_count = text_ds.batch(1024).prefetch(1).map(vectorize_layer)
for batch in text_count:
freq_batch = batch.numpy().sum(axis=0)
if i == 0:
freq_arr = np.zeros(len(freq_batch))
freq_arr += freq_batch.astype(int)
else:
freq_arr += freq_batch.astype(int)