我正在开发一个 Java 项目,它基本上是一个假新闻检测应用程序。该数据集包含两列文本(新闻文章)和标签(0:假/1:真)。此数据将转换为 JSON 文件。在Java中,我使用Regex将所有停用词替换为空格(“”)。然后我开始使用 Java 进行矢量化。我在 Weka 和 Deeplearning4j 中遇到了内置矢量化技术的问题。现在,我使用“StringToWordVector”过滤器对文本进行矢量化。我将在我的 Java 应用程序中提供 .java 文件的代码。
数据处理器.java
package fnd;
import com.fasterxml.jackson.databind.ObjectMapper;
import weka.core.Attribute;
import weka.core.DenseInstance;
import weka.core.Instance;
import weka.core.Instances;
import weka.core.converters.ArffSaver;
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
public class DataProcessor {
public static void main(String[] args) {
try {
// Specify the path to your JSON file containing news data
String jsonFilePath = "src/main/resources/fnd_output.json";
// Create ObjectMapper instance to read JSON
ObjectMapper objectMapper = new ObjectMapper();
// Deserialize JSON array into an array of News objects
News[] newsArray = objectMapper.readValue(new File(jsonFilePath), News[].class);
// Prepare attributes for the Instances
ArrayList<Attribute> attributes = new ArrayList<>();
attributes.add(new Attribute("text", (ArrayList<String>) null)); // Text attribute as string
// Define nominal values for the label attribute
ArrayList<String> labelValues = new ArrayList<>();
labelValues.add("positive");
labelValues.add("negative");
Attribute labelAttribute = new Attribute("label", labelValues); // Label attribute as nominal
attributes.add(labelAttribute);
// Create an empty Instances object
Instances instances = new Instances("TextInstances", attributes, 0);
// Set the index of the class attribute (label attribute)
instances.setClassIndex(attributes.size() - 1);
// Process each News object and add to Instances
for (News news : newsArray) {
String processedText = TextPreprocessor.preprocessText(news.getText());
// Vectorize the processed text
Instances vectorizedInstance = TextVectorization.vectorizeText(processedText);
// Create a new Instance
Instance instance = new DenseInstance(attributes.size());
// Set the dataset for the instance
instance.setDataset(instances);
// Handle text attribute (assuming it's a string attribute)
Attribute textAttr = attributes.get(0);
if (textAttr.isString()) {
instance.setValue(textAttr, vectorizedInstance.instance(0).stringValue(0));
} else {
System.err.println("Text attribute is not a string attribute.");
}
// Handle label attribute (assuming it's a nominal attribute)
Attribute labelAttr = labelAttribute;
if (labelAttr.isNominal()) {
instance.setValue(labelAttr, news.getLabel());
} else {
System.err.println("Label attribute is not a nominal attribute.");
}
// Add the instance to Instances
instances.add(instance);
}
// Output instances to ARFF file
ArffSaver arffSaver = new ArffSaver();
arffSaver.setInstances(instances);
arffSaver.setFile(new File("vectorized_text_with_labels.arff"));
arffSaver.writeBatch();
System.out.println("Text vectorization complete with labels. Saved as vectorized_text_with_labels.arff");
} catch (IOException e) {
e.printStackTrace();
} catch (Exception ex) {
ex.printStackTrace();
}
}
}
文本矢量化
package fnd;
import java.io.File;import java.io.IOException;import java.util.ArrayList;
import weka.core.Attribute;import weka.core.Instances;import weka.core.DenseInstance;import weka.core.converters.ArffSaver;import weka.filters.Filter;import weka.filters.unsupervised.attribute.StringToWordVector;import weka.core.Instance;
public class TextVectorization {
// Method to perform text vectorization (convert string to word vector)
public static Instances vectorizeText(String text) throws Exception {
// Create ArrayList to hold attributes
ArrayList<Attribute> attributes = new ArrayList<>();
// Create a single attribute named "text"
Attribute textAttribute = new Attribute("text", (ArrayList<String>) null);
attributes.add(textAttribute);
// Create Instances object with the specified attribute
Instances instances = new Instances("TextInstances", attributes, 0);
instances.setClass(textAttribute); // Set the class attribute to "text"
// Create a new Instance with the provided text
// Create a new Instance
Instance instance = new DenseInstance(instances.numAttributes());
instance.setValue(textAttribute, text);
instances.add(instance);
// Apply StringToWordVector filter to vectorize the text
StringToWordVector filter = new StringToWordVector();
filter.setInputFormat(instances);
Instances vectorizedData = Filter.useFilter(instances, filter);
return vectorizedData;
}
public static void saveInstancesToArff(Instances instances, String filename) throws IOException {
ArffSaver arffSaver = new ArffSaver();
arffSaver.setInstances(instances);
arffSaver.setFile(new File(filename));
arffSaver.writeBatch();
}
}
文本预处理器
package fnd;
import java.util.regex.Matcher;import java.util.regex.Pattern;
public class TextPreprocessor {
private static final Pattern URL_PATTERN = Pattern.compile("http[s]?://\\S+|www\\.\\S+");
private static final Pattern HTML_TAG_PATTERN = Pattern.compile("<[^>]+>");
public static String preprocessText(String text) {
if (text == null || text.isEmpty()) {
return "";
}
// Convert text to lowercase
text = text.toLowerCase();
// Remove URLs and HTML tags
text = removeUrlsAndHtmlTags(text);
// Remove non-word characters (except spaces), digits, and newline characters
text = removeSpecialCharacters(text);
return text;
}
private static String removeUrlsAndHtmlTags(String text) {
Matcher urlMatcher = URL_PATTERN.matcher(text);
text = urlMatcher.replaceAll("");
Matcher htmlTagMatcher = HTML_TAG_PATTERN.matcher(text);
text = htmlTagMatcher.replaceAll("");
return text;
}
private static String removeSpecialCharacters(String text) {
StringBuilder processedText = new StringBuilder(text.length());
for (char ch : text.toCharArray()) {
if (Character.isLetter(ch) || Character.isWhitespace(ch)) {
processedText.append(ch);
}
}
return processedText.toString();
}
}
据我所知,有关错误的详细信息。
java.lang.IllegalArgumentException:属性不是名义、字符串或日期!在weka.core.AbstractInstance.stringValue(AbstractInstance.java:674)在weka.core.AbstractInstance.stringValue(AbstractInstance.java:644)在fnd。 DataProcessor.main(DataProcessor.java:60)
取消注释该行将运行代码。
instance.setValue(textAttr, vectorizedInstance.instance(0).stringValue(0));
如何对文本进行矢量化,然后将数据输入模型?
您一次处理一个文档的数据,每次都重新初始化 StringToWordVector 过滤器。但是,过滤器每次都会根据您刚刚推送的文档内容生成不同的词袋。每次,矢量化输出中的列都将与不同的单词相关。要解决此问题,您至少需要将所有文本数据添加到单个 weka.core.Instances 对象,然后应用过滤器。
但是...
由于您计划执行分类,因此您应该将 StringToWordVector 与 FilteredClassifier 元分类器以及选择的基分类器结合使用以进行实际分类。这样您就可以确保使用初始化的 StringToWordVector 过滤器正确预处理后续预测。 在这种情况下,您的文本数据应该是第一个属性,与该文本关联的标签应该是第二个属性(类属性)。