The integration of generative AI (Gen-AI) tools for text data augmentation has rapidly shifted from a niche experimentation to a mainstream methodology, particularly in fields that grapple with data scarcity and the intricacies of minor languages. Nowhere is this more pronounced than in the Lithuanian educational context, where classification tasks often suffer from limited, imbalanced, and idiosyncratic datasets. This article investigates the empirical influence of Gen-AI-powered augmentation on text classification accuracy, leveraging a meticulously structured study in which six traditional machine learning models are benchmarked under numerous augmentation and preprocessing strategies.
At the core of the methodology is a deliberate focus on classical machine learning algorithms: multi-layer perceptron (MLP), random forest (RF), gradient-boosted trees (GBT), k-nearest neighbors (kNN), decision trees (DT), and naive Bayes (NB). The decision to eschew deep learning approaches—like large Transformer-based architectures—is grounded in pragmatic concerns. Deep learning models require extensive datasets and significant computational resources to yield incremental improvements in accuracy, making them often impractical for smaller, domain-specific corpora such as Lithuanian educational text data.
Critically, the research aims not to push the boundaries of model complexity, but instead to isolate and assess how much performance gain, if any, can be attributed purely to text data augmentation via Gen-AI. It is a vital distinction: rather than conflating augmentation and architecture, the study offers a rare clarity of focus, ensuring that improvements are not masked by the overwhelming representational capacity of deep learning systems.
The efficacy of LSA is statistically validated using paired t-tests, comparing model performance (accuracy, precision, recall, F1) before and after various reductions, ranging from 10 to 50 dimensions. The findings are striking:
Performance is not appraised by a single metric. Instead, the comprehensive quartet—accuracy, recall, precision, and F1 score—captures the multifaceted nature of classification efficacy, particularly critical for imbalanced datasets common in educational assessments.
Introducing Gen-AI augmentation triggers a marked shift:
What is pivotal here is the robustness of augmentation across multiple classifiers—suggesting that for complex, finely grained label distinctions, Gen-AI’s paraphrasing, expansion, and re-contextualization capabilities generate more learnable representations that persist across model types.
This seeming shortfall is mitigated—indeed, reversed—when sBERT is paired with Gen-AI data augmentation. The results border on transformative:
The specific reasons for this difference in tool efficacy are complex, likely relating to the qualitative characteristics of augmented text—length, contextual expansion, and linguistic naturalness—all domains where OpenAI and Microsoft models presently set benchmarks.
The applicability is not limited to Lithuanian or educational data. Similar methodological playbooks could be wielded in other small language communities, specialist domains, or archival contexts—anywhere authentic text is rare and imbalanced.
Yet this technological leap is not without its cautions. The study’s success pivots on careful tool selection, rigorous preprocessing, and emergent cross-validation. Its lessons should inspire not just data scientists in education but all practitioners navigating the uncharted, AI-augmented waters of low-resource NLP.
As natural language processing continues to democratize, bridging historic capability gaps, the fusion of Gen-AI and classical machine learning stands out as both a pragmatic and promising strategy—a testament to the continual evolution within the field of artificial intelligence.
Source: Nature The influence of Gen-AI tools application for text data augmentation: case of Lithuanian educational context data classification - Scientific Reports
The Rationale for Traditional Machine Learning Over Deep Learning
At the core of the methodology is a deliberate focus on classical machine learning algorithms: multi-layer perceptron (MLP), random forest (RF), gradient-boosted trees (GBT), k-nearest neighbors (kNN), decision trees (DT), and naive Bayes (NB). The decision to eschew deep learning approaches—like large Transformer-based architectures—is grounded in pragmatic concerns. Deep learning models require extensive datasets and significant computational resources to yield incremental improvements in accuracy, making them often impractical for smaller, domain-specific corpora such as Lithuanian educational text data.Critically, the research aims not to push the boundaries of model complexity, but instead to isolate and assess how much performance gain, if any, can be attributed purely to text data augmentation via Gen-AI. It is a vital distinction: rather than conflating augmentation and architecture, the study offers a rare clarity of focus, ensuring that improvements are not masked by the overwhelming representational capacity of deep learning systems.
Text Vectorization: From Bag of Words to sBERT
A principal hurdle in text data science is the “curse of dimensionality”—transforming text into numerically encoded vectors generates expansive feature spaces that can impede algorithm performance, inflate training times, and exacerbate overfitting. To tackle this, the study employs two of the most prevalent text-to-vector techniques: bag of words (BoW) and sentence-BERT (sBERT) embeddings.- BoW offers a straightforward, frequency-based representation where each unique word maps to a vector dimension. Post-processing the original Lithuanian educational datasets, the BoW dimension typically ballooned to around 10,000 features per sample.
- sBERT, meanwhile, condenses semantic meaning into dense, fixed-size (384-dimensional) vectors, leveraging the power of pretrained Transformer models distilled for sentence-level similarity.
Dimensionality Reduction: Latent Semantics Analysis in Action
High-dimensional BoW vectors necessitate dimensionality reduction for effective model training and hyperparameter tuning. Here, Latent Semantics Analysis (LSA) is deployed—a time-tested technique that uncovers latent relationships between words and documents by applying singular value decomposition to the term-document matrix.The efficacy of LSA is statistically validated using paired t-tests, comparing model performance (accuracy, precision, recall, F1) before and after various reductions, ranging from 10 to 50 dimensions. The findings are striking:
- Both MLP and kNN see robust, statistically significant gains post-reduction, with kNN benefiting most.
- For random forest, however, the difference is muted in some configurations, suggesting a degree of robustness to high-dimensional input, albeit at the cost of training efficiency.
Experimental Design: A Data Scientist’s Gauntlet
The broad experimental canvas encompasses 15,296 model trainings—an extraordinary breadth that permits rigorous cross-validation and hyperparameter optimization. Consistency is maintained through stratified k-fold cross-validation (k=5), ensuring statistical parity and enabling fair, comparative insights across classifiers, data subsets, and augmentation schemes.Performance is not appraised by a single metric. Instead, the comprehensive quartet—accuracy, recall, precision, and F1 score—captures the multifaceted nature of classification efficacy, particularly critical for imbalanced datasets common in educational assessments.
Gen-AI Data Augmentation: A Comparative Lens
Bag of Words Results: Incremental but Uneven Gains
Initial model training on raw, unaugmented BoW representations already delivers respectable performance—a testament to the careful preprocessing (removal of numbers, punctuation, stop words, and short tokens). For two-class problems, accuracy ranges from 82.28% (kNN) to 87.60% (RF), while three-class setups register slightly diminished values, highlighting the increased difficulty of multi-label tasks.Introducing Gen-AI augmentation triggers a marked shift:
- Four of six algorithms (MLP, RF, GBT, kNN) show consistent accuracy improvements when data is augmented—especially when leveraging outputs from ChatGPT and Copilot in concert.
- Decision tree and naive Bayes models trail behind, with NB often exhibiting outright performance degradation, echoing findings from related research that note its vulnerability to altered feature distributions.
Three-class Problems: Amplified Benefits for Augmentation
The trend intensifies for three-class tasks. Accuracy hikes for MLP, RF, GBT, and kNN are more pronounced, with the kNN algorithm notably thriving in these higher-complexity scenarios. The performance uplift can, in certain configurations, exceed 10 percentage points.What is pivotal here is the robustness of augmentation across multiple classifiers—suggesting that for complex, finely grained label distinctions, Gen-AI’s paraphrasing, expansion, and re-contextualization capabilities generate more learnable representations that persist across model types.
sBERT: Diminishing Returns Without Augmentation—But Spectacular Gains When Used Well
Shifting to sBERT embeddings—ostensibly a more sophisticated representation—it is initially surprising that, without augmentation, most machine learning models actually fare worse than with the reduced BoW baseline. For instance, decision trees and naive Bayes see dramatic declines, over 14% and 10% respectively for binary classes.This seeming shortfall is mitigated—indeed, reversed—when sBERT is paired with Gen-AI data augmentation. The results border on transformative:
- kNN model accuracy surges beyond 97% on two-class problems, a remarkable increase (over 15% in absolute terms from baseline).
- Decision tree models, formerly lagging, now record 4–12% improvements, while naive Bayes enjoys upticks in certain augmented configurations.
Hyperparameter and Model Optimization: Best Practices Surface
The exhaustive experimental design allows rare visibility into hyperparameter optimization trends:- MLP: More iterations and greater neuron counts boost accuracy, though hidden layer depth is less influential.
- Random Forest: For small datasets, the information gain split criterion is optimal; for larger sets, the Gini index prevails. Model count (100–850) varies by subset size.
- GBT: Peak performance clusters around a learning rate of 0.01 and 1000 trees, with variable tree depth.
- kNN: Weighted neighbors and neighborhood sizes between 2–8 are consistently optimal.
- Decision Tree and Naive Bayes: No robust hyperparameter trends distinguish themselves; NB’s one-size-fits-all simplicity is unchanged by data augmentation.
Tool Efficacy: ChatGPT and Copilot Outperform Gemini
Across virtually every configuration, the combination of ChatGPT and Copilot for data augmentation delivers the most substantial accuracy improvements, dwarfing the incremental gains achieved via Google’s Gemini tool. This is an unambiguous signal that, in practice, augmentation tool choice is not a trivial variable but a significant determinant of model performance.The specific reasons for this difference in tool efficacy are complex, likely relating to the qualitative characteristics of augmented text—length, contextual expansion, and linguistic naturalness—all domains where OpenAI and Microsoft models presently set benchmarks.
Critical Analysis: Risks, Caveats, and Generalizability
Strengths
- Methodological Transparency: The study’s use of paired t-tests, stratified cross-validation, and a dizzying array of model runs (15,296) yields a statistical rigor rarely encountered in real-world language data experiments.
- Comprehensive Feature Engineering: The parallel analysis of BoW and sBERT, the dimension reduction via LSA, and close tracking of hyperparameter effects ensure that results are robust across a broad methodological spectrum.
- Reproducibility: Clear documentation of preprocessing filters, vector reduction, and model training pathways supports future efforts at meta-analysis or replication in other low-resource languages.
Potential Risks and Limitations
- Overfitting via Augmentation: While augmentation generally boosts metrics, it can inflate the risk of “data leakage”—inadvertently introducing too-similar synthetic samples that fail to represent real-world diversity. Especially with traditional models prone to memorization, care is needed to monitor for this artefact.
- Dependence on Augmentation Tool Quality: As evident from differences between Gemini, ChatGPT, and Copilot, model improvements are contingent on the linguistic and domain fidelity of the AI used. Results may not generalize if augmentation tools generate noisy, irrelevant, or subtly biased text—a risk if proprietary models are updated or their training data diverge stylistically.
- Focus on Accuracy over Real-World Outcomes: Accuracy metrics, F1, and recall are proxies for generalization—not guarantees. Misclassifications in sensitive educational contexts can have real consequences, so augmented improvements should be stress-tested with unseen, real-world test sets, not just synthetic splits.
- Scalability: While bag of words and LSA scale well to moderate dataset sizes, future deployments with larger corpora or more nuanced domain-specific language may necessitate revisiting deep learning or hybrid approaches for further efficiency gains.
Wider Implications for Educational Technology and NLP
The surge in performance documented here has direct and immediate implications for Lithuanian educational data pipelines: accurate classification models can now be reliably trained even on previously insufficiently sized datasets. More broadly, the findings underscore a vital point for low-resource language AI—well-crafted Gen-AI augmentation, paired with thoughtful feature engineering and traditional models, can close much of the performance gap typically ceded to high-resource counterparts.The applicability is not limited to Lithuanian or educational data. Similar methodological playbooks could be wielded in other small language communities, specialist domains, or archival contexts—anywhere authentic text is rare and imbalanced.
Best Practices and Recommendations
- Prioritize Reliable Augmentation Tools: When pursuing text augmentation, favor established, high-quality Gen-AI models—preferably with options for prompt engineering and domain-specific constraints.
- Use Hybrid Embedding and Reduction Pipelines: Begin with BoW for baseline robustness, then graduate to denser semantic vectors like sBERT. Always incorporate dimensionality reduction to guard against overfitting and ensure training efficiency.
- Benchmark Across Multiple Classifiers: Avoid premature optimization on a single model—kNN and GBT are especially effective, but RF and MLP also show promise. Decision tree and NB may falter unless carefully tuned.
- Validate with Real-World Unseen Data: Augmented gains are only meaningful if they extend to truly novel, untouched samples. Incorporate “gold standard” holdout sets wherever possible.
Conclusion: A Path Forward for Low-Resource NLP
The empirical evidence is resounding: Gen-AI text augmentation, when executed thoughtfully, substantially strengthens classification accuracy in Lithuanian educational contexts—and by extension, promises similar breakthroughs across other minoritized languages and domains with scarce original data.Yet this technological leap is not without its cautions. The study’s success pivots on careful tool selection, rigorous preprocessing, and emergent cross-validation. Its lessons should inspire not just data scientists in education but all practitioners navigating the uncharted, AI-augmented waters of low-resource NLP.
As natural language processing continues to democratize, bridging historic capability gaps, the fusion of Gen-AI and classical machine learning stands out as both a pragmatic and promising strategy—a testament to the continual evolution within the field of artificial intelligence.
Source: Nature The influence of Gen-AI tools application for text data augmentation: case of Lithuanian educational context data classification - Scientific Reports