new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Dec 11

CosmicMan: A Text-to-Image Foundation Model for Humans

We present CosmicMan, a text-to-image foundation model specialized for generating high-fidelity human images. Unlike current general-purpose foundation models that are stuck in the dilemma of inferior quality and text-image misalignment for humans, CosmicMan enables generating photo-realistic human images with meticulous appearance, reasonable structure, and precise text-image alignment with detailed dense descriptions. At the heart of CosmicMan's success are the new reflections and perspectives on data and models: (1) We found that data quality and a scalable data production flow are essential for the final results from trained models. Hence, we propose a new data production paradigm, Annotate Anyone, which serves as a perpetual data flywheel to produce high-quality data with accurate yet cost-effective annotations over time. Based on this, we constructed a large-scale dataset, CosmicMan-HQ 1.0, with 6 Million high-quality real-world human images in a mean resolution of 1488x1255, and attached with precise text annotations deriving from 115 Million attributes in diverse granularities. (2) We argue that a text-to-image foundation model specialized for humans must be pragmatic -- easy to integrate into down-streaming tasks while effective in producing high-quality human images. Hence, we propose to model the relationship between dense text descriptions and image pixels in a decomposed manner, and present Decomposed-Attention-Refocusing (Daring) training framework. It seamlessly decomposes the cross-attention features in existing text-to-image diffusion model, and enforces attention refocusing without adding extra modules. Through Daring, we show that explicitly discretizing continuous text space into several basic groups that align with human body structure is the key to tackling the misalignment problem in a breeze.

  • 6 authors
·
Apr 1, 2024 1

Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving

The task of issue resolving is to modify a codebase to generate a patch that addresses a given issue. However, existing benchmarks, such as SWE-bench, focus almost exclusively on Python, making them insufficient for evaluating Large Language Models (LLMs) across diverse software ecosystems. To address this, we introduce a multilingual issue-resolving benchmark, called Multi-SWE-bench, covering Java, TypeScript, JavaScript, Go, Rust, C, and C++. It includes a total of 1,632 high-quality instances, which were carefully annotated from 2,456 candidates by 68 expert annotators, ensuring that the benchmark can provide an accurate and reliable evaluation. Based on Multi-SWE-bench, we evaluate a series of state-of-the-art models using three representative methods (Agentless, SWE-agent, and OpenHands) and present a comprehensive analysis with key empirical insights. In addition, we launch a Multi-SWE-RL open-source community, aimed at building large-scale reinforcement learning (RL) training datasets for issue-resolving tasks. As an initial contribution, we release a set of 4,723 well-structured instances spanning seven programming languages, laying a solid foundation for RL research in this domain. More importantly, we open-source our entire data production pipeline, along with detailed tutorials, encouraging the open-source community to continuously contribute and expand the dataset. We envision our Multi-SWE-bench and the ever-growing Multi-SWE-RL community as catalysts for advancing RL toward its full potential, bringing us one step closer to the dawn of AGI.

DreamFoley: Scalable VLMs for High-Fidelity Video-to-Audio Generation

Recent advances in video generation have achieved remarkable improvements in visual content fidelity. However, the absence of synchronized audio severely undermines immersive experience and restricts practical applications of these technologies. To address this challenge, several pioneering works have explored diffusion transformer architectures for generating plausible video-synchronized audio, including Kling-foley, HunyuanVideo-foley and Thinksound. Distinct from existing works, we introduce an autoregressive audio generation architecture (DreamFoley) that harnesses the capabilities of large vision-language models (VLMs) to jointly model sequential interactions among video, audio, and text modalities. Our approach features a dual-visual encoder module that effectively captures both audio-aligned and text-aligned visual features. Additionally, we employ a Residual Vector Quantization audio tokenizer with a delay-pattern generation scheme to balance the trade-off between training efficiency and audio quality. Moreover, we introduce the classifier-free guidance strategy into VLMs to bootstrap generated audio quality. Furthermore, we establish an efficient data production pipeline to scale audio-video-text triple collection. Finally, extensive experiments are conducted to validate the effectiveness of our model, achieving promising performance across popular benchmarks. We hope that the findings in this study provide a strong foundation for future video-to-audio generation research. We also release the previously missing audio-visual textual descriptions from the public benchmark, aiming to facilitate subsequent researchers in conducting more convenient and effective evaluations and comparisons.

  • 5 authors
·
Dec 4

LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale

Recent video large language models (Video LLMs) often depend on costly human annotations or proprietary model APIs (e.g., GPT-4o) to produce training data, which limits their training at scale. In this paper, we explore large-scale training for Video LLM with cheap automatic speech recognition (ASR) transcripts. Specifically, we propose a novel streaming training approach that densely interleaves the ASR words and video frames according to their timestamps. Compared to previous studies in vision-language representation with ASR, our method naturally fits the streaming characteristics of ASR, thus enabling the model to learn temporally-aligned, fine-grained vision-language modeling. To support the training algorithm, we introduce a data production pipeline to process YouTube videos and their closed captions (CC, same as ASR), resulting in Live-CC-5M dataset for pre-training and Live-WhisperX-526K dataset for high-quality supervised fine-tuning (SFT). Remarkably, even without SFT, the ASR-only pre-trained LiveCC-7B-Base model demonstrates competitive general video QA performance and exhibits a new capability in real-time video commentary. To evaluate this, we carefully design a new LiveSports-3K benchmark, using LLM-as-a-judge to measure the free-form commentary. Experiments show our final LiveCC-7B-Instruct model can surpass advanced 72B models (Qwen2.5-VL-72B-Instruct, LLaVA-Video-72B) in commentary quality even working in a real-time mode. Meanwhile, it achieves state-of-the-art results at the 7B/8B scale on popular video QA benchmarks such as VideoMME and OVOBench, demonstrating the broad generalizability of our approach. All resources of this paper have been released at https://showlab.github.io/livecc.

  • 6 authors
·
Apr 22 2

MindGPT-4ov: An Enhanced MLLM via a Multi-Stage Post-Training Paradigm

We present MindGPT-4ov, a multimodal large language model (MLLM) that introduces a general post-training paradigm spanning data production, model training, and efficient deployment. It achieves state-of-the-art performance across multiple benchmarks at low cost, effectively enhancing the foundational capabilities of MLLMs and the generalization ability. Focusing on data construction, supervised fine-tuning strategies, and multimodal reinforcement learning methods, this work proposes three key innovations: (1) An information density-based data generation scheme, integrated with a dual-dimensional tree-structured label system, enabling automated generation of high-quality cross-domain data. (2) A collaborative curriculum supervised fine-tuning approach that balances the injection of domain-specific knowledge with the preservation of general capabilities. (3) A hybrid reinforcement learning paradigm that enhances reasoning ability while simultaneously addressing multi-objective optimization such as diversity exploration, maintenance of multimodal perception, and response conciseness. Moreover, we implement a series of infrastructure optimizations, such as 5D parallel training, operator optimization, and inference quantization to enhance training and inference efficiency while reducing the cost of domain adaptation. Experimental results demonstrate that the MindGPT-4ov model outperforms state-of-the-art models on benchmarks such as MMBench, MMStar, MathVision, and MathVista. In addition, MindGPT-4ov also demonstrates superior user experience in vertical domain tasks, enabling a seamless transition from academic research to industrial deployment. MindGPT-4ov provides a general post-training paradigm applicable to a wide range of MLLMs. The model weights, datasets, and code for the Qwen3-VL-based variants will be recently open-sourced to support the community's development of MLLMs.

  • 17 authors
·
Dec 2

POLCA: Power Oversubscription in LLM Cloud Providers

Recent innovation in large language models (LLMs), and their myriad use-cases have rapidly driven up the compute capacity demand for datacenter GPUs. Several cloud providers and other enterprises have made substantial plans of growth in their datacenters to support these new workloads. One of the key bottleneck resources in datacenters is power, and given the increasing model sizes of LLMs, they are becoming increasingly power intensive. In this paper, we show that there is a significant opportunity to oversubscribe power in LLM clusters. Power oversubscription improves the power efficiency of these datacenters, allowing more deployable servers per datacenter, and reduces the deployment time, since building new datacenters is slow. We extensively characterize the power consumption patterns of a variety of LLMs and their configurations. We identify the differences between the inference and training power consumption patterns. Based on our analysis of these LLMs, we claim that the average and peak power utilization in LLM clusters for inference should not be very high. Our deductions align with the data from production LLM clusters, revealing that inference workloads offer substantial headroom for power oversubscription. However, the stringent set of telemetry and controls that GPUs offer in a virtualized environment, makes it challenging to have a reliable and robust power oversubscription mechanism. We propose POLCA, our framework for power oversubscription that is robust, reliable, and readily deployable for GPU clusters. Using open-source models to replicate the power patterns observed in production, we simulate POLCA and demonstrate that we can deploy 30% more servers in the same GPU cluster for inference, with minimal performance loss

  • 7 authors
·
Aug 24, 2023

A Probabilistic Framework for Temporal Distribution Generalization in Industry-Scale Recommender Systems

Temporal distribution shift (TDS) erodes the long-term accuracy of recommender systems, yet industrial practice still relies on periodic incremental training, which struggles to capture both stable and transient patterns. Existing approaches such as invariant learning and self-supervised learning offer partial solutions but often suffer from unstable temporal generalization, representation collapse, or inefficient data utilization. To address these limitations, we propose ELBO_TDS, a probabilistic framework that integrates seamlessly into industry-scale incremental learning pipelines. First, we identify key shifting factors through statistical analysis of real-world production data and design a simple yet effective data augmentation strategy that resamples these time-varying factors to extend the training support. Second, to harness the benefits of this extended distribution while preventing representation collapse, we model the temporal recommendation scenario using a causal graph and derive a self-supervised variational objective, ELBO_TDS, grounded in the causal structure. Extensive experiments supported by both theoretical and empirical analysis demonstrate that our method achieves superior temporal generalization, yielding a 2.33\% uplift in GMV per user and has been successfully deployed in Shopee Product Search. Code is available at https://github.com/FuCongResearchSquad/ELBO4TDS.

  • 5 authors
·
Nov 25

Production of Categorical Data Verifying Differential Privacy: Conception and Applications to Machine Learning

Private and public organizations regularly collect and analyze digitalized data about their associates, volunteers, clients, etc. However, because most personal data are sensitive, there is a key challenge in designing privacy-preserving systems. To tackle privacy concerns, research communities have proposed different methods to preserve privacy, with Differential privacy (DP) standing out as a formal definition that allows quantifying the privacy-utility trade-off. Besides, with the local DP (LDP) model, users can sanitize their data locally before transmitting it to the server. The objective of this thesis is thus two-fold: O_1) To improve the utility and privacy in multiple frequency estimates under LDP guarantees, which is fundamental to statistical learning. And O_2) To assess the privacy-utility trade-off of machine learning (ML) models trained over differentially private data. For O_1, we first tackled the problem from two "multiple" perspectives, i.e., multiple attributes and multiple collections throughout time, while focusing on utility. Secondly, we focused our attention on the multiple attributes aspect only, in which we proposed a solution focusing on privacy while preserving utility. In both cases, we demonstrate through analytical and experimental validations the advantages of our proposed solutions over state-of-the-art LDP protocols. For O_2, we empirically evaluated ML-based solutions designed to solve real-world problems while ensuring DP guarantees. Indeed, we mainly used the input data perturbation setting from the privacy-preserving ML literature. This is the situation in which the whole dataset is sanitized independently and, thus, we implemented LDP algorithms from the perspective of the centralized data owner. In all cases, we concluded that differentially private ML models achieve nearly the same utility metrics as non-private ones.

  • 1 authors
·
Apr 2, 2022

Measurement of the properties of Higgs boson production at $\sqrt{s} = 13$ TeV in the $H\toγγ$ channel using $139$ fb$^{-1}$ of $pp$ collision data with the ATLAS experiment

Measurements of Higgs boson production cross-sections are carried out in the diphoton decay channel using 139 fb^{-1} of pp collision data at s = 13 TeV collected by the ATLAS experiment at the LHC. The analysis is based on the definition of 101 distinct signal regions using machine-learning techniques. The inclusive Higgs boson signal strength in the diphoton channel is measured to be 1.04^{+0.10}_{-0.09}. Cross-sections for gluon-gluon fusion, vector-boson fusion, associated production with a W or Z boson, and top associated production processes are reported. An upper limit of 10 times the Standard Model prediction is set for the associated production process of a Higgs boson with a single top quark, which has a unique sensitivity to the sign of the top quark Yukawa coupling. Higgs boson production is further characterized through measurements of Simplified Template Cross-Sections (STXS). In total, cross-sections of 28 STXS regions are measured. The measured STXS cross-sections are compatible with their Standard Model predictions, with a p-value of 93%. The measurements are also used to set constraints on Higgs boson coupling strengths, as well as on new interactions beyond the Standard Model in an effective field theory approach. No significant deviations from the Standard Model predictions are observed in these measurements, which provide significant sensitivity improvements compared to the previous ATLAS results.

  • 1 authors
·
Jul 1, 2022

ComPile: A Large IR Dataset from Production Sources

Code is increasingly becoming a core data modality of modern machine learning research impacting not only the way we write code with conversational agents like OpenAI's ChatGPT, Google's Bard, or Anthropic's Claude, the way we translate code from one language into another, but also the compiler infrastructure underlying the language. While modeling approaches may vary and representations differ, the targeted tasks often remain the same within the individual classes of models. Relying solely on the ability of modern models to extract information from unstructured code does not take advantage of 70 years of programming language and compiler development by not utilizing the structure inherent to programs in the data collection. This detracts from the performance of models working over a tokenized representation of input code and precludes the use of these models in the compiler itself. To work towards the first intermediate representation (IR) based models, we fully utilize the LLVM compiler infrastructure, shared by a number of languages, to generate a 182B token dataset of LLVM IR. We generated this dataset from programming languages built on the shared LLVM infrastructure, including Rust, Swift, Julia, and C/C++, by hooking into LLVM code generation either through the language's package manager or the compiler directly to extract the dataset of intermediate representations from production grade programs. Statistical analysis proves the utility of our dataset not only for large language model training, but also for the introspection into the code generation process itself with the dataset showing great promise for machine-learned compiler components.

  • 9 authors
·
Sep 27, 2023

Open-Sora: Democratizing Efficient Video Production for All

Vision and language are the two foundational senses for humans, and they build up our cognitive ability and intelligence. While significant breakthroughs have been made in AI language ability, artificial visual intelligence, especially the ability to generate and simulate the world we see, is far lagging behind. To facilitate the development and accessibility of artificial visual intelligence, we created Open-Sora, an open-source video generation model designed to produce high-fidelity video content. Open-Sora supports a wide spectrum of visual generation tasks, including text-to-image generation, text-to-video generation, and image-to-video generation. The model leverages advanced deep learning architectures and training/inference techniques to enable flexible video synthesis, which could generate video content of up to 15 seconds, up to 720p resolution, and arbitrary aspect ratios. Specifically, we introduce Spatial-Temporal Diffusion Transformer (STDiT), an efficient diffusion framework for videos that decouples spatial and temporal attention. We also introduce a highly compressive 3D autoencoder to make representations compact and further accelerate training with an ad hoc training strategy. Through this initiative, we aim to foster innovation, creativity, and inclusivity within the community of AI content creation. By embracing the open-source principle, Open-Sora democratizes full access to all the training/inference/data preparation codes as well as model weights. All resources are publicly available at: https://github.com/hpcaitech/Open-Sora.

  • 9 authors
·
Dec 29, 2024

Towards Trustworthy Machine Learning in Production: An Overview of the Robustness in MLOps Approach

Artificial intelligence (AI), and especially its sub-field of Machine Learning (ML), are impacting the daily lives of everyone with their ubiquitous applications. In recent years, AI researchers and practitioners have introduced principles and guidelines to build systems that make reliable and trustworthy decisions. From a practical perspective, conventional ML systems process historical data to extract the features that are consequently used to train ML models that perform the desired task. However, in practice, a fundamental challenge arises when the system needs to be operationalized and deployed to evolve and operate in real-life environments continuously. To address this challenge, Machine Learning Operations (MLOps) have emerged as a potential recipe for standardizing ML solutions in deployment. Although MLOps demonstrated great success in streamlining ML processes, thoroughly defining the specifications of robust MLOps approaches remains of great interest to researchers and practitioners. In this paper, we provide a comprehensive overview of the trustworthiness property of MLOps systems. Specifically, we highlight technical practices to achieve robust MLOps systems. In addition, we survey the existing research approaches that address the robustness aspects of ML systems in production. We also review the tools and software available to build MLOps systems and summarize their support to handle the robustness aspects. Finally, we present the open challenges and propose possible future directions and opportunities within this emerging field. The aim of this paper is to provide researchers and practitioners working on practical AI applications with a comprehensive view to adopt robust ML solutions in production environments.

  • 2 authors
·
Oct 28, 2024

Dynamical Model of $J/Ψ$ photo-production on the nucleon

A dynamical model based on a phenomenological charm quark-nucleon(c-N) potential v_{cN} and the Pomeron-exchange mechanism is constructed to investigate the J/Psi photo-production on the nucleon from threshold to invariant mass W=300 GeV. The J/Psi-N potential,V_{J/Psi N}(r),is constructed by folding v_{cN} into the wavefunction Phi_{J/Psi}(cc) of J/Psi within a Constituent Quark Model(CQM) of Ref.[43]. A photo-production amplitude is also generated by v_{cN} by a cc-loop integration over the gammarightarrow cc vertex function and Phi_{J/Psi}(cc). No commonly used Vector Meson Dominance assumption is used to define this photo-production amplitude which is needed to describe the data near the threshold. The potential v_{cN}(r) is parameterized in a form such that the predicted V_{J/Psi N}(r) at large distances has the same Yukawa potential form extracted from a Lattice QCD(LQCD) calculation of Ref.[18]. The parameters of v_{cN} are determined by fitting the total cross section data of JLab by performing calculations that include J/Psi-N final state interactions(FSI). The resulting differential cross sections are found in good agreements with the data. It is shown that the FSI effects dominate the cross section in the very near threshold region, allowing for sensitive testing of the predicted J/Psi-N scattering amplitudes. By imposing the constraints of J/Psi-N potential extracted from the LQCD calculation, we have obtained three J/Psi-N potentials which fit the JLab data equally well. The resulting J/Psi-N scattering lengths are in the range of a=(-0.05 fm sim -0.25 fm). With the determined v_{cN}(r) and the wavefunctions generated from the same CQM, the constructed model is used to predict the cross sections of photo-production of eta_c(1S) and Psi(2S) mesons for future experimental tests.

  • 3 authors
·
Mar 4, 2024

HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation

Human image animation involves generating videos from a character photo, allowing user control and unlocking potential for video and movie production. While recent approaches yield impressive results using high-quality training data, the inaccessibility of these datasets hampers fair and transparent benchmarking. Moreover, these approaches prioritize 2D human motion and overlook the significance of camera motions in videos, leading to limited control and unstable video generation.To demystify the training data, we present HumanVid, the first large-scale high-quality dataset tailored for human image animation, which combines crafted real-world and synthetic data. For the real-world data, we compile a vast collection of copyright-free real-world videos from the internet. Through a carefully designed rule-based filtering strategy, we ensure the inclusion of high-quality videos, resulting in a collection of 20K human-centric videos in 1080P resolution. Human and camera motion annotation is accomplished using a 2D pose estimator and a SLAM-based method. For the synthetic data, we gather 2,300 copyright-free 3D avatar assets to augment existing available 3D assets. Notably, we introduce a rule-based camera trajectory generation method, enabling the synthetic pipeline to incorporate diverse and precise camera motion annotation, which can rarely be found in real-world data. To verify the effectiveness of HumanVid, we establish a baseline model named CamAnimate, short for Camera-controllable Human Animation, that considers both human and camera motions as conditions. Through extensive experimentation, we demonstrate that such simple baseline training on our HumanVid achieves state-of-the-art performance in controlling both human pose and camera motions, setting a new benchmark. Code and data will be publicly available at https://github.com/zhenzhiwang/HumanVid/.

  • 11 authors
·
Jul 24, 2024 3

How much speech data is necessary for ASR in African languages? An evaluation of data scaling in Kinyarwanda and Kikuyu

The development of Automatic Speech Recognition (ASR) systems for low-resource African languages remains challenging due to limited transcribed speech data. While recent advances in large multilingual models like OpenAI's Whisper offer promising pathways for low-resource ASR development, critical questions persist regarding practical deployment requirements. This paper addresses two fundamental concerns for practitioners: determining the minimum data volumes needed for viable performance and characterizing the primary failure modes that emerge in production systems. We evaluate Whisper's performance through comprehensive experiments on two Bantu languages: systematic data scaling analysis on Kinyarwanda using training sets from 1 to 1,400 hours, and detailed error characterization on Kikuyu using 270 hours of training data. Our scaling experiments demonstrate that practical ASR performance (WER < 13\%) becomes achievable with as little as 50 hours of training data, with substantial improvements continuing through 200 hours (WER < 10\%). Complementing these volume-focused findings, our error analysis reveals that data quality issues, particularly noisy ground truth transcriptions, account for 38.6\% of high-error cases, indicating that careful data curation is as critical as data volume for robust system performance. These results provide actionable benchmarks and deployment guidance for teams developing ASR systems across similar low-resource language contexts. We release accompanying and models see https://github.com/SunbirdAI/kinyarwanda-whisper-eval

  • 6 authors
·
Oct 8

Large Language Models and Synthetic Data for Monitoring Dataset Mentions in Research Papers

Tracking how data is mentioned and used in research papers provides critical insights for improving data discoverability, quality, and production. However, manually identifying and classifying dataset mentions across vast academic literature is resource-intensive and not scalable. This paper presents a machine learning framework that automates dataset mention detection across research domains by leveraging large language models (LLMs), synthetic data, and a two-stage fine-tuning process. We employ zero-shot extraction from research papers, an LLM-as-a-Judge for quality assessment, and a reasoning agent for refinement to generate a weakly supervised synthetic dataset. The Phi-3.5-mini instruct model is pre-fine-tuned on this dataset, followed by fine-tuning on a manually annotated subset. At inference, a ModernBERT-based classifier efficiently filters dataset mentions, reducing computational overhead while maintaining high recall. Evaluated on a held-out manually annotated sample, our fine-tuned model outperforms NuExtract-v1.5 and GLiNER-large-v2.1 in dataset extraction accuracy. Our results highlight how LLM-generated synthetic data can effectively address training data scarcity, improving generalization in low-resource settings. This framework offers a pathway toward scalable monitoring of dataset usage, enhancing transparency, and supporting researchers, funders, and policymakers in identifying data gaps and strengthening data accessibility for informed decision-making.

  • 3 authors
·
Feb 14

Deep Open-Set Recognition for Silicon Wafer Production Monitoring

The chips contained in any electronic device are manufactured over circular silicon wafers, which are monitored by inspection machines at different production stages. Inspection machines detect and locate any defect within the wafer and return a Wafer Defect Map (WDM), i.e., a list of the coordinates where defects lie, which can be considered a huge, sparse, and binary image. In normal conditions, wafers exhibit a small number of randomly distributed defects, while defects grouped in specific patterns might indicate known or novel categories of failures in the production line. Needless to say, a primary concern of semiconductor industries is to identify these patterns and intervene as soon as possible to restore normal production conditions. Here we address WDM monitoring as an open-set recognition problem to accurately classify WDM in known categories and promptly detect novel patterns. In particular, we propose a comprehensive pipeline for wafer monitoring based on a Submanifold Sparse Convolutional Network, a deep architecture designed to process sparse data at an arbitrary resolution, which is trained on the known classes. To detect novelties, we define an outlier detector based on a Gaussian Mixture Model fitted on the latent representation of the classifier. Our experiments on a real dataset of WDMs show that directly processing full-resolution WDMs by Submanifold Sparse Convolutions yields superior classification performance on known classes than traditional Convolutional Neural Networks, which require a preliminary binning to reduce the size of the binary images representing WDMs. Moreover, our solution outperforms state-of-the-art open-set recognition solutions in detecting novelties.

  • 5 authors
·
Aug 30, 2022

ILASR: Privacy-Preserving Incremental Learning for Automatic Speech Recognition at Production Scale

Incremental learning is one paradigm to enable model building and updating at scale with streaming data. For end-to-end automatic speech recognition (ASR) tasks, the absence of human annotated labels along with the need for privacy preserving policies for model building makes it a daunting challenge. Motivated by these challenges, in this paper we use a cloud based framework for production systems to demonstrate insights from privacy preserving incremental learning for automatic speech recognition (ILASR). By privacy preserving, we mean, usage of ephemeral data which are not human annotated. This system is a step forward for production levelASR models for incremental/continual learning that offers near real-time test-bed for experimentation in the cloud for end-to-end ASR, while adhering to privacy-preserving policies. We show that the proposed system can improve the production models significantly(3%) over a new time period of six months even in the absence of human annotated labels with varying levels of weak supervision and large batch sizes in incremental learning. This improvement is 20% over test sets with new words and phrases in the new time period. We demonstrate the effectiveness of model building in a privacy-preserving incremental fashion for ASR while further exploring the utility of having an effective teacher model and use of large batch sizes.

  • 14 authors
·
Jul 19, 2022

California Crop Yield Benchmark: Combining Satellite Image, Climate, Evapotranspiration, and Soil Data Layers for County-Level Yield Forecasting of Over 70 Crops

California is a global leader in agricultural production, contributing 12.5% of the United States total output and ranking as the fifth-largest food and cotton supplier in the world. Despite the availability of extensive historical yield data from the USDA National Agricultural Statistics Service, accurate and timely crop yield forecasting remains a challenge due to the complex interplay of environmental, climatic, and soil-related factors. In this study, we introduce a comprehensive crop yield benchmark dataset covering over 70 crops across all California counties from 2008 to 2022. The benchmark integrates diverse data sources, including Landsat satellite imagery, daily climate records, monthly evapotranspiration, and high-resolution soil properties. To effectively learn from these heterogeneous inputs, we develop a multi-modal deep learning model tailored for county-level, crop-specific yield forecasting. The model employs stratified feature extraction and a timeseries encoder to capture spatial and temporal dynamics during the growing season. Static inputs such as soil characteristics and crop identity inform long-term variability. Our approach achieves an overall R2 score of 0.76 across all crops of unseen test dataset, highlighting strong predictive performance across California diverse agricultural regions. This benchmark and modeling framework offer a valuable foundation for advancing agricultural forecasting, climate adaptation, and precision farming. The full dataset and codebase are publicly available at our GitHub repository.

  • 3 authors
·
Jun 11

Score Forgetting Distillation: A Swift, Data-Free Method for Machine Unlearning in Diffusion Models

The machine learning community is increasingly recognizing the importance of fostering trust and safety in modern generative AI (GenAI) models. We posit machine unlearning (MU) as a crucial foundation for developing safe, secure, and trustworthy GenAI models. Traditional MU methods often rely on stringent assumptions and require access to real data. This paper introduces Score Forgetting Distillation (SFD), an innovative MU approach that promotes the forgetting of undesirable information in diffusion models by aligning the conditional scores of "unsafe" classes or concepts with those of "safe" ones. To eliminate the need for real data, our SFD framework incorporates a score-based MU loss into the score distillation objective of a pretrained diffusion model. This serves as a regularization term that preserves desired generation capabilities while enabling the production of synthetic data through a one-step generator. Our experiments on pretrained label-conditional and text-to-image diffusion models demonstrate that our method effectively accelerates the forgetting of target classes or concepts during generation, while preserving the quality of other classes or concepts. This unlearned and distilled diffusion not only pioneers a novel concept in MU but also accelerates the generation speed of diffusion models. Our experiments and studies on a range of diffusion models and datasets confirm that our approach is generalizable, effective, and advantageous for MU in diffusion models. (Warning: This paper contains sexually explicit imagery, discussions of pornography, racially-charged terminology, and other content that some readers may find disturbing, distressing, and/or offensive.)

  • 3 authors
·
Sep 17, 2024

ChildDiffusion: Unlocking the Potential of Generative AI and Controllable Augmentations for Child Facial Data using Stable Diffusion and Large Language Models

In this research work we have proposed high-level ChildDiffusion framework capable of generating photorealistic child facial samples and further embedding several intelligent augmentations on child facial data using short text prompts, detailed textual guidance from LLMs, and further image to image transformation using text guidance control conditioning thus providing an opportunity to curate fully synthetic large scale child datasets. The framework is validated by rendering high-quality child faces representing ethnicity data, micro expressions, face pose variations, eye blinking effects, facial accessories, different hair colours and styles, aging, multiple and different child gender subjects in a single frame. Addressing privacy concerns regarding child data acquisition requires a comprehensive approach that involves legal, ethical, and technological considerations. Keeping this in view this framework can be adapted to synthesise child facial data which can be effectively used for numerous downstream machine learning tasks. The proposed method circumvents common issues encountered in generative AI tools, such as temporal inconsistency and limited control over the rendered outputs. As an exemplary use case we have open-sourced child ethnicity data consisting of 2.5k child facial samples of five different classes which includes African, Asian, White, South Asian/ Indian, and Hispanic races by deploying the model in production inference phase. The rendered data undergoes rigorous qualitative as well as quantitative tests to cross validate its efficacy and further fine-tuning Yolo architecture for detecting and classifying child ethnicity as an exemplary downstream machine learning task.

  • 3 authors
·
Jun 17, 2024

Coconut Palm Tree Counting on Drone Images with Deep Object Detection and Synthetic Training Data

Drones have revolutionized various domains, including agriculture. Recent advances in deep learning have propelled among other things object detection in computer vision. This study utilized YOLO, a real-time object detector, to identify and count coconut palm trees in Ghanaian farm drone footage. The farm presented has lost track of its trees due to different planting phases. While manual counting would be very tedious and error-prone, accurately determining the number of trees is crucial for efficient planning and management of agricultural processes, especially for optimizing yields and predicting production. We assessed YOLO for palm detection within a semi-automated framework, evaluated accuracy augmentations, and pondered its potential for farmers. Data was captured in September 2022 via drones. To optimize YOLO with scarce data, synthetic images were created for model training and validation. The YOLOv7 model, pretrained on the COCO dataset (excluding coconut palms), was adapted using tailored data. Trees from footage were repositioned on synthetic images, with testing on distinct authentic images. In our experiments, we adjusted hyperparameters, improving YOLO's mean average precision (mAP). We also tested various altitudes to determine the best drone height. From an initial [email protected] of 0.65, we achieved 0.88, highlighting the value of synthetic images in agricultural scenarios.

  • 6 authors
·
Dec 16, 2024

Antagonising explanation and revealing bias directly through sequencing and multimodal inference

Deep generative models produce data according to a learned representation, e.g. diffusion models, through a process of approximation computing possible samples. Approximation can be understood as reconstruction and the large datasets used to train models as sets of records in which we represent the physical world with some data structure (photographs, audio recordings, manuscripts). During the process of reconstruction, e.g., image frames develop each timestep towards a textual input description. While moving forward in time, frame sets are shaped according to learned bias and their production, we argue here, can be considered as going back in time; not by inspiration on the backward diffusion process but acknowledging culture is specifically marked in the records. Futures of generative modelling, namely in film and audiovisual arts, can benefit by dealing with diffusion systems as a process to compute the future by inevitably being tied to the past, if acknowledging the records as to capture fields of view at a specific time, and to correlate with our own finite memory ideals. Models generating new data distributions can target video production as signal processors and by developing sequences through timelines we ourselves also go back to decade-old algorithmic and multi-track methodologies revealing the actual predictive failure of contemporary approaches to synthesis in moving image, both as relevant to composition and not explanatory.

  • 3 authors
·
Aug 25, 2023

Leveraging Large Language Models in Conversational Recommender Systems

A Conversational Recommender System (CRS) offers increased transparency and control to users by enabling them to engage with the system through a real-time multi-turn dialogue. Recently, Large Language Models (LLMs) have exhibited an unprecedented ability to converse naturally and incorporate world knowledge and common-sense reasoning into language understanding, unlocking the potential of this paradigm. However, effectively leveraging LLMs within a CRS introduces new technical challenges, including properly understanding and controlling a complex conversation and retrieving from external sources of information. These issues are exacerbated by a large, evolving item corpus and a lack of conversational data for training. In this paper, we provide a roadmap for building an end-to-end large-scale CRS using LLMs. In particular, we propose new implementations for user preference understanding, flexible dialogue management and explainable recommendations as part of an integrated architecture powered by LLMs. For improved personalization, we describe how an LLM can consume interpretable natural language user profiles and use them to modulate session-level context. To overcome conversational data limitations in the absence of an existing production CRS, we propose techniques for building a controllable LLM-based user simulator to generate synthetic conversations. As a proof of concept we introduce RecLLM, a large-scale CRS for YouTube videos built on LaMDA, and demonstrate its fluency and diverse functionality through some illustrative example conversations.

  • 13 authors
·
May 13, 2023

TeleAntiFraud-28k: A Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection

The detection of telecom fraud faces significant challenges due to the lack of high-quality multimodal training data that integrates audio signals with reasoning-oriented textual analysis. To address this gap, we present TeleAntiFraud-28k, the first open-source audio-text slow-thinking dataset specifically designed for automated telecom fraud analysis. Our dataset is constructed through three strategies: (1) Privacy-preserved text-truth sample generation using automatically speech recognition (ASR)-transcribed call recordings (with anonymized original audio), ensuring real-world consistency through text-to-speech (TTS) model regeneration; (2) Semantic enhancement via large language model (LLM)-based self-instruction sampling on authentic ASR outputs to expand scenario coverage; (3) Multi-agent adversarial synthesis that simulates emerging fraud tactics through predefined communication scenarios and fraud typologies. The generated dataset contains 28,511 rigorously processed speech-text pairs, complete with detailed annotations for fraud reasoning. The dataset is divided into three tasks: scenario classification, fraud detection, fraud type classification. Furthermore, we construct TeleAntiFraud-Bench, a standardized evaluation benchmark comprising proportionally sampled instances from the dataset, to facilitate systematic testing of model performance on telecom fraud detection tasks. We also contribute a production-optimized supervised fine-tuning (SFT) model trained on hybrid real/synthetic data, while open-sourcing the data processing framework to enable community-driven dataset expansion. This work establishes a foundational framework for multimodal anti-fraud research while addressing critical challenges in data privacy and scenario diversity. The project will be released at https://github.com/JimmyMa99/TeleAntiFraud.

  • 10 authors
·
Mar 31 2

Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction

Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contributions include: 1) a 130B-parameter unified speech-text multi-modal model that achieves unified understanding and generation, with the Step-Audio-Chat version open-sourced; 2) a generative speech data engine that establishes an affordable voice cloning framework and produces the open-sourced lightweight Step-Audio-TTS-3B model through distillation; 3) an instruction-driven fine control system enabling dynamic adjustments across dialects, emotions, singing, and RAP; 4) an enhanced cognitive architecture augmented with tool calling and role-playing abilities to manage complex tasks effectively. Based on our new StepEval-Audio-360 evaluation benchmark, Step-Audio achieves state-of-the-art performance in human evaluations, especially in terms of instruction following. On open-source benchmarks like LLaMA Question, shows 9.3% average performance improvement, demonstrating our commitment to advancing the development of open-source multi-modal language technologies. Our code and models are available at https://github.com/stepfun-ai/Step-Audio.

  • 121 authors
·
Feb 17

The Necessity of Imperfection:Reversing Model Collapse via Simulating Cognitive Boundedness

Although synthetic data is widely promoted as a remedy, its prevailing production paradigm -- one optimizing for statistical smoothness -- systematically removes the long-tail, cognitively grounded irregularities that characterize human text. Prolonged training on such statistically optimal but cognitively impoverished data accelerates model collapse. This paper proposes a paradigm shift: instead of imitating the surface properties of data, we simulate the cognitive processes that generate human text. We introduce the Prompt-driven Cognitive Computing Framework (PMCSF), whose core consists of a Cognitive State Decoder (CSD) that reverse-engineers unstructured text into structured cognitive vectors, and a Cognitive Text Encoder (CTE) that re-materializes these states into text enriched with human-typical imperfections via mathematically defined Cognitive Perturbation Operators. The framework is validated through a two-stage objective evaluation pipeline. First, in cognitive codec verification, CTE text yields a Jensen-Shannon divergence of 0.0614 from human text (vs. 0.4431 for standard LLM output), passes double-blind professional media review, and achieves an intraclass correlation coefficient ICC > 0.9 for cognitive profile alignment across heterogeneous models. Second, in functional gain evaluation, isomorphic stress tests in the A-share market show that strategies incorporating CTE-generated data reduce maximum drawdown by 47.4% during the 2015 crash and deliver 8.6% Defensive Alpha, exceeding transaction costs by a factor of 33. Our findings demonstrate that modelling human cognitive limitations -- not copying surface data -- enables synthetic data with genuine functional gain, offering a viable technical pathway toward resolving the AI data-collapse crisis.

  • 1 authors
·
Dec 1

LEIA: Linguistic Embeddings for the Identification of Affect

The wealth of text data generated by social media has enabled new kinds of analysis of emotions with language models. These models are often trained on small and costly datasets of text annotations produced by readers who guess the emotions expressed by others in social media posts. This affects the quality of emotion identification methods due to training data size limitations and noise in the production of labels used in model development. We present LEIA, a model for emotion identification in text that has been trained on a dataset of more than 6 million posts with self-annotated emotion labels for happiness, affection, sadness, anger, and fear. LEIA is based on a word masking method that enhances the learning of emotion words during model pre-training. LEIA achieves macro-F1 values of approximately 73 on three in-domain test datasets, outperforming other supervised and unsupervised methods in a strong benchmark that shows that LEIA generalizes across posts, users, and time periods. We further perform an out-of-domain evaluation on five different datasets of social media and other sources, showing LEIA's robust performance across media, data collection methods, and annotation schemes. Our results show that LEIA generalizes its classification of anger, happiness, and sadness beyond the domain it was trained on. LEIA can be applied in future research to provide better identification of emotions in text from the perspective of the writer. The models produced for this article are publicly available at https://huggingface.co/LEIA

  • 6 authors
·
Apr 21, 2023

Towards Practical Deployment-Stage Backdoor Attack on Deep Neural Networks

One major goal of the AI security community is to securely and reliably produce and deploy deep learning models for real-world applications. To this end, data poisoning based backdoor attacks on deep neural networks (DNNs) in the production stage (or training stage) and corresponding defenses are extensively explored in recent years. Ironically, backdoor attacks in the deployment stage, which can often happen in unprofessional users' devices and are thus arguably far more threatening in real-world scenarios, draw much less attention of the community. We attribute this imbalance of vigilance to the weak practicality of existing deployment-stage backdoor attack algorithms and the insufficiency of real-world attack demonstrations. To fill the blank, in this work, we study the realistic threat of deployment-stage backdoor attacks on DNNs. We base our study on a commonly used deployment-stage attack paradigm -- adversarial weight attack, where adversaries selectively modify model weights to embed backdoor into deployed DNNs. To approach realistic practicality, we propose the first gray-box and physically realizable weights attack algorithm for backdoor injection, namely subnet replacement attack (SRA), which only requires architecture information of the victim model and can support physical triggers in the real world. Extensive experimental simulations and system-level real-world attack demonstrations are conducted. Our results not only suggest the effectiveness and practicality of the proposed attack algorithm, but also reveal the practical risk of a novel type of computer virus that may widely spread and stealthily inject backdoor into DNN models in user devices. By our study, we call for more attention to the vulnerability of DNNs in the deployment stage.

  • 6 authors
·
Nov 25, 2021

TabReD: A Benchmark of Tabular Machine Learning in-the-Wild

Benchmarks that closely reflect downstream application scenarios are essential for the streamlined adoption of new research in tabular machine learning (ML). In this work, we examine existing tabular benchmarks and find two common characteristics of industry-grade tabular data that are underrepresented in the datasets available to the academic community. First, tabular data often changes over time in real-world deployment scenarios. This impacts model performance and requires time-based train and test splits for correct model evaluation. Yet, existing academic tabular datasets often lack timestamp metadata to enable such evaluation. Second, a considerable portion of datasets in production settings stem from extensive data acquisition and feature engineering pipelines. For each specific dataset, this can have a different impact on the absolute and relative number of predictive, uninformative, and correlated features, which in turn can affect model selection. To fill the aforementioned gaps in academic benchmarks, we introduce TabReD -- a collection of eight industry-grade tabular datasets covering a wide range of domains from finance to food delivery services. We assess a large number of tabular ML models in the feature-rich, temporally-evolving data setting facilitated by TabReD. We demonstrate that evaluation on time-based data splits leads to different methods ranking, compared to evaluation on random splits more common in academic benchmarks. Furthermore, on the TabReD datasets, MLP-like architectures and GBDT show the best results, while more sophisticated DL models are yet to prove their effectiveness.

  • 4 authors
·
Jun 27, 2024 6

FilMaster: Bridging Cinematic Principles and Generative AI for Automated Film Generation

AI-driven content creation has shown potential in film production. However, existing film generation systems struggle to implement cinematic principles and thus fail to generate professional-quality films, particularly lacking diverse camera language and cinematic rhythm. This results in templated visuals and unengaging narratives. To address this, we introduce FilMaster, an end-to-end AI system that integrates real-world cinematic principles for professional-grade film generation, yielding editable, industry-standard outputs. FilMaster is built on two key principles: (1) learning cinematography from extensive real-world film data and (2) emulating professional, audience-centric post-production workflows. Inspired by these principles, FilMaster incorporates two stages: a Reference-Guided Generation Stage which transforms user input to video clips, and a Generative Post-Production Stage which transforms raw footage into audiovisual outputs by orchestrating visual and auditory elements for cinematic rhythm. Our generation stage highlights a Multi-shot Synergized RAG Camera Language Design module to guide the AI in generating professional camera language by retrieving reference clips from a vast corpus of 440,000 film clips. Our post-production stage emulates professional workflows by designing an Audience-Centric Cinematic Rhythm Control module, including Rough Cut and Fine Cut processes informed by simulated audience feedback, for effective integration of audiovisual elements to achieve engaging content. The system is empowered by generative AI models like (M)LLMs and video generation models. Furthermore, we introduce FilmEval, a comprehensive benchmark for evaluating AI-generated films. Extensive experiments show FilMaster's superior performance in camera language design and cinematic rhythm control, advancing generative AI in professional filmmaking.

  • 9 authors
·
Jun 23 1

Tryage: Real-time, intelligent Routing of User Prompts to Large Language Models

The introduction of the transformer architecture and the self-attention mechanism has led to an explosive production of language models trained on specific downstream tasks and data domains. With over 200, 000 models in the Hugging Face ecosystem, users grapple with selecting and optimizing models to suit multifaceted workflows and data domains while addressing computational, security, and recency concerns. There is an urgent need for machine learning frameworks that can eliminate the burden of model selection and customization and unleash the incredible power of the vast emerging model library for end users. Here, we propose a context-aware routing system, Tryage, that leverages a language model router for optimal selection of expert models from a model library based on analysis of individual input prompts. Inspired by the thalamic router in the brain, Tryage employs a perceptive router to predict down-stream model performance on prompts and, then, makes a routing decision using an objective function that integrates performance predictions with user goals and constraints that are incorporated through flags (e.g., model size, model recency). Tryage allows users to explore a Pareto front and automatically trade-off between task accuracy and secondary goals including minimization of model size, recency, security, verbosity, and readability. Across heterogeneous data sets that include code, text, clinical data, and patents, the Tryage framework surpasses Gorilla and GPT3.5 turbo in dynamic model selection identifying the optimal model with an accuracy of 50.9% , compared to 23.6% by GPT 3.5 Turbo and 10.8% by Gorilla. Conceptually, Tryage demonstrates how routing models can be applied to program and control the behavior of multi-model LLM systems to maximize efficient use of the expanding and evolving language model ecosystem.

  • 2 authors
·
Aug 22, 2023

BatteryLife: A Comprehensive Dataset and Benchmark for Battery Life Prediction

Battery Life Prediction (BLP), which relies on time series data produced by battery degradation tests, is crucial for battery utilization, optimization, and production. Despite impressive advancements, this research area faces three key challenges. Firstly, the limited size of existing datasets impedes insights into modern battery life data. Secondly, most datasets are restricted to small-capacity lithium-ion batteries tested under a narrow range of diversity in labs, raising concerns about the generalizability of findings. Thirdly, inconsistent and limited benchmarks across studies obscure the effectiveness of baselines and leave it unclear if models popular in other time series fields are effective for BLP. To address these challenges, we propose BatteryLife, a comprehensive dataset and benchmark for BLP. BatteryLife integrates 16 datasets, offering a 2.4 times sample size compared to the previous largest dataset, and provides the most diverse battery life resource with batteries from 8 formats, 80 chemical systems, 12 operating temperatures, and 646 charge/discharge protocols, including both laboratory and industrial tests. Notably, BatteryLife is the first to release battery life datasets of zinc-ion batteries, sodium-ion batteries, and industry-tested large-capacity lithium-ion batteries. With the comprehensive dataset, we revisit the effectiveness of baselines popular in this and other time series fields. Furthermore, we propose CyclePatch, a plug-in technique that can be employed in a series of neural networks. Extensive benchmarking of 18 methods reveals that models popular in other time series fields can be unsuitable for BLP, and CyclePatch consistently improves model performance establishing state-of-the-art benchmarks. Moreover, BatteryLife evaluates model performance across aging conditions and domains. BatteryLife is available at https://github.com/Ruifeng-Tan/BatteryLife.

  • 9 authors
·
Feb 25

Analysis and Applications of Deep Learning with Finite Samples in Full Life-Cycle Intelligence of Nuclear Power Generation

The advent of Industry 4.0 has precipitated the incorporation of Artificial Intelligence (AI) methods within industrial contexts, aiming to realize intelligent manufacturing, operation as well as maintenance, also known as industrial intelligence. However, intricate industrial milieus, particularly those relating to energy exploration and production, frequently encompass data characterized by long-tailed class distribution, sample imbalance, and domain shift. These attributes pose noteworthy challenges to data-centric Deep Learning (DL) techniques, crucial for the realization of industrial intelligence. The present study centers on the intricate and distinctive industrial scenarios of Nuclear Power Generation (NPG), meticulously scrutinizing the application of DL techniques under the constraints of finite data samples. Initially, the paper expounds on potential employment scenarios for AI across the full life-cycle of NPG. Subsequently, we delve into an evaluative exposition of DL's advancement, grounded in the finite sample perspective. This encompasses aspects such as small-sample learning, few-shot learning, zero-shot learning, and open-set recognition, also referring to the unique data characteristics of NPG. The paper then proceeds to present two specific case studies. The first revolves around the automatic recognition of zirconium alloy metallography, while the second pertains to open-set recognition for signal diagnosis of machinery sensors. These cases, spanning the entirety of NPG's life-cycle, are accompanied by constructive outcomes and insightful deliberations. By exploring and applying DL methodologies within the constraints of finite sample availability, this paper not only furnishes a robust technical foundation but also introduces a fresh perspective toward the secure and efficient advancement and exploitation of this advanced energy source.

  • 11 authors
·
Nov 7, 2023

L3Cube-HingCorpus and HingBERT: A Code Mixed Hindi-English Dataset and BERT Language Models

Code-switching occurs when more than one language is mixed in a given sentence or a conversation. This phenomenon is more prominent on social media platforms and its adoption is increasing over time. Therefore code-mixed NLP has been extensively studied in the literature. As pre-trained transformer-based architectures are gaining popularity, we observe that real code-mixing data are scarce to pre-train large language models. We present L3Cube-HingCorpus, the first large-scale real Hindi-English code mixed data in a Roman script. It consists of 52.93M sentences and 1.04B tokens, scraped from Twitter. We further present HingBERT, HingMBERT, HingRoBERTa, and HingGPT. The BERT models have been pre-trained on codemixed HingCorpus using masked language modelling objectives. We show the effectiveness of these BERT models on the subsequent downstream tasks like code-mixed sentiment analysis, POS tagging, NER, and LID from the GLUECoS benchmark. The HingGPT is a GPT2 based generative transformer model capable of generating full tweets. We also release L3Cube-HingLID Corpus, the largest code-mixed Hindi-English language identification(LID) dataset and HingBERT-LID, a production-quality LID model to facilitate capturing of more code-mixed data using the process outlined in this work. The dataset and models are available at https://github.com/l3cube-pune/code-mixed-nlp .

  • 2 authors
·
Apr 18, 2022

Exploring the Potential of AI-Generated Synthetic Datasets: A Case Study on Telematics Data with ChatGPT

This research delves into the construction and utilization of synthetic datasets, specifically within the telematics sphere, leveraging OpenAI's powerful language model, ChatGPT. Synthetic datasets present an effective solution to challenges pertaining to data privacy, scarcity, and control over variables - characteristics that make them particularly valuable for research pursuits. The utility of these datasets, however, largely depends on their quality, measured through the lenses of diversity, relevance, and coherence. To illustrate this data creation process, a hands-on case study is conducted, focusing on the generation of a synthetic telematics dataset. The experiment involved an iterative guidance of ChatGPT, progressively refining prompts and culminating in the creation of a comprehensive dataset for a hypothetical urban planning scenario in Columbus, Ohio. Upon generation, the synthetic dataset was subjected to an evaluation, focusing on the previously identified quality parameters and employing descriptive statistics and visualization techniques for a thorough analysis. Despite synthetic datasets not serving as perfect replacements for actual world data, their potential in specific use-cases, when executed with precision, is significant. This research underscores the potential of AI models like ChatGPT in enhancing data availability for complex sectors like telematics, thus paving the way for a myriad of new research opportunities.

  • 1 authors
·
Jun 23, 2023

DataEnvGym: Data Generation Agents in Teacher Environments with Student Feedback

The process of creating training data to teach models is currently driven by humans, who manually analyze model weaknesses and plan how to create data that improves a student model. Recent approaches using LLMs as annotators reduce human effort, but still require humans to interpret feedback from evaluations and control the LLM to produce data the student needs. Automating this labor-intensive process by creating autonomous data generation agents - or teachers - is desirable, but requires environments that can simulate the feedback-driven, iterative, closed loop of data creation. To enable rapid and scalable testing for such agents and their modules, we introduce DataEnvGym, a testbed of teacher environments for data generation agents. DataEnvGym frames data generation as a sequential decision-making task, involving an agent consisting of a data generation policy (which generates a plan for creating training data) and a data generation engine (which transforms the plan into data), inside an environment that provides student feedback. The agent's goal is to improve student performance. Students are iteratively trained and evaluated on generated data, with their feedback (in the form of errors or weak skills) being reported to the agent after each iteration. DataEnvGym includes multiple teacher environment instantiations across 3 levels of structure in the state representation and action space. More structured environments are based on inferred skills and offer more interpretability and curriculum control. We support 3 diverse tasks (math, code, and VQA) and test multiple students and teachers. Example agents in our teaching environments can iteratively improve students across tasks and settings. Moreover, we show that environments teach different skill levels and test variants of key modules, pointing to future work in improving data generation agents, engines, and feedback mechanisms.

  • 4 authors
·
Oct 8, 2024

KramaBench: A Benchmark for AI Systems on Data-to-Insight Pipelines over Data Lakes

Constructing real-world data-to-insight pipelines often involves data extraction from data lakes, data integration across heterogeneous data sources, and diverse operations from data cleaning to analysis. The design and implementation of data science pipelines require domain knowledge, technical expertise, and even project-specific insights. AI systems have shown remarkable reasoning, coding, and understanding capabilities. However, it remains unclear to what extent these capabilities translate into successful design and execution of such complex pipelines. We introduce KRAMABENCH: a benchmark composed of 104 manually-curated real-world data science pipelines spanning 1700 data files from 24 data sources in 6 different domains. We show that these pipelines test the end-to-end capabilities of AI systems on data processing, requiring data discovery, wrangling and cleaning, efficient processing, statistical reasoning, and orchestrating data processing steps given a high-level task. Our evaluation tests 5 general models and 3 code generation models using our reference framework, DS-GURU, which instructs the AI model to decompose a question into a sequence of subtasks, reason through each step, and synthesize Python code that implements the proposed design. Our results on KRAMABENCH show that, although the models are sufficiently capable of solving well-specified data science code generation tasks, when extensive data processing and domain knowledge are required to construct real-world data science pipelines, existing out-of-box models fall short. Progress on KramaBench represents crucial steps towards developing autonomous data science agents for real-world applications. Our code, reference framework, and data are available at https://github.com/mitdbg/KramaBench.

  • 19 authors
·
Jun 6

FAIR Jupyter: a knowledge graph approach to semantic sharing and granular exploration of a computational notebook reproducibility dataset

The way in which data are shared can affect their utility and reusability. Here, we demonstrate how data that we had previously shared in bulk can be mobilized further through a knowledge graph that allows for much more granular exploration and interrogation. The original dataset is about the computational reproducibility of GitHub-hosted Jupyter notebooks associated with biomedical publications. It contains rich metadata about the publications, associated GitHub repositories and Jupyter notebooks, and the notebooks' reproducibility. We took this dataset, converted it into semantic triples and loaded these into a triple store to create a knowledge graph, FAIR Jupyter, that we made accessible via a web service. This enables granular data exploration and analysis through queries that can be tailored to specific use cases. Such queries may provide details about any of the variables from the original dataset, highlight relationships between them or combine some of the graph's content with materials from corresponding external resources. We provide a collection of example queries addressing a range of use cases in research and education. We also outline how sets of such queries can be used to profile specific content types, either individually or by class. We conclude by discussing how such a semantically enhanced sharing of complex datasets can both enhance their FAIRness, i.e., their findability, accessibility, interoperability, and reusability, and help identify and communicate best practices, particularly with regards to data quality, standardization, automation and reproducibility.

  • 2 authors
·
Apr 19, 2024