Data-centric AI (DCAI): Difference between revisions

no edit summary
(Created page with "==Introduction== Model-centric AI is the paradigm taught in most ML classes and revolves around producing the best model given a clean, well-curated dataset. In contrast, Data-centric AI involves systematically engineering data to build better AI systems. Data-centric AI can come in two forms: *algorithms that understand data and use that information to improve models *algorithms that modify data to improve ML models. Examples of this include curriculum learning (...")
 
No edit summary
Line 18: Line 18:
Despite tempting temptation, don't skip Step 2 through Step 4. You can repeat Steps 3-4 multiple times to deploy the most effective ML systems.
Despite tempting temptation, don't skip Step 2 through Step 4. You can repeat Steps 3-4 multiple times to deploy the most effective ML systems.


==Examples of Data-centric AI==
==Examples==
This field covers the following methods:
 
*Outlier detection and removal (handling unusual examples in the dataset)
*Correction and error detection (handling incorrect labels/values in the dataset).
*Establishing consensus (determining truth among many crowdsourced annotations).
*Data augmentation (adding examples of data to encode prior information)
*Feature engineering and Selection (manipulating the way data are represented).
*Active learning (selecting most informative data to label next).
*Curriculum Learning (Ordering the data in a dataset from easiest to most difficult)
*Recent high-profile examples of ML applications clearly show how reliability of ML model deployed in real-world depend on training data.
 
OpenAI stated openly that errors in data and labels were the main problem in training famous ML models such as Dall-E, GPT-3 and ChatGPT. These are stills from the demo of DallE 2.
 
Tesla was able to produce autonomous driving systems that are far more advanced than comparable competitors by using model-assisted data improvement (Step 3). The key to this success is the Data Engine. These slides are from Andrej Karpathy (Tesla Director of AI 2021).
 
==Reasons for Data-centric AI==
*Data quality issues are costing the U.S. alone an estimated $3 Trillion annually.
*Automated methods and systematic engineering principles are now needed to ensure ML models are trained with clean data.
*Recent research on image classification with noisily labeled data revealed simple methods which adaptively change the dataset can lead to more accurate models than sophisticated modeling strategies.