57
edits
(Created page with "==Introduction== Model-centric AI is the paradigm taught in most ML classes and revolves around producing the best model given a clean, well-curated dataset. In contrast, Data-centric AI involves systematically engineering data to build better AI systems. Data-centric AI can come in two forms: *algorithms that understand data and use that information to improve models *algorithms that modify data to improve ML models. Examples of this include curriculum learning (...") |
No edit summary |
||
Line 18: | Line 18: | ||
Despite tempting temptation, don't skip Step 2 through Step 4. You can repeat Steps 3-4 multiple times to deploy the most effective ML systems. | Despite tempting temptation, don't skip Step 2 through Step 4. You can repeat Steps 3-4 multiple times to deploy the most effective ML systems. | ||
==Examples of Data-centric AI== | ==Examples== | ||
This field covers the following methods: | |||
*Outlier detection and removal (handling unusual examples in the dataset) | |||
*Correction and error detection (handling incorrect labels/values in the dataset). | |||
*Establishing consensus (determining truth among many crowdsourced annotations). | |||
*Data augmentation (adding examples of data to encode prior information) | |||
*Feature engineering and Selection (manipulating the way data are represented). | |||
*Active learning (selecting most informative data to label next). | |||
*Curriculum Learning (Ordering the data in a dataset from easiest to most difficult) | |||
*Recent high-profile examples of ML applications clearly show how reliability of ML model deployed in real-world depend on training data. | |||
OpenAI stated openly that errors in data and labels were the main problem in training famous ML models such as Dall-E, GPT-3 and ChatGPT. These are stills from the demo of DallE 2. | |||
Tesla was able to produce autonomous driving systems that are far more advanced than comparable competitors by using model-assisted data improvement (Step 3). The key to this success is the Data Engine. These slides are from Andrej Karpathy (Tesla Director of AI 2021). | |||
==Reasons for Data-centric AI== | |||
*Data quality issues are costing the U.S. alone an estimated $3 Trillion annually. | |||
*Automated methods and systematic engineering principles are now needed to ensure ML models are trained with clean data. | |||
*Recent research on image classification with noisily labeled data revealed simple methods which adaptively change the dataset can lead to more accurate models than sophisticated modeling strategies. |
edits