AI Wiki
Category

Data & Datasets

104 articles

Bucketing

Machine Learning

C4 (Colossal Clean Crawled Corpus)

Natural Language Processing

CIFAR-10

AI Benchmarks, Computer Vision

COCO dataset

Computer Vision, Machine Learning

Categorical Data

Machine Learning, Statistics

CharXiv

AI Benchmarks

Class-Imbalanced Dataset

Machine Learning

Common Corpus

Natural Language Processing, Open Source AI

Common Crawl

Machine Learning, Natural Language Processing

Common Pile

Natural Language Processing, Open Source AI

Continuous Feature

Machine Learning, Statistics

Convenience Sampling

Machine Learning, Statistics

Cosmopedia

Large Language Models, Open Source AI

Coverage Bias

AI Ethics, Machine Learning

DCLM (DataComp for Language Models)

AI Benchmarks, Natural Language Processing, Open Source AI

Data Augmentation

Deep Learning, Machine Learning

Data Provenance Initiative

Machine Learning

Data Set or Dataset

Machine Learning

Data preprocessing

Machine Learning

Data-centric AI (DCAI)

MLOps

DatologyAI

AI Companies

Dense Feature

Machine Learning

Derived label

Machine Learning

Dimension Reduction

Machine Learning

Discrete Feature

Machine Learning

Dolma

Large Language Models, Open Source AI

Downsampling

Deep Learning, Machine Learning

Ego-Exo4D

Computer Vision, Meta AI

Ego4D

Computer Vision, Meta AI

Feature

Machine Learning

Feature Cross

Machine Learning

Feature Engineering

Machine Learning

Feature Extraction

Machine Learning

Feature Selection

Algorithms, Machine Learning

Feature Set

Data Science, Machine Learning

Feature Vector

Machine Learning

FineWeb

Machine Learning, Natural Language Processing, Open Source AI

FineWeb-2

Machine Learning

FineWeb-Edu

Large Language Models, Machine Learning

Ground Truth

Machine Learning

HotpotQA

AI Benchmarks, Artificial Intelligence, Natural Language Processing

How to Prevent OpenAI and Google From Training Their LLMs on Your Website's Data

Large Language Models

Imbalanced Dataset

Machine Learning

Instance

Machine Learning

Inter-rater agreement

Model Evaluation, Statistics

Iris dataset

AI Benchmarks, Machine Learning, Statistics

LAION

Computer Vision, Machine Learning

LAION-5B

Generative AI

LVIS (Large Vocabulary Instance Segmentation)

Computer Vision

Label

Machine Learning

MMMLU

AI Benchmarks

MNIST

Computer Vision, Machine Learning

Mecka

Robotics Companies

MetaCLIP

Meta AI, Multimodal AI

MimicGen

Embodied AI, NVIDIA, Robotics

Nemotron-CC

NVIDIA, Natural Language Processing

Noise

Machine Learning

Non-Response Bias

Machine Learning, Statistics

Normalization

Machine Learning

Numerical Data

Machine Learning