Input Layer

Introduction

The input layer is the first layer in a neural network, responsible for receiving raw data and passing it into the network for processing. Every neural network architecture, from simple perceptrons to complex deep neural networks, begins with an input layer that serves as the gateway between external data and the model's internal computations.

Unlike hidden layers and the output layer, the input layer does not perform any learned computation. It applies no weights, biases, or activation functions. Instead, it acts as a conduit that distributes the feature values of each data sample to the first set of trainable neurons in the network. The number of neurons (or nodes) in the input layer is determined entirely by the dimensionality of the input data. For example, if a dataset has 10 features per sample, the input layer will have 10 neurons, one for each feature.

The design of the input layer has significant downstream effects on model performance. Choosing the correct input shape, applying appropriate preprocessing, and handling different data types properly are all considerations that influence how well a network can learn from its training data.

Historical background

The concept of an input layer has evolved alongside the development of artificial neural networks over several decades.

McCulloch-Pitts neuron (1943)

Warren McCulloch and Walter Pitts published "A Logical Calculus of the Ideas Immanent in Nervous Activity" in 1943, introducing the first mathematical model of an artificial neuron. Their model accepted binary inputs (0 or 1), representing excitatory and inhibitory signals, and produced a binary output based on a threshold function. While the McCulloch-Pitts neuron did not include a learning mechanism or trainable weights, it established the foundational idea that external signals enter a computational unit through designated input connections. The architecture lacked the ability to process continuous values and had no mechanism for adjusting connection strengths during training.

Rosenblatt's perceptron (1957-1958)

Frank Rosenblatt extended the McCulloch-Pitts model by introducing the perceptron in 1957. The Mark I Perceptron machine, first publicly demonstrated on June 23, 1960, featured a three-layer architecture with a distinctive input mechanism. Its sensory input layer consisted of an array of 400 photocells arranged in a 20x20 grid, called "sensory units" (S-units) or the "input retina." Each photocell could connect to up to 40 association units in the hidden layer. Rosenblatt's key innovation was replacing the binary inputs of the McCulloch-Pitts neuron with real-valued inputs and trainable weights, along with an error-correction training algorithm. This made the input layer not just a passive receiver but the entry point for a learnable system.

Modern input layers

As neural networks grew deeper and more complex through the 1980s and beyond, the input layer's role became more formalized. Yann LeCun's work on convolutional neural networks in 1989, applied to handwritten zip code recognition, introduced multi-channel image inputs. The rise of recurrent neural networks brought sequential input handling, and the transformer architecture introduced by Vaswani et al. in 2017 redefined how text inputs are tokenized, embedded, and enriched with positional information before entering the network.

How the input layer works

The input layer accepts a numerical representation of the data sample and forwards it to the next layer in the network, which is typically the first hidden layer. Each neuron in the input layer corresponds to a single input value (or a single element in a multi-dimensional input) and is connected to every neuron in the subsequent layer through weighted connections.

Here is a step-by-step description of the data flow:

Data entry. A single data sample (also called an example or instance) is fed into the input layer as a numerical vector or tensor.
Distribution. Each neuron in the input layer holds one element of the input and passes that value forward along its outgoing connections.
Weighted transmission. The connections between the input layer and the first hidden layer carry weights, which are learned parameters that determine how much influence each input value has on each hidden neuron.
Next-layer computation. The first hidden layer receives the weighted sum of inputs, applies a bias term, and then passes the result through an activation function to introduce non-linearity.

Mathematically, if the input layer has n neurons holding values x_1, x_2, ..., x_n, then the output of the j-th neuron in the first hidden layer is computed as:

h_j = f(w_1j * x_1 + w_2j * x_2 + ... + w_nj * x_n + b_j)

where w_ij is the weight connecting input neuron i to hidden neuron j, b_j is the bias term, and f is the activation function.

Because the input layer itself does not transform the data, it is sometimes excluded from the count when people refer to the "number of layers" in a neural network. A network described as having "three layers" may actually have an input layer plus three additional layers (two hidden layers and one output layer).

Input shapes for different data types

One of the most important decisions when designing a neural network is specifying the shape of the input layer. The shape must match the structure and dimensionality of the data. Different data types require different input shapes.

Data type	Typical input shape	Example	Common network type
Tabular / structured	`(num_features,)`	A CSV row with 13 columns: shape `(13,)`	Feedforward (MLP)
Grayscale image	`(height, width, 1)`	28x28 MNIST digit: shape `(28, 28, 1)`	CNN
Color image (RGB)	`(height, width, 3)`	64x64 photo: shape `(64, 64, 3)`	CNN
Time series / sequence	`(timesteps, features_per_step)`	30 days of 5 stock indicators: shape `(30, 5)`	RNN, LSTM
Text (tokenized)	`(sequence_length,)`	Sentence of 128 tokens: shape `(128,)`	Transformer, embedding layer
Audio (spectrogram)	`(time_frames, frequency_bins, 1)`	Mel spectrogram: shape `(128, 80, 1)`	CNN, RNN
Point cloud (3D)	`(num_points, 3)` or `(num_points, features)`	1024 3D points: shape `(1024, 3)`	GNN, PointNet
Graph	`(num_nodes, node_features)` + adjacency	Social network with 500 nodes, 16 features: shape `(500, 16)`	GNN

For tabular data, the input layer is a flat vector where each element represents one feature. For images, the input layer is a three-dimensional tensor encoding height, width, and color channels. For sequential data such as time series or text, the input layer is a two-dimensional structure encoding timesteps and features at each step. For graph-structured data, the input consists of node feature matrices paired with adjacency information that describes how nodes are connected.

Input layer across network architectures

Different neural network architectures handle input data in distinct ways. The input layer's structure, preprocessing expectations, and connection pattern to subsequent layers vary depending on the architecture.

Feedforward networks (MLPs)

In a standard multilayer perceptron, the input layer is a one-dimensional vector of features. Every input neuron connects to every neuron in the first hidden layer (a fully connected or dense connection pattern). The input layer size equals the number of features in the dataset. This is the simplest and most direct form of input layer.

Convolutional neural networks

In a convolutional neural network, the input layer preserves the spatial structure of the data. Rather than flattening an image into a one-dimensional vector, the input retains its height, width, and channel dimensions. For a color image, the input has three channels (red, green, blue). The first convolutional layer then applies learned filters (kernels) that slide across the spatial dimensions of the input, detecting local patterns such as edges and textures. Yann LeCun's LeNet-5, published in 1998, was among the first networks to demonstrate that preserving spatial input structure through convolutional layers significantly outperforms flattened input approaches for image recognition tasks.

Recurrent neural networks

A recurrent neural network accepts sequential input where data arrives one timestep at a time. At each timestep, a new input vector enters the network and is combined with the hidden state from the previous timestep. The input layer at each step has the same dimensionality (the number of features per timestep), but the sequence length can vary across samples. LSTM and GRU variants extend this pattern with gating mechanisms that regulate how input information flows into and out of the cell state.

Transformers

The transformer architecture, introduced by Vaswani et al. in 2017, handles input through a multi-step process before it reaches the first attention layer:

Tokenization. Raw text is broken into subword tokens using algorithms such as Byte Pair Encoding (BPE) or SentencePiece.
Token embedding. Each token ID is mapped to a dense vector through a learned embedding table. If the vocabulary has 50,000 tokens and the embedding dimension is 768, the embedding table is a matrix of shape (50,000, 768).
Positional encoding. Because transformers process all tokens in parallel (unlike RNNs), they have no inherent sense of token order. Positional encodings, either learned or based on fixed sine and cosine functions, are added to the token embeddings to inject sequence position information.

The result of these three steps is a matrix of shape (sequence_length, embedding_dim) that enters the first self-attention layer. This input pipeline is more complex than that of feedforward or convolutional networks, but it allows transformers to handle variable-length sequences and capture long-range dependencies efficiently.

Graph neural networks

In a graph neural network (GNN), the input layer accepts both node feature matrices and structural information about how nodes are connected. A graph G = (V, E) is represented by a node feature matrix of shape (num_nodes, num_features) and an adjacency matrix or edge list describing the connections. For point cloud data, each 3D point becomes a node, and edges are constructed using k-nearest neighbor algorithms. The first GNN layer then aggregates features from neighboring nodes, combining structural and feature information from the very first step.

The batch dimension

In practice, neural networks process multiple data samples at once rather than one at a time. A group of samples processed together is called a batch, and the number of samples in the group is the batch size. The batch dimension is prepended to the input shape as the first axis of the tensor.

For example, if a single image has shape (28, 28, 1) and the batch size is 32, the tensor passed to the network has shape (32, 28, 28, 1). When defining a model, most frameworks ask you to specify only the shape of a single sample and handle the batch dimension automatically.

Single sample shape	Batched shape (batch size = 64)
`(13,)`	`(64, 13)`
`(28, 28, 1)`	`(64, 28, 28, 1)`
`(30, 5)`	`(64, 30, 5)`
`(128,)`	`(64, 128)`

The batch dimension allows frameworks to take advantage of parallel computation on GPUs, dramatically speeding up both training and inference. Larger batch sizes generally improve hardware utilization but require more memory and can affect generalization behavior.

Input preprocessing and normalization

Raw data is rarely suitable for direct input into a neural network. Proper preprocessing of input data is essential for stable and efficient training.

Feature scaling

When input features exist on vastly different scales (for example, one feature ranges from 0 to 1 while another ranges from 0 to 10,000), the gradient descent optimization process struggles to converge. The loss landscape becomes elongated and asymmetric, causing updates to oscillate. Normalization and standardization address this problem by rescaling features to comparable ranges.

Technique	Formula	Result range	When to use
Min-max normalization	x' = (x - min) / (max - min)	[0, 1]	When a bounded range is needed
Z-score standardization	x' = (x - mean) / std	Centered at 0, std = 1	General-purpose; most common
Max-abs scaling	x' = x / max(abs(x))	[-1, 1]	Sparse data
Robust scaling	x' = (x - median) / IQR	Varies	Data with many outliers
Pixel scaling (images)	x' = x / 255.0	[0, 1]	Image pixel intensities

Z-score standardization (also called the standard scaler approach) is the most common technique. It transforms each feature to have a mean of zero and a standard deviation of one, which produces spherical contours in the objective function and helps the network converge more quickly. The choice of scaling method can also depend on the activation function used in subsequent layers; for instance, sigmoid and tanh activations work well with inputs normalized to [0, 1] or [-1, 1], while ReLU is generally less sensitive to input scale.

Encoding categorical variables

Neural networks require numerical inputs, so non-numeric categorical variables must be converted to numbers before entering the input layer. The two most common approaches are:

One-hot encoding. Each category is represented as a binary vector of length equal to the number of unique categories. Exactly one element is set to 1, and all others are 0. For example, a "color" feature with three possible values (red, green, blue) becomes three binary features: [1, 0, 0], [0, 1, 0], or [0, 0, 1]. One-hot encoding is straightforward but creates high-dimensional, sparse vectors when the number of categories is large.

Learned embeddings. Each category is mapped to a dense, low-dimensional vector through a trainable embedding layer. The embedding values are learned during training, capturing relationships between categories. This approach is far more efficient for high-cardinality features (such as product IDs or zip codes with thousands of unique values) because it avoids the sparsity problem of one-hot encoding.

Encoding method	Dimensionality	Handles relationships	Best for
One-hot encoding	Equals number of categories	No	Low-cardinality features (< 20 categories)
Label encoding	1	Implies ordinal relationship	Ordinal variables
Learned embeddings	User-defined (typically 8-256)	Yes	High-cardinality features (> 20 categories)
Target encoding	1	Partially	Supervised learning with many categories

Handling missing values

Missing entries in the input data must be addressed before the data reaches the input layer. Common strategies include mean/median imputation (replacing missing values with the feature's average), forward/backward fill for time series, and indicator variables that add a binary feature flagging whether the original value was missing. Some architectures support masking layers that explicitly mark missing positions so the network can learn to ignore them.

Dimensionality reduction

When the number of input features is very large, dimensionality reduction techniques such as PCA (Principal Component Analysis) or autoencoders can reduce the input layer size. This helps mitigate the curse of dimensionality, where high-dimensional inputs lead to sparse data, increased computational cost, and a higher risk of overfitting. However, dimensionality reduction should be applied carefully, as it can discard information that is relevant to the task.

Tokenization for text

For natural language processing tasks, raw text is converted into numerical token IDs before entering the network. Common tokenization algorithms include Byte Pair Encoding (BPE), WordPiece (used by BERT), and SentencePiece. The choice of tokenizer and vocabulary size directly determines the dimensionality and granularity of the input representation.

Data augmentation at the input

Data augmentation applies random transformations to input data during training to increase the effective size and diversity of the training set. While augmentation is not performed by the input layer itself, it modifies the data that reaches the input layer and plays a significant role in improving model generalization.

Common augmentation techniques by data type:

Data type	Augmentation techniques
Images	Random flips, rotations, crops, color jitter, scaling, Gaussian noise
Text	Synonym replacement, random insertion, back-translation, token masking
Audio	Time stretching, pitch shifting, adding background noise, SpecAugment
Tabular	SMOTE (for class imbalance), noise injection, feature dropout
Time series	Window slicing, magnitude warping, time warping

Augmentation must be applied only to training data. Validation and test data should remain unaugmented to provide an accurate estimate of model performance. In modern deep learning frameworks, augmentation is typically implemented as a preprocessing pipeline or as Keras preprocessing layers (such as RandomFlip, RandomRotation, and RandomZoom) that operate before or within the input pipeline.

Variable-length inputs

Many real-world datasets contain inputs that vary in length. Sentences have different numbers of words, audio clips have different durations, and time series have different numbers of observations. Neural networks generally require fixed-size input tensors within a single batch, so special techniques are used to handle variable-length data.

Padding

Padding is the most common approach. All sequences in a batch are extended to the same length by appending (post-padding) or prepending (pre-padding) a special value, usually zero. The padded tensor can then pass through the input layer as a regular fixed-size tensor.

For example, if one sentence has 12 tokens and another has 20 tokens, the shorter sentence is padded with 8 zeros to reach length 20.

Masking

Because padded values carry no information, a mask tensor is often provided alongside the input. The mask marks which positions contain real data and which are padding. Layers such as LSTMs and transformers use this mask to ignore padded positions during computation, preventing the padding from distorting the learned representations.

Bucketing

Bucketing groups sequences of similar length into the same batch to minimize the amount of padding needed. Instead of padding every sequence to the global maximum length, sequences are sorted or grouped by length, and each batch is padded only to the length of its longest member. This reduces wasted computation and speeds up training.

Dynamic computation graphs

Frameworks like PyTorch support dynamic computation graphs, which allow the network's structure to change from one input to the next. This makes it possible to process variable-length inputs without padding by constructing the graph on the fly for each sample. However, batching remains challenging with this approach, and padding is still preferred in most production settings.

Multi-input and multimodal models

Some tasks require a network to accept data from multiple sources or in multiple formats simultaneously. A multi-input model uses more than one input layer, each designed for a different data stream.

Common multi-input use cases

Image plus metadata. A medical imaging model might accept an X-ray image through a CNN branch and patient demographic data through a fully connected branch.
Text plus numerical features. A product review model might process review text through a transformer branch and structured product attributes through a separate input.
Multiple modalities. A video understanding model might accept visual frames, audio, and subtitle text through three separate input branches.

Fusion strategies

In a multi-input architecture, each branch processes its input independently through its own initial layers. The outputs of these branches are then combined into a shared layer that feeds into the rest of the network. Three common fusion strategies exist:

Fusion strategy	Description	When to use
Early fusion	Raw features from all modalities are concatenated at the input level	Modalities with similar structure and scale
Intermediate fusion	Each modality is processed by its own encoder first, then representations are combined	Most common approach; modalities have different structures
Late fusion	Each modality is processed through its own full subnetwork, and predictions are combined at the output	When modalities are largely independent

Both TensorFlow/Keras (using the Functional API) and PyTorch support multi-input models. Cross-attention mechanisms, as used in models like Flamingo, allow one modality's input to attend to another modality's representations, enabling richer interaction between input streams.

Input layer in PyTorch vs. TensorFlow/Keras

The two most popular deep learning frameworks handle the input layer differently in their APIs.

TensorFlow / Keras

In Keras, the input layer is an explicit object created using keras.Input(). You specify the shape of a single sample (excluding the batch dimension), and Keras builds a symbolic tensor that defines the model's entry point.

import keras

# Explicit input layer for tabular data with 13 features
inputs = keras.Input(shape=(13,), dtype="float32", name="tabular_input")
x = keras.layers.Dense(64, activation="relu")(inputs)
outputs = keras.layers.Dense(1)(x)
model = keras.Model(inputs=inputs, outputs=outputs)

Key parameters of keras.Input() include:

Parameter	Description
`shape`	Shape tuple for one sample, e.g., `(13,)` or `(28, 28, 1)`
`dtype`	Data type of the input tensor (default: `"float32"`)
`name`	Optional string identifier
`batch_size`	Optionally fix the batch size

PyTorch

PyTorch does not have a dedicated input layer class. Instead, the first layer of the network (such as nn.Linear or nn.Conv2d) implicitly defines the expected input shape through its in_features or in_channels parameter.

import torch
import torch.nn as nn

class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        # The first Linear layer implicitly defines the input: 13 features
        self.fc1 = nn.Linear(in_features=13, out_features=64)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(in_features=64, out_features=1)

    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.fc2(x)
        return x

A key difference is that PyTorch requires you to specify both in_features and out_features for each linear layer, while Keras only requires the output units and infers the input size from the Input layer or the preceding layer.

Framework comparison

Aspect	TensorFlow / Keras	PyTorch
Input layer	Explicit `keras.Input()` object	Implicit; defined by first layer's `in_features`
Shape specification	`shape=(13,)` on Input	`in_features=13` on `nn.Linear`
Activation functions	Can be included in layer definition	Must be applied separately
Batch dimension	Handled automatically	Handled automatically
Multi-input support	Functional API with multiple `Input()` objects	Custom `forward()` accepting multiple arguments
Shape inference	`model.summary()` shows all shapes	`torchinfo.summary()` (third-party)

Dropout at the input layer

Dropout is a regularization technique that randomly sets a fraction of neuron outputs to zero during training, preventing the network from relying too heavily on any single feature. While dropout is most commonly applied to hidden layers, it can also be applied at the input layer to simulate missing features and reduce overfitting.

Srivastava, Hinton, Krizhevsky, Sutskever, and Salakhutdinov (2014) found that the optimal dropout rate for the input layer is typically lower than for hidden layers. For input neurons, a retention probability of around 0.8 (dropping 20% of inputs) often works well, compared to the standard 0.5 retention used in hidden layers. Input dropout effectively acts as a form of data augmentation by presenting slightly different views of the input to the network during each training step.

Activation function in the input layer

The input layer typically does not use an activation function. Its purpose is to pass raw data forward without transformation. Any activation applied at this stage would distort the original feature values before the network has a chance to learn from them.

In some specialized architectures, a normalization layer (such as batch normalization) may be placed immediately after the input layer to standardize activations before they reach the first hidden layer. Batch normalization, introduced by Ioffe and Szegedy in 2015, normalizes inputs to each layer by subtracting the mini-batch mean and dividing by the mini-batch standard deviation. When placed right after the input, it can serve as a learned form of input standardization, eliminating the need for manual feature scaling in the preprocessing pipeline. However, this normalization layer is technically separate from the input layer itself.

Input validation and debugging

Incorrect input data is one of the most common sources of errors in deep learning. Input validation catches problems early, before they propagate through the network and produce confusing error messages or silent failures.

Common input errors

Error type	Example	Consequence
Shape mismatch	Feeding a `(32, 10)` tensor to a layer expecting `(32, 13)`	`ValueError` / `RuntimeError`
Wrong data type	Passing integer data when float32 is expected	Silent precision loss or error
Unnormalized data	Features with values in the millions	Exploding gradients, failed training
Missing batch dimension	Feeding shape `(13,)` instead of `(1, 13)`	Dimension error
NaN or Inf values	Corrupted data or failed preprocessing	NaN loss, training collapse
Incorrect channel order	Feeding channels-first data to a channels-last model	Silently incorrect results
Data leakage	Including target variable in input features	Inflated training metrics, poor generalization

Validation strategies

Assertions. Add explicit shape and type checks before data enters the model (e.g., assert x.shape<sup><a href="#cite_note-1" class="cite-ref">[1]</a></sup> == 13).
Framework utilities. TensorFlow provides tf.ensure_shape() to enforce tensor shapes at runtime. PyTorch users can call tensor.shape and compare.
Model summaries. Use model.summary() in Keras or torchinfo.summary() in PyTorch to inspect the expected input and output shapes of every layer.
Data loaders. Inspect the shape, data type, and value range of batches produced by your data loader before training begins.
Step-by-step debugging. Print the shape of tensors before every layer call. Automatic differentiation systems can perform silent broadcasting when shapes are mismatched, causing tensors to take on unexpected dimensions without raising errors.
Multiple batch sizes. Test your code with different batch sizes during development to catch shape misalignments with the batch dimension.

Feature selection and the curse of dimensionality

The number of features in the input layer directly affects a network's capacity, training time, and risk of overfitting. Including irrelevant or redundant features increases the dimensionality of the input, which can degrade model performance through the curse of dimensionality.

The curse of dimensionality refers to the phenomenon where the volume of the input space grows so fast with increasing dimensions that the available data becomes sparse. In high-dimensional spaces, distance metrics become less meaningful, and models need exponentially more data to maintain the same level of statistical significance.

Feature selection methods help identify which input features carry useful information:

Method type	Examples	Description
Filter methods	Correlation analysis, mutual information, chi-squared test	Evaluate features independently of the model
Wrapper methods	Recursive feature elimination, forward/backward selection	Train the model repeatedly with different feature subsets
Embedded methods	L1 regularization (Lasso), tree-based feature importance	Feature selection happens as part of model training
Neural methods	Learned input gates, attention-based selection	The network learns which inputs to focus on

Deep neural networks can perform implicit feature selection through their learned representations. Lower layers learn to detect basic patterns, while deeper layers identify more complex structures. However, starting with a carefully selected set of input features still improves training efficiency and can prevent the network from fitting to noise in irrelevant dimensions.

Role of the input layer in machine learning

The input layer is the starting point of a machine learning model, and it plays an integral role in its operation. It receives raw input data and passes it on to the next layer for further processing, ultimately producing meaningful information.

The input layer acts as a bridge between raw input data and the final output produced by the model. Its task is to give the model all of the information it needs in order to make accurate predictions while simultaneously supporting the model's capacity for learning and improvement over time.

Several design choices at the input layer directly affect model quality:

Feature selection. Choosing which features to include (and exclude) determines what information the network can learn from.
Dimensionality. More input features increase the model's capacity but also raise the risk of overfitting and increase computational cost.
Data representation. How raw data is encoded into numbers (pixel values, token IDs, one-hot vectors, learned embeddings) influences how easily the network can extract patterns.
Preprocessing consistency. The same preprocessing steps applied during training must also be applied at inference time. Mismatches between training and serving preprocessing are a common source of production bugs.

Explain like I'm 5 (ELI5)

Imagine you have a toy sorting machine. Before the machine can sort your toys, you need to place them on a special tray at the front. That tray is like the input layer. It does not sort the toys itself; it just holds them so the machine can see what it needs to work with.

If you are sorting toy cars, each slot on the tray holds one car. If you are sorting building blocks, each slot holds one block. The number of slots matches whatever you are putting in. Once the toys are on the tray, the machine takes them inside and starts figuring out how to group them. The tray is always the first step, and without it, the machine has nothing to work with.

Now, sometimes the toys come in all different sizes. Some are tiny and some are huge. If you just dump them on the tray like that, the machine gets confused because the big toys take up so much attention. So first, you resize all the toys to be about the same size. That is what "normalizing" the input means: making all the values roughly the same scale so the machine can pay fair attention to each one.

References

Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep Learning*. MIT Press. Chapter 6: Deep Feedforward Networks.
McCulloch, W. S. & Pitts, W. (1943). "A Logical Calculus of the Ideas Immanent in Nervous Activity." *Bulletin of Mathematical Biophysics*, 5(4), 115-133.
Rosenblatt, F. (1958). "The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain." *Psychological Review*, 65(6), 386-408.
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). "Gradient-Based Learning Applied to Document Recognition." *Proceedings of the IEEE*, 86(11), 2278-2324.
Vaswani, A., et al. (2017). "Attention Is All You Need." *Advances in Neural Information Processing Systems* (NeurIPS), 30.
Ioffe, S. & Szegedy, C. (2015). "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift." *Proceedings of the 32nd International Conference on Machine Learning (ICML)*, 448-456.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). "Dropout: A Simple Way to Prevent Neural Networks from Overfitting." *Journal of Machine Learning Research*, 15, 1929-1958.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). "Efficient Estimation of Word Representations in Vector Space." *Proceedings of ICLR Workshop*.
Hornik, K., Stinchcombe, M., & White, H. (1989). "Multilayer Feedforward Networks Are Universal Approximators." *Neural Networks*, 2(5), 359-366.
Keras Documentation. "Input object." https://keras.io/api/layers/core_layers/input/
PyTorch Documentation. "torch.nn.Linear." https://pytorch.org/docs/stable/generated/torch.nn.Linear.html
Hancock, J. T. & Khoshgoftaar, T. M. (2020). "Survey on Categorical Data for Neural Networks." *Journal of Big Data*, 7, Article 28.
Shorten, C. & Khoshgoftaar, T. M. (2019). "A Survey on Image Data Augmentation for Deep Learning." *Journal of Big Data*, 6, Article 60.
Google Developers. "Neural networks: Nodes and hidden layers." *Machine Learning Crash Course*. https://developers.google.com/machine-learning/crash-course/neural-networks/nodes-hidden-layers
Jordan, J. "Normalizing your data (specifically, input and batch normalization)." https://www.jeremyjordan.me/batch-normalization/

Introduction

Historical background

McCulloch-Pitts neuron (1943)

Rosenblatt's perceptron (1957-1958)

Modern input layers

How the input layer works

Input shapes for different data types

Input layer across network architectures

Feedforward networks (MLPs)

Convolutional neural networks

Recurrent neural networks

Transformers

Graph neural networks

The batch dimension

Input preprocessing and normalization

Feature scaling

Encoding categorical variables

Handling missing values

Dimensionality reduction

Tokenization for text

Data augmentation at the input

Variable-length inputs

Padding

Masking

Bucketing

Dynamic computation graphs

Multi-input and multimodal models

Common multi-input use cases

Fusion strategies

Input layer in PyTorch vs. TensorFlow/Keras

TensorFlow / Keras

PyTorch

Framework comparison

Dropout at the input layer

Activation function in the input layer

Input validation and debugging

Common input errors

Validation strategies

Feature selection and the curse of dimensionality

Role of the input layer in machine learning

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

GELU (Gaussian Error Linear Unit)

Multi-head Latent Attention

Sparse autoencoder

ARC-AGI 2

LeNet

Activation Function

Introduction

Historical background

McCulloch-Pitts neuron (1943)

Rosenblatt's perceptron (1957-1958)

Modern input layers

How the input layer works

Input shapes for different data types

Input layer across network architectures

Feedforward networks (MLPs)

Convolutional neural networks

Recurrent neural networks

Transformers

Graph neural networks

The batch dimension

Input preprocessing and normalization

Feature scaling

Encoding categorical variables

Handling missing values

Dimensionality reduction

Tokenization for text

Data augmentation at the input

Variable-length inputs

Padding

Masking

Bucketing

Dynamic computation graphs

Multi-input and multimodal models

Common multi-input use cases

Fusion strategies

Input layer in PyTorch vs. TensorFlow/Keras