Vector embeddings: Difference between revisions

No edit summary
Line 52: Line 52:
Despite these drawbacks, [[transformer models]] have been incredibly successful, leading to a proliferation of text vectorizer models for various data types such as audio, video, and images. Some models, like [[CLIP]], can vectorize multiple data types (e.g., images and text) into a single vector space, enabling content-based image searches using only text.
Despite these drawbacks, [[transformer models]] have been incredibly successful, leading to a proliferation of text vectorizer models for various data types such as audio, video, and images. Some models, like [[CLIP]], can vectorize multiple data types (e.g., images and text) into a single vector space, enabling content-based image searches using only text.


==Creating Vector Embeddings for Other Media Types==
==Creating Vector Embeddings for Non-text Media==
In addition to text, vector embeddings can be created for various types of data, such as images and audio recordings. Images can be embedded using convolutional neural networks (CNNs) like VGG and Inception, while audio recordings can be converted into vectors using image embedding transformations over their visual representations, such as spectrograms.
In addition to text, vector embeddings can be created for various types of data, such as [[images]] and [[audio]] recordings. Images can be embedded using [[convolutional neural network]]s ([[CNN]]s) like [[VGG]] and [[Inception]], while audio recordings can be converted into vectors using [[image embedding transformation]]s over their [[visual representation]]s, such as [[spectrograms]].


===Example: Image Embedding with a Convolutional Neural Network===
===Example: Image Embedding with a Convolutional Neural Network===
In this example, raw images are represented as greyscale pixels, which correspond to a matrix of integer values ranging from 0 to 255, where 0 signifies black and 255 represents white. The matrix values define a vector embedding, with the first coordinate being the matrix's upper-left cell and the last coordinate corresponding to the lower-right matrix cell.
In this example, raw images are represented as greyscale pixels, which correspond to a matrix of integer values ranging from 0 to 255, where 0 signifies black and 255 represents white. The matrix values define a vector embedding, with the first coordinate being the matrix's upper-left cell and the last coordinate corresponding to the lower-right matrix cell.


While such embeddings effectively maintain the semantic information of a pixel's neighborhood in an image, they are highly sensitive to transformations like shifts, scaling, cropping, and other image manipulation operations. Consequently, they are often used as raw inputs to learn more robust embeddings.
While such embeddings effectively maintain the semantic information of a pixel's neighborhood in an image, they are highly sensitive to transformations like [[shifts]], [[scaling]], [[cropping]], and other [[image manipulation]] operations. Consequently, they are often used as raw inputs to learn more robust embeddings.


A Convolutional Neural Network (CNN or ConvNet) is a class of deep learning architectures typically applied to visual data, transforming images into embeddings. CNNs process input through hierarchical small local sub-inputs known as receptive fields. Each neuron in each network layer processes a specific receptive field from the previous layer. Each layer either applies a convolution on the receptive field or reduces the input size through a process called subsampling.
A Convolutional Neural Network (CNN or ConvNet) is a class of deep learning architectures typically applied to visual data, transforming images into embeddings. CNNs process input through hierarchical small local sub-inputs known as receptive fields. Each neuron in each network layer processes a specific receptive field from the previous layer. Each layer either applies a convolution on the receptive field or reduces the input size through a process called subsampling.


A typical CNN structure includes receptive fields as sub-squares in each layer, serving as input to a single neuron within the preceding layer. Subsampling operations reduce layer size, while convolution operations expand layer size. The resulting vector embedding is obtained through a fully connected layer.
A typical CNN structure includes receptive fields as sub-squares in each layer, serving as input to a single [[neuron]] within the preceding layer. Subsampling operations reduce layer size, while convolution operations expand layer size. The resulting vector embedding is obtained through a fully connected layer.


Learning the network weights (i.e., the embedding model) requires a large set of labeled images. The weights are optimized to ensure that images with the same labels have closer embeddings compared to those with different labels. Once the CNN embedding model is learned, images can be transformed into vectors and stored with a K-Nearest-Neighbor index. For a new unseen image, it can be transformed using the CNN model, its k-most similar vectors can be retrieved, and the corresponding similar images can be identified.
Learning the network [[weights]] (i.e., the embedding model) requires a large set of [[label]]ed images. The weights are optimized to ensure that images with the same labels have closer embeddings compared to those with different labels. Once the CNN embedding model is learned, images can be transformed into vectors and stored with a [[K-Nearest-Neighbor]] index. For a new unseen image, it can be transformed using the CNN model, its [[k-most similar vector]]s can be retrieved, and the corresponding similar images can be identified.


Although this example focuses on images and CNNs, vector embeddings can be created for various types of data, and multiple models or methods can be employed to generate them.
Although this example focuses on images and CNNs, vector embeddings can be created for various types of data, and multiple models or methods can be employed to generate them.
370

edits