Building a Powerful Recommandation System with TensorFlow and BigQuery ML in Almost 5 Mins
In the world of natural language processing (NLP), one of the fundamental tasks is to measure the similarity between pieces of text. Whether it’s finding similar product names, recommending related articles, or matching user queries, text similarity search plays a crucial role in various applications. In this article, we will explore how to create a simple text similarity search system using a pre-trained TensorFlow text encoder model and BigQuery ML.
Introduction to Text Embeddings
Text embeddings are dense vector representations of words or sentences that capture semantic relationships between them. In other words, words or sentences with similar meanings are represented as vectors that are close to each other in a high-dimensional space. These embeddings are learned through powerful language models, such as the Universal Sentence Encoder (USE) by Google, which is pretrained on a large corpus of text data.
The Universal Sentence Encoder is capable of encoding sentences into fixed-length vectors, making it an ideal choice for our text similarity search. TensorFlow provides a pre-trained model of the Universal Sentence Encoder on Tensorflow Hub, which we will utilize in this article.
Importing your ML model into BigQuery
In this section, we will walk through the process of downloading a SavedModel from the TensorFlow Hub and importing it into BigQuery for further use.
Download the Universal Sentence Encoder SavedModel using the “wget” command-line tool and save it to the desired location:
wget \
-O universal-sentence-encoder-multilingual-large-v3.tar.gz \
https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3\?tf-hub-format\=compressed\
Extract the downloaded model files:
mkdir -p universal-sentence-encoder-multilingual-large-v3
tar -xf \
universal-sentence-encoder-multilingual-large-v3.tar.gz \
-C universal-sentence-encoder-multilingual-large-v3
Upload the model files to your Google Cloud Storage (GCS) bucket using the `gsutil` command-line tool:
gsutil cp \
universal-sentence-encoder-multilingual-large-v3/* \
gs://shikanime-studio-labs/universal-sentence-encoder-multilingual-large-v3
Now, with the Universal Sentence Encoder SavedModel available in your GCS bucket, you can proceed to import it into BigQuery:
CREATE OR REPLACE MODEL search.universal_sentence_encoder_large
OPTIONS(
model_type='tensorflow',
model_path='gs://shikanime-studio-labs/universal-sentence-encoder-multilingual-large-v3/*'
)
If everything is set up correctly, you should see information about the model’s details and I/O schema.
In this section, we learned how to download a TensorFlow SavedModel from the TensorFlow Hub and import it into BigQuery. By importing the Universal Sentence Encoder model, we are now ready to generate text embeddings directly within BigQuery, allowing us to efficiently perform text similarity search and other natural language processing tasks. With that set up, we can move on to the next step of building our simple text similarity search system using text embeddings in BigQuery.
Creating the Text Embeddings in BigQuery
For demonstration purposes, we will use an example dataset of BBC news.
CREATE OR REPLACE TABLE search.text_embeddings AS (
WITH samples AS (
SELECT
ROW_NUMBER() OVER (ORDER BY title) AS id,
TRIM(LOWER(title)) AS inputs
FROM
`bigquery-public-data.bbc_news.fulltext` )
SELECT
id,
inputs AS text,
outputs AS embeddings
FROM
ML.PREDICT(
MODEL search.universal_sentence_encoder_large,
TABLE samples )
ORDER BY
id )
We use this model to generate text embeddings for the article titles stored in the bigquery-public-data.bbc_news.fulltext table. The embeddings are then saved into a new table called search.text_embeddings.
Calculating Text Similarities
Once we have the text embeddings ready, we can proceed to calculate the similarity between different texts. We will use the cosine similarity metric, which measures the cosine of the angle between two vectors. The closer the cosine similarity is to 1, the more similar the texts are.
CREATE OR REPLACE TABLE search.text_similarities AS (
WITH pairs AS (
SELECT
a.id AS id,
a.embeddings AS embeddings,
b.id AS candidate_id,
b.embeddings AS candidate_embeddings
FROM
`search.text_embeddings` a
INNER JOIN
`search.text_embeddings` b
ON
a.id < b.id )
SELECT
id,
candidate_id,
ML.DISTANCE(
embeddings,
candidate_embeddings,
"COSINE"
) AS cosine_similarity
FROM
pairs )
In the above SQL script, we use a self-join to create pairs of text embeddings, avoiding redundant calculations by ensuring that we only compute the similarity once for each unique pair. The cosine similarity is then calculated using the “ML.DISTANCE” function, and the results are stored in the search.text_similarities table.
Retrieving Top Similarities
Finally, we can retrieve the top similar texts for each entry in the dataset. For this, we will group the results by the original text’s ID and use the “ARRAY_AGG” function to collect the most similar candidate IDs along with their cosine similarity scores.
CREATE OR REPLACE TABLE search.text_tops AS (
SELECT
id,
ARRAY_AGG(
STRUCT(candidate_id, cosine_similarity)
ORDER BY
cosine_similarity
LIMIT
5
) AS candidate_ids
FROM
`search.text_similarities`
GROUP BY
1 )
The search.text_tops table will contain the IDs of the original texts along with the IDs of their most similar texts and their corresponding cosine similarity scores, which can be used to rank and display the most relevant matches.
{
ᅠ "query": "eu 'too slow' on economic reforms",
ᅠ "results": [
ᅠ ᅠ {
ᅠ ᅠ ᅠ "text": "palestinian economy in decline",
ᅠ ᅠ ᅠ "cosine_similarity": "0.59719177005056823"
ᅠ ᅠ },
ᅠ ᅠ {
ᅠ ᅠ ᅠ "text": "us economy still growing says fed",
ᅠ ᅠ ᅠ "cosine_similarity": "0.58802716863231685"
ᅠ ᅠ },
ᅠ ᅠ {
ᅠ ᅠ ᅠ "text": "economy \u0027stronger than forecast\u0027",
ᅠ ᅠ ᅠ "cosine_similarity": "0.58484919826101955"
ᅠ ᅠ },
ᅠ ᅠ {
ᅠ ᅠ ᅠ "text": "japan narrowly escapes recession",
ᅠ ᅠ ᅠ "cosine_similarity": "0.58384371093131937"
ᅠ ᅠ },
ᅠ ᅠ {
ᅠ ᅠ ᅠ "text": "sluggish economy hits german jobs",
ᅠ ᅠ ᅠ "cosine_similarity": "0.57841135349034722"
ᅠ ᅠ }
ᅠ ]
}
Conclusion
In this article, we explored how to build a simple text similarity search system using a pre-trained TensorFlow text encoder model, specifically the Universal Sentence Encoder. By calculating the cosine similarity between text embeddings, we can efficiently find similar texts for various applications, such as product recommendations, content recommendations, or search query expansion.
Text similarity search is a powerful technique that opens the door to a wide range of natural language processing applicationsm. As you continue your NLP journey, you can further enhance the system with more advanced models, fine-tuning techniques, and optimizations to handle larger datasets and achieve even better results.