Building a Text Similarity Search App with Flask, BERT, and FAISS | Step-by-Step Tutorial
Webdock โ Fast Cloud VPS Linux Hosting
Introduction
Let’s see how to create a text similarity search web app using Flask, a pre-trained BERT model, and FAISS for efficient similarity search!
Section 1: Importing Libraries
“First, we need to import all the necessary libraries. We’ll be using Flask to create our web application, Hugging Face’s transformers for the BERT model and tokenizer, NumPy for numerical operations, FAISS for similarity search, and PyTorch for handling our model computations.”
Code:
from flask import Flask, request, jsonify, render_template_string
from transformers import AutoTokenizer, AutoModel
import numpy as np
import faiss
import torch
Section 2: Initializing Flask and Model
“Next, we initialize our Flask app and set up our BERT model and tokenizer. We’re using ‘bert-base-uncased’ for this example.”
Code:
app = Flask(__name__)
# Initialize the model and tokenizer
model_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
Section 3: Creating the Embedding Function
“Here, we define a function get_embedding
that takes a text string and returns its embedding. We tokenize the input text and pass it through the BERT model, then take the mean of the last hidden state to get a single embedding vector.”
Code:
def get_embedding(text: str) -> np.ndarray:
"""Generate embedding for a given text using a pre-trained model."""
inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True)
with torch.no_grad():
outputs = model(**inputs)
embeddings = outputs.last_hidden_state.mean(dim=1)
return embeddings.numpy()
Section 4: Preparing Sample Texts and FAISS Index
“We have some sample texts for demonstration. We generate embeddings for these texts and use FAISS to create an index for efficient similarity search.”
Code:
# Sample texts for demonstration
sample_texts = [
"The quick brown fox jumps over the lazy dog.",
"Artificial intelligence is transforming industries.",
"FAISS is a library for efficient similarity search.",
"Natural language processing enables machines to understand text.",
"Transformers models are state-of-the-art for NLP tasks."
]
embeddings = np.vstack([get_embedding(text) for text in sample_texts])
index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings)
Section 5: Setting Up Flask Routes
“Now, we set up the Flask routes. We have a home route that handles both GET and POST requests. For POST requests, it takes the user input, generates its embedding, and searches for similar texts using the FAISS index.”
Code:
@app.route('/', methods=['GET', 'POST'])
def home():
if request.method == 'POST':
query_text = request.form.get('text')
if not query_text:
return render_template_string(HTML_TEMPLATE, results=[], error="No text provided")
query_embedding = get_embedding(query_text)
k = 3 # Number of nearest neighbors
distances, indices = index.search(query_embedding, k)
results = [{"text": sample_texts[idx], "distance": float(dist)} for idx, dist in zip(indices[0], distances[0])]
return render_template_string(HTML_TEMPLATE, results=results, error=None)
return render_template_string(HTML_TEMPLATE, results=[], error=None)
Section 6: HTML Template for Rendering the Web Page
Webdock โ Fast Cloud VPS Linux Hosting
“Here’s our HTML template. It uses Bootstrap for styling. The template includes a form for user input and displays the results of the similarity search.”
Code:
HTML_TEMPLATE = """
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Text Similarity Search</title>
<!-- Bootstrap CSS CDN -->
<link href="https://stackpath.bootstrapcdn.com/bootstrap/4.5.2/css/bootstrap.min.css" rel="stylesheet">
</head>
<body>
<div class="container mt-4">
<h1 class="mb-4">Text Similarity Search</h1>
<form method="post" class="mb-4">
<div class="form-group">
<label for="text">Enter text:</label>
<input type="text" id="text" name="text" class="form-control" required>
</div>
<button type="submit" class="btn btn-primary">Search</button>
</form>
{% if error %}
<div class="alert alert-danger" role="alert">
{{ error }}
</div>
{% endif %}
{% if results %}
<h2>Results:</h2>
<ul class="list-group">
{% for result in results %}
<li class="list-group-item"><strong>{{ result.text }}</strong> - Distance: {{ result.distance }}</li>
{% endfor %}
</ul>
{% endif %}
</div>
<!-- Bootstrap JS and dependencies -->
<script src="https://cdn.jsdelivr.net/npm/@popperjs/core@2.10.2/dist/umd/popper.min.js" integrity="sha384-7+zCNj/IqJ95wo16oMtfsKbZ9ccEh31eOz1HGyDuCQ6wgnyJNSYdrPa03rtR1zdB" crossorigin="anonymous"></script>
<script src="https://cdn.jsdelivr.net/npm/bootstrap@5.1.3/dist/js/bootstrap.min.js" integrity="sha384-QJHtvGhmr9XOIpI6YVutG+2QOK9T+ZnN4kzFN1RtK3zEFEIsxhlmWl5/YESvpZ13" crossorigin="anonymous"></script>
</body>
</html>
"""
Section 7: Running the Flask App
“Finally, we run the Flask app. This will start the server on 0.0.0.0
and port 8000
.”
Code:
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8000)
Conclusion
“And that’s it! We’ve built a simple text similarity search web app using Flask, BERT, and FAISS. I hope you found this tutorial helpful.”