Overview
This article provides a detailed walkthrough of setting up the Elasticsearch database for an open-domain question answering system. The objective is to build a comprehensive system with three major components: Database, Retriever, and Reader. In this particular notebook, we will focus on the initial step, which involves configuring the database using Elasticsearch.
Elasticsearch install
In this section, we will configure Elasticsearch for our open-domain question answering system. We will utilize Elasticsearch 7.17, which can be conveniently deployed using Docker. Below are the steps to pull the Elasticsearch image and start a single-node cluster:
Pulling the Elasticsearch Image
Obtaining Elasticsearch for Docker is straightforward. Simply issue the following command to pull the image from the Elastic Docker registry:
docker pull docker.elastic.co/elasticsearch/elasticsearch:7.17.18
Starting a Single Node Cluster
To start a single-node Elasticsearch cluster suitable for development or testing purposes, we need to specify single-node discovery to bypass bootstrap checks. Execute the following command:
docker run -p 127.0.0.1:9200:9200 -p 127.0.0.1:9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.17.18
This command binds Elasticsearch ports to localhost, allowing access to the cluster. The -e
flag is used to set the discovery.type
to single-node
, ensuring that Elasticsearch operates in a single-node mode.
With Elasticsearch up and running in Docker, we can proceed to set up our database for the question answering system.
Try Elasticsearch instance open the url: localhost:9200
PDF File Import and Preprocessing
we begins by importing the paper PDF file titled Error Analysis of Pretrained Language Models (PLMs) in English‑to‑Arabic Machine Translation..
The pypdf2
library is employed to extract text from each page of the PDF. The text is then preprocessed, addressing issues such as unwanted characters and line breaks.
Install PyPDF2
!pip install pypdf2
Pdf Preprocessing
from PyPDF2 import PdfReader
# Open the PDF file
pdf_reader = PdfReader('./data/Error-Analysis-PLMs.pdf')
# Preprocess the text
def preprocess_text(text):
"""
Preprocesses the text by replacing unwanted text with desired replacements.
Args:
- text (str): The input text to be preprocessed.
Returns:
- str: The preprocessed text.
"""
replacements = {'\xa0': ' ', '\n':' '}
for old_text, new_text in replacements.items():
text = text.replace(old_text, new_text)
return text
# Extract the text from all pages
pages_text = [preprocess_text(page.extract_text()) for page in pdf_reader.pages]
This code snippet demonstrates the following steps:
- Importing the
PdfReader
class from thePyPDF2
library. - Opening the PDF file titled "Error Analysis of Pretrained Language Models (PLMs) in English‑to‑Arabic Machine Translation" located at the specified file path.
- Defining a
preprocess_text
function to preprocess the extracted text by replacing unwanted characters such as\xa0
and newline characters (\n
) with appropriate replacements (e.g., space). - Extracting text from each page of the PDF using list comprehension and preprocessing it using the defined
preprocess_text
function. The preprocessed text is stored in thepages_text
list.
✨ Once the PDF data🗄️ is prepared, we'll proceed to configure the Elasticsearch instance to incorporate the PDF data into the 🔍 Elasticsearch database.
Elasticsearch Setup
import requests
# Check cluster health
requests.get('http://localhost:9200/_cluster/health').json()
- The code sends a GET request to the Elasticsearch cluster health endpoint (
http://localhost:9200/_cluster/health
) to check the cluster health status. - The response is then parsed as JSON to extract relevant information about the cluster health.
Index Creation and Data Formatting
Create ElasticsearchDocumentStore instance
from haystack.document_stores import ElasticsearchDocumentStore
# Create ElasticsearchDocumentStore instance
doc_store = ElasticsearchDocumentStore(
host='localhost',
username='', password='',
index='error_plms'
)
- This code snippet utilizes the
ElasticsearchDocumentStore
class from the Haystack library to create an instance nameddoc_store
. - The instance is configured to connect to Elasticsearch running on localhost without authentication.
- The index specified is named 'error_plms', which will store data related to error analysis of pretrained language models.
now in order to write new data in the elasticsearch index we need to have a specific data structure:
# Format data into a list of dictionaries
data_json = [
{
'content': paragraph,
'meta': {
'source': 'Human-Centric'
}
} for paragraph in pages_text
]
- This list comprehension iterates over each paragraph in the
pages_text
list, which contains preprocessed text extracted from the PDF. - Each paragraph is encapsulated in a dictionary with two keys: 'content' (containing the text of the paragraph) and 'meta' (containing metadata, in this case, the source 'Human-Centric').
Data Upload to Elasticsearch
finally! let upload the data from the processed pdf file
# Upload data to Elasticsearch index
doc_store.write_documents(data_json)
- The
write_documents
method of theElasticsearchDocumentStore
instance (doc_store
) is used to upload the formatted data (stored indata_json
) to the Elasticsearch index named 'error_plms'. - This step completes the process of creating the index and populating it with the preprocessed data.
Next Step - To Build our Q&A Pipeline, let's go 🙂
Haystack Pipeline
In this section, we establish the pipeline for our open-domain question answering system using the Haystack library. The pipeline consists of two crucial components: a retriever and a reader. The retriever is responsible for retrieving relevant documents from the Elasticsearch document store, while the reader processes these documents to extract answers to user queries.
Connect to Elasticsearch Document Store
First, we connect to the Elasticsearch document store using the ElasticsearchDocumentStore
class from the Haystack library. This allows us to access and retrieve documents stored in the Elasticsearch index.
from haystack.document_stores import ElasticsearchDocumentStore
doc_store = ElasticsearchDocumentStore(
host='localhost',
username='', password='',
index='error_plms'
)
- We create an instance of
ElasticsearchDocumentStore
nameddoc_store
. - The
host
parameter specifies the Elasticsearch host (in this case, 'localhost'). - We leave the
username
andpassword
parameters empty since authentication is not required. - The
index
parameter specifies the name of the Elasticsearch index to connect to ('error_plms').
Initialize Retriever and Reader Models
Next, we initialize the retriever and reader models. The retriever employs a BM25 algorithm
to retrieve relevant documents based on a user query, while the reader utilizes a BERT-based model to extract answers from these retrieved documents.
from haystack.nodes import FARMReader, BM25Retriever
retriever = BM25Retriever(doc_store)
reader = FARMReader(model_name_or_path='deepset/bert-base-cased-squad2', context_window_size=1500)
- We import the
BM25Retriever
andFARMReader
classes from the Haystack library. - The
BM25Retriever
is initialized with the Elasticsearch document store (doc_store
). - The
FARMReader
is initialized with the BERT-based model ('deepset/bert-base-cased-squad2') and a context window size of 1500 tokens.
Initialize Open-Domain Question Answering Pipeline
Finally, we initialize the open-domain question answering pipeline using the retriever and reader models.
from haystack.pipelines import ExtractiveQAPipeline
qa = ExtractiveQAPipeline(reader=reader, retriever=retriever)
- We import the
ExtractiveQAPipeline
class from the Haystack library. - The pipeline is initialized with the reader and retriever models we previously defined.
- This pipeline allows us to run queries and extract answers from documents stored in the Elasticsearch index.
Extract Answer Handler and Ask() Function
Extract Answer Handler
In our question answering pipeline, the extract_answer
function serves as a handler to process the responses obtained from the pipeline. It takes a response object as input, which contains information about the user query and the extracted answers. The function then formats and prints the query along with the corresponding answers, if any.
def extract_answer(response_obj):
query = response_obj['query']
answers = response_obj['answers']
if len(answers) > 0:
print(f'Q: {query}')
_ = [print(f'A{index+1}: {answer.answer}') for index, answer in enumerate(answers)]
- The
extract_answer
function takes a response object as input, which typically contains the query and extracted answers. - It extracts the query and answers from the response object.
- If answers are found (i.e., the
answers
list is not empty), it prints the query along with each answer.
Ask() Function
The Ask
function is a utility function designed to facilitate querying our question answering pipeline. It takes a query string as input, runs it through the pipeline, and utilizes the extract_answer
function to handle the response, printing the query and extracted answers.
def Ask(query=""):
if len(query) > 0:
response = qa.run(query=query, params={"Reader": {"top_k": 3}})
extract_answer(response)
- The
Ask
function is a wrapper function that simplifies the process of querying our pipeline. - It accepts a query string as input.
- If the query string is not empty, it runs the query through the pipeline using the
qa.run
method. - The
qa.run
method returns a response object, which is then passed to theextract_answer
function for further processing and printing of the query and answers.
Example Queries and Responses
We demonstrate the functionality of our pipeline by posing several example queries and examining the corresponding answers retrieved from the documents stored in the Elasticsearch index.
Example Query: "What is PLMs?"
Ask("what is PLMs?")
output:
Q: what is PLMs ?
A1: pretrained language models
A2: Pretrained Language Models
A3: Pretrained Language Models
Example Query: "What are Pretrained language models?"
Ask("what are Pretrained language models?")
output:
Q: what are Pretrained language models ?
A1: languages with limited corpora for unsuper - vised NMT
A2: neural network models trained on large amounts of text data in an unsupervised manner
A3: a state-of-the-art walkthrough
Example Query: "What is the objectives when translating from English to Arabic?"
Ask("what is the objectives when translating from English to Arabic?")
output:
Q: what is the objectives when translating from English to Arabic ?
A1: assess the translation performance
A2: improves the translation accuracy
A3: better long sentence translation
Example Query: "What is Conclusion and Future Work?"
Ask("what is Conclusion and Future Work?")
output:
Q: what is Conclusion and Future Work ?
A1: advancing machine translation capabilities for the English–Arabic language pairs
A2: state-of-the-art pretrained language models
A3: Sect. 6 offers recommendations and future work, highlighting potential avenues for enhancing the PLM performance
In this walkthrough
we've successfully set up Elasticsearch for our open-domain question answering system. We pulled the Elasticsearch Docker image, configured a single-node cluster, and created an index named 'error_plms' to store relevant data.
After formatting our data and uploading it to the Elasticsearch index, we established a Haystack pipeline with a retriever using the BM25 algorithm and a reader utilizing a BERT-based model.
Demonstrating the system's functionality, we provided example queries with their corresponding responses, showcasing accurate information retrieval from the Elasticsearch index.
Conclusion
In conclusion, with Elasticsearch integrated into our question answering pipeline, we've laid a strong foundation for an efficient open-domain question answering system, capable of handling diverse user queries effectively.