Cover Image for Building an Elasticsearch-Powered Question Answering System: Setup and Pipeline Construction
Abdulrahman Muhialdeen
Abdulrahman Muhialdeen

Building an Elasticsearch-Powered Question Answering System: Setup and Pipeline Construction

Explore the construction of an open-source Question Answering System leveraging Elasticsearch and Haystack, utilizing the 'Error Analysis of Pretrained Language Models (PLMs) in English-to-Arabic Machine Translation' paper as the open-source file...

Overview

This article provides a detailed walkthrough of setting up the Elasticsearch database for an open-domain question answering system. The objective is to build a comprehensive system with three major components: Database, Retriever, and Reader. In this particular notebook, we will focus on the initial step, which involves configuring the database using Elasticsearch.

Elasticsearch install

In this section, we will configure Elasticsearch for our open-domain question answering system. We will utilize Elasticsearch 7.17, which can be conveniently deployed using Docker. Below are the steps to pull the Elasticsearch image and start a single-node cluster:

Pulling the Elasticsearch Image

Obtaining Elasticsearch for Docker is straightforward. Simply issue the following command to pull the image from the Elastic Docker registry:

docker pull docker.elastic.co/elasticsearch/elasticsearch:7.17.18

Starting a Single Node Cluster

To start a single-node Elasticsearch cluster suitable for development or testing purposes, we need to specify single-node discovery to bypass bootstrap checks. Execute the following command:

docker run -p 127.0.0.1:9200:9200 -p 127.0.0.1:9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.17.18

This command binds Elasticsearch ports to localhost, allowing access to the cluster. The -e flag is used to set the discovery.type to single-node, ensuring that Elasticsearch operates in a single-node mode.

With Elasticsearch up and running in Docker, we can proceed to set up our database for the question answering system.

Try Elasticsearch instance open the url: localhost:9200

PDF File Import and Preprocessing

we begins by importing the paper PDF file titled Error Analysis of Pretrained Language Models (PLMs) in English‑to‑Arabic Machine Translation..

The pypdf2 library is employed to extract text from each page of the PDF. The text is then preprocessed, addressing issues such as unwanted characters and line breaks.

Install PyPDF2

!pip install pypdf2

Pdf Preprocessing

from PyPDF2 import PdfReader
 
# Open the PDF file
pdf_reader = PdfReader('./data/Error-Analysis-PLMs.pdf')
 
# Preprocess the text
def preprocess_text(text):
    """
    Preprocesses the text by replacing unwanted text with desired replacements.
    
    Args:
    - text (str): The input text to be preprocessed.
    
    Returns:
    - str: The preprocessed text.
    """
    replacements = {'\xa0': ' ', '\n':' '}
    for old_text, new_text in replacements.items():
        text = text.replace(old_text, new_text)
    return text
 
# Extract the text from all pages
pages_text = [preprocess_text(page.extract_text()) for page in pdf_reader.pages]
 

This code snippet demonstrates the following steps:

  1. Importing the PdfReader class from the PyPDF2 library.
  2. Opening the PDF file titled "Error Analysis of Pretrained Language Models (PLMs) in English‑to‑Arabic Machine Translation" located at the specified file path.
  3. Defining a preprocess_text function to preprocess the extracted text by replacing unwanted characters such as \xa0 and newline characters (\n) with appropriate replacements (e.g., space).
  4. Extracting text from each page of the PDF using list comprehension and preprocessing it using the defined preprocess_text function. The preprocessed text is stored in the pages_text list.

✨ Once the PDF data🗄️ is prepared, we'll proceed to configure the Elasticsearch instance to incorporate the PDF data into the 🔍 Elasticsearch database.

Elasticsearch Setup

import requests
 
# Check cluster health
requests.get('http://localhost:9200/_cluster/health').json()
  • The code sends a GET request to the Elasticsearch cluster health endpoint (http://localhost:9200/_cluster/health) to check the cluster health status.
  • The response is then parsed as JSON to extract relevant information about the cluster health.

Index Creation and Data Formatting

Create ElasticsearchDocumentStore instance

from haystack.document_stores import ElasticsearchDocumentStore
 
# Create ElasticsearchDocumentStore instance
doc_store = ElasticsearchDocumentStore(
    host='localhost',
    username='', password='',
    index='error_plms'
)
  • This code snippet utilizes the ElasticsearchDocumentStore class from the Haystack library to create an instance named doc_store.
  • The instance is configured to connect to Elasticsearch running on localhost without authentication.
  • The index specified is named 'error_plms', which will store data related to error analysis of pretrained language models.

now in order to write new data in the elasticsearch index we need to have a specific data structure:

# Format data into a list of dictionaries
data_json = [
    {
        'content': paragraph,
        'meta': {
            'source': 'Human-Centric'
        }
    } for paragraph in pages_text
]
  • This list comprehension iterates over each paragraph in the pages_text list, which contains preprocessed text extracted from the PDF.
  • Each paragraph is encapsulated in a dictionary with two keys: 'content' (containing the text of the paragraph) and 'meta' (containing metadata, in this case, the source 'Human-Centric').

Data Upload to Elasticsearch

finally! let upload the data from the processed pdf file

# Upload data to Elasticsearch index
doc_store.write_documents(data_json)
  • The write_documents method of the ElasticsearchDocumentStore instance (doc_store) is used to upload the formatted data (stored in data_json) to the Elasticsearch index named 'error_plms'.
  • This step completes the process of creating the index and populating it with the preprocessed data.

Next Step - To Build our Q&A Pipeline, let's go 🙂

Haystack Pipeline

In this section, we establish the pipeline for our open-domain question answering system using the Haystack library. The pipeline consists of two crucial components: a retriever and a reader. The retriever is responsible for retrieving relevant documents from the Elasticsearch document store, while the reader processes these documents to extract answers to user queries.

Connect to Elasticsearch Document Store

First, we connect to the Elasticsearch document store using the ElasticsearchDocumentStore class from the Haystack library. This allows us to access and retrieve documents stored in the Elasticsearch index.

from haystack.document_stores import ElasticsearchDocumentStore
 
doc_store = ElasticsearchDocumentStore(
    host='localhost',
    username='', password='',
    index='error_plms'
)
  • We create an instance of ElasticsearchDocumentStore named doc_store.
  • The host parameter specifies the Elasticsearch host (in this case, 'localhost').
  • We leave the username and password parameters empty since authentication is not required.
  • The index parameter specifies the name of the Elasticsearch index to connect to ('error_plms').

Initialize Retriever and Reader Models

Next, we initialize the retriever and reader models. The retriever employs a BM25 algorithm to retrieve relevant documents based on a user query, while the reader utilizes a BERT-based model to extract answers from these retrieved documents.

from haystack.nodes import FARMReader, BM25Retriever
 
retriever = BM25Retriever(doc_store)
reader = FARMReader(model_name_or_path='deepset/bert-base-cased-squad2', context_window_size=1500)
  • We import the BM25Retriever and FARMReader classes from the Haystack library.
  • The BM25Retriever is initialized with the Elasticsearch document store (doc_store).
  • The FARMReader is initialized with the BERT-based model ('deepset/bert-base-cased-squad2') and a context window size of 1500 tokens.

Initialize Open-Domain Question Answering Pipeline

Finally, we initialize the open-domain question answering pipeline using the retriever and reader models.

from haystack.pipelines import ExtractiveQAPipeline
 
qa = ExtractiveQAPipeline(reader=reader, retriever=retriever)
  • We import the ExtractiveQAPipeline class from the Haystack library.
  • The pipeline is initialized with the reader and retriever models we previously defined.
  • This pipeline allows us to run queries and extract answers from documents stored in the Elasticsearch index.

Extract Answer Handler and Ask() Function

Extract Answer Handler

In our question answering pipeline, the extract_answer function serves as a handler to process the responses obtained from the pipeline. It takes a response object as input, which contains information about the user query and the extracted answers. The function then formats and prints the query along with the corresponding answers, if any.

def extract_answer(response_obj):
    query = response_obj['query']
    answers = response_obj['answers']
    if  len(answers) > 0:
        print(f'Q: {query}')
        _ = [print(f'A{index+1}: {answer.answer}') for index, answer in enumerate(answers)]
  • The extract_answer function takes a response object as input, which typically contains the query and extracted answers.
  • It extracts the query and answers from the response object.
  • If answers are found (i.e., the answers list is not empty), it prints the query along with each answer.

Ask() Function

The Ask function is a utility function designed to facilitate querying our question answering pipeline. It takes a query string as input, runs it through the pipeline, and utilizes the extract_answer function to handle the response, printing the query and extracted answers.

def Ask(query=""):
    if len(query) > 0:
        response = qa.run(query=query, params={"Reader": {"top_k": 3}})
        extract_answer(response)
  • The Ask function is a wrapper function that simplifies the process of querying our pipeline.
  • It accepts a query string as input.
  • If the query string is not empty, it runs the query through the pipeline using the qa.run method.
  • The qa.run method returns a response object, which is then passed to the extract_answer function for further processing and printing of the query and answers.

Example Queries and Responses

We demonstrate the functionality of our pipeline by posing several example queries and examining the corresponding answers retrieved from the documents stored in the Elasticsearch index.

Example Query: "What is PLMs?"

Ask("what is PLMs?")

output:

Q: what is PLMs ?

A1: pretrained language models

A2: Pretrained Language Models

A3: Pretrained Language Models

Example Query: "What are Pretrained language models?"

Ask("what are Pretrained  language  models?")

output:

Q: what are Pretrained language models ?

A1: languages with limited corpora for unsuper - vised NMT

A2: neural network models trained on large amounts of text data in an unsupervised manner

A3: a state-of-the-art walkthrough

Example Query: "What is the objectives when translating from English to Arabic?"

Ask("what is the objectives when translating from English to Arabic?")

output:

Q: what is the objectives when translating from English to Arabic ?

A1: assess the translation performance

A2: improves the translation accuracy

A3: better long sentence translation

Example Query: "What is Conclusion and Future Work?"

Ask("what is Conclusion and Future Work?")

output:

Q: what is Conclusion and Future Work ?

A1: advancing machine translation capabilities for the English–Arabic language pairs

A2: state-of-the-art pretrained language models

A3: Sect. 6 offers recommendations and future work, highlighting potential avenues for enhancing the PLM performance


In this walkthrough

we've successfully set up Elasticsearch for our open-domain question answering system. We pulled the Elasticsearch Docker image, configured a single-node cluster, and created an index named 'error_plms' to store relevant data.

After formatting our data and uploading it to the Elasticsearch index, we established a Haystack pipeline with a retriever using the BM25 algorithm and a reader utilizing a BERT-based model.

Demonstrating the system's functionality, we provided example queries with their corresponding responses, showcasing accurate information retrieval from the Elasticsearch index.

Conclusion

In conclusion, with Elasticsearch integrated into our question answering pipeline, we've laid a strong foundation for an efficient open-domain question answering system, capable of handling diverse user queries effectively.