Shubham Chatterjee > Research

PhD Research: Neural Entity-Oriented Information Retrieval and Extraction

Download Dissertation»

Large Language Models (LLMs) like ChatGPT have revolutionized our approach to language comprehension and generation. Yet, they face issues like hallucination -- generating information that lacks grounding in factual sources. To mitigate this, the integration of information retrieval mechanisms has emerged as a pivotal solution. These mechanisms allow LLMs to ground their responses in real-world, verifiable data, ensuring heightened accuracy and reliability. Thus, Retrieval Augmented Generation (RAG) has assumed paramount importance in modern AI applications, underscoring the critical need for the development of effective information retrieval systems. Entities, serving as explicit anchors in text, representing specific people, places, concepts, events, and more, emerge as indispensable assets in this context. While entities have historically proven their merit in feature-based information retrieval (IR) systems, modern neural IR models have scarcely tapped into their potential. As such, my PhD research delved into advancing the utilization of entities within the AI ecosystem, with a particular focus on their role in bolstering the capabilities of information retrieval systems.

My PhD work pioneered the integration of Knowledge Graph semantics into neural IR, diving deep into the interplay between entity semantics and neural IR's vector representations. Specifically, I focused on:

Query-Specific Representation Learning

Graph embedding methods provide vector representations of entities in Knowledge Graphs and have proven beneficial for NLP tasks like entity linking and relation classification. While such methods excel in encapsulating the general semantics and knowledge of entities within the Knowledge Graph, they may not be ideal for downstream IR systems. This is because entities can have indirect relationships that become evident only in the context of a particular query. For instance, the Wikipedia page for the entity Food and Drug Administration does not mention the entity Robert Swanson, but they are contextually linked through the query "Genetically Modified Organism", as Swanson founded the company that produced the first FDA-approved genetically engineered insulin.

In my previous work, I focused on models that learn query-specific entity representations, ensuring embeddings are both knowledge-rich and query-relevant. I have been the first to demonstrate that such query-specific entity embeddings outperform traditional graph embedding techniques like Wikipedia2Vec in creating superior entity clusters. These enhanced clusters can play a pivotal role in improving RAG systems. By harnessing entities, we can more intelligently organize vast information troves, ensuring that document clusters are not just based on superficial text matches but on deeper semantic connections. For instance, consider a user who seeks information about defensive measures during bear attacks. While simple clustering might group documents with generic references to bears, entity-driven clustering, would specifically target documents that discuss entities directly related to "defensive measures during bear attacks." This ensures a query-focused and contextually relevant cluster. When a user expresses interest in defensive measures, the system can access this refined cluster, crafting its response from documents that are most pertinent and insightful for the query. In my prior research , I explored methods to identify brief text segments that elucidate the connection between an entity and a user's query. Given the current context, these techniques can be adapted to cluster documents based on query-related sub-topics.

Additionally, organizing information with entity clusters can allow the RAG system to guide users through different sub-topics related to their main query. This is especially beneficial for users unfamiliar with the subject. The RAG system can take a proactive stance, leading users through the topic and highlighting sub-topics they might find informative.

The impact of this research and its implications caught the attention of major industry players, leading to invitations to present my work at prestigious platforms like the BBC Data Dates series. My latest work (under review) extends this by learning query-specific document representations to refine document ranking.

Fine-Grained Information Extraction for Text Understanding

Entities are pivotal for nuanced automated text understanding. For example, in a scenario where a user inquires about bear attacks, it's essential for the system to discern whether t the user is referring to reasons behind bear attacks, preventive measures, or perhaps the aftermath. Understanding the specific context or aspect of the entity "bear attacks" is crucial. The task of entity aspect linking seeks to enhance the precision of textual entity links by identifying the exact facet of an entity being discussed, often sourcing these aspects from sections of the entity's Wikipedia page.

In my prior research, I pioneered a neural entity aspect linking system that utilizes query-specific knowledge. Notably, my system begins by predicting "guiding" entities (entities that are likely to appear in the accurate aspect) and then employs these predictions to steer the system towards the correct aspect. In the realm of RAG systems, I envision entity aspects to play a transformative role. For example, explicit aspects such as "causes", "recovery", and "defense" for the entity "bear attacks" can aid in the systematic organization and categorization of documents based on distinct facets. For instance, if someone wants to explore the reasons behind bear attacks, the RAG system, using the "causes" aspect, can swiftly pinpoint and retrieve the most relevant documents, bypassing the traditionally laborious clustering step.

Finding Relevant Entities for a Query

Often, many search queries can be answered using entities, for example, questions such as "Who is the mayor of Berlin?" or queries that seek a particular list of entities such as "Professional sports team in Philadelphia". In fact, prior work has found that approximately 40-70% of Web search queries target entities. In the current scenario, as previously discussed, the knowledge of query-relevant entities can equip a RAG system to guide users through complex topics, especially when they lack a comprehensive understanding of the topic.

In my prior research, I focused on building systems that can retrieve and rank entities from a given Knowledge Graph in response to user queries. I was the first to study the significance of employing fine-grained entity aspects, and the role of query-specific knowledge in entity retrieval. Specifically, my findings revealed that incorporating entity aspects improved entity retrieval performance by 35%, while query-specific knowledge contributed to a 42% enhancement over prior state-of-the-art systems.

MSc Research: Multi-Objective Optimization


My research during my masters was in a completely different area called Multi-Objective Optimization. Multi-objective optimization is an area of multiple criteria decision-making that is concerned with mathematical optimization problems involving more than one objective function to be optimized simultaneously. For a nontrivial multi-objective optimization problem, no single solution exists that simultaneously optimizes each objective. In that case, the objective functions are said to be conflicting, and there exists a (possibly infinite) number of Pareto optimal solutions. A solution is called non-dominated, Pareto optimal, Pareto efficient or non-inferior, if none of the objective functions can be improved in value without degrading some other objective values. Without additional subjective preference information, all Pareto optimal solutions are considered equally good. Researchers study multi-objective optimization problems from different viewpoints and, thus, there exist different solution philosophies and goals when setting and solving them. The goal may be to find a representative set of Pareto optimal solutions, and/or quantify the trade-offs in satisfying the different objectives, and/or finding a single solution that satisfies the subjective preferences of a human decision maker.

In my work, I worked with a class of algorithms called Genetic Algorithms to solve a multi-objective travelling salesman problem. A genetic algorithm (GA) is a meta-heuristic inspired by the process of natural selection that belongs to the larger class of evolutionary algorithms (EA). Genetic algorithms are commonly used to generate high-quality solutions to optimization and search problems by relying on biologically inspired operators such as mutation, crossover and selection. The two objectives are minimization of the distance travelled by the salesman and minimization of the time taken to travel. I modelled the problem as a single objective optimization problem using the weighted sum method of modeling the objective function. I then used a Genetic Algorithm to see how the distance and time values change with the changes in weights assigned to the two objectives and varied the probability of mutation, the initial population and the number of generations to study its effect on the fitness value.

This work was published at the IEEE International Conference on Information Technology (ICIT).

BSc Research: Image Processing and Computer Vision


For my bachelor's thesis, I worked in yet another area called Computer Vision (although it wasn't that popular at the time because Neural Networks had yet to make a foray into the field). My research was about designing an algorithm for a computer vision based surveillance system to generate an alert if any moving body is localized within a defined sensitive area. This is done by using a proximity operation. The approach is based on first detecting of a sensitive area, and then uniquely identifying all the paths of all the objects entering at a standoff distance of that sensitive area. After identifying the sensitive area while doing the path identification phase, a catalog is maintained which consists of the co-ordinates of the already entered objects as well as the newly entering objects. The work also generates an alert with the help of a relative study of those cataloged data with respect to that of a predefined reference frame. The proposed algorithm is validated using real time video taken by CCTV footage and as well as using an animated video which consists of all types of possible test cases.

I had two journal publications from this research.[paper 1] [paper 2]