Building Knowledge Graphs with LLM Graph Transformer | by Tomaz Bratanic | Nov, 2024 | Towards Data Science
Tomaz Bratanic
Follow
Towards Data Science
--
2
Listen
Share
Creating graphs from text is incredibly exciting, but definitely challenging. Essentially, it’s about converting unstructured text into structured data. While this approach has been around for some time, it gained significant traction with the advent of Large Language Models (LLMs), bringing it more into the mainstream.
In the image above, you can see how information extraction transforms raw text into a knowledge graph. On the left, multiple documents show unstructured sentences about individuals and their relationships with companies. On the right, this same information is represented as a graph of entities and their connections, showing who worked at or founded various organizations.
But why would you want to extract structured information from text and represent it as a graph? One key reason is to power retrieval-augmented generation (RAG) applications. While using text embedding models over unstructured text is an useful approach, it can fall short when it comes to answering complex, multi-hop questions that require understanding connections across multiple entities or question where structured operations like filtering, sorting, and aggregation is required. By extracting structured information from text and constructing knowledge graphs, you not only organize data more effectively but also create a powerful framework for understanding complex relationships between entities. This structured approach makes it much easier to retrieve and leverage specific information, expanding the types of questions you can answer while providing greater accuracy.
Around a year ago, I began experimenting with building graphs using LLMs, and due to the growing interest, we decided to integrate this capability into LangChain as the LLM Graph Transformer. Over the past year, we’ve gained valuable insights and introduced new features, which we’ll be showcasing in this blog post.
The code is available on GitHub.
We will use Neo4j as the underlying graph store, which comes with out-of-the box graph visualizations. The easiest way to get started is to use a free instance of Neo4j Aura, which offers cloud instances of the Neo4j database. Alternatively, you can set up a local instance of the Neo4j database by downloading the Neo4j Desktop application and creating a local database instance.
The LLM Graph Transformer was designed to provide a flexible framework for building graphs using any LLM. With so many different providers and models available, this task is far from simple. Fortunately, LangChain steps in to handle much of the standardization process. As for the LLM Graph Transformer itself, it’s like two cats stacked in a trench coat —with the ability to operate in two completely independent modes.
The LLM Graph Transformer operates in two distinct modes, each designed to generate graphs from documents using an LLM in different scenarios.
These two modes ensure that the LLM Graph Transformer is adaptable to different LLMs, allowing it to build graphs either directly using tools or by parsing output from a text-based prompt.
Note that you can use prompt-based extraction even with models that support tools/functions by setting the attribute ignore_tools_usage=True.
We initially chose a tool-based approach for extraction since it minimized the need for extensive prompt engineering and custom parsing functions. In LangChain, the with_structured_output method allows you to extract information using tools or functions, with output defined either through a JSON structure or a Pydantic object. Personally, I find Pydantic objects clearer, so we opted for that.
We start by defining a Node class.
Each node has an id, a label, and optional properties. For brevity, I haven’t included full descriptions here. Describing ids as human-readable unique identifier is important since some LLMs tend to understand ID properties in more traditional way like random strings or incremental integers. Instead we want the name of entities to be used as id property. We also limit the available label types by simply listing them in the labeldescription. Additionally, LLMs like OpenAI’s, support an enum parameter, which we also use.
Next, we take a look at the Relationship class
This is the second iteration of the Relationship class. Initially, we used a nested Node object for the source and target nodes, but we quickly found that nested objects reduced the accuracy and quality of the extraction process. So, we decided to flatten the source and target nodes into separate fields—for example, source_node_id and source_node_label, along with target_node_id and target_node_label. Additionally, we define the allowed values in the descriptions for node labels and relationship types to ensure the LLMs adhere to the specified graph schema.
The tool-based extraction approach enables us to define properties for both nodes and relationships. Below is the class we used to define them.
Each Property is defined as a key-value pair. While this approach is flexible, it has its limitations. For instance, we can't provide a unique description for each property, nor can we specify certain properties as mandatory while others optional, so all properties are defined as optional. Additionally, properties aren't defined individually for each node or relationship type but are instead shared across all of them.
We’ve also implemented a detailed system prompt to help guide the extraction. In my experience, though, the function and argument descriptions tend to have a greater impact than the system message.
Unfortunately, at the moment, there is no simple way to customize function or argument descriptions in LLM Graph Transformer.
Since only a few commercial LLMs and LLaMA 3 support native tools, we implemented a fallback for models without tool support. You can also set ignore_tool_usage=True to switch to a prompt-based approach even when using a model that supports tools.
Most of the prompt engineering and examples for the prompt-based approach were contributed by Geraldus Wilsen.
With the prompt-based approach, we have to define the output structure directly in the prompt. You can find the whole prompt here. In this blog post, we’ll just do a high-level overview. We start by defining the system prompt.
In the prompt-based approach, a key difference is that we ask the LLM to extract only relationships, not individual nodes. This means we won’t have any isolated nodes, unlike with the tool-based approach. Additionally, because models lacking native tool support typically perform worse, we do not allow extraction any properties — whether for nodes or relationships, to keep the extraction output simpler.
Next, we add a couple of few-shot examples to the model.
In this approach, there’s currently no support for adding custom few-shot examples or extra instructions. The only way to customize is by modifying the entire prompt through the promptattribute. Expanding customization options is something we’re actively considering.
Next, we’ll take a look at defining the graph schema.
When using the LLM Graph Transformer for information extraction, defining a graph schema is essential for guiding the model to build meaningful and structured knowledge representations. A well-defined graph schema specifies the types of nodes and relationships to be extracted, along with any attributes associated with each. This schema serves as a blueprint, ensuring that the LLM consistently extracts relevant information in a way that aligns with the desired knowledge graph structure.
In this blog post, we’ll use the opening paragraph of Marie Curie’s Wikipedia page for testing with an added sentence at the end about Robin Williams.
We’ll also be using GPT-4o in all examples.
To start, let’s examine how the extraction process works without defining any graph schema.
Now we can process the documents using the aconvert_to_graph_documents function, which is asynchronous. Using async with LLM extraction is recommended, as it allows for parallel processing of multiple documents. This approach can significantly reduce wait times and improve throughput, especially when dealing with multiple documents.
The response from the LLM Graph Transformer will be a graph document, which has the following structure:
The graph document describes extracted nodes and relationships . Additionally, the source document of the extraction is added under the source key.
We can use the Neo4j Browser to visualize the outputs, providing a clearer and more intuitive understanding of the data.
The image above shows two extraction passes over the same paragraph about Marie Curie. In this case, we used GPT-4 with tool-based extraction, which also allows for isolated nodes, as illustrated in the image. Because no graph schema was defined, the LLM determines at runtime what information to extract, which can lead to variations in the output, even from the same paragraph. As a result, some extractions are more detailed than others and may vary in structure, even for the same information. For instance, on the left, Marie is represented as the WINNERof the Nobel Prize, while on the right, she WONthe Nobel Prize.
Now, let’s try the same extraction using the prompt-based approach. For models that support tools, you can enable prompt-based extraction by setting the ignore_tool_usageparameter.
Again, we can visualize two separate executions in Neo4j Browser.
With the prompt-based approach, we won’t see any isolated nodes. However, as with previous extractions, the schema can vary between runs, resulting in different outputs on the same input.
Next, let’s walk through how defining a graph schema can help produce more consistent outputs.
Constraining the extracted graph structure can be highly beneficial, as it guides the model to focus on specific, relevant entities and relationships. By defining a clear schema, you improve consistency across extractions, making the outputs more predictable and aligned with the information you actually need. This reduces variability between runs and ensures that the extracted data follows a standardized structure, capturing expected information. With a well-defined schema, the model is less likely to overlook key details or introduce unexpected elements, resulting in cleaner, more usable graphs.
We’ll start by defining the expected types of nodes to extract using the allowed_nodesparameter.
Here, we defined that the LLM should extract five types of nodes like Person, Organization, Location, and more. We visualize two separate executions in Neo4j Browser for comparison.
By specifying the expected node types, we achieve more consistent node extraction. However, some variation may still occur. For example, in the first run, “radioactivity” was extracted as a research field, while in the second, it was not.
Since we haven’t defined relationships, their types can also vary across runs. Additionally, some extractions may capture more information than others. For instance, the MARRIED_TOrelationship between Marie and Pierre isn’t present in both extractions.
Now, let’s explore how defining relationship types can further improve consistency.
As we’ve observed, defining only node types still allows for variation in relationship extraction. To address this, let’s explore how to define relationships as well. The first approach is to specify allowed relationships using a list of available types.
Let’s again examine two separate extractions.
With both nodes and relationships defined, our outputs become significantly more consistent. For example, Marie is always shown as winning an award, being the spouse of Pierre, and working at the University of Paris. However, since relationships are specified as a general list without restrictions on which nodes they can connect, some variation still occurs. For instance, the FIELD_OF_RESEARCHrelationship might appear between a Personand a ResearchField, but sometimes it links an Awardto a ResearchField. Additionally, since relationship directions aren’t defined, there may be differences in directional consistency.
To address the issues of not being able to specify which nodes a relationship can connect and enforcing relationship direction, we recently introduced a new option for defining relationships, as shown below.
Rather than defining relationships as a simple list of strings, we now use a three-element tuple format, where the elements represents the source node, relationship type, and target node, respectively.
Let’s visualize the results again.
Using the three-tuple approach provides a much more consistent schema for the extracted graph across multiple executions. However, given the nature of LLMs, there may still be some variation in the level of detail extracted. For instance, on the right side, Pierre is shown as winning the Nobel Prize, while on the left, this information is missing.
The final enhancement we can make to the graph schema is to define properties for nodes and relationships. Here, we have two options. The first is setting either node_propertiesor relationship_propertiesto trueallows the LLM to autonomously decide which properties to extract.
Let’s examine the results.
We’ve enabled the LLM to add any node or relationship properties it considers relevant. For instance, it chose to include Marie Curie’s birth and death dates, her role as a professor at the University of Paris, and the fact that she won the Nobel Prize twice. These additional properties significantly enrich the extracted information.
The second option we have is to define the node and relationship properties we want to extract.
The properties are simply defined as two lists. Let’s see what the LLM extracted.
The birth and death dates remain consistent with the previous extraction. However, this time, the LLM also extracted the start date of Marie’s professorship at the University of Paris.
Properties indeed add valuable depth to the extracted information, though there are currently some limitations in this implementation:
If you thought we had perfected a way to make the LLM follow the defined schema flawlessly, I have to set the record straight. While we invested considerable effort into prompt engineering, it’s challenging to get LLM, especially the less performant one, to adhere to instructions with complete accuracy. To tackle this, we introduced a post-processing step, called strict_mode, that removes any information not conforming to the defined graph schema, ensuring cleaner and more consistent results.
By default, strict_modeis set to True, but you can disable it with the following code:
With strict mode turned off, you may get node or relationship types outside the defined graph schema, as LLMs can sometimes take creative liberties with output structure.
The extracted graph documents from the LLM Graph Transformer can be imported into graph databases like Neo4j for further analysis and applications using the add_graph_documents method. We’ll explore different options for importing this data to suit different use cases.
You can import nodes and relationships into Neo4j using the following code.
This method straightforwardly imports all nodes and relationships from the provided graph documents. We’ve used this approach throughout the blog post to review the results of different LLM and schema configurations.
Most graph databases support indexes to optimize data import and retrieval. In Neo4j, indexes can only be set for specific node labels. Since we might not know all the node labels in advance, we can handle this by adding a secondary base label to each node using the baseEntityLabelparameter. This way, we can still leverage indexing for efficient importing and retrieval without needing an index for every possible node label in the graph.
As mentioned, using the baseEntityLabel parameter will result in each node having an additional __Entity__ label.
The final option is to also import the source documents for the extracted nodes and relationships. This approach lets us track which documents each entity appeared in. You can import the source documents using the include_source parameter.
Upon inspecting the imported graph, we should see a result similar to this.
In this visualization, the source document is highlighted in blue, with all entities extracted from it connected by MENTIONSrelationships. This mode allows you to build retrievers that utilize both structured and unstructured search approaches.
In this post, we explored LangChain’s LLM Graph Transformer and its dual modes for building knowledge graphs from text. The tool-based mode, our primary approach, leverages structured output and function calling, which reduces prompt engineering and allows for property extraction. Meanwhile, the prompt-based mode is useful when tools aren’t available, relying on few-shot examples to guide the LLM. However, prompt-based extraction does not support property extraction and also yields no isolated nodes.
We observed that defining a clear graph schema, including allowed node and relationship types, improves extraction consistency and performance. A constrained schema helps ensure that the output adheres to our desired structure, making it more predictable, reliable, and applicable. Whether using tools or prompts, the LLM Graph Transformer enables more organized, structured representations of unstructured data, enabling better RAG applications and multi-hop query handling.
The code is available on GitHub. You can also try out the LLM Graph Transformer in a no-code environment using Neo4j’s hosted LLM Graph Builder application.
llm-graph-builder.neo4jlabs.com
Tool-Based Mode (Default):Prompt-Based Mode (Fallback):Default importLLM Graph Builder