May 2, 2024

Unlocking AI Advancement: The Superiority of Lakehouse Architectures for Hosting Enterprise Knowledge Graphs

Lakehouses, exemplified by platforms like Iceberg, emerge as optimal environments for hosting enterprise knowledge graphs tailored for AI applications. Industry giants like Google, Microsoft, and Nvidia emphasize the pivotal role of knowledge graphs in AI, particularly evident in models like RAG in LLM. However, establishing a standardized tech stack for housing knowledge graphs within data infrastructure remains elusive. Existing case studies primarily spotlight smaller datasets, overlooking the complexities of managing large-scale knowledge graphs prevalent in industry. This presentation delves into the superiority of lakehouse architectures, such as Iceberg, for accommodating enterprise knowledge graphs compared to alternative solutions. Through an analysis encompassing scalability, performance, cost-effectiveness, ecosystem integration, and associated trade-offs, we uncover why lakehouses stand out as the premier choice for hosting expansive knowledge graphs critical to AI advancement with real-world examples.

Sign up to watch all Subsurface 2024 sessions


Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Dr. Weimo Liu:

Hi everyone. Today we will talk about unlocking AI advancement, the superiority of Lakehouse architectures for hosting enterprise knowledge graphs. I’m Weimo from PuppyGraph. Before we created PuppyGraph, I worked at Google and Telegraph on query language and query engines. Today’s agenda will be first, we will talk about the challenge of adopting LLM in enterprise and then explain why Graph Rack can help. We will also show some examples with GPT model and IMDB data to prove our point. We will also share how to build a knowledge graph for LLM with our data stack and experience. Finally, we will talk about the advantage of hosting knowledge graphs on Lakehouse. 

Challenges of Adopting LLMs in Enterprises

First, let’s talk about the challenge of adopting LLM in enterprise. The first reason is that it’s hard for the ChatGPT to query your private data. The GTP models are not trained on private data and most enterprises also don’t want to share the private data with OpenAI. The second challenge is that when the query is a data-oriented question, ChatGPT can lead to a hallucination when answering such questions. They may provide wrong answers or give a long paragraph but actually don’t really answer the question. We will show the hallucination with a real example. For example, if you ask ChatGPT what’s the nice movie of Tom Hanks, he will talk about, oh, there are a lot of famous movies and Tom Hanks is a famous actor. But then the last sentence will be, if you have a specific movie in mind that you think maybe he’s nice, feel free to ask about it. But when we really ask about the nice movie, like, how does Dragon Knight rank in Tom Hanks’ movies, ChatGPT will give the answer that Dragon Knight is among Tom Hanks’ early movies and talk about what happened in Dragon Knight. It actually doesn’t answer the question as well. But when we use a graph rack to answer this question, we can see that the nice movie of Tom Hanks is Dragon Knight. We directly get the answer and also a link as a proof. Here is the IMDB reference, Dragon Knight. 

Let’s take another look at another question, like which person does Jackie Chan collaborate with most frequently? Then ChatGPT will answer that Jackie Chan is famous for fighting style movies and also has a lot of achievements. But the last sentence is that his Wikipedia page or other filmography resources would offer detailed insights. It’s actually not helpful at all and doesn’t answer the question. But with graph rack, we can make it three steps, thought, action, and observation. The thought is that we need to find the person Jackie Chan collaborates with most frequently. The action is that extract all the person who collaborated with Jackie Chan before and then rank by the times of collaboration. Then the observation will be the person Jackie Chan collaborates most frequently is James Hsieh with 181 collaborations. Actually, James Hsieh covered the voice of Jackie Chan in animations. 

Why Graph RAG?

Then why graph rack? First, let’s us explain what is a graph rack. The graph rack is a rack based on a holograph. Graph rack builds on the concept of rack by leveraging the holograph already well-built by the users or the enterprise. The graph rack allows integration of structured data like a holograph. Then AOM can better process and have a better informed basis for the model response. On the example, we can see that dogs is animals, animals is living things, cows is also animals, and cows is herb and herb is the plants. Plants is also living things. We can see that the point is the concept and the edge shows the relation between the concepts. There is a lot of famous entrepreneur or scientists already point out that not a graph can help and the benefits of AOM a lot. Jesse Huang in NVIDIA conference in the keynotes show that there are different data sources like structured data, not a graph and the vector database can benefit AOM and AI structures. Also, Andrew Ng, one of the most famous AI scientists, also have a course in talking about non-graph based rack. 

How to Build Your Own AI Chatbot With Private Data and LLM

Then let’s share how we build our own AI chatbot with private data and AOM. The tech stack includes GPT 3.5, AMDB dataset, Apache Iceberg as a data storage, and use PuppyGraph to hosting the data in Iceberg as an autograph. We can see there are three tables in the AMDB dataset. The first is the title basis. The title basis includes movies, animations, TV shows, and others. T-Const is the primary key. The title type may be movies, may be animations. The primary title is the name of the movies or animations or TV shows. We can also have some other attributes like start year, end year. The second table is name basis. It’s a person who works in the movie industry and maybe they are producers, actors, directors, and other roles. N-Const is the primary key. Prime name is the name of the person and the primary proficiency is whether he or she is an actor or director or producers. The third table is the title principles. Actually, there is a connection between title and names. We can see that T-Const is the primary key of the title basis and N-Const is the primary key of the name basis, which means for a movie which has actors, then the title principle will have the T-Const as the movie and N-Const as the actors. It’s a short relation between title and name basis. 

We can model these three tables as a graph. The blue point is the title and the right point is the person. The person can cast and be a crew of a title. The person can be an actor, actress, and also have attributes like a primary name, birth year, and death year. The title can have a title type, primary title, start year, and other information. Here your data is still stored in three tables, but logically you already have a graph schema on top of it. Then you can visualize the data and query the data by a public graph. Here we can see that in the dataset there are about 22 million vertices and 170 million edges. 

After we have the node graph, we will consider how to answer the question, like which movie ranks 9th over Tom Hanks. We can see that from all the person vertices, we find out the point which primary name is Tom Hanks. Then we just find all the movies he acted and ordered by the start year and get the 9th. Then we can directly get the name, which is Dragnet. The query is generated by LLM with some prompt. We can see that it’s explainable with an accurate answer. For the second question, who collaborates most with Jackie Chan? We can first get the point type is a person and the name is Jackie Chan. And then go along a year edge. Then you will get all the titles. And then go along with the other edge. Then you will get all the other person. And then we get the prime name, root by the prime name, and get count of all the prime names. Then you will have a list of a person and the count collaborates with Jackie Chan. We order it by values and get the first, which is James Hsieh. 

Knowledge Graphs on Lakehouse

Finally, let’s explain why host knowledge wrap on Lakehouse is better than a ground database. There are four main reasons. One is large-scale. Another is integrated with the existing data pipeline. And also no data duplication. The final one is the total data control. Let’s explain one by one. The first is the large-scale. In the real world, knowledge graph is usually very huge. Like the Google Knowledge Graph, when you search on Google search box, and then, for example, Tom Hanks again, then you will see, oh, there are several movies like Forrest Gump, also Dragnet, and some others. And then you can click another movie. You will get the other actors, producers, directors, and other information. 

Also, a very famous knowledge graph is the Wikipedia Knowledge Graph. You can imagine that it’s also a big knowledge graph. The enterprise knowledge graph is large as well. Because the knowledge is increased day by day, the data size can easily go up to billions of edges. The small knowledge graph is limited. Because the information provided is limited, and the value it brings is relatively small. The reason why we need to handle a big knowledge graph by Lakehouse is because a graph database is hard to handle a very large amount of data. And also, the cost is very high. No data duplication. Maybe some of us haven’t realized that the data in Lakehouse is already a knowledge graph. For example, for an e-commerce website, there are tables like customer, product, credit card, purchase history. It’s different tables in Lakehouse. But actually, logically, it’s also a graph. 

Purchase history is actually an edge between customer and the product. When a customer buys one of the products, there is an edge between them. It’s actually a purchase history. You can also have other types of edges like a click history. If a customer clicks one product webpage, it’s a row in your Lakehouse. But also, logically, there is an edge between them. They are all connected. Then when the data size is big, no one wants to do ETL because it’s hard, complex, and expensive. And there must be an engineer maintaining the ETL to make sure it works well and the data is synced. And also, it’s hard and costly to maintain another CDC. The third reason is that integrating with the existing data pipeline. Because if you already set up a Lakehouse, the data pipeline runs very well. Because data is not only used for LLM, it’s also for the existing use case. And no one wants to bring additional system complexity. At the same time, if you can model the data in Lakehouse as a graph, you can also run some graph data analysis jobs like anti-fraud, cybersecurity, and recommendation of e-commerce. Then you can have the insight connection between your data and also have a visualization to make the user and the data owner have easy insight on top of your data. In this case, you only have one copy of data and for your existing data pipeline, you don’t need to do additional work. But you have more information and get more value of your data. 

The final reason is the total data control. First, no one wants the third party to access their data. Not only for the LLM vendor, but also for the data management vendor. If you need to have another data stack, you need to reset all the permissions and access control. But if we can reuse your current Lakehouse to host the Knowledge Graph, you can just leverage the existing data management instead of bringing additional works. 

Here is a summary. So we don’t need to duplicate data and handle large Knowledge Graph. Data can be used for the daily data analysis and also LLM and also some new graph data analysis cases. You don’t need to bring additional data access management to your existing data pipeline. Thank you.