Retrieval-Augmented Generation (RAG) vs Data Extraction
While similar, RAG and Data Extraction are fundamentally different in their use cases and desired goals.
RAG is combines information retrieval with generative AI to create responses based on relevant context. RAG is ideal for answering specific questions or generating context-aware responses from a large corpus, like handling FAQs, policy summaries, or dynamic information.
Let’s consider a practical example using an employee policy document that contains information about work-from-home guidelines, employee benefits, and leave policies.
For a RAG-based system, the user might ask a question, such as:
How much paid time off does an employee with 5 years of service receive?
The system might then respond with an answer:
Employees with 5 years of service are eligible for 20 days of paid time off per year. Leave requests should be submitted at least two weeks in advance, and PTO is restricted during company-wide blackout periods
RAG provides contextual responsesin natural language, suitable for conversational or FAQ-type applications.
Data extraction aims to extract structured data information from documents (e.g., names, dates, specific values) without generating responses. The goal is to capture and structure data for easier access and analysis.
Using the previous example when querying from an employee policy document containing work-from-home policy, an user might query the document for "PTO detail". After which, a data extraction system might give back a JSON data as follow:
Data Extraction gives structured data outputs, ideal for populating databases, automating forms, or generating reports where individual fields are needed without additional narrative context.
Data Extraction using ColiVara
Step 1: Client Setup
Install the colivara-py SDK library
If using Jupyter Notebook:
!pipinstall--no-cache-dir--upgradecolivara_py
If using the command shell
pipinstall--no-cache-dir--upgradecolivara_py
Step 2: Prepare Documents
1
Download Files: Download the desired files to your machine. This code specifically will download them into your docs/ folder
import requestsimport osdefdownload_file(url,local_filename): response = requests.get(url)if response.status_code ==200: os.makedirs('docs', exist_ok=True)withopen(local_filename, 'wb')as f: f.write(response.content)print(f"Successfully downloaded: {local_filename}")else:print(f"Failed to download: {url}")# URLs and local filenamesfiles = [{"url":"https://github.com/tjmlabs/colivara-demo/raw/main/docs/Work-From-Home%20Guidance.pdf","filename":"docs/Work-From-Home-Guidance.pdf"},{"url":"https://github.com/tjmlabs/colivara-demo/raw/main/docs/StaffVendorPolicy-Jan2019.pdf","filename":"docs/StaffVendorPolicy-Jan2019.pdf"}]# Download each filefor file in files:download_file(file["url"], file["filename"])
2
Upload Files to be procesed: Sync documents to the ColiVara server. The server will process these files to generate the necessary embeddings
from colivara_py import ColiVarafrom pathlib import Pathimport base64rag_client =ColiVara( base_url="https://api.colivara.com", api_key="your-api-key")new_collection = rag_client.create_collection( name="my_collection", metadata={"description": "A sample collection"})defsync_documents(): # get all the documents under docs/ folder and upsert them to colivara documents_dir =Path('docs') files = [f for f in documents_dir.glob('**/*')if f.is_file()]for file in files:withopen(file, 'rb')as f: file_content = f.read() encoded_content = base64.b64encode(file_content).decode('utf-8') rag_client.upsert_document( name=file.name, document_base64=encoded_content, collection_name="my_collection", wait=True )print(f"Upserted: {file.name}")sync_documents()
3
(Optional) Verify that documents have been processed by ColiVara: Documents have been convert into "screenshots" to generate embeddings
Install the LLM of choice. Here, we are using OpenAI's GPT model
If using Jupyter Notebook:
!pipinstallopenai
If using the command shell:
!pipinstallopenai
2
Extract the JSON data
import jsonfrom openai import OpenAIllm_client =OpenAI(api_key="your-api-key")# if your document is too big (20+ pages), or you want to extract data across multiple documents, # consider a pipeline where you search first and get top 3 pages, and then do this step.# here since our document is small - we are passing the whole document at once.defextract_data(data_to_extract,colivara_document): string_json = json.dumps(data_to_extract) content = [ {"type":"text", "text": f"""Use the following images as a reference to extract structured data with the following user example as a guide: {string_json}.\n
If information isnot available, keep the value blank.""", } ] pages = colivara_document.pages for page in pages: base64 = f"data:image/png;base64,{page.img_base64}" content.append( { "type": "image_url", "image_url": {"url": base64} } ) messages = [ {"role": "system", "content": "Our goal is to find out when a policy was issued to remind our users to review it at regular intervals. Always respond in JSON"},
{"role": "user", "content": content} ] completion = llm_client.chat.completions.create( model="gpt-4o", messages=messages, response_format= { "type": "json_object" }, temperature=0.25, seed=123 ) return completion.choices[0].message.contentdocument_name = "Work-From-Home-Guidance.pdf"doc = rag_client.get_document( document_name=document_name, collection_name="my_collection", expand="pages")data_to_extract = {"year_issued": 2014, "month_issued": 3}data = extract_data(data_to_extract, doc)print(data)