Open-Source AI Cookbook documentation

Information Extraction with Haystack and NuExtract

Open-Source AI Cookbook

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

Information Extraction with Haystack and NuExtract

Authored by: Stefano Fiorucci

In this notebook, we will see how to automate Information Extraction from textual data using Language Models.

🎯 Goal: create an application to extract specific information from a given text or URL, following a user-defined structure.

🧰 Stack

Haystack 🏗️: a customizable orchestration framework for building LLM applications. We will use Haystack to build the Information Extraction Pipeline.
NuExtract: a small Language Model, specifically fine-tuned for structured data extraction.

Install dependencies

! pip install haystack-ai trafilatura transformers pyvis

Components

Haystack has two main concepts: Components and Pipelines.

🧩 Components are building blocks that perform a single task: file conversion, text generation, embedding creation…

➿ Pipelines allow you to define the flow of data through your LLM application, by combining Components in a directed (cyclic) graph.

We will now introduce the various components of our Information Extraction application. Afterwards, we will integrate them into a Pipeline.

LinkContentFetcher and HTMLToDocument : extract text from web pages

In our experiment, we will extract data from startup funding announcements found on the web.

To download web pages and extract text, we use two components:

LinkContentFetcher: fetches the content of some URLs and returns a list of content streams (as ByteStream objects).
HTMLToDocument: converts HTML sources into textual Documents.

>>> from haystack.components.fetchers import LinkContentFetcher
>>> from haystack.components.converters import HTMLToDocument


>>> fetcher = LinkContentFetcher()

>>> streams = fetcher.run(urls=["https://example.com/"])["streams"]

>>> converter = HTMLToDocument()
>>> docs = converter.run(sources=streams)

>>> print(docs)

&#123;'documents': [Document(id=65bb1ce4b6db2f154d3acfa145fa03363ef93f751fb8599dcec3aaf75aa325b9, content: 'This domain is for use in illustrative examples in documents. You may use this domain in literature ...', meta: &#123;'content_type': 'text/html', 'url': 'https://example.com/'})]}

HuggingFaceLocalGenerator : load and try the model

We use the HuggingFaceLocalGenerator, a text generation component that allows loading a model hosted on Hugging Face using the Transformers library.

Haystack supports many other Generators, including HuggingFaceAPIGenerator (compatible with Hugging Face APIs and TGI).

We load NuExtract, a model fine-tuned from microsoft/Phi-3-mini-4k-instruct to perform structured data extraction from text. The model size is 3.8B parameters. Other variants are also available: NuExtract-tiny (0.5B) and NuExtract-large (7B).

The model is loaded with bfloat16 precision to fit in Colab with negligible performance loss compared to FP32, as suggested in the model card.

Notes on Flash Attention

At inference time, you will probably see a warning saying: “You are not running the flash-attention implementation”.

GPUs available on free environments like Colab or Kaggle do not support it, so we decided to not use it in this notebook.

In case your GPU architecture supports it (details), you can install it and get a speed-up as follows:

pip install flash-attn --no-build-isolation

Then add "attn_implementation": "flash_attention_2" to model_kwargs.

from haystack.components.generators import HuggingFaceLocalGenerator
import torch

generator = HuggingFaceLocalGenerator(
    model="numind/NuExtract", huggingface_pipeline_kwargs={"model_kwargs": {"torch_dtype": torch.bfloat16}}
)

# effectively load the model (warm_up is automatically invoked when the generator is part of a Pipeline)
generator.warm_up()

The model supports a specific prompt structure, as can be inferred from the model card.

Let’s manually create a prompt to try the model. Later, we will see how to dynamically create the prompt based on different inputs.

>>> prompt = """<|input|>\n### Template:
... {
...     "Car": {
...         "Name": "",
...         "Manufacturer": "",
...         "Designers": [],
...         "Number of units produced": "",
...     }
... }
... ### Text:
... The Fiat Panda is a city car manufactured and marketed by Fiat since 1980, currently in its third generation. The first generation Panda, introduced in 1980, was a two-box, three-door hatchback designed by Giorgetto Giugiaro and Aldo Mantovani of Italdesign and was manufactured through 2003 — receiving an all-wheel drive variant in 1983. SEAT of Spain marketed a variation of the first generation Panda under license to Fiat, initially as the Panda and subsequently as the Marbella (1986–1998).

... The second-generation Panda, launched in 2003 as a 5-door hatchback, was designed by Giuliano Biasio of Bertone, and won the European Car of the Year in 2004. The third-generation Panda debuted at the Frankfurt Motor Show in September 2011, was designed at Fiat Centro Stilo under the direction of Roberto Giolito and remains in production in Italy at Pomigliano d'Arco.[1] The fourth-generation Panda is marketed as Grande Panda, to differentiate it with the third-generation that is sold alongside it. Developed under Stellantis, the Grande Panda is produced in Serbia.

... In 40 years, Panda production has reached over 7.8 million,[2] of those, approximately 4.5 million were the first generation.[3] In early 2020, its 23-year production was counted as the twenty-ninth most long-lived single generation car in history by Autocar.[4] During its initial design phase, Italdesign referred to the car as il Zero. Fiat later proposed the name Rustica. Ultimately, the Panda was named after Empanda, the Roman goddess and patroness of travelers.
... <|output|>
... """

>>> result = generator.run(prompt=prompt)
>>> print(result)

&#123;'replies': ['&#123;\n    "Car": &#123;\n        "Name": "Fiat Panda",\n        "Manufacturer": "Fiat",\n        "Designers": [\n            "Giorgetto Giugiaro",\n            "Aldo Mantovani",\n            "Giuliano Biasio",\n            "Roberto Giolito"\n        ],\n        "Number of units produced": "over 7.8 million"\n    }\n}\n']}

Nice ✅

PromptBuilder : dynamically create prompts

The PromptBuilder is initialized with a Jinja2 prompt template and renders it by filling in parameters passed through keyword arguments.

Our prompt template reproduces the structure shown in model card.

During our experiments, we discovered that indenting the schema is particularly important to ensure good results. This probably stems from how the model was trained.

from haystack.components.builders import PromptBuilder
from haystack import Document

prompt_template = """<|input|>
### Template:
{{ schema | tojson(indent=4) }}
{% for example in examples %}
### Example:
{{ example | tojson(indent=4) }}\n
{% endfor %}
### Text
{{documents[0].content}}
<|output|>
"""

prompt_builder = PromptBuilder(template=prompt_template)

>>> example_document = Document(content="The Fiat Panda is a city car...")

>>> example_schema = {
...     "Car": {
...         "Name": "",
...         "Manufacturer": "",
...         "Designers": [],
...         "Number of units produced": "",
...     }
... }

>>> prompt = prompt_builder.run(documents=[example_document], schema=example_schema)["prompt"]

>>> print(prompt)

<|input|>
### Template:
&#123;
    "Car": &#123;
        "Designers": [],
        "Manufacturer": "",
        "Name": "",
        "Number of units produced": ""
    }
}

### Text
The Fiat Panda is a city car...
<|output|>

Works well ✅

OutputAdapter

You may have noticed that the result of the extraction is the first element of the replies list and consists of a JSON string.

We would like to have a dictionary for each source document. To perform this transformation in a pipeline, we can use the OutputAdapter.

>>> import json
>>> from haystack.components.converters import OutputAdapter


>>> adapter = OutputAdapter(
...     template="""{{ replies[0]| replace("'",'"') | json_loads}}""",
...     output_type=dict,
...     custom_filters={"json_loads": json.loads},
... )

... print(adapter.run(**result))

&#123;'output': &#123;'Car': &#123;'Name': 'Fiat Panda', 'Manufacturer': 'Fiat', 'Designers': ['Giorgetto Giugiaro', 'Aldo Mantovani', 'Giuliano Biasio', 'Roberto Giolito'], 'Number of units produced': 'over 7.8 million'}}}

Information Extraction Pipeline

Build the Pipeline

We can now create our Pipeline by adding and connecting the individual components.

from haystack import Pipeline

ie_pipe = Pipeline()
ie_pipe.add_component("fetcher", fetcher)
ie_pipe.add_component("converter", converter)
ie_pipe.add_component("prompt_builder", prompt_builder)
ie_pipe.add_component("generator", generator)
ie_pipe.add_component("adapter", adapter)

ie_pipe.connect("fetcher", "converter")
ie_pipe.connect("converter", "prompt_builder")
ie_pipe.connect("prompt_builder", "generator")
ie_pipe.connect("generator", "adapter")

# IN CASE YOU NEED TO RECREATE THE PIPELINE FROM SCRATCH, YOU CAN UNCOMMENT THIS CELL

# ie_pipe = Pipeline()
# ie_pipe.add_component("fetcher", LinkContentFetcher())
# ie_pipe.add_component("converter", HTMLToDocument())
# ie_pipe.add_component("prompt_builder", PromptBuilder(template=prompt_template))
# ie_pipe.add_component("generator", HuggingFaceLocalGenerator(model="numind/NuExtract",
#                                       huggingface_pipeline_kwargs={"model_kwargs": {"torch_dtype":torch.bfloat16}})
# )
# ie_pipe.add_component("adapter", OutputAdapter(template="""{{ replies[0]| replace("'",'"') | json_loads}}""",
#                                          output_type=dict,
#                                          custom_filters={"json_loads": json.loads}))

# ie_pipe.connect("fetcher", "converter")
# ie_pipe.connect("converter", "prompt_builder")
# ie_pipe.connect("prompt_builder", "generator")
# ie_pipe.connect("generator", "adapter")

Let’s review our pipeline setup:

>>> ie_pipe.show()

Define the sources and the extraction schema

We select a list of URLs related to recent startup funding announcements.

Additionally, we define a schema for the structured information we aim to extract.

urls = [
    "https://techcrunch.com/2023/04/27/pinecone-drops-100m-investment-on-750m-valuation-as-vector-database-demand-grows/",
    "https://techcrunch.com/2023/04/27/replit-funding-100m-generative-ai/",
    "https://www.cnbc.com/2024/06/12/mistral-ai-raises-645-million-at-a-6-billion-valuation.html",
    "https://techcrunch.com/2024/01/23/qdrant-open-source-vector-database/",
    "https://www.intelcapital.com/anyscale-secures-100m-series-c-at-1b-valuation-to-radically-simplify-scaling-and-productionizing-ai-applications/",
    "https://techcrunch.com/2023/04/28/openai-funding-valuation-chatgpt/",
    "https://techcrunch.com/2024/03/27/amazon-doubles-down-on-anthropic-completing-its-planned-4b-investment/",
    "https://techcrunch.com/2024/01/22/voice-cloning-startup-elevenlabs-lands-80m-achieves-unicorn-status/",
    "https://techcrunch.com/2023/08/24/hugging-face-raises-235m-from-investors-including-salesforce-and-nvidia",
    "https://www.prnewswire.com/news-releases/ai21-completes-208-million-oversubscribed-series-c-round-301994393.html",
    "https://techcrunch.com/2023/03/15/adept-a-startup-training-ai-to-use-existing-software-and-apis-raises-350m/",
    "https://www.cnbc.com/2023/03/23/characterai-valued-at-1-billion-after-150-million-round-from-a16z.html",
]


schema = {
    "Funding": {
        "New funding": "",
        "Investors": [],
    },
    "Company": {"Name": "", "Activity": "", "Country": "", "Total valuation": "", "Total funding": ""},
}

Run the Pipeline!

We pass the required data to each component.

Note that most of them receive data from previously executed components.

from tqdm import tqdm

extracted_data = []

for url in tqdm(urls):
    result = ie_pipe.run({"fetcher": {"urls": [url]}, "prompt_builder": {"schema": schema}})

    extracted_data.append(result["adapter"]["output"])

Let’s inspect some of the extracted data

extracted_data[:2]

Data exploration and visualization

Let’s explore the extracted data to assess its correctness and gain insights.

Dataframe

We start by creating a Pandas Dataframe. For simplicity, we flatten the extracted data.

def flatten_dict(d, parent_key=""):
    items = []
    for k, v in d.items():
        new_key = f"{parent_key} - {k}" if parent_key else k
        if isinstance(v, dict):
            items.extend(flatten_dict(v, new_key).items())
        elif isinstance(v, list):
            items.append((new_key, ", ".join(v)))
        else:
            items.append((new_key, v))
    return dict(items)

import pandas as pd

df = pd.DataFrame([flatten_dict(el) for el in extracted_data])
df = df.sort_values(by="Company - Name")

df

dataframe

Apart from some errors in “Company - Country”, the extracted data looks good.

Build a simple graph

To understand the relationships between companies and investors, we construct a graph and visualize it.

First, we build a graph using NetworkX.

NetworkX is a Python package that allows to create and manipulate networks/graphs in a simple way.

Our simple graph will have companies and investors as nodes. We will connect investors to companies if they are mentioned in the same document.

import networkx as nx

# Create a new graph
G = nx.Graph()

# Add nodes and edges
for el in extracted_data:
    company_name = el["Company"]["Name"]
    G.add_node(company_name, label=company_name, title="Company")

    investors = el["Funding"]["Investors"]
    for investor in investors:
        if not G.has_node(investor):
            G.add_node(investor, label=investor, title="Investor", color="red")
        G.add_edge(company_name, investor)

Next, we use Pyvis to visualize the graph.

Pyvis is a Python package for interactive visualization of networks/graphs. It integrates nicely with NetworkX.

from pyvis.network import Network
from IPython.display import display, HTML


net = Network(notebook=True, cdn_resources="in_line")
net.from_nx(G)

net.show("simple_graph.html")
display(HTML("simple_graph.html"))

graph visualization

Looks like Andreessen Horowitz is quite present in the selected funding announcements 😊

Conclusion and ideas

In this notebook, we demonstrated how to set up an information extraction system using a small language model (NuExtract) and Haystack, a customizable orchestration framework for LLM applications.

How can we use the extracted data?

Some ideas:

The extracted data can be added to the original documents stored in a Document Store. This allows for advanced search capabilities with metadata filtering.
Expanding on the previous idea, you can do RAG (Retrieval Agumented Extraction) with metadata extraction from the query, as explained in this blog post.
Store the documents and extracted data in a Knowledge Graph and perform Graph RAG (Neo4j-Haystack integration).

< > Update on GitHub

←LLM Gateway for PII Detection Code Search With Vector Embeddings Using Qdrant→