Structured data
LLMs are quite good at extracting structured data from unstructured text, images, etc. Though not always perfect, this can be a very helpful way to reduce the amount of manual work needed to extract information from a large amount of text or documents. Here are just a few scenarios where this can be useful:
- Form processing: Extract structured field-value pairs from scanned documents, invoices, and forms to reduce manual data entry.
- Automated table extraction: Identify and extract tables from unstructured text and images.
- Sentiment analysis: Extract sentiment scores and associated entities from customer reviews or social media posts to gain insights into public opinion.
- Classification: Classify text into predefined categories, such as spam detection or topic classification.
- Executive summaries: Extract key points and structured data from lengthy reports or articles to create concise summaries for decision-makers.
Intro to .extract_data()
The chatlas
package provides a simple way to extract structured data: the .extract_data()
method. To use it, you’ll need three things:
- Pick a model provider (e.g.,
ChatOpenAI()
). - Define a data model by subclassing
pydantic
’sBaseModel
class.- Here you’ll define the fields and value types you’re expecting in the input.
- Pass the unstructured input and
data_model
to the.extract_data()
method.
from chatlas import ChatOpenAI
from pydantic import BaseModel
class Person(BaseModel):
str
name: int
age:
= ChatOpenAI()
chat_client
chat_client.extract_data("My name is Susan and I'm 13 years old",
=Person,
data_model )
.extract_data()
then returns a dictionary matching the fields and types in the data_model
:
'name': 'Susan', 'age': 13} {
For more examples and details on how .extract_data()
works, see the chatlas documentation.
chatlas
also supports input other than text, such as images (content_image_file()
) and PDF (content_pdf_file()
).
Basic app
To go from this basic script to a Shiny app, you’ll at least want a couple things:
- Change
.extract_data()
toawait chat_client.extract_data_async()
.- This helps the Shiny app scale efficiently to multiple concurrent users.
- You could also wrap this part in a non-blocking task to keep the rest of the app responsive within the same session.
- Decide how the user will provide and/or navigate the unstructured input(s).
- This could be a simple text input field (as below), a file upload, a chat interface, etc.
For now, let’s keep it simple and use a text input field:
app.py
from chatlas import ChatOpenAI
from pydantic import BaseModel
from shiny import reactive
from shiny.express import input, render, ui
= ChatOpenAI()
chat_client
class Person(BaseModel):
str
name: int
age:
with ui.card():
"Enter some input with name and age")
ui.card_header(
ui.input_text_area("user_input", None, update_on="blur", width="100%",
="My name is Susan and I'm 13 years old",
value
)"submit", label="Extract data")
ui.input_action_button(
@render.ui
@reactive.event(input.submit)
async def result():
return ui.markdown(f"Extracted data: `{await data()}`")
@reactive.calc
async def data():
return await chat_client.extract_data_async(
input.user_input(),
=Person,
data_model )
Editable data
Remember that the LLM is not perfect – you may want to manually correct or refine the extracted data. In this scenario, it may be useful to allow the user to edit the extracted data, and download it when they are done. Here’s an example of how to do this in a named entity extraction app.
data_model.py
from pydantic import BaseModel, Field
class NamedEntity(BaseModel):
"""Named entity in the text."""
str = Field(description="The extracted entity name")
name:
str = Field(
type_: ="The entity type, e.g. 'person', 'location', 'organization'"
description
)
str = Field(
context: ="The context in which the entity appears in the text."
description
)
class NamedEntities(BaseModel):
"""Named entities in the text."""
list[NamedEntity] = Field(description="Array of named entities")
entities:
= NamedEntities.model_rebuild() _
app.py
import pandas as pd
from chatlas import ChatOpenAI
from faicons import icon_svg
from shiny import reactive
from shiny.express import input, render, ui
from data_model import NamedEntities
= ChatOpenAI()
chat_client
with ui.card():
"Named Entity Extraction")
ui.card_header(
ui.input_text_area("user_input", None, update_on="blur", width="100%",
="John works at Google in New York. He met with Sarah, the CEO of Acme Inc., last week in San Francisco.",
value
)"submit", label="Extract", icon=icon_svg("paper-plane"))
ui.input_action_button(
with ui.card():
with ui.card_header(class_="d-flex justify-content-between align-items-center"):
"Extracted (editable) table"
@render.download(filename="entities.csv", label="Download CSV")
async def download():
= await data()
d yield d.to_csv(index=False)
@render.data_frame
async def data_frame():
return render.DataGrid(
await data(),
=True,
editable="100%",
width
)
@reactive.calc
@reactive.event(input.user_input)
async def data():
= await chat_client.extract_data_async(
d input.user_input(),
=NamedEntities,
data_model
)= pd.DataFrame(d["entities"])
df return df.rename(columns={"type_": "type"})