From 2025, digital identities could replace passports and licences for age verification when buying alcohol in the UK.
What are structured, semi-structured, and unstructured documents?
Different types of documents, ranging from invoices to medical records, are structured in various ways. It is essential to understand these structures to apply the correct analysis technologies, implement business rules, and make informed decisions.
S
tructure plays a fundamental role in how we store, process, and analyse information. Documents, as basic units of information, can be classified into three main categories based on their level of structure: structured, semi-structured, and unstructured.
Structured documents
Structured documents have a specific and well-defined schema, where all information elements are organised uniformly. Each field or piece of data is located in a predefined position and follows a standard format, making it easy to process and validate automatically.
Examples
- Passports: Passports are a clear example of structured documents. Each information field (name, date of birth, nationality, passport number, etc.) has a specific position and format, allowing for easy manual and automated verification and authentication.
- Official identity documents: These types of documents, such as identity cards or national ID cards, follow a standardised design with clearly defined fields for name, address, identification number, photograph, and so on.
- Driving licences: Similar to passports and identity documents, driving licences are structured with predefined fields for the holder’s personal information and details such as their licence category.
- Tax forms: Tax returns that require specific information in predefined fields, such as the 1040 Form in the U.S. or the Modelo (Model) 100 in Spain.
In sectors like government, security, and transportation, the use of structured documents is crucial for ensuring the accuracy and authenticity of information. This enables efficient identification and verification processes, reduces the risk of fraud, and facilitates interoperability between different national and international systems.
Semi-structured documents
This category includes documents with a partial structure, where certain elements are consistent and well-defined, but other data may vary in position and format. While these documents have tags or metadata that organise the information, not all fields follow a rigid order, allowing for some flexibility.
Examples
- Payslips: Payslips typically have a predefined format with clear sections such as employee name, gross salary, deductions, and net salary. However, the data can vary depending on the employee or the month, and the layout of some sections may change depending on the company or country.
- Invoices: Invoices contain common elements such as the date, invoice number, description of products or services, total amount, and details of the issuer and recipient. However, the format and placement of these elements can vary between different companies or billing systems.
- Purchase receipts: Similar to invoices, purchase receipts include standard information such as the total amount, details of the items purchased, and the transaction date. However, the layout of this information may not be consistent.
- Purchase orders: Documents that include information about products or services purchased, with a format that can vary depending on the supplier but typically follows certain common rules (order number, date, product description, etc.).
- Curriculum vitae (CV): While CVs generally follow a common format with sections like work experience, education, and skills, their structure can vary significantly between individuals or sectors.
Unstructured documents
As the name suggests, unstructured documents lack a predefined format or structure. Information within these documents can be scattered and does not follow a specific pattern, making automated analysis more challenging and often requiring the use of advanced technologies for data extraction.
Examples
- Bank statements: Although bank statements include key information such as balances, transactions, and dates, the presentation of this data can vary widely, lacking a uniform format that would facilitate automated processing.
- Applications and powers of attorney: These legal documents often contain natural language text without a fixed format, ranging from simple requests to complex legal authorisations, each with its own style and layout.
- Emails: While certain metadata such as sender, recipient, and subject are defined, the body of the message is freeform and can vary greatly.
- Legal contracts: Terms and conditions may be organised into paragraphs and sections, but the language and structure can differ significantly depending on the type of contract and the parties involved.
- Research articles: Although research articles have a basic structure in terms of sections (introduction, methodology, results, discussion), the textual content and details vary significantly.
In sectors such as banking, finance, and law, unstructured documents are common and present unique challenges. Handling these documents requires a careful approach, often supported by advanced analysis tools such as Natural Language Processing (NLP) or Artificial Intelligence (AI) to extract and analyse relevant information. Proper management of these documents is crucial for maintaining information integrity and ensuring compliance with regulations.
Integration of advanced technology in document management: TrustCloud AICR
TrustCloud AICR integrates powerful Optical Character Recognition (OCR) engines and advanced Artificial Intelligence (AI) capabilities for in-depth document analysis, regardless of their structure.
This system can “understand” the content of documents by analysing syntactic and semantic patterns. This means that not only is the text identified, but the meaning and context of the information are also interpreted, enabling a richer and more accurate understanding, as well as the application of business rules based on this comprehension of the content.
TrustCloud AICR performs complementary actions that are highly useful across various sectors: data and keyword extraction, transformation of documents into identity attributes, or automated information categorisation. These capabilities are especially valuable in processes such as credit assessments, loan management, tax compliance, and the administration of powers of attorney.
By combining OCR with artificial intelligence, the solution transforms documents of any type into valuable digital assets, enhancing accuracy, efficiency, and decision-making capabilities across a wide range of business applications.
Discover what TrustCloud AICR can do for your company