Market pressures are forcing companies to invest millions of dollars in business intelligence (BI) architectures, enterprise content management (ECM) systems and decision reporting tools. But often companies ignore the component that is most vital for success: clear, consistent and reliable data. Data is the building blocks of information. If the data in the operation systems and data warehouses is not complete, accurate and consistent, attempts to benefit from the organization's information are going to fail and can even cause damage.
Consider a data warehouse belonging to a bank, importing data from various operation systems, and that contains a duplicate record of the same customer. One record lists a negative balance of $-1,000 for the customer, while another record shows that the customer also has a positive balance of $1.5M. The branch manager who accesses the first record but not the second one can involuntarily cause the loss of this customer. This is a case of record duplication and the inability to recognize the fact that this is the same customer.
Decisions Based on Deficient Data
Without high-quality data, decision makers must guess what they should know. Worse, they are liable to make decisions that appear to be informed but are in fact based on deficient data. To work with information and decision support systems that provide reliable and uniform data in real time it is not enough to collect the data and feed it to the BI tools. First, it is important to gain an in-depth understanding of all the aspects of the data and to explore possibilities of improvement before making the data available to a wider group of users. To prevent poor performance of the BI system, it is important to test the quality of the data and verify that it is complete, consistent, up-to-date and accurate.
By strongly emphasizing data quality, organizations can accelerate the development of the architecture of their BI systems; reduce the number of repeat transformations and extractions from the warehouse; report to end users about data quality issues; and eventually increase profits and return on investment in the BI infrastructure.
A data cleansing process of this nature results in a significant reduction in the number of customers and vendors because it eliminates duplicates and outdated records. At times, the reduction in the number of customers, vendors and items reaches 70 percent. Automatic data cleansing tools are used to transfer data from operation systems to the data warehouse in combination with data integration tools, which are intended to transfer the data from the information systems by saving time and costly development resources. An additional benefit of using automated systems is that the monitoring and quality of the data are improved significantly, and decisions are made based on a unified view of each customer as a single entity.
Many organizations are aware of problems having to do with transaction data in their systems, but do not know how to address the problems. By their nature, data migration and conversion processes reveal data quality problems. When data from various sources is integrated and subjected to new business requirements in the data warehouse, it becomes accessible to business users; in these situations, the completeness and reliability of the data gain paramount importance.
Data quality problems can be the result of several factors, including:
- Data distributed among several platforms and legacy systems.
- Extensive data redundancy between various application systems.
- Lack of standards for data within the organization.
- Deficient metadata or complete lack of metadata for legacy systems.
Maximizing Profits from Data
There are many tools for dealing with data quality by performing various operations such as tracing flaws and improving processes. Tools for handling data quality issues help organizations gain control over their data assets and derive optimal benefits from their BI infrastructure. These tools help maximize the organization's return on its investment in data infrastructure by analyzing organizational data and ensuring that only "clean" and reliable data is populating the warehouses and datamarts; reveal hidden business rules and verify their validity; grant priority to issues of data quality that lead the organization to invest in areas that have the greatest influence; use business rules and data verification to cleanse the data while it is being transferred; monitor and manage data over time to ensure that the active data cleansing programs provide consistent and measurable advantages. By understanding the properties, advantages and deficiencies of the original data, the organization can prevent surprises, set expectations and reduce the need for corrections.
One of the most prevalent myths is that a new system or data warehouse will fix data problems originating with the legacy systems. Although a process of data transfer results in a transformation of the data for improved business approach, the transformation process in itself does not guarantee cleaner data. With the right data cleansing, conversion and loading tools, and with data quality assurance tools, it is possible to maintain operation systems and data warehouses that contain reliable, uniform and useful data that serve the overall business success.
Printing paper and moving paper are both expensive and time-consuming. Companies that need to cut those costs and manage processes more effectively are simply converting documents that were being sent by mail into faxes or scanned document images. From the recipient's standpoint, the incoming documents are all unstructured or semi-structured. Whether image-based or data, the documents might look similar (all invoices), but the data elements do not contain understandable metatags. Therefore, each document must be looked at and interpreted into a common format that is understandable by the IT backend procedures.
Capturing data from images using traditional forms processing is based on knowing the specific form layout so that you can build a template to locate which fields to capture, the rules to use for each field and any cross-field validations. The template also defines the associated output metadata for the fields. Traditional forms processing works well when the layout of forms is the same or where clear identifiers define the format. Examples include tax returns, credit card applications, medical claims, etc.
Capturing data from unstructured, unknown data layouts can use search engines. Those hunt through unstructured text to identify and extract contextually relevant documents and phrases. However, creating understandable metadata for output into business processes requires business-specific rules, which means that the software must understand what the document is.
New intelligent document recognition technologies, originally developed for invoice processing and the electronic mailroom, use techniques from each of the above areas and eliminate the limitations. It is no longer necessary to know what the form layout looks like. It is no longer necessary to insert batch separators. It is no longer necessary to presort. Specific rules can make the data understandable. Intelligent document recognition has the ability to figure out what the document category is and apply the appropriate business rules.
IDR, which is also called intelligent data capture, works a lot more like humans, relying on training and an internal knowledge of the layout and content of generic form types, which is used to understand and extract required information and initiate workflows. That widens the types of forms that can be captured and reduces costs, but IDR also changes capture capabilities substantially into a series of tools that have the ability to interpret and extract data from all sorts of unstructured information.
The information can be input as scanned paper or document formatted information, whether it is data-centric, such as Word or PDF file, or image-based. Typically that includes and leverages multiple different methods including pattern recognition, OCR (optical character recognition), and other recognition and search engines to locate and extract required information before applying business rules to it. IDR capture provides the ability to make sense of and help manage the unstructured, untagged information that is coming into the corporation. It can provide the front-end understanding needed to feed business process management and business intelligence applications, as well as traditional accounting and records management systems.
Capture is evolving into a critical business systems need that improves core business processes and competitiveness through its development of business rules-based document understanding. The capture market can be broken down into four sub-segments:
- Ad Hoc and Desktop Scanning - used by office workers who want to convert paper documents into usable electronic documents on which they can work or collaborate. The devices used are slow-speed scanners or networked office digital copiers (MFPs).
- Batch and Distributed Batch Scanning - used to get documents into a centralized document repository or used to classify and route them to a centralized point as quickly as possible.
- Full-Text Capture OCR - converts textual documents, such as scanned magazine articles, into ASCII data that can be edited or managed or used to find documents.
- Transaction Capture and Process Management - previously forms processing, similar to batch and distributed batch capture, but the output is data-centric and used to provide data for use in a business process.
In recent years, those sub-segments have each shown some interesting trends. The first three have grown at more than 20 percent, driven by a number of key issues that are coming together to cause market stress, including increased business velocity; the need to reduce costs; the need to optimize equipment usage; the availability of lower-cost distributed duplex desktop scanners; full-text search at the desktop; and image standardization and business acceptance. The transaction capture and process management sector is currently the largest segment, accounting for 34 percent of the overall market. The proven forms processing technology offers some major cost reductions over in-house data entry or even offshore processing, which is increasing in cost.
[Source: Matrix]