Recently we have added on to the machine learning power of Capture 2.0 with the development of the Document Classification Engine. This Capture component allows for unstructured data entering Alfresco from a variety of sources to be automatically categorized according to our clients’ object models. This post will explain how we leverage machine learning in our Classification Engine to automatically collect the data necessary for 21 CFR compliant batch records.
Document Classification Supports SuggestR Intelligent Indexing
We have previously demonstrated a machine learning approach to extracting metadata from AP invoices using our Capture 2.0. For that example, the invoices were entering Alfresco already classified by vendor based on the email address they were sent from. Capture 2.0 used the classified vendor primary key to look up the locational data needed to extract metadata from that vendor’s invoice type.
More challenging is a batch records scenario, in which different types of relevant batch documents are received from different sources, requiring manual intervention to determine the object type (primary key) before Capture can index the metadata.
Using the Capture 2.0 Classification engine based on SuggestR, data from scenarios that don’t have an easy way to distinguish documents can be automatically categorized allowing the user to focus on validation and efficienty process incoming documents.
Leveraging Naïve Bayes for Machine Learning
Similar to SuggestR, the Classification Engine is built to learn and scale in a real-time production system. The engine uses the Naïve Bayes probabilistic classification technique, in which a document is represented as a “bag of words” (no location data) and the classifying features of the document are the frequencies with which each word appears. The Naïve Bayes assumes the probability that an incoming document is of a particular type, t,can be determined by evaluating the similarity in word counts of previously classified t documents.
Leveraging a Naïve Bayes classifier offers us the following benefits:
- Learning based on incoming data – The word frequency features of a classified document, once validated by the user, join the dataset and are evaluated in subsequent classifications.
- Scaling – The Naïve Bayes classifier scales based on the number of categories, not the number of processed documents, so it is highly efficient, even in a large document repository.
- Minimal training required – A very small training set is needed for the algorithm to begin meaningful evaluation of features.
Here is a demo of this Document Classification Engine for Batch Records:
Keep an eye on our blog for the next steps in our on-going capture 2.0 work:
- Multi-strategy capabilities (run strategies in addition to Naive Bayes to improve confidence)
- Additional engines, such as image classification
- Leveraging classification to reduce setup overhead large-scale migrations (I.E. filestore to Alfresco)