Home Tools Wf Information Extractor

SmartProducts Community

Who's Online

We have 7 guests online


Statistics

Visits today:21
Visits yesterday:48
Visits in this month:855
Visits in this year:7960
Visits total:22806
Impressions total:177954
Bots today:73

Workflow Information Extraction

The role of the Workflow Information Extraction component, as defined previously in [D3.4.1], is to provide sets of tools to support the (semi-) automatic capture of domain-specific knowledge from largely available textual data, primarily instructional texts. The input textual data are commonly available in heterogeneous sources, such as organisational data (e.g., previously carried out work orders in the EADS scenario; car manual templates in the CRF scenario), and the internet (e.g., recipe websites in the PRE scenario; car maintenance procedures found on the internet in the CRF scenario); the output of IE is domain-specific procedural knowledge (interchangeably as workflows in the SmartProducts context, see [D1.2.2] Section 4.18) structured and annotated according to the definition of workflows and domain-specific ontologies, and will be used to populate a smart product’s proactive knowledge base.

In the initial phase of the development of IE, it is defined as a designer/developer tool that enables large-scale knowledge capture, which produces coarse-grained procedural knowledge to be submitted to Server-side Knowledge Fusion, and Workflow Editor ([D3.4.1] Section 3.1, 4.2) for manual verification and modification if necessary. The goal of these processes is to pre-populate smart products’ knowledge base before their release.

The IE component consists of two main parts, namely, Workflow Extractor and Workflow Annotator. The entire IE process is defined as a sequential flow that is similar to the majority of state-of-the-art IE systems. The process requires initial input of a formal definition of workflows (which we refer to as Generic Workflow Ontology, [D2.1.2] Section 7) that describes concepts and relations to be identified from a textual procedural description, and the documents in HTML/XML format that embeds layout features of texts (which we refer to as HTML/XML documents). As described in [D3.4.1], such documents can be original (e.g., downloaded from the internet), or converted from any other formats (car manual PDF’s) by using third party OCR tools. The process then begins by processing these documents using the Workflow Extractor, whose goal is to identify the textual passages (e.g., sections, paragraphs) of individual workflows, and analyse them according to the Generic Workflow Ontology.

The output of Workflow Extractor is structured representation (in XPDL) of individual workflows, whose components are identified according to the Generic Workflow Ontology. The XPDL documents may be loaded by the Workflow Editor to be manually verified or modified. Next they are submitted to Workflow Annotator, which takes input of domain-specific ontologies ([D2.1.2] Section 12, 13, 14) tailored for each application domain, to identify domain specific objects and terms from the workflow and link them to the ontologies. The motivation of annotating workflows according to domain ontologies is to semantically index the knowledge and enable machine understanding and manipulation of knowledge. This step completes the IE process, producing semantically annotated workflows linked with domain-specific ontologies as XPDL documents, which may also be verified and modified by Workflow Editor.

 

Developer: Ziqi Zhang


License

Current version of the Workflow IE module is planned to be released under LGPL3.0 license.

Download

The SmartProducts release available in the project SVN, check out from: https://trac.tk.informatik.tu-darmstadt.de/svn/projects/. We will work with the subdirectory apps/workflow_ie only. The source code of these examples can be found in: apps/workflow_ie/src; the executables are in apps/workflow_ie/classes; the libraries are in apps/workflow_ie/lib; the example documents used for these examples are contained in apps/workflow_ie/data. These documents contain a Blue&Me manual in PDF, its corresponding converted HTML documents by Omnipage, and some manually created workflow descriptions. Details of these are explained in the following with each example

How to get started

  • Example1- getting startet

In this section, we demonstrate a basic example that calls corresponding modules and classes of the IE API to form the standard IE processing pipeline. This example does not perform actual IE tasks but only checks if required modules are available and outputs certain messages; it serves as a template to demonstrate the procedure to be followed by users of this API in order to perform the IE tasks themselves.

Steps:

  1. Ensure you are in the directory [project-root]/apps/workflow_ie
  2. Type in command: java –classpath “classes;lib/*” eu.project.smartproduct.wp3.ie.TestBasic resources/data/D321-getstarted-example/sample.htm (NOTE LINUX & Mac users: replace ";" with ":")

The program takes only one parameter, which points to an HTML document. A sample document is provided: resources/data/D321-getstarted-example/ sample.htm, the HTML document that is converted from the PDF document of Blule&Me manual, available in resources\data\D321-complex-additional\blueme.pdf.

  1. The system will try to check the availability of input files, create objects of IE processors, and report a list of events as shown below.

The source code can be found in: eu.project.smartproduct.wp3.ie.TestBasic

 

Output:

-Checking input file...

-Simulating IE process...

=== Layout Feature Extractor ===

=== Section Segmentation and Clustering ===

=== Workflow Recognizer ===

=== Completing WORKFLOW EXTRACTOR, outputting to XPDL ===

=== Starting WORKFLOW ANNOTATOR ===

=== Entity Recognition ===

=== Term Recognition ===

=== Ontology Population ===

=== Completing WORKFLOW ANNOTATOR, outputting to XPDL ===

 
  • Example 2 - Section segmentation

This example involves two modules of the IE system, namely, Layout Feature Extractor and Section Segmentation and Clustering. The role of Layout Feature Extractor in this example is to infer high level features from the basic content and layout features. The output is a list of re-constructed logical content blocks with refined and enriched layout features, which are then used for section segmentation and clustering.

Note that users of IE system are not directly involved in this process. Instead, this is rather a internal process of Workflow Extractor.  We create a simple class to demonstrate the functionality and the assembly of corresponding modules/processors to achieve the section segmentation task.  This is class uk.ac.shef.oak.passagextractor.test.Test SecSegmentation.

 

Steps:

  1. Ensure you are in the directory [project-root]/apps/workflow_ie
  2. Type in command: java –classpath “classes;lib/*” uk.ac.shef.oak.passagextractor.test.TestSecSegmentation resources/data/D321-complex-additional/input (NOTE LINUX Mac users: replace ";" with ":")

This program requires only one parameter, which is the directory containing HTML documents, as provided above.  

  1. The system starts the processing and outputs recognised sections and their relations
ie-1

Figure 1 Recognised section corresponds to the input document.


ie-2

ie-3

Figure 2 Recognised section that has sub-sections corresponds to the input document.

The example generates output, part of which are shown in Figure 1 and Figure 2. As shown in Figure 1, the process successfully identifies the proper section and its content. Firstly, it recognises that the string “SRC”, “/” and “OK” are consecutive content block that identifies a section heading (the tag in Figure 1); secondly, it identifies the proper list structure that exists in the document, and the list items (the < li > tag; note that the word “❒” is no longer the body content, but identified as the “bulletvalue” for the list). Also this section does not own any sub-sections (SectioHasSecsection: )

Figure 2 shows an example of correctly identified sub-section relation. The top screenshot shows that the section with id = “11” has two sub-sections with ids “12” and “13”. The bottom screenshot shows the section with id “12”. By checking the original PDF document we can see this has been recognised correctly

  •  

    • Example 3 Entity recognition

     

    Assuming we have a collection of identified workflow descriptions, next, they are submitted to Workflow Annotator to recognise domain-specific entities and link them to domain ontologies. In this section, we demonstrate the functionality of entity recognition, and primarily:

    • The implemented learning algorithm, which learns from examples and extracts similar entities from new data;
    • The implemented features used by the learning algorithm.

    Please note that this example is created for the sake of easy understanding and demonstrating functionality, therefore, you may find the learning accuracy below your expectation. This is mainly because:

    • The provided learning examples are simple and insufficient compared to state-of-the-art;
    • The features are selected arbitrarily.

    In this example, only one domain concept is concerned, namely, the “Vehicle Component” concept defined in the FIAT car domain ontology. We show that by learning from a set of example workflow descriptions that have been annotated by this concept, the system can extract (new) instances of the same concept from new data that are not previously seen in the training data.

    To run the example of Entity Recogniser, following resources are required, and have been provided under [project-root]/apps/workflow_ie/resources:

    • A third party generic machine learning tool [project-root]/apps/workflow_ie/resources/ ml, “/windows” for windows platform, and “/linux” for linux platform;
    • A third party natural language processing (NLP) model [project-root]/apps/workflow_ie/resources/nlp;
    • Example corpus [project-root]/apps/workflow_ie/resources/data/D321-additional-example-2, which contains following sub-folders:
      • examples – three toy documents of annotated workflow descriptions. The vehicle component concept (FIAT car domain) is annotated;
      • test – one document with a similar workflow description. This contains different content, which uses new terms and is not annotated;
      • testoutput – the output of Entity Recognition will be written here.

    Steps – training the system:

    1. Ensure you are in the directory [project-root]/apps/workflow_ie
    2. Type in command: java -Xmx512m -classpath "classes;lib/*" eu.project.smartproduct.wp3.ie.TestER resources/ml/windows resources/nlp 0 resources/data/D321-additional-example-2/examples (NOTE: LINUX Mac users: replace ";" with ":"; replae "ml/windows" with "ml/linux")

     The system parameter “-Xmx512m” increases default java virtual machine memory to 512MB. This is the memory requirement for running this example. Program parameters include:

    1.  
      1. args[0] = “resources/ml/windows”: the path to the folder containing the machine learning tool for Windows system. If you are using Linux, replace its value with “resources/ml/linux”; 
      2. args[1] = “resources/nlp”: path to the folder containing the NLP models;
      3. args[2] = 0: just an arbitrary flag indicator;
      4. args[3] = “resources/….examples”: path to the folder containing the annotated documents to be used as learning examples;
    2. The system analyses the examples, and outputs a learnt model to your current running directory.

     

    The output is not human-readable. However users should expect one directory (in the form of uk.ac.shef.wit.aleph.?) and two files (.script and .properties) are created in their current directory. Once the program terminates and these files are generated, the learning system is complete, and users can continue to apply the learnt model to extract same types of entities from new documents. The steps are shown as below:

    Steps – using the learnt model:

    1. Ensure you are in the directory [project-root]/apps/workflow_ie
    2. Type in command: java -Xmx512m -classpath "classes;lib/*" eu.project.smartproduct.wp3.ie.TestER resources/ml/windows resources/nlp 1 resources/data/D321-additional-example-2/test resources/data/D321-additional-example-2/testoutput (NOTE: Linux/Mac users: replace ";" with ":"; replace "ml/windows" with "ml/linux")

    The system parameter “-Xmx512m” increases default java virtual machine memory to 512MB. This is the memory requirement for running this example. Program parameters include:

    1.  
      1. args[0] = “resources/ml/windows”: the path to the folder containing the machine learning tool for Windows system. If you are using Linux, replace its value with “resources/ml/linux”; 
      2. args[1] = “resources/nlp”: path to the folder containing the NLP models;
      3. args[2] = 1: just an arbitrary flag indicator;
      4. args[3] = “resources/…/test”: path to the folder containing the test documents that are to be annotated;
      5. args[4] = “resources/…/testoutput”: path to the folder in which the annotated documents are saved.
    2. The system applies the learnt model, and annotates the new document by “vehicle component”.

    The process should annotate the documents specified by args[3], and place the annotated documents in args[4]. Figure 3 shows the screenshot of the annotated document.

  •  

    ie-4

     

    Figure 3The result you should get from running this example. Annotation is represented by the tag

    As shown in Figure 3, the workflow is different from the examples (as in resources/data/D321-additional-example-2/examples) provided. However, the system identifies correct vehicle components even the document is new; especially it identifies new terms that are not found in the examples, namely, “tail”, “trunk”, “light housing” and “socket”.

  • Due to the space of the webpage, some details may be lacking. We therefore point readers to the deliverable "D.3.1.2, D3.2.1, D3.3.1: Initial Version of Knowledge Management Methodologies and Technologies" for details.


    Related Tools (if applicable)

    • WorkflowEditor

    API Documentation

    • will be updated

    Additional Material

    • [D2.1.2]           D2.1.1: Initial Version of the Conceptual Framework
    • [D3.1.1]           D3.1.1: Requirements analysis for Smart Products based Knowledge  Management, Smart Products, 2009.
    • [D3.4.1]           D.3.4.1: Initial Design of Knowledge Management Methodologies and Technologies, SmartProducts, 2010
    • [D3.1.2-3.2.1-3.3.1] D.3.1.2, D3.2.1, D3.3.1: Initial Version of Knowledge Management Methodologies and Technologies
  •