Data Integration

Introduction
Data Types
Connection Bridges
Mocking

Introduction

For connecting data sources to DCP, the design separates the data from the corresponding storage or query technology. Inside the application, the concept of logical data types is introduced. A data type in this context is a logical set of information within a domain, which are normally stored together in the same system. Every data type provides an interface abstracting storage and/or query related specifics. Doing so the business layer is independent and additional or different source can be easily introduced by implementing the interface as a driver for a specific technology.

The following logical data types are known inside DCP:

Equipment Data: provides site hierarchy, sensor telemetry and batch context
Experiment Data: Design of experiment runs and projects
Process Monitoring Data: CQA, KPIs and bio burden for product/process quality attributes
Electronic Document Data: controlled document management system to retrieve documents (such as reports/procedures) and/or related meta data.
Analytical Data: controlled document management system to retrieve documents (such as reports/procedures) and/or related meta data.

This is separated from the connection types representing the storage/access technology and the query format. The following connection types are defined in DCP:

OSISoftPI: Aveeva PI connection via the webAPI (using REST)
GCP: Google Cloud Platform storing data in BigQuery Tables, access using C# BigQuery driver
SynTQ: Optima SynTQ storage for spectra data, underlying technology is MS SQL with an serialization/deserialization layer
SQL: Standard Microsoft SQL Server, access using T-SQL with derived implementations accounting for different schemas.

By this separation, configuration of data connections is a two-step process. In the first step the user configures a data source. The connection parameters to be defined are based on the ConnectionType. In a 2nd step a mapping between the data source and the data type needs to be performed. This mapping is implemented within the context of a module. The module requirements in terms of data types are saved in the ConfigurationItems.json file. Every model may require 0 to n different data types. Modules with the needs for the same data type, e.g. site hierarchy contained in EquipmentData can share or use different data sources. These modules can connect to validated and non-validated data sources according to their needs.

Inside the application code endpoints are always called in the context of the data type not the technical implementation. Interface resolution is implemented based on the connection type of the configured data source for the requested datatype. This allows using different data types and different data source to be combined in the same method. Using a data source is always connected to injecting the required factory e.g. IEquipmentDataFactory using dependency injection. The factory resolves to the concrete implementation of the IEquipmentDataSource based on the technical type of the data source mapping.

Data Types

Data types describes a set of logical entities in the same context. Those are normally stored in the same datasource.

Equipment Data

Equipment Data as a logical data type includes contextualized sensor data, raw sensor telemetry, asset hierarchy, and operational context. Contextualized sensor data combines raw telemetry with timestamps and conditions, while raw telemetry captures parameters like temperature and pressure. The asset hierarchy organizes equipment relationships within a facility. Event frames record specific events or conditions, such as maintenance or faults, over time. Unit procedures provide standardized operation sequences. Together, these elements enable effective decision-making, predictive maintenance, and optimized asset performance.

Experiment Data

Experiment Data as a logical data type includes both the schedule of the study (when which run, with a corresponding set-up has been performed on an asset), such as Design of Experiments (DoE), and the related online and offline measurements taken during the process execution.

Document Management Data

Electronic Document Management Data as a logical data type encompasses both the actual documents and their associated metadata within a company's digital repository. This data type includes access to all company documents such as reports, manuals, policies, and contracts, ensuring they are securely stored and easily retrievable. The associated metadata, which includes details like document title, author, creation date, revision history, and access permissions, provides essential context and aids in efficient document organization, search, and retrieval. Together, these components streamline document management, enhance compliance, and improve collaboration across the organization.

Analytical data

Analytical Data as a logical data type comprises of attributes and their measured values, characterizing an entire run of a process or experiment. These values are typically obtained from analytical or laboratory instruments in an offline manner, following the completion of the run. This data provides detailed insights into various parameters such as chemical composition, physical properties, and other critical metrics. An example for an analytical attribute might be the final titer archived during a fermentation.

Connection Bridges

A connection bridge is a conceptual tool or mechanism designed to facilitate the integration and interaction between two distinct data domains. These data domains may be disparate systems, databases, or datasets, each possibly employing unique keys to identify records. The primary function of a connection bridge is to enable the mapping and translation of data from one system to another, ensuring that records in one domain correspond accurately to records in the other.

Key Characteristics of a Connection Bridge:

Integration of Different Data Domains:
Distinct Data Domains: These are separate systems or datasets with their own schemas, structures, and key identifiers. Objective: The bridge aims to create a link between these domains, allowing for coherent data exchange and integration.
Key Mapping:
- Mapping Keys: The connection bridge uses a set of keys to establish a correspondence between records in the two domains. These keys act as identifiers that help in locating and translating the data accurately.
- Composite Keys: Sometimes, the keys used for mapping are not simple single-field keys. They can be composite keys, composed of multiple fields from the domain. This composition can provide a unique identifier where single fields alone may not suffice.

Mocking

A mock datasource is a simulated data provider that generates data for testing, development, and validation purposes. Inside DCP, the mock datasource uses pseudo-random functions to create the values. This means that instead of relying on actual data from real-world sensors or processes, the datasource produces synthetic data that mimics expected patterns and ranges. By using pseudo-random functions, the generated values can exhibit controlled variability and realistic behavior, allowing developers and testers to evaluate system performance, troubleshoot issues, and ensure the robustness of data handling algorithms without the need for live data connections.

Equipment Data - Asset Hierarchy

The mock asset hierarchy is generated using a graph structure, where the configuration allows users to control the graph's depth and the maximum number of children each node can have. The deeper an element is within the graph, the less likely it is to have a high number of children. Each node in the graph represents a layer, and nodes without children are designated as assets where sensors and attributes are attached.

Each node is identified by a unique ID, which seeds the random generator used for generating deeper levels of the hierarchy. This ensures consistency and reproducibility in the mock data. The root node, which serves as the starting point of the hierarchy, is seeded based on the user's configuration settings. This structured approach allows for flexible and controlled creation of complex asset hierarchies, aiding in testing and development scenarios where realistic yet customizable data structures are needed.

Equipment Data - Sensor Data

Sensor data is mocked by sampling from a list of function classes, using the attributeId as a key. Within each function class, certain parameters are randomized, and the functions map Unix times to values. There are three main generators: one for string values, one for numeric values during active events, and one for numeric values outside a running batch. Outside a running batch, values are mainly noise to achieve similar (not identical) patterns for an attribute during an event frame. Functions are always applied concerning the event frame time rather than the absolute time. Additionally, a fourth generator exists for real-time scenarios, producing constant values to support consistent and reproducible testing.

Equipment Data - Event Frame Context

The event frame context is mocked by sampling a base period defined in the configuration, with sampling based on the template assigned to each node. The base period is used to segment the timescale, determining if event frames exist within each segment based on a random value seeded by the node ID and the base period's start time.

When event frames are present within the base period, there are two scenarios: constant and dynamic. The constant scenario is used for testing real-time cases where exactly reproducible patterns are required. In the dynamic scenario, which is more realistic, the duration of the event frame varies. This variation is achieved by adjusting the start and end times within the base period, adding up to 20% to the base period to calculate the start time and subtracting up to 20% for the end time.

Analytical Data

Analytical data is mocked by sampling from a list of function classes, using the attributeId as a key. Within each function class, certain parameters are randomized, and the functions map Unix times to values.

This page was last edited on 19 August 2024, 15:10 (UTC).