Mathematical Models

Introduction

The MVDA module implements various algorithms, in order to provide different monitoring capabilities. From an implementation point of view they can be separated along two dimensions. Refer to them as:

  • Model Class
  • Model Type

Model Class

The first dimension, the model class, is based on the view on how the timeseries going into the analysis are represented. DCP distinguishes between:

  • Batch Evolution Models (BEM), those models are monitoring the process over time as a function of process maturity (most often equal to time), enabling real-time updates and inflight monitoring.
  • Batch Level Models (BLM) are representing a batch as a single data point. Where the timeseries can be either gathered by batch-wise unfolding or by feature extraction. In any case this requires the process to be completed in order to perform an analysis.

Model Type

The second dimension DCP uses to classify models, is referred to as model type and is mainly an encoding of the applied algorithm. The model structure (available outputs, required-encoding, etc.) is algorithm dependent and therefore need to be managed.

These dimensions can be re-assembled into the following hierarchy. This diagram contains only the most important methods. For a complete list of methods or for method details check the inline documentation of the interface, model interfaces are named: IModelService, IBatchEvolutionModelService and IBatchLevelService.

classDiagram
    class Model{
        +GetModelConfigurationFieldsAsync()
        +GetConfigurationDefaults()
        +CreateModelAsync()
        +UpdateModelAsync()
        +ModelUploadAsync()
        +TestModelAsync()
        +ParameterLimitsAsync()
        +RawParameterLimitsAsync()
        +GetModelParameters()
        +GetModelSensors()
    }

    class BEMModel{
        +CreateSignUpsForActiveBatchesAsync()
        +BatchDataHistoricalWithCacheAsync()
        +BatchDataAtTimeWithCacheAsync()
        +GetBulkValuesUpdate()
        +ContributionPlotAsync()
        +BatchRawDataHistoricalAsync()
        +BatchRawDataAtTimeAsync()
        +BatchDataAtTimeAsync()
        +CheckDataPointViolationsAsync()
        +CalculateBatchSummaryStatistics()
    }


    class BLMModel{
        +BatchDataAsync()
        +SourceOfVariationPlotAsync()
        +BatchRawDataAsync()
        +BatchDataPointCommentAsync()
        +CheckBatchViolations()
        +CalculateBatchSummaryStatistics()
    }

    Model <|-- BEMModel
    Model <|-- BLMModel

    class PLSBEMModel

    BEMModel ..|>  PLSBEMModel

    class PCABLMModel

    BLMModel ..|> PCABLMModel

\newpage

At this point of time, the following model types are implemented:

  • Partial Least Squares - Batch Evolution Models (PLS-BEM)
  • Principle Component Analysis - Batch Level Models (PCA-BLM)

The model class is mainly defining the functionality what could be done with the model. As BEM models support real-time caching in addition more methods need to be implemented. The following table compares the supported functionality based on the model class:

Functionality Batch Evolution Models Batch Level Models
Summary Statistics Yes, implemented on concrete model type Yes, implemented on concrete model type
Real-time Caching supported No support
Diagnostic plots Contribution plot - event specific Source of variation plot - event agnostic
Commenting Single observations or complete batches Complete batches only

The classes BEMModelService respective BLMModelService are abstract classes, which only implement model class specific but model type agnostic functionality. For example, working with raw data BatchRawDataAtTimeAsync(). Due to their abstract nature they need a concrete implementation which is providing algorithm/model type specific logic.

The main differences in the implementation per model type are in method validations, calculating distance metrics. The concrete mathematical calculations are performed on the calculation node. See the acdcMvda R package documentation for further details.

Model Structure

DCP MVDA uses JSON as storage format. The JSON documents are saved in the database utilizing the JSON support provided by MS SQL Server. The model format has some common sections, within the sections the format is adopted to fit the needs of the model type. The common sections are presented below.

Info

This section contains general information on the model, e.g. the class, the format version, etc. The basic information is strong typed and mandatory for every model implementation. If a model might not require a specific field, null values are allowed:

  • Maturity (string), the name of the maturity sensor used for alignment in BEM cases or used as a basis for the unfolding in BLM cases
  • Interval (integer), the sampling/interpolation interval between datapoints in the modeling dataset, used to align limits and further calculation
  • AvgRuntime (double), the average duration of a batch/process in total hours. The value is used in the service layer to identify potential problems (e.g. not receiving batch end messages)
  • FormatVersion (string), the two digit versioning of the model storage format e.g. 2.0, see above for details
  • ModelClass (string), stored as a string see sections above for details
  • ModelType (string), stored as a string see sections above for details
  • A (integer), the number of fitted components
  • K (integer), the number of input parameters (including formulas, and other calculated/derived variables) going into the model.
  • N (integer), the number of observations included into the dataset, for batch models (BEM and BLM) this corresponds to the number of batches in the dataset
  • DfMod (double), the degrees of freedom calculated by the model, see the statistical appendix on how the values are calculated.

Sensors

This section contains a set/list of unique Ids at the data source layer which are used to read the input variables. This section is strong typed. Every sensor has the following properties:

  • WebId (string) unique (and URL safe) identifier on the external datasource
  • Name (string) the user-defined name, used by the datasource to refer to a particular sensor/name
  • Description (string) the sensor description as used on the datasource
  • DeviceWebId (string) the unique identifier of the asset to which to tie the sensor to. This is used to identify different datasources, if a model should be applied asset agnostic.

Limits

In this section the limits of model parameters are stored, the structure is adopted to the concrete model class/structure, therefore the section has lose types.

Fit

This section contains information on model fit quality, typical examples are R2X (goodness of fit), as fit measure might vary across different model types, this section has lose types.

Coefficients

This section contains the internal coefficients/weights used to calculate model outputs or for diagnosing models. Available coefficients are normally stored as matrices, however the available components heavily depend on the model and therefore only lose types are used.

Parameter

This section contains the pre-processing information (transformation, scaling, etc. ). In more detail this includes:

  • Name (string) the identifier used to refer to an input in the model context
  • Mean (double) the mean value for the input in the modeling dataset, used for applying scaling
  • Std (double) the standard deviation value for the input in the modeling dataset, used for applying scaling
  • MissingValue (double) the amount of missing values for the sensor in the modelling dataset (in percent)
  • Monotonic (bool) flag indicator if the variable is monotonically increasing in the modelling dataset
  • Scale (string), the scale method to be applied to this variable, possible values are uv - Unit Variance scaling, pareto for Pareto-scaling and none for not applying any scaling
  • Center (bool), should the variable be centred (subtracting the mean)
  • ScaleModifier (double) the scale modifier to be applied to the parameter during scaling
  • BlockWeight (double) the manual weight used for block scaling
  • Transform (string), the transformation formula to be applied during preprocessing, as an R expression e.g, exp(X + 2)
  • Type (string), the type of the parameter, can either be: maturity, xVar (raw inputs), xVarCalc (inputs derived from other inputs by a formula)
  • Formula (string), if the parameter is derived from other inputs this contains the formula, other inputs are enclosed using the @ character, a simple example is: @Sensor1@ - first(@Sensor1@)
  • Included (bool), should the variable be used for model output generation, this is mainly used for models after batch level unfolding

Model formats

The model format has a separate version identifier to allow the evolution of storage formats over time. The load model function implements adapters to translate the historical formats to the current format.

Model vectors

There are two types of model vectors:

  • Model outputs are representing values, which should be monitored/trended
  • Internal vectors are used to calculate model outputs (e.g. weights) or to investigate and diagnose the model (e.g. fit). For the details on each model vector refer to the statistical appendix.

Model outputs

In principle, it has to be distinguished between two groups of model output vectors:

  • input signals: which can be raw signals as ingested into the connected data source. As the data pre-processing is model type independent, there is one implementation for the raw signals within a model class. From a signal processing pipeline, the system differentiates between a 1:1 forwarding and cases where transformations or mathematical formulas need to be applied in order to get the desired output.

  • output signals: which are the result of a mathematical calculation. The model output is model class and type specific.

A model output specification consists of a mandatory CalculationType, this is an enumeration with two additional attributes: the description is used to convert into a human readable format and the OcpuName which is used as a mapping to the identifier inside the calculation node. Every calculation type may be further described by a set of tags.

Tags have the following properties:

  • TagName (string): the name of the tag, used to refer to a specific information inside the context of the calculation type.
  • TagValue (string): the value (set by the user) providing detailed specification of the calculation type.
  • TagType (byte): this categorizes the tags in different groups. Available groups are: limits, modelOutput. The groups are used to filter the tags of interest when they are used.

In order to introduce new model vectors into the system the following steps needs to be done:

  • register a new output in DCP.MVDA.Constants.CalculationType. The description attribute is used for building the labels shown to the user. The Ocpu attribute, translates the calculation to the name used inside the analytics layer.
  • implement the functions for calculating the new output in BatchDataHistorical()
  • extend the methods CalculateDistanceMeth and CheckDataPointViolationsAsync() used for violation assessment
  • implement the rendering of the model vector description in ConfigurationToDisplayString()
  • extend the functions for the user input selection GetModelConfigurationFieldsAsync() and GetConfigurationDefaults()

Model Factories

In order to resolve to correct instances of the class in the different modules the factory pattern is used. Based on the usage in the application - different factories are used. There is a general factory which provides the IModel interface which contains the methods which needs to be implemented by all models. In multiple places of the application dedicated implementations based on the model class are implemented. In these cases model class specific factories are used to return IBEMModel or IBLMModel. This is due to the supported functionality by class (see table above). As an example the Cache worker service utilize IBEMModel as real-time caching is supported by batch evolution models only.

Model lifecycle

Models can be in different states, which can be best illustrated in the following diagram:

stateDiagram
    [*] --> InProgress
    InProgress --> Completed: complete model
    Completed --> InValidation: Open Validation Wizard
    Completed --> InProgress: Changed coefficients
    InProgress --> InProgress: Iteration without completion
    InValidation --> Completed: Model Report approved
    Completed --> Obsolete: User delete
    Completed --> Archived: User delete
    Completed --> [*]
    Obsolete --> [*]
    Archived --> [*]

The states have the following context:

  • InProgress models are currently being edited/subject and therefore in an intermediate state, in this state the model can not be applied nor validated/etc.
  • Completed models can either have a GxP or non-GxP tag associated with the model - they are in a stable state used for monitoring, etc.
  • InValidation the model is locked and does not allow further edits the model is currently going trough the process of validation. During this process the model still can be applied and events can be evaluated against the model.

Model development

Inside MVDA Frontend application Model Development Wizard is separated in a few modules and classes, that implements different parts of steps and logic.

Model manager modules separation

📂 MVDA app
 └──📂 src
     ├──📂 dataset
     ├──📂 model-diagnostics
     ├──📂 model-wizard
     └──📂 workset

The src/ folder is the main code and contains:

  • dataset - Contains module, classes and components that are related to first step in model wizard, the creation of dataset(s)
  • model-diagnostics - Contains module and components that are related to third step in model wizard, implementing model testing and other diagnostic options
  • model-wizard - Contains module, classes and components that are used for managing the Model Wizard, and used for the model fit itself
  • workset - Contains module and class for managing a Workset, a workset is based on a collection of datasets

The most important classes and corresponding methods, are illustrated below. This is not a complete list purposely and only intended to give a high level overview of the most important concepts.

classDiagram
    class Model {
        -enum class
        -enum type
        +modelClass IModelClass
        +workset: IWorksetClass
        +details: IModelDetails
        +isPublic: bool
    }
    class ModelClassFactory {
        +enum class
        createPLSModel()
        createPCAModel()
    }
    
    class Workset {
        +number Id
        +Model model
        +DatasetCollection datasetCollection
    }
    class DatasetCollection {
        -number selectedIndex
        +Dataset[] datasets
        createNewDataset()
        deleteDataset()
        removeParameter()
        addSensor()
    }
    
    class Dataset {
        +string Name
        +string DeviceWebId
        +bool IsDefault
        removeParameter()
        addSensor()
    }
    class BEMFactory {
        enum class
        createPLSModel()
    }
    class BLMFactory {
        enum class
        createPCAModel()
    }
    class PCAModel {
        enum type
    }
    class PLSModel {
        enum type
    }
    DatasetCollection o-- Dataset
    Workset <|-- DatasetCollection
    BEMFactory <|-- ModelClassFactory
    BLMFactory <|-- ModelClassFactory
    PLSModel <|-- BEMFactory
    PCAModel <|-- BLMFactory
    Model <|-- PCAModel
    Model <|-- PLSModel
    Model <|-- Workset

The usage of the model class depends on the workflow mode. DCP differentiates between Create or Read/Edit mode.

During Create the process is separate on a few steps:

  1. The model class is checked, using the correct Class Factory a model class instance is created.
  2. Then the model type is checked and used Model Type Class to create the Model.
  3. Next step is to create a new Workset with an empty DatasetCollection linked to the model instance.
  4. During first step of the model wizard a new default Dataset is created and added to the DatasetCollection. Adding additional datasets to the collection is optional.
  5. After completion of the first step of the Model Wizard, the Workset is saved to the database, then the user can continue to the next step.
  6. On second step of the Model Wizard additional information about the model is filled, and Model is saved to the database. From this point this model can be edited in the future.
  7. On last (fourth) step Model monitoring settings are collected and saved in the database.

During the Edit/Read mode (inside Model Wizard or Model Validation) same structure is used, but with different steps.

  1. After loading the Model from the database, corresponding Class Factory and Type Class are used to initialize the Model.
  2. Based on saved in Model Workset Id is loaded corresponding Workset.
  3. All Datasets related to Workset are loaded.
  4. After loading Datasets, DatasetCollection is created inside the Workset.
  5. Model is ready to be used in the application.

Model Storage

The model entities can be seen below:

erDiagram
    Model {
        int ID ""
        nvarchar Name "The model name - defined by the user"
        int Version "The model version - incremented when the coefficients are changing"
        nvarchar Conditions ""
        nvarchar ReportUrl "The URL to the model report stored as a PDF on the disk"
        int SiteID ""
        bit IsPublic "Can the model be accessed/applied by all MVDA users"
        nvarchar Sensors ""
        bit InProgress "Is the model currently being edited/reworked"
        nvarchar Description "The user entered description of the model"
        nvarchar TestSettings "The test settings used during internal testing"
        nvarchar DeviceWebIds ""
        nvarchar DefaultLimitSpecification "The default limit specification - defined by the model developer"
        int LastModifiedBy "The userId, who performed the last change - used for audit trail"
        int OwnerUserId "The userId, who is owning the record - may have special privileges"
        datetime2 SysEndTime ""
        datetime2 SysStartTime ""
        nvarchar Batches ""
        nvarchar Coefficients "The fitted coefficients used to calculate the model output(s)"
        nvarchar Fit "The fit measures used to access the model quality"
        nvarchar Info ""
        nvarchar Limits "The calculated model limit parameters - used for the limit assessment"
        nvarchar Parameter ""
        int WorksetID ""
        bit IsArchived "Flag indicating whether the model is archived"
    }

    Workset {
        int ID ""
        nvarchar ModelBatches "An array of unique Ids from the datasource - describing the batches used in the training set"
        nvarchar TestBatches "An array of unique Ids from the datasource - describing the batches used in the internal testing set"
        tinyint UnfoldingType "the applied unfolding type - either BEM or BLM"
        int LastModifiedBy "The userId, who performed the last change - used for audit trail"
        int OwnerUserId "The userId, who is owning the record - may have special privileges"
        int SiteId ""
        datetime2 SysEndTime ""
        datetime2 SysStartTime ""
    }

    Dataset {
        int ID ""
        nvarchar Name "The name identifying the datasource"
        nvarchar Filter "The global filter - representing the hirachey to the element"
        bit IsDefault "Flag indicating the leading dataset"
        int Interval "The interpoaltion interval between datapoints in seconds"
        nvarchar DeviceWebId "The equipment identifier on the datasource"
        int LastModifiedBy "The userId, who performed the last change - used for audit trail"
        int OwnerUserId "The userId, who is owning the record - may have special privileges"
        int SiteId ""
        int WorksetID ""
        datetime2 SysEndTime ""
        datetime2 SysStartTime ""
        nvarchar Batches ""
        nvarchar Parameter ""
        nvarchar Sensors ""
        nvarchar TimeRange ""
    }

    ModelSession {
        int ID ""
        int ModelID "The model as defined in the database"
        nvarchar SessionID "The session - referring to a model present on the calculation node"
        datetime2 SessionDateUpdated "The session date - used to check the validity"
    }

    WorksetSession {
        int ID ""
        int WorksetID ""
        nvarchar SessionID "The session - referring to a workset present on the calculation node"
        datetime2 SessionDateUpdated "The session date - used to check the validity"
    }

    DatasetSession {
        int ID ""
        int DatasetID "The dataset as defined in the database"
        nvarchar SessionID "The session - referring to a dataset present on the calculation node"
        datetime2 SessionDateUpdated "The session date - used to check the validity"
    }

    Model ||--|| ModelSession : "has session cache"
    Model  ||--|| Workset : owns
    Workset  ||--|| WorksetSession : "has session cache"
    Workset  ||--|{ Dataset : "consists of"
    Dataset  ||--|| DatasetSession : "has session cache"

Relations to notifications and dashboards are hidden on purpose to simplify the illustration.

Records classification and audit trail

For the session tables:

Specification Value
Content/Overview Sessions on the calculation node
Data classification Cache only
Change Tracking No
Audit Trail No
Retention period N/A

All other tables:

Specification Value
Content/Overview State and definitions of a MVDA model and the dataset definition
Data classification Official records
Change Tracking SystemVersioned table features inside SQL
Audit Trail Module specific audit trails
Retention period 10 years
erDiagram
    ModelSharedUsers {
        int ID ""
        int ModelId "The linked model from which the calculation was based on"
        int UserId "The contributor userId - allowed to perform edits"
        int LastModifiedBy "The userId, who performed the last change - used for audit trail"
        int OwnerUserId "The userId, who is owning the record - may have special privileges"
        datetime2 SysEndTime ""
        datetime2 SysStartTime ""
    }

    ModelReview {
        int ID ""
        int ModelID "The linked model from which the review was based on"
        int Version "The linked model version from which the review was based on"
        nvarchar Comment "The user added comment - descrbing the findings of the review"
        bit IsAbnormalitiesDetected "Where there any abnormal behaviors identified during the review"
        int LastModifiedBy "The userId, who performed the last change - used for audit trail"
        int OwnerUserId "The userId, who is owning the record - may have special privileges"
        datetime2 SysEndTime ""
        datetime2 SysStartTime ""
    }

    ModelLock {
        int ID ""
        int ModelID ""
        bit IsLocked "Flag indicating whether the model validation is locked"
        int LockedBy "UserId of the user who locked the model"
        datetime2 LockedOn "Timestamp in UTC when the model has been locked"
    }

    Model ||--o{ ModelSharedUsers : "shared with"
    Model ||--o{ ModelReview : "documented as"
    Model  ||--|| ModelLock : locks

Records classification and audit trail

For the lock table:

Specification Value
Content/Overview User lock state of the model validation
Data classification Cache only
Change Tracking No
Audit Trail No
Retention period N/A

All other tables

Specification Value
Content/Overview Model state related records e.g. sharing
Data classification Convenience records
Change Tracking SystemVersioned table features inside SQL
Audit Trail Module specific audit trails
Retention period 3 years

The main reason for separating out the cache service is the usage of temporal features and the implementation of the audit trail. As sessions have a lifetime (of 24 hours by default) records are updated frequently - which would result in incorrect audit trail updates.

Model validation

The model validation inside the DCP context are only help steps for generating the actual model validation record which is stored in EDMS.

Model validation records

erDiagram

ModelSimulationBatches {
 int ID ""
 int ModelValidationInformationId ""
 nvarchar Name "The user defined name of the test scenario"
 nvarchar Description "The user defined description of the test scenario"
 tinyint ModelTestResult "The SME defined expected test result"
 nvarchar SimulatedSensor ""
 int LastModifiedBy "The userId, who performed the last change - used for audit trail"
 int OwnerUserId "The userId, who is owning the record - may have special privileges"
 datetime2 SysEndTime ""
 datetime2 SysStartTime ""
 tinyint LikelyHoodOfOccurence "The likelihood of the described event happening"
 tinyint Severity "The servity of the described event happening"
}

ValidationDocuments {
 int ID ""
 int ModelValidationInformationID ""
 tinyint DocumentType "The role of the document in the DCP context plan/report"
 nvarchar LocalFile "The UNC path to the hard disk location where the DCP generated document version is stored"
 nvarchar EDMSDocumentId "The unique identifier of the document in EDMS"
 int LastModifiedBy "The userId, who performed the last change - used for audit trail"
 int OwnerUserId "The userId, who is owning the record - may have special privileges"
 datetime2 SysEndTime ""
 datetime2 SysStartTime ""
 tinyint DocumentStatus "The document status in EDMS e.g. approved, effective, draft, etc."
}

ModelValidationLock {
 int ID ""
 int ModelValidationInformationID ""
 bit IsLocked "Flag indicating wheather the model validation is locked"
 int LockedBy "UserId of the user who locked the model validation"
 datetime2 LockedOn "Timestamp in UTC when the model validation has been locked"
}

ModelValidationInformation {
 int ID ""
 int ModelId "The linked model to be validated"
 int Version "The linked model version to be validated"
 tinyint Scope "The user defined scope of the validation"
 nvarchar OtherScope "The user defined scope details of the validation"
 nvarchar IntendedUse "The user defined model intended use of the validation"
 nvarchar ProcessDescription "The user defined process description of the validation"
 nvarchar AcceptanceCriteria "The user defined acceptance criteria for the validation"
 nvarchar TestBatches "The user selected test batches"
 int LastModifiedBy "The userId, who performed the last change - used for audit trail"
 int OwnerUserId "The userId, who is owning the record - may have special privileges"
 datetime2 SysEndTime ""
 datetime2 SysStartTime ""
}

ModelValidationInformation || --o{ ValidationDocuments: "based on"
ModelValidationInformation ||--|| ModelValidationLock: "locks"
ModelValidationInformation || --o{ ModelSimulationBatches: "described by"

Records classification and audit trail

For the lock table:

Specification Value
Content/Overview user lock state of the model validation
Data classification Cache only
Change Tracking No
Audit Trail No
Retention period N/A

All other tables:

Specification Value
Content/Overview Information for the model validation, official validation plan/report in EDMS
Data classification Official records
Change Tracking SystemVersioned table features inside SQL
Audit Trail Module specific audit trails
Retention period 10 years

Validation wizard modules separation

📂 MVDA app
 └──📂 src
     ├──📂 generic-batch
     └──📂 model-validation

The src main folder contains: * generic-batch folder - Contains module, service, store and components that are related to third step of Model Validation Wizard - Simulation * model-validation folder - Contains module, service, store and components that are used to create Model Validation Wizard

Most components that are used to create Model Validation Wizard reside in the ModelValidation module. That includes the wizard component itself and most of the components that are used for the steps inside the wizard:

  • Scope Form
  • Batches for Validation grid and Form Controls
  • Simulation wrapper component
  • Finalize form

ValidationWizard component hold main Form and Validators for it. For example validation for Expectation (Total To Fall and To Pass) of selected Batches and Simulation Batches.

Logically, Simulation Batches represent a separate entity and all CRUD operations for it are situated inside GenericBatch module. The module contains the store, service and models that are used for managing simulation batches. The main components inside the module are:

  • GenericBatchEditor
  • GenericBatchGrid component

The GenericBatchEditor contains all logic for editing or to create a new simulation batch. The GenericBatchGrid represent all simulation batches, created for the current validation, and uses form control for test expectation per batch.

This page was last edited on 03 May 2024, 07:57 (UTC).