How to document assumptions in ML preprocessing logic

Documenting assumptions in ML preprocessing logic is crucial for maintaining the integrity of your pipeline, enabling collaboration, and ensuring reproducibility. Assumptions are typically implicit beliefs or conditions on which the preprocessing steps depend, and when not documented, they can cause confusion or misinterpretation down the line.

Here’s a breakdown of how to document assumptions in ML preprocessing logic effectively:

1. Create a Clear Assumptions Section

At the beginning of your preprocessing code or notebook, create a dedicated section for assumptions. Label it clearly with headings like “Assumptions” or “Preprocessing Assumptions.”

Be explicit and concise in describing each assumption. For example:

python
# Assumptions
# 1. The input data is in CSV format with headers.
# 2. Missing values are represented by 'NaN' or empty strings.
# 3. All categorical features are encoded as strings.
# 4. The data is normalized for numerical features.

2. Describe Data Format and Integrity Assumptions

Include assumptions about the structure of the input data, such as:
- Column names, types, and formats
- The presence of required columns or fields
- Expected range or distribution of values
- Missing values and how they are handled (e.g., imputation, removal)
- Consistency of data types for specific columns

Example:

python
# Assumption: The dataset will always have columns 'age', 'income', 'education_level'
# If 'education_level' is missing, it is imputed with 'unknown'.

3. Explain Assumptions About Preprocessing Steps

For each preprocessing step (e.g., scaling, encoding, outlier removal), document the underlying assumptions that affect how that step operates.

Example:

python
# Assumption: All numerical features must be scaled between 0 and 1.
# This is done using Min-Max scaling based on the assumption that the features are roughly uniformly distributed.

4. State Assumptions About Data Splitting and Sampling

If the logic involves data splitting (train/test/validation) or sampling, document how this is done and under what assumptions.

Example:

python
# Assumption: The data splitting is performed using stratified sampling, ensuring balanced class distributions in each split.

5. Clarify External Data Dependencies

If preprocessing logic relies on external datasets (e.g., pre-trained embeddings, vocabulary lists), explicitly mention these dependencies.

Example:

python
# Assumption: The model's word embeddings are pre-trained on a corpus of English text.
# If a different corpus is used, embeddings need to be retrained.

6. Consider Time/Temporal Assumptions

For time series data or sequential models, document any temporal assumptions about data preprocessing (e.g., data continuity, time-window assumptions).

Example:

python
# Assumption: The dataset is time-ordered, and no future data is used in preprocessing.

7. Reference the Source of Assumptions (if applicable)

Where possible, link assumptions to relevant literature, business requirements, or any source that provides context for why the assumption is made.

Example:

python
# Assumption: Feature 'temperature' is expected to be in Celsius. 
# This is based on business guidelines provided by the product team.

8. Use Comments in the Code

Inline comments are a simple and effective way to document assumptions that directly relate to specific code blocks. This helps maintain the context in which the assumptions apply.

Example:

python
# Assumption: Input data is expected to have no negative values
df['age'] = df['age'].clip(lower=0)

9. Version Control for Assumptions

As assumptions can change over time (e.g., with new business requirements or updated data), keep track of changes to assumptions by using version control and documenting them in your commit messages or changelogs.

Example:

bash
git commit -m "Updated assumption: 'age' column no longer has negative values."

10. Include Assumptions in Documentation or README

For more comprehensive documentation, include assumptions in a project’s README or a formal documentation file (e.g., a markdown or a wiki page).
This provides a high-level overview of assumptions for people interacting with the code in the future.

Example Template for Documenting Assumptions:

markdown
# Preprocessing Assumptions

### 1. Data Format Assumptions
- Input data is in CSV format with headers.
- Columns `age`, `income`, and `education_level` are mandatory.
- Missing values are represented as 'NaN' or empty strings.
  
### 2. Preprocessing Steps Assumptions
- Numerical features are scaled using Min-Max normalization (0-1 scale).
- Categorical features are encoded using one-hot encoding.
- Any outliers in the `income` column (values > 3 standard deviations) are removed.

### 3. External Dependencies
- Pre-trained word embeddings are based on the English Wikipedia dataset.

### 4. Temporal Assumptions
- Data is ordered by time for time series models.
- No data leakage from future data in training sets.

### 5. Business Assumptions
- The `education_level` feature assumes values from a fixed set: 'high school', 'bachelor', 'master', 'PhD'. Any new value will be treated as 'unknown'.

---

11. Automate Assumption Tracking

In complex ML pipelines, especially those with multiple team members or dynamic datasets, consider automating assumption tracking by incorporating metadata management tools or frameworks that capture assumptions dynamically during preprocessing.

Conclusion

Documenting assumptions in preprocessing logic not only helps maintain clarity within the pipeline but also assists with debugging, future modifications, and collaboration. By integrating assumptions into your workflow documentation, you ensure transparency and avoid potential pitfalls that could arise due to misinterpretation of how data is being handled.

Share this Page your favorite way: Click any app below to share.

See all the ways to share this page

How to document assumptions in ML preprocessing logic

1. Create a Clear Assumptions Section

2. Describe Data Format and Integrity Assumptions

3. Explain Assumptions About Preprocessing Steps

4. State Assumptions About Data Splitting and Sampling

5. Clarify External Data Dependencies

6. Consider Time/Temporal Assumptions

7. Reference the Source of Assumptions (if applicable)

8. Use Comments in the Code

9. Version Control for Assumptions

10. Include Assumptions in Documentation or README

Example Template for Documenting Assumptions:

11. Automate Assumption Tracking

Conclusion

Check Out Our Newest Posts we wrote about

Why your ML system design must support partial retraining

Why your ML pipeline must detect missing or stale features

Why your ML feedback loop must consider label quality

Why your ML deployment plan must include fallback logic