MSR (Mining Software Repositories) Metrics

The MSR plugin analyzes git history to extract evolution-based architectural insights that complement static analysis. These metrics reveal real-world change patterns that may not be visible in the code structure alone.

Overview

MSR (Mining Software Repositories) is a research field that analyzes version control history to understand software evolution. This plugin implements key MSR metrics:

Churn: Volume of code changes per file/module
Co-change coupling: Files that change together (logical coupling)
Hotspots: Files with high churn AND high centrality (technical debt indicators)
Wrong boundaries: Files that co-change frequently but aren’t linked in import graph

Metrics

Churn Metrics

Churn measures the volume of code changes over time:

msr.churn_total - Total lines changed (added + deleted) across all files
msr.churn_avg - Average churn per file
msr.churn_max - Maximum churn for a single file
msr.high_churn_file_count - Number of files with high churn (>1000 lines or >50 commits)
msr.commit_count - Total number of commits analyzed

Interpretation:

High churn files are often indicators of:
- Technical debt
- Frequently changing requirements
- Unstable modules
- Areas needing refactoring

Co-change Metrics

Co-change coupling measures how often files change together:

msr.cochange_pairs - Number of file pairs that co-changed
msr.cochange_avg - Average co-change count per pair
msr.cochange_max - Maximum co-change count between any two files

Interpretation:

High co-change indicates logical coupling
Files that co-change should ideally be architecturally linked
Co-change without import dependency suggests wrong boundaries

Hotspot Detection

Hotspots are files with both high churn AND high centrality:

msr.hotspot_count - Total number of hotspots detected
msr.hotspot_severe_count - Severe hotspots (churn >2000, centrality >20)
msr.hotspot_moderate_count - Moderate hotspots (churn >1000, centrality >15)

Hotspot Criteria:

Churn > 500 lines AND centrality > 10
Severity levels:
- Severe: churn > 2000 AND centrality > 20
- Moderate: churn > 1000 AND centrality > 15
- Mild: churn > 500 AND centrality > 10

Interpretation:

Hotspots are critical technical debt indicators
These files are both frequently changed AND central to the architecture
Prioritize refactoring hotspots to reduce maintenance burden

Wrong Boundary Detection

Wrong boundaries are files that co-change but aren’t architecturally linked:

msr.wrong_boundary_count - Total number of wrong boundaries
msr.wrong_boundary_severe_count - Severe cases (co-change >10 times)

Wrong Boundary Criteria:

Files co-changed ≥3 times
No direct import dependency between them
Both files exist in the import graph

Severity Levels:

Severe: co-change > 10 times
Moderate: co-change > 5 times
Mild: co-change ≥ 3 times

Interpretation:

Wrong boundaries indicate architectural misalignment
Files that change together should be in the same module/domain
Consider restructuring to align with change patterns

Configuration

The MSR plugin requires a git repository. It automatically detects the repository root by walking up from the source path.

metrics:
  - id: msr
    enabled: true
    config:
      # Optional: limit commits analyzed (default: 10000)
      max_commits: 10000
      # Optional: time range
      since: "2024-01-01T00:00:00Z"
      until: "2024-12-31T23:59:59Z"

Policy Examples

Detect High Churn

policy:
  invariants:
    - metric: msr.high_churn_file_count
      op: "<="
      value: 5

Prevent Hotspots

policy:
  invariants:
    - metric: msr.hotspot_count
      op: "<="
      value: 3
    - metric: msr.hotspot_severe_count
      op: "=="
      value: 0

Find Wrong Boundaries

policy:
  invariants:
    - metric: msr.wrong_boundary_count
      op: "<="
      value: 10
    - metric: msr.wrong_boundary_severe_count
      op: "=="
      value: 0

Use Cases

1. Technical Debt Identification

Use hotspots to identify files that need refactoring:

metrics:
  - id: msr
policy:
  invariants:
    - metric: msr.hotspot_severe_count
      op: "=="
      value: 0

2. Domain Boundary Validation

Use wrong boundaries to validate that architectural boundaries align with change patterns:

metrics:
  - id: msr
policy:
  invariants:
    - metric: msr.wrong_boundary_severe_count
      op: "=="
      value: 0

3. Change Impact Analysis

Use churn metrics to understand which modules are most volatile:

metrics:
  - id: msr
report:
  format: console
  # Will show top churn files in details

Details Output

The MSR plugin provides detailed information in the details field:

{
  "hotspots": [
    {
      "node_id": "src/core/auth.ts",
      "churn": 2500,
      "centrality": 25,
      "severity": "severe"
    }
  ],
  "wrong_boundaries": [
    {
      "file1": "src/auth/login.ts",
      "file2": "src/auth/session.ts",
      "cochange_count": 15,
      "severity": "severe"
    }
  ],
  "date_range": {
    "first_commit": "2024-01-01T00:00:00Z",
    "last_commit": "2024-12-31T23:59:59Z"
  }
}

Limitations

Git Repository Required: The plugin requires a git repository. If no repository is found, it returns empty metrics with a message.
Performance: Analyzing large repositories can be slow. The plugin limits to 10,000 commits by default.
Line Count Approximation: For performance, line counts are approximated by distributing total diff stats across changed files. For exact per-file counts, a more detailed analysis would be needed.
Time Range: Time-based filtering is done after fetching commits, which may be inefficient for very large repositories.

Best Practices

Combine with Static Metrics: Use MSR metrics together with static metrics (SCC, PC, Modularity) for a complete picture.
Focus on Trends: Track MSR metrics over time to identify deteriorating areas.
Prioritize Hotspots: Address severe hotspots first - they have the highest maintenance cost.
Validate Boundaries: Use wrong boundaries to validate that your architectural boundaries match real change patterns.
Set Realistic Thresholds: Start with lenient thresholds and tighten them as you refactor.

References

D’Ambros, M., et al. (2010). “Analyzing software evolution through code churn”
Zimmermann, T., et al. (2005). “Mining version histories to guide software changes”
Hassan, A. E. (2009). “Predicting faults using the complexity of code changes”
Arcan, S., et al. (2017). “How do developers react to API evolution? The Pharo ecosystem case”