STLF for Flexibility Markets — What Counts as Good and How to Achieve It

TL;DR. “Unknown demand = no market” — short-term load forecasting (STLF) is the enabling layer for every flexibility trade. But the academic answer to what makes a good STLF (low MAPE, hybrid deep-learning models) is the wrong answer for flexibility markets. In a flex market the forecast is not a prediction to be admired — it is a financial instrument whose value is defined entirely by the market’s settlement rules. A 1.47%-MAPE system-load model can be a bad flex-market forecast; a higher-error per-DER probabilistic baseline can be a good one. “Good” here means accurate where errors are expensive, manipulation-resistant where money is at stake, and current at the timing the market clears. And the binding constraint on achieving it in Sweden is not model sophistication — it is data access and organizational capability, not algorithms. This page reframes Load Forecasting through the market-economics lens, drawing on Baseline Methods, Balancing Markets, Demand Response, Submetering, and Distribution System Operator.

The standard answer — and why it misleads

The forecasting literature gives a clear, well-evidenced answer to “what is a good load forecast”: minimize MAPE, and use hybrid ML/DL models to do it. Hybrids consistently beat single methods — LSTM-BPNN reached 1.47% MAPE vs 3.55% for standalone LSTM in one comparative study (Source - Load Forecasting Methods Survey (2025)). Temperature dominates the inputs; CNN-LSTM and LSTM-Transformer architectures are the documented frontier.

This is correct for aggregate system load, and almost beside the point for flexibility markets. Three reasons it misleads:

The headline accuracy figures are achieved at the transmission/system level, where the law of large numbers smooths thousands of loads into a predictable curve. Flex markets operate at the feeder, household-cluster, and single-DER level, where relative variability is far higher (Source - Load Forecasting Methods Survey (2025)). The very averaging that makes 1.47% achievable disappears at the edge.
MAPE is a symmetric, average-weighted loss. Flex-market economics are asymmetric, threshold-based, and concentrated in a few hours. The metric optimizes the wrong thing.
The literature forecasts gross load. Flex markets settle against a counterfactual baseline and a net imbalance position — different objects, with different — and adversarial — properties.

What flex markets actually reward

To define “good,” start from what the market pays for and penalizes. There are two distinct sides, with two distinct “good”s.

DSO / buyer side — the forecast creates the order. A DSO procures flexibility because it forecasts a specific grid segment will exceed its limit in a specific window; for availability products (LFM-h/p) the forecast defines the activation window in the product itself (Load Forecasting › Why forecasting matters for flexibility markets). A good DSO forecast correctly identifies which hours and which locations are congestion-risk. Error means buying flexibility that isn’t needed, or missing congestion entirely. This is feeder/segment-level, weather- and DER-inventory-driven.

FSP / aggregator side — the forecast constructs the bid and avoids penalty. The FSP must forecast (a) the baseline (what the resource would have done absent activation — the settlement counterfactual) and (b) the portfolio’s net position (which drives imbalance charges). This is household-cluster and per-DER level.

These two “good”s are not the same forecast, are not at the same granularity, and are not equally achievable.

Five ways “good for flex” diverges from “low MAPE”

1. The wrong object — baseline, not gross load

In flex settlement the relevant quantity is the baseline: the counterfactual consumption absent activation, against which delivered flexibility is measured (Baseline Methods). A model that nails total building load can still produce a useless baseline, because the baseline is a prediction of a counterfactual that never happens — and for DERs no pre-committed schedule exists to anchor it. The forecasting problem is structurally harder than predicting realized load, because there is no ground truth to validate against on activation days.

2. The wrong granularity — variability beats the averaging

Aggregate STLF is easy precisely because individual volatility cancels. At the level flex markets clear — a heat pump, an EV charger, a 500-battery cluster — load is near-binary, behavior- and weather-driven, with no smooth historical mean. A data gap in this wiki names it directly: what MAPE is achievable at household-cluster level is unknown, yet that is the level at which aggregators must forecast to enter the BRP framework at scale. Edge-level accuracy, not system-level accuracy, is the achievement that matters.

3. The wrong loss function — cost-weighted, threshold, peak-hour

MAPE treats all errors equally. The market does not:

Imbalance settlement is price- and direction-weighted. BRP forecast error is settled at the balanskraftspris under the Nordic single-price (enpris) model — so an error against the system in a scarcity hour can cost orders of magnitude more than the same percentage error on a calm night (Balancing Markets › Cost allocation, Source - Nordic Imbalance Settlement Handbook v5.2 (2025)). Winter 2025/26 saw a balancing energy price of 7,363 EUR/MWh — the day to be wrong was catastrophically more expensive than average.
Delivery verification is a cliff. SWITCH requires meeting a 75% delivery threshold; a baseline that is “off by 10%” can mean 100% loss of the activation payment if it tips the resource under the line (Baseline Methods › Capacity clearing vs. energy settlement). The loss function is a step, not a slope.
The scarcity/peak hours dominate. Accuracy on the few highest-peak, low-wind days — which drive congestion, activation, and utnyttjningsgrad — matters far more than average-day accuracy. Those are exactly the days with the least historical precedent. A flex-good forecast is deliberately overweighted toward the tail.

4. The adversarial dimension — manipulation-resistance and recalculability

A statistical forecast has no adversary; a settlement baseline does. The FSP is paid against it and can game it: the sports-arena case (lights switched on before the measurement window to inflate the reference and be paid for a phantom reduction) and the MBMA battery problem (switching charge→discharge before the pre-reading) are documented Swedish examples (Baseline Methods › FSP experience — baseline as a participation barrier, Baseline Methods › MBMA and the battery integrity problem). So “good” includes properties orthogonal to accuracy: NC DR Art. 14 requires baselines be “recalculable, transparent, precise, accurate, and unbiased” — note transparent and unbiased sitting beside accurate. A slightly less accurate method that cannot be gamed is better for the market than a more accurate one that can.

5. Information currency — the forecast must be fresh at gate closure

A model that is most accurate offline is not most useful if the market needs the number a day ahead. CoordiNet FSPs noted that a D-1 baseline “loses all information that is added between the day before and the control occasion … you can have a very good forecast, but it is a dynamic system” (Baseline Methods › FSP experience — baseline as a participation barrier). Forecast value decays with staleness relative to the market’s gate-closure timing. A good flex forecast is the best one usable at the moment the market clears, not the best one computable in principle.

The imbalance asymmetry — which forecasting bias is worst

Divergence 3 has a specific, decision-relevant answer that every FSP weighs: the economically worst forecasting bias is systematic over-optimism about deliverable flexibility — availability, battery state-of-charge, or a too-high baseline. The reason is the Nordic single-price (enpris) settlement design (Balancing Markets › Cost allocation, Source - Nordic Imbalance Settlement Handbook v5.2 (2025)).

Under single pricing, one imbalance price applies per settlement period regardless of the sign of the imbalance, and that price is set by the direction the system needed regulation: system short → high (up-regulation) price; system long → low/negative price. So an FSP is essentially never penalized for helping the system — the entire risk lives in being on the wrong side during scarcity. Swedish flex products are overwhelmingly upward, and upward flex is by construction called when the system is short — i.e., during the high-price, scarcity hours when the imbalance price (mFRR-based, strongly spot-correlated) is highest.

That sets up a sharp asymmetry:

Bias	Mechanism	Cost
Over-optimism (over-commit / SOC too high / baseline too high)	Under-delivers → short into a short system → settled at the penal up-regulation price; may also trip the 75% delivery cliff → total loss of activation payment	Cash loss, concentrated in the most expensive hour
Conservatism (under-commit / bid below capability)	Delivers fully or over-delivers; any imbalance tends to be favorable	Opportunity cost only

The deeper principle: it is not the magnitude of the bias that ruins you, it is its correlation with price. Imbalance cost = error × price; a zero-mean, scarcity-uncorrelated error nets out cheaply across many periods at the single price, while a bias that leaves the FSP short during scarcity concentrates all its cost in the highest-price hours. And those hours — cold snaps, low wind, peak demand — are exactly the ones a historical-average forecast handles worst, because they have the least precedent in the training data. The statistical weakness and the financial penalty coincide.

Two corollaries:

Battery-specific: over-optimism about SOC is the acute form — the AEM/NEM energy-management rules (Energy Storage › NEM and AEM — operational energy management) can pull a battery out of delivery exactly when called, producing under-delivery into scarcity.
Sweden-specific amplifier: with the BSP/compensation framework still a “paper construction” (no functional cross-BRP perimeter correction; Model 3/4 compensation unbuilt), activation accounting is messier than in Finland/Norway, so baseline and net-position errors bleed into imbalance more readily (Independent Aggregation in Sweden — The Implementation Gap).

The rational response — and the reason property 6 below matters — is deliberate conservatism: bid below expected capability and hold SOC headroom, sized against the asymmetric loss using the distribution of the forecast, not a point estimate.

A working definition of “good STLF for flexibility markets”

Pulling the five together, a forecast is good for flex markets to the extent it is:

Right object — forecasts the counterfactual baseline and net imbalance position, not just gross load
Right granularity — accurate at feeder / household-cluster / per-DER level, where variability is high
Right loss function — minimizes cost-weighted, threshold-aware, peak-hour error, not average MAPE
Manipulation-resistant and recalculable — transparent, unbiased, auditable (NC DR Art. 14)
Information-current — usable at the market’s gate-closure timing
Probabilistic, not point — communicates uncertainty so the bidder can size bids to risk (under-delivery cliff vs. imbalance cost)

Property 6 is the quiet one: because the loss function is asymmetric and threshold-based, a bidder needs the distribution, not a point estimate, to choose a bid that balances the 75%-delivery cliff against imbalance exposure. The Energiforsk “what-if” behavioral scenario approach — publishing assumptions and ranges rather than a single number — is the methodological expression of this (Distribution Network Development Plan › DNDP digitalization and automation).

How to achieve it — ranked by what actually binds

The order matters: in Sweden the constraint is rarely the algorithm.

1. Data access — the binding constraint. Household- and DER-level STLF is impossible without clean, standardized, granular data. Two enablers, both incomplete:

Submetering / dedicated measuring devices (DMDs) give a per-device signal, removing whole-building noise and letting different baseline methods apply to different DER types in one portfolio (Submetering, Baseline Methods › Submetering as enabling infrastructure for baseline accuracy). Art. 7b of Regulation 2024/1747 creates a customer right to install DMDs; Swedish implementation rules are pending. This is the single biggest accuracy lever on the FSP side.
The DHV/FIS data backbone is the prerequisite for scalable household STLF — standardized metering and settlement data across DSO boundaries. Until it exists (proposal due September 2026), data access, not modeling, is the structural barrier (Load Forecasting › Swedish context).

2. Bottom-up DER inventory. You cannot forecast a feeder without knowing what is connected to it. The PREDATOR analyze–aggregate–extrapolate framework formalizes this: identify current EV/solar/heat-pump/battery inventory from AMI and registers, build current-state feeder profiles, then extrapolate with growth and socioeconomic data (Source - Energiforsk 2026-1168 AI-modeller Prognostisering Efterfrågan El (2026)). The “analyze” step is the foundation; the RISE AMI classification tool addresses the same gap.

3. Hybrid ML/DL — where it actually helps. CNN-LSTM and similar models earn their complexity specifically on the EV and heat-pump non-linearity that invalidates historical averaging and that statistical baselines (XofY, rolling average) handle worst (Baseline Methods › Methods, Source - Load Forecasting Methods Survey (2025)). But they are only as good as the data feeding them and only worth deploying above a capability threshold most DSOs do not clear (next point). ML is a lever on granularity and non-linearity, not a substitute for data.

4. Probabilistic / scenario methods over point forecasts. Given the asymmetric loss, communicating uncertainty is itself an accuracy improvement in decision terms. Scenario-based “what-if” forecasting (publishing behavioral assumptions and ranges) is more honest and more useful than false-precision point estimates, especially for the tail hours that dominate the economics.

This is the DSO-side mirror of the FSP imbalance asymmetry, and E.ON‘s SWITCH operation demonstrates it in practice (operator-reported, recent winter seasons): to stop under-forecasting peak demand and missing the highest-congestion hours, E.ON moved order generation and activation from a mean/average (P50) STLF to a high-confidence-interval (upper-quantile) forecast. The trade-off is explicit insurance: the high-CI forecast generates substantial excess availability remuneration for capacity that was never needed, but it succeeds where it matters — far more of the actual congestion is managed by the market, because the earlier point-biased forecast too often missed the very peak hours it existed to predict, producing poor results in prior seasons. The systematic under-estimation of peak demand is observed broadly across DSO and third-party ML forecasters — exactly the tail-hour failure this page predicts, since cold-snap/low-wind peaks have the least precedent in training data. Each party rationally abandons the point forecast toward the quantile that protects against its catastrophic error: the FSP biases down on its own deliverable flex; the DSO biases up on demand. The “insurance cost” is the cost-weighted loss function made operational — a bounded, known over-procurement cost paid to avoid the unbounded cost of unmanaged congestion (market failure → curtailment or reinforcement). The congestion nodes in question are sub-transmission transformers of roughly 15–50 MW/MVA (rising to ~800 MW for large regional constraints such as Södra Skåne / Sege–Arrie), where relative variability is high enough that a single forecast miss flips the decision to call the market.

5. Capacity-limit products — the forecasting bypass (but see the design caveat below). When good STLF is unachievable, change the product, not the model. Operating-envelope / capacity-limit products (and the Netherlands GOPACS schedule-modification model) define a power cap at the connection point and verify against it — avoiding the counterfactual baseline entirely (Baseline Methods › Capacity-limit products as a baseline alternative, Source - EC LFM Specification and Design Criteria (VITO, 2025)). This trades forecasting burden for a different clearing architecture, and is the right design choice for LV congestion, high-manipulation-risk resources, and early markets where settlement complexity blocks participation. Swedish experience is mixed — NODES MaxUsage (used and then discontinued at Kinnekulle Energi) is the cautionary case — and E.ON has been hesitant for two reasons that are both correct: baselining creeps back (the cap is often defined as a reduction from expected consumption — a counterfactual), and “what are we actually paying for” is unanswerable when the cap may never bind. The design recommendation below addresses both.

6. Organizational capability — the real bottleneck. PREDATOR’s central finding: many Swedish DSOs have no internal data-driven forecasting model at all, relying on expert judgment and the Energiforsk lathund; the national-method work found 32% lack any documented methodology. DSOs ask for holistic solutions, not components because they lack staff combining power-systems and data-science skills (Source - Energiforsk 2026-1168 AI-modeller Prognostisering Efterfrågan El (2026)). The accuracy frontier is irrelevant to an organization that cannot operate a model — which connects directly to Small DSO Capacity — The Binding Constraint on Swedish Flexibility Policy.

Designing the cap product

A firm-cap product is the practical embodiment of this bypass, but a naive design fails exactly as E.ON found — the cap defined as “a reduction from expected consumption” smuggles the counterfactual back in, and a cap that never binds pays for nothing. The corrected design (anchor the cap to the firm contractual level as a soft-to-firm conversion or a reduction below it; price it as headroom insurance; set the strike with the high-CI forecast; settle on metered cap-compliance) is set out as recommendation B5 in the SWITCH markets analysis. The key link to this page: the cap product does not escape forecasting, it relocates the counterfactual from per-activation settlement to once-per-node targeting — handled by the same high-CI forecast E.ON already runs for order generation. The forecast that decides when to call the market is the forecast that decides where to set the cap.

The Swedish reality

The operational STLF layer is unstandardized and commercially held: individual aggregators, BRPs, and energy-service providers develop it as private competency, while only the LTLF/MTLF planning layer (DNDP/FNA) has a national method (Load Forecasting › Swedish context). No public benchmark exists for MAPE achieved at grid-segment (DSO) or aggregated-household (FSP) level in Swedish deployments — a tracked data gap. The result: “good” is currently defined per-actor and unverifiable across the market. The DHV/FIS backbone and DMD implementation are the two interventions that would turn STLF from a private moat into shared infrastructure — and shift the competitive frontier from data access to modeling skill, which is where the literature’s answer finally becomes the relevant one.

What to watch

DMD implementation rules (Art. 7b) — whether Sweden mandates settlement-grade submetering, the biggest FSP-side accuracy lever.
DHV/FIS data architecture (proposal Sept 2026) — whether it exposes standardized, API-accessible metering for third-party STLF.
NC DR baseline T&C — how Sweden codifies “recalculable, transparent, unbiased”; which methods enter the national register.
Capacity-limit product uptake — whether DSOs adopt operating envelopes to sidestep baselines where STLF is weak.
A public STLF benchmark — any DSO- or FSP-level MAPE data at feeder/household-cluster granularity would be the first hard evidence of what “good” actually is in Sweden.

Data gaps

Public benchmark of STLF accuracy at feeder (DSO) and household-cluster (FSP) level in Swedish flex markets — none exists
Whether Swedish DSOs use point or probabilistic forecasts to define LFM-h/p activation windows (E.ON SWITCH high-CI practice is operator-reported; corroborate against published SWITCH season documentation, including the quantified excess-availability “insurance” cost)
How the 75% delivery threshold interacts with baseline forecast error in practice — measured revenue impact of baseline miss
DMD settlement-grade vs indicative-grade decision in NC DR T&C — determines per-DER baseline accuracy ceiling
Whether DESGRID (PREDATOR follow-up) is funded — the route to an operational DSO forecasting tool
Whether E.ON/SWITCH pilots a firm-cap operating-envelope product (design proposed here) and how it performs vs the discontinued NODES MaxUsage

Load Forecasting — the concept page: horizons (VSTLF/STLF/MTLF/LTLF), methods, metrics
Baseline Methods — the counterfactual problem; nine methods; manipulation and integrity
Balancing Markets — imbalance settlement as the cost function for BRP forecast error
Demand Response — what STLF enables; synchronization risk at scale
Submetering — DMDs as the per-DER accuracy enabler (Art. 7b)
Distribution System Operator — DSO forecasting-capability gap
Small DSO Capacity — The Binding Constraint on Swedish Flexibility Policy — the organizational bottleneck
The Swedish BESS Business Case — Revenue Stacking and the FCR Saturation Problem — imbalance cost as a driver of battery STLF need
Elmarknadshubb — DHV/FIS as the data backbone that makes household STLF scalable
SWITCH — the E.ON market platform; TO/DO products and baseline methods the cap product would extend
E.ON SWITCH Markets — Supply Liquidity and Growth Recommendations — operational SWITCH analysis; natural home for the capacity-limit product recommendation
Kinnekulle Energi — NODES MaxUsage as the discontinued Swedish capacity-limit precedent