Extensible Spaced-Repetition Simulator
This project is a small, dependency-light simulator inspired by What will a general simulator of spaced repetition consist of? and mirrors the Rust FSRS simulator ideas in Python. It separates the simulator into four modules so you can stress-test schedulers against richer real-world assumptions, with an event-driven reference engine and a batched tensor engine for multi-user sweeps.
Quickstart
Install dependencies with uv, then run a quick simulation (no logs, just plots). Quickstart assumes ../srs-benchmark and ../Anki-button-usage are available; see Requirements and data.
uv sync
uv run simulate.py --priority new-first --days 90 --no-log
The single-user CLI uses the event engine and can emit per-event records:
uv run simulate.py --engine event --priority new-first --days 90 --log-reviews
For larger retention sweeps, use the batched tensor entrypoint:
uv run experiments/retention_sweep/run_sweep_users_batched.py --start-user 1 --end-user 200 --env lstm --sched fsrs6,anki_sm2,memrise --torch-device cuda
Requirements and data
- Python 3.13+ (matches
pyproject.toml). - Use
uv syncto install dependencies. Torch is pulled from the uv indexes declared inpyproject.toml(CUDA builds on Windows/Linux, CPU builds on macOS). - Plotly is included for interactive FSRS6 ADR policy-surface HTML plots.
cma/pycma is included for CMA-ES black-box policy search in RL scheduler experiments, including ADR and AP schedulers.srs-benchmarkrepo is expected next to this repo at../srs-benchmark(override with--srs-benchmark-root). It provides FSRS/HLR/DASH weights inresult/*.jsonland LSTM weights inweights/LSTM/<user_id>.pth.- Generate LSTM weights in the
srs-benchmarkrepo by running:
uv run script.py --algo LSTM --weights
Anki-button-usagerepo is required bysimulate.pyand defaults to../Anki-button-usage/button_usage.jsonl(override with--button-usage). Pass--user-idto select the matching per-user row. Button usage loads marginal rating probabilities and costs by default; pass--review-markov-transitionto opt intolong_term_transitionreview-button Markov behavior.- SSP-MMC policies require precomputed policy files. Generate them in the sibling repo, then point
SSPMMCSchedulerat the outputs (see../SSP-MMC-FSRS).
CLI usage
Simulation logs store metadata and totals by default; add --log-reviews to include per-event logs (can be large). Daily time series are written to a sidecar CSV file with the same basename as the JSONL log.
Common examples:
uv run simulate.py --days 30 --deck 500 --learn-limit 20 --review-limit 200 --cost-limit-minutes 60 --seed 7 --no-progress --no-log
uv run simulate.py --env fsrs3 --sched fsrs3 --desired-retention 0.85 --no-log
uv run simulate.py --env fsrs6 --sched hlr --desired-retention 0.8 --no-log
uv run simulate.py --sched fsrs6 --scheduler-priority high_difficulty --no-log
uv run simulate.py --sched fixed@7 --priority review-first --no-log
uv run simulate.py --log-dir logs/runs --days 180 --seed 123
uv run simulate.py --sched sspmmc --sspmmc-policy ../SSP-MMC-FSRS/outputs/policies/<policy>.json --no-log
Flag notes:
--no-plotand--no-progressdisable the Matplotlib dashboard and progress bar.--log-dircontrols where JSONL logs and daily CSVs are written.--button-usagepoints at a button-usage JSONL file to override default costs and rating probabilities. Review-button Markov transitions fromlong_term_transitionare ignored unless--review-markov-transitionis set.--benchmark-resultand--benchmark-partitionoverride whichsrs-benchmarkresult rows are loaded.--fuzzapplies Anki-style interval fuzzing to scheduler outputs.
FSRS6 priority modes: low_retrievability, high_retrievability, low_difficulty, high_difficulty.
RL experiment infrastructure
The rebooted RL scheduler experiment infrastructure starts from checked-in TOML profiles and machine-readable stage records:
uv run python experiments/rl_scheduler/run_experiment.py --config experiments/rl_scheduler/configs/fsrs6_adr_linear_cmaes_users_1_8.toml --stage all --run-id fsrs6_adr_linear_cmaes_users_1_8_v1
uv run python experiments/rl_scheduler/select_fsrs6_baseline_drs.py --config experiments/rl_scheduler/configs/fsrs6_adr_linear_portfolio_users_1_8.toml
uv run python experiments/rl_scheduler/select_fsrs6_baseline_drs.py --config experiments/rl_scheduler/configs/fsrs3_scheduler_users_1_8.toml --scheduler fsrs3 --output-manifest artifacts/rl_scheduler/baseline_dr_selection/fsrs3_users_1_8_16dr_pop16_gen5.json
uv run python experiments/rl_scheduler/run_experiment.py --config experiments/rl_scheduler/configs/fsrs6_adr_linear_portfolio_users_1_8.toml --stage all --run-id fsrs6_adr_linear_portfolio_users_1_8_v1
uv run python experiments/rl_scheduler/run_experiment.py --config experiments/rl_scheduler/configs/fsrs6_ap_cmaes_users_1_8.toml --stage all --run-id fsrs6_ap_cmaes_users_1_8_v1
uv run python experiments/rl_scheduler/run_portfolio_workflow.py --config experiments/rl_scheduler/configs/anki_sm2_ap_portfolio_users_1_8_pop16_20_v1.toml --run-id anki_sm2_ap_portfolio_users_1_8_pop16_20_v1 --skip-manifest --skip-baseline-sweep
uv run python experiments/rl_scheduler/inspect_run.py --run-root artifacts/rl_scheduler/fsrs6_adr_linear_cmaes_users_1_8/fsrs6_adr_linear_cmaes_users_1_8_v1
uv run python experiments/rl_scheduler/validate_artifact.py --metadata <artifact_metadata.json> --require-files
uv run python experiments/rl_scheduler/plot_fsrs6_adr_policy_surfaces.py --train-run-root artifacts/rl_scheduler/<profile>/<run-id> --users 1,2
The FSRS6 ADR surface plotter supports ordinary ADR artifacts by baseline DR and
ADR portfolio child artifacts by memorized_average / deck.
dry-run validates the TOML and prints resolved commands without writing formal
outputs. preflight writes a config snapshot, resolved config, command record,
run record, GPU summary, gate summary, manifest, and preflight summary under the
configured output_root. stage-baseline validates FSRS6 JSONL log metadata,
including configured baseline.desired_retention_values or per-user
[baseline_dr_selection] manifest values, and the configured
simulation.review_markov_transition mode. It stages exact baseline logs by
copy or hardlink without staging CSV sidecars. train-overfit
runs the user-provided training.command_template once per training user and
lambda value by default; portfolio trainers run once per training user and do
not use training.lambda_grid. When [training.batch].enabled = true, supported
in-tree RL trainers run in one Python process and batch multiple users into the
same batched tensor simulation call, avoiding GPU multi-process requirements
while preserving per-user artifact directories. It then requires scheduler policy
artifact metadata under the command output directory. sweep can run the
configured batched retention sweep from the
same TOML and validates the resulting JSONL logs. build-pareto fans out
build_pareto.py per user, and analyze-pareto writes both analysis.md and
machine-readable analysis_summary.json from those Pareto JSON files. CUDA
train-overfit and sweep stages automatically write GPU monitor artifacts
under <stage>/gpu_monitor/; the formal performance_summary.json links to
the monitor summary. Formal Markov mode defaults to
simulation.review_markov_transition = false; new formal logs, artifacts, and
reports record this field so Markov-off runs are not mixed with legacy Markov-on
results. Portfolio profiles should use run_portfolio_workflow.py
as the standard entry point so baseline DR selection, baseline sweep, formal
stages, and any configured [report] step run in order. Formal stages fail if
required inputs are missing. all runs the configured stages in order and stops
at the first non-zero stage result. Scheduler policy artifacts must validate
against the metadata contract before they can be used by formal stages.
fsrs6_adr_linear_cmaes_users_1_8.toml trains one simplified
3-parameter fsrs6_adr_log_linear_v1 policy per user and baseline desired
retention with CMA-ES, then evaluates the resulting ordinary fsrs6_adr
schedulers.
fsrs6_adr_linear_portfolio_users_1_8.toml trains 16 simplified
3-parameter fsrs6_adr_log_linear_v1 portfolio children per user with
SMS-EMOA hypervolume optimization, then evaluates them as ordinary
fsrs6_adr schedulers.
fsrs6_default_adr_portfolio_users_1_8_pop16_20_v1.toml trains the same ADR
portfolio form while using default FSRS-6 weights inside the ADR scheduler; the
environment still follows the configured simulation environment.
fsrs6_ap_cmaes_users_1_8.toml trains fsrs6_ap (Adaptive Parameters), which
searches 21 bounded FSRS-6 scheduler parameters as standardized deltas from each
user's fitted FSRS-6 weights. It batches users and baseline DR values in one
process and still writes one independent policy artifact per user/DR/lambda for
the standard sweep and Pareto stages.
ADR trains one artifact per (user, baseline DR, lambda) and applies the
overfit gate against the same user's same-DR FSRS-6 baseline, with both relative
memorized-average and memorized-per-minute gains required to be greater than
0.0.
AP uses the ADR overfit gate per DR, but the action is the full FSRS-6
weight vector. The train lane layout flattens (user, baseline DR, CMA-ES candidate) and uses [training.ap].dr_batch_size plus [training.batch] to
control DR and user batching.
fsrs6_adr_portfolio and fsrs6_ap_portfolio train multiple child policies
per user with SMS-EMOA. The objective is the Pareto hypervolume gain of
candidate (memorized_average, -time_average) points over the user's FSRS-6 DR
baseline set. Portfolio profiles use [baseline_dr_selection] to point at a
generated per-user manifest of 16 FSRS-6 desired-retention values, selected in
the FSRS6 environment before formal staging/training. The selector writes
per-generation hypervolume progress to <manifest>.progress.jsonl by default
and batches multiple users together up to --max-lanes-per-batch lanes.
Portfolio artifacts are
lambda-less and write children under
user_<id>/policies/policy_*/; set
training.artifact_metadata_glob = "policies/**/metadata.json" so sweep stages
discover each child policy.
Training batch mode is configured under [training.batch], for example:
[training.batch]
enabled = true
trainer = "auto"
batch_size = 8
trainer = "auto" resolves the built-in trainer from training.command_template;
set an explicit trainer such as "fsrs6_adr_cmaes",
"fsrs6_adr_portfolio", or "fsrs6_ap_cmaes" for command-template-free
in-process runs.
Experiments
Retention sweep + Pareto (compare environments, optional SSP-MMC policies):
uv run experiments/retention_sweep/run_sweep.py --env fsrs6,lstm --sched fsrs6
uv run experiments/retention_sweep/run_sweep.py --env fsrs6,lstm --sched sspmmc
uv run experiments/retention_sweep/run_sweep.py --env lstm --sched fsrs6_adr --fsrs6-adr-policy <policy.json>
uv run experiments/retention_sweep/run_sweep.py --env lstm --sched fsrs6_ap --fsrs6-ap-policy <policy.json>
uv run experiments/retention_sweep/run_sweep.py --env fsrs6,lstm --sched fsrs6,sspmmc
uv run experiments/retention_sweep/build_pareto.py --env fsrs6,lstm --sched fsrs6,sspmmc
Single-card lifecycle tradeoff experiments have their own guide in experiments/single_card_tradeoff/README.md, including finite and stationary FSRS6 oracle baselines, interval/retention distillation, and continuous desired-retention oracle variants.
That family now has its own formal stage runner at experiments.single_card_tradeoff.cli.run_experiment, which reads semantic task tables instead of raw [[commands]] blocks. The same guide also covers native FSRS6 ADR training via experiments.single_card_tradeoff.cli.fsrs6_adr_train_multiuser.
By default, SSP-MMC policies are loaded from ../SSP-MMC-FSRS/outputs/policies/user_<id>. Override with --sspmmc-policy-dir or --sspmmc-policies. Use --sched to compare DR sweeps across schedulers; include sspmmc, fsrs6_adr, fsrs6_adr_time, or fsrs6_ap to add policy curves. For fsrs6_adr and fsrs6_adr_time, pass --fsrs6-adr-policy <policy.json>; the policy maps scheduler-side FSRS-6 state to desired retention. For fsrs6_ap, pass --fsrs6-ap-policy <policy.json>; the policy contains bounded FSRS-6 scheduler weights and its baseline DR. For fixed intervals, pass fixed@<days> in --sched. Single-user retention sweep logs default to logs/retention_sweep/user_<id> and use the event engine. Retention sweeps write JSONL summaries by default but skip daily CSV sidecars to limit disk usage; pass --diagnostic-csv-logs when diagnosing simulation behavior or when using CSV-based plotting helpers. build_pareto.py writes results JSON to logs/retention_sweep/<config>/ and plots to experiments/retention_sweep/plots/<config>/, where <config> encodes --short-term, --fuzz, --engine, and compare flags; per-user outputs are disambiguated with _user_<id> in the filename. build_pareto.py recursively scans JSONL logs under --log-dir, annotates points by default, and can compare staged baseline logs with nested sweep outputs. Pass --hide-labels to disable labels, --fuzz on/off to filter logs, or --compare-fuzz to overlay fuzz on/off curves.
Short-term scheduling:
uv run simulate.py --engine event --env lstm --sched lstm --short-term-source steps --learning-steps 1,10 --relearning-steps 10
To explicitly disable learning/relearning steps while using --short-term-source steps, pass empty strings:
uv run simulate.py --engine event --env lstm --sched lstm --short-term-source steps --learning-steps "" --relearning-steps ""
Scheduler-driven short-term (LSTM only, no steps):
uv run simulate.py --engine event --env lstm --sched lstm --short-term-source sched
Use --short-term-loops-limit <N> to cap short-term review loops per user per day in event and batched runs, not total short-term review interactions. A loop may process multiple due short-term cards; each card is processed at most once per loop. Remaining short-term cards carry over to the next day.
When short-term scheduling is enabled, benchmark weights are loaded from *-short-secs result files, and LSTM weights are loaded from weights/LSTM-short-secs in the srs-benchmark repo (override via --benchmark-result if needed).
Additional retention sweep helpers:
uv run experiments/retention_sweep/run_sweep_users.py --start-user 1 --end-user 10 --env fsrs6,lstm --sched fsrs6,anki_sm2,memrise --max-parallel 4
uv run experiments/retention_sweep/run_sweep_users_batched.py --start-user 1 --end-user 200 --env lstm --sched fsrs6,anki_sm2,memrise
uv run experiments/retention_sweep/run_sweep_users_batched.py --start-user 1 --end-user 10 --env lstm --sched fsrs6_adr --fsrs6-adr-policy <policy.json>
uv run python experiments/retention_sweep/run_sweep_users_batched.py --config <edited-batched-sweep.toml> --dry-run
uv run python experiments/retention_sweep/run_sweep_users_batched.py --start-user 1 --end-user 8 --env lstm --sched fsrs6,fsrs6_adr --fsrs6-adr-policy-root <train-overfit/train_outputs>
uv run experiments/retention_sweep/build_pareto_users.py --start-user 1 --end-user 8 --env lstm --sched fsrs6,fsrs6_adr --engine batched
uv run experiments/retention_sweep/build_pareto_users.py --config experiments/rl_scheduler/configs/fsrs6_adr_linear_cmaes_users_1_8.toml --dry-run
uv run experiments/retention_sweep/build_pareto_users.py --start-user 1 --end-user 10 --env fsrs6,lstm --sched fsrs6,sspmmc
uv run experiments/retention_sweep/aggregate_users.py --env lstm --sched fsrs6,anki_sm2,memrise
uv run experiments/retention_sweep/dominance.py --env lstm
uv run python experiments/retention_sweep/plot_short_loops.py --env lstm --sched fsrs6 --short-term-source steps --desired-retention 0.9 --metric avg --out experiments/retention_sweep/plots/short_loops_fsrs6_steps_dr09.png --no-show
run_sweep_users.pyfans outrun_sweep.pyacross a user-id range and supports--max-parallel,--cuda-devices(round-robin per worker), plus MPS env passthrough;--max-parallelonly delivers speedups when GPU Multi-Process Service (MPS) is enabled on the host. In parallel it shows an overall work bar, a user bar, and per-worker bars (disable with--child-progress off, and use--show-commands onif you need the raw subprocess commands).run_sweep_users_batched.pyruns LSTM/FSRS6 retention sweeps with the batched tensor engine. By default it uses--max-lanes-per-batch 10000to precompute outer user batches before loading per-user weights, keeping each user's scheduler/DR lanes together; use--batch-sizeonly when you want a fixed outer user count, such as distributing work with--cuda-devices. Each batch expands(user, scheduler, parameter/DR)into simulation lanes, so mixed scheduler sweeps share one batched engine call per environment batch. It is also the supported entrypoint for FSRS-trainedfsrs6_adr,fsrs6_adr_time,fsrs6_default_adr,fsrs6_ap, andanki_sm2_appolicies evaluated in FSRS6 or external LSTM environments. Use--fsrs6-dr-manifest <manifest.json>or--fsrs3-dr-manifest <manifest.json>to run FSRS6 or FSRSv3 DR lanes from per-user selected values instead of the uniform range. Use--fsrs6-adr-policy <policy.json>for one policy,--fsrs6-adr-policy-root <train-overfit/train_outputs>or--fsrs6-adr-train-run-root <run-root>to expand trained(user, baseline DR, lambda)policies or lambda-less portfolio child policies forfsrs6_adr,fsrs6_adr_time, orfsrs6_default_adr, or--fsrs6-adr-policy-manifest <policies.toml>for explicit entries. Use--fsrs6-ap-policy <policy.json>for one AP policy,--fsrs6-ap-policy-root <train-overfit/train_outputs>or--fsrs6-ap-train-run-root <run-root>to expand trained(user, baseline DR, lambda)AP policies or lambda-less AP portfolio children, or--fsrs6-ap-policy-manifest <policies.toml>for explicit entries. Use--anki-sm2-ap-policy <policy.json>for one Anki SM2 AP policy,--anki-sm2-ap-policy-root <train-overfit/train_outputs>or--anki-sm2-ap-train-run-root <run-root>for portfolio children, or--anki-sm2-ap-policy-manifest <policies.toml>for explicit entries.fsrs6_adr,fsrs6_adr_time,fsrs6_default_adr, andfsrs6_apmanifests includebaseline_desired_retention; portfolio children may set it tonull;anki_sm2_apportfolio children are no-DR policies. Formal ADR/AP experiments keep the batched sweep settings insideexperiments/rl_scheduler/configs/*.tomlso one TOML reproduces training, sweep, Pareto build, and analysis. When that TOML is used directly withrun_sweep_users_batched.py --config,[sweep].log_diris honored as the shared log root. When it is used throughexperiments/rl_scheduler/run_experiment.py --stage sweep, the same sweep settings are run-local and logs are written to<output_root>/<run_id>/sweep/sweep_outputs; formalbuild-paretoscans<output_root>/<run_id>rather than the shared log root. Batched sweeps use--log-layout userby default, so--log-dir logs/retention_sweepwriteslogs/retention_sweep/user_<id>/sched_...for all schedulers and can be consumed directly bybuild_pareto_users.py; use--log-layout sweepto keep the legacysched_.../user_<id>layout. Short-term steps are supported via--short-term-source steps, and LSTM sched-based short-term is supported via--short-term-source sched. Batched sweeps skip per-user daily CSV sidecars and batch GPU CSV logs by default; pass--diagnostic-csv-logsto write them under the normal log root.SRS_LSTM_MAX_BATCHdefaults to 65536. This is a throughput-oriented chunk size and can use more than 10 GiB of GPU memory on larger LSTM lane batches; batched sweep entrypoints enable PyTorch expandable CUDA segments at startup to reduce allocator fragmentation, but reduceSRS_LSTM_MAX_BATCHif shared GPU memory spill still appears, or keep--max-lanes-per-batchat or below the default 10000 unless memory allows larger chunks.build_pareto_users.pyfans outbuild_pareto.pyacross a user-id range.aggregate_users.pyaggregates per-user retention_sweep logs into summary JSON, recursively scanning nested batched lane logs under--log-dir. By default it plots FSRS-6 equivalent distributions vs Anki-SM-2/Memrise; use--equiv-baselinesto choose different baseline scheduler specs and--equiv-pairsfor generic DR-scheduler pair boxplots such aslstm:fsrs6. Use--equiv-report fsrs3(and includefsrs3in--sched) to switch the baseline-equivalence target to FSRSv3.dominance.pyreports per-user dominance rates between Anki-SM-2 and Memrise, plus FSRS-6 default (DR=90%) vs Anki-SM-2/Memrise, and saves stacked bar charts.plot_short_loops.pyplots the per-user distribution of daily short-term loops from retention_sweep CSV sidecars. Generate those sidecars with--diagnostic-csv-logs. For DR schedulers such asfsrs6, pass--desired-retentionso the plot uses a single target-retention config instead of mixing multiple DR runs for the same user. Use--metric avg-activeto average only over days with loops,--log-dirto point at a single user or alternate sweep root, and--max-user-ticksto cap lower-panel x-axis labels.- Note: very high desired retention targets (e.g.,
--end-retention 0.99for FSRS6 sweeps) can dramatically increase GPU memory usage; reduce--max-lanes-per-batch, set--batch-size, or cap retention if you hit OOM. - Retention range flags use
--start-retention/--end-retentionacross the sweep, aggregate, and pareto tools. - To pass retention sweep overrides (e.g.,
--start-retention/--end-retention/--step, now 0-1 floats) torun_sweep.py, add them after--when invokingrun_sweep_users.py.
Engine support matrix
Legend: ✓ supported, — not supported.
Event engine:
| env \ sched | fsrs6 | fsrs3 | hlr | dash | lstm | fixed | anki_sm2 | anki_sm2_ap | memrise | sspmmc | fsrs6_adr | fsrs6_adr_time | fsrs6_default_adr | fsrs6_ap |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| lstm | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| fsrs6 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| fsrs3 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| hlr | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| dash | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Batched tensor engine (run_sweep_users_batched.py):
| env \ sched | fsrs6 | fsrs3 | lstm | anki_sm2 | anki_sm2_ap | memrise | fixed | fsrs6_adr | fsrs6_adr_time | fsrs6_default_adr | fsrs6_ap |
|---|---|---|---|---|---|---|---|---|---|---|---|
| lstm | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| fsrs6 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Notes:
- Event engine is the reference implementation and supports all scheduler/environment combinations, even if some pairings are not meaningful.
- Batched mode is intended for multi-user retention sweeps and is currently limited to the environments and schedulers listed above.
fsrs6_adr,fsrs6_adr_time, andfsrs6_default_adrrequire one of--fsrs6-adr-policy,--fsrs6-adr-policy-root,--fsrs6-adr-train-run-root, or--fsrs6-adr-policy-manifest;fsrs6_apuses the matching--fsrs6-ap-*policy source flags;anki_sm2_apuses the matching--anki-sm2-ap-*policy source flags.
Evaluation
experiments/retention_sweep/aggregate_users.py compares scheduler efficiency by aggregating retention_sweep logs across users for each environment, scheduler, and target setting (desired retention or fixed interval) and restricting to the intersection of user IDs so each config is compared on the same users. Formal RL-scheduler analyze-pareto reports use scheduler-only Pareto hypervolume as the primary metric.
Metrics and outputs:
- Memorized cards (average, all days): average number of memorized cards across the full simulation horizon.
- Study minutes per day (average): average study time per day from the logs.
- Memorized cards per minute (average): memorized cards divided by study minutes. In RL-scheduler portfolio reports, unweighted policy-point averages of memorized cards, time, and efficiency are diagnostics only because they depend on where the portfolio samples the frontier.
- Pareto portfolio reports:
analyze-paretosummarizes scheduler-only HV delta against the FSRS6 baseline, HV delta / baseline HV, per-user HV delta five-number summaries, coverage-aware same-budget memory lift AUC over the common covered time-budget interval, and coverage-aware same-target time saved AUC over the common covered memory-target interval. The AUC metrics use linear interpolation between each scheduler's Pareto frontier points and do not extrapolate beyond the shared frontier coverage. The same metrics are written toanalysis_summary.json; formal experiment reports are generated byexperiments/rl_scheduler/generate_experiment_report.pyfromreport_summary.json, not by hand-copying tables from Markdown. - Plots: for baseline scheduler specs (default Anki-SM-2 and Memrise, configurable via
--equiv-baselines), it computes each user's equivalence-target DR by interpolating the target DR scheduler (FSRS-6 by default, configurable via--equiv-report) points to match the baseline memorized cards (average, all days), then compares memorized cards per minute (average) distributions along with per-user differences and ratios. It also supports generic DR-scheduler pair boxplots via--equiv-pairs, which interpolate the target curve to each baseline DR point and bin the resulting efficiency ratio by baseline DR. - Per-user Pareto frontier comparison:
build_pareto_users.pysaves a Pareto frontier plot per user (filename suffixed with_user_<id>) into a shared config-specific plot directory, overlaying environments and schedulers to show the tradeoff frontier in terms of memorized cards (average, all days) vs study minutes per day (average). - Axes under default retention_sweep settings (1825 days, deck 10,000, learn limit 10/day, review limit 9,999/day, cost limit 720 minutes/day, review-first, seed 42, batched engine): the X axis "Memorized cards (average, all days)" is the expected number of cards remembered per day averaged over the whole run (sum of predicted retrievability across learned cards), and the Y axis "Minutes of studying per day (average)" is the average daily study time reported by the cost model over the whole run (lower = better, since it is the cost axis in the tradeoff plot).
Retention sweep comparisons (lstm)
_All figures below use --env lstm --engine batched --short-term on --short-term-source steps in the corresponding retention_sweep analysis scripts._
Interpretation caveat: these plots are conditional on the LSTM environment being the simulator's ground-truth memory model. That is not a neutral test bed for every scheduler, and it is especially favorable to the LSTM scheduler because the scheduler is being evaluated inside the same model family that defines recall dynamics. These figures should therefore be read as "performance under the LSTM simulator", not as unbiased evidence that LSTM is universally better on real users.
SM2 vs Memrise dominance
Caption: Per-user dominance outcomes between Anki-SM-2 and Memrise (dominates vs tradeoff).
FSRS-6 default (DR=90%) vs Anki-SM-2 dominance
Caption: Per-user dominance outcomes between FSRS-6 default (DR=90%) and Anki-SM-2 (dominates vs tradeoff).
FSRS-6 default (DR=90%) vs Memrise dominance
Caption: Per-user dominance outcomes between FSRS-6 default (DR=90%) and Memrise (dominates vs tradeoff).
FSRS6 equivalence vs Anki-SM-2
Caption: FSRS-6 interpolated to match Anki-SM-2 memorized-average per user; compares memorized-per-minute distributions and deltas. (n=7954; superiority=78.8%; mean ratio=1.143; median ratio=1.124 (IQR 1.015-1.316); mean DR=0.885; median DR=0.895).
FSRS6 equivalence vs Memrise
Caption: FSRS-6 interpolated to match Memrise memorized-average per user; compares memorized-per-minute distributions and deltas. (n=7544; superiority=70.3%; mean ratio=1.074; median ratio=1.074 (IQR 0.980-1.188); mean DR=0.848; median DR=0.869).
LSTM equivalence vs Anki-SM-2
Caption: LSTM scheduler interpolated to match Anki-SM-2 memorized-average per user; compares memorized-per-minute distributions and deltas. (n=8281; superiority=87.4%; mean ratio=1.243; median ratio=1.232 (IQR 1.091-1.449); mean DR=0.895; median DR=0.901).
LSTM equivalence vs Memrise
Caption: LSTM scheduler interpolated to match Memrise memorized-average per user; compares memorized-per-minute distributions and deltas. (n=8070; superiority=82.4%; mean ratio=1.133; median ratio=1.132 (IQR 1.040-1.247); mean DR=0.872; median DR=0.885).
Benchmarks
Performance baselines live under benches/. See benches/README.md for details.
Run the default suite:
uv run python benches/run_bench.py --srs-benchmark-root ../srs-benchmark
Run a single scenario:
uv run python benches/run_bench.py --scenario event_lstm_lstm --srs-benchmark-root ../srs-benchmark
Key concepts
- MemoryModel / Environment: (
simulator.core.MemoryModel) governs how recall probability and memory state evolve. Implementations live undersimulator/models. - BehaviorModel / User: (
simulator.core.BehaviorModel) turns hidden retrievability into observed ratings, can skip days, and sets the first rating. - CostModel / Workload: (
simulator.core.CostModel) converts each review into a dynamic time cost (e.g. longer latency when R is low). - Scheduler (
simulator.core.Scheduler): the agent under test. It only receives aCardViewprojection (history, due date, prior intervals) and returns the next interval plus its internal state. - simulate (
simulator.core.simulate): a day-stepped loop that wires all four components together. - simulate_multiuser (
simulator.batched_engine.simulate_multiuser): a tensor engine used by batched multi-user retention sweeps. It returns aggregate per-user stats and accepts a torch device override through the batched sweep entrypoint.
Architecture and control flow
The simulator follows an environment-agent loop where each module owns a distinct responsibility and communicates through lightweight data structures.
Data model
| Type | Purpose |
|---|---|
Card |
Internal state tracked by the simulator (id, due, lapses, memory/scheduler state, metadata). |
CardView |
Scheduler/behavior-visible projection of a card. Includes history but hides the environment's ground-truth state. |
ReviewLog |
(rating, elapsed, day) tuples appended to Card.history and used for logging/analysis. |
SimulationStats |
Time series counters plus a chronological list of Event records (day, action, card_id, rating, retrievability, cost, interval, due). |
Event loop
- Initialize deck - create
Cardobjects, seed future queue with due dates, and set up per-day counters. - Daily setup - each simulated day:
- Call
behavior.start_day(day, rng)to reset attendance/limit tracking. - Move cards whose
due <= dayfrom the future queue into the ready heap. Each ready entry stores(scheduler.review_priority(view, day), tie_breaker, card_id)so the scheduler can hint which review should run first (e.g., lowest retrievability).
- Call
- Behavior-driven actions - repeatedly ask
behavior.choose_action(day, next_review_view, next_new_view, rng):next_review_viewis the highest-priority ready card,next_new_viewis a placeholder for the next unseen card.- Behavior may return
Action.REVIEW,Action.LEARN, orNone(stop for the day). It enforces daily limits (new/review counts, cost ceiling) and implements heuristics such as new-first vs review-first.
- Learning path - when choosing
Action.LEARN:- Behavior picks an initial rating via
initial_rating. MemoryModel.init_cardsets the ground-truth stability/difficulty.Scheduler.init_cardcomputes the first interval and scheduler state.CostModel.learning_costreturns task time; the simulator updates stats, records a "new" event, and schedules the next review by pushing(due, priority, id)back to the future queue.
- Behavior picks an initial rating via
- Review path - when choosing
Action.REVIEW:- Compute elapsed days and call
MemoryModel.predict_retentionfor true retrievability. - Behavior samples a rating via
review_rating; if it returnsNonethe user skipped the rest of the day and the card is deferred. - Otherwise update ground-truth (
MemoryModel.update_card), ask the scheduler for the next interval (schedule), compute review cost, update stats, and log a "review" event.
- Compute elapsed days and call
- Deferral - once behavior stops or limits are reached, any remaining ready reviews are deferred by setting
card.due = day + 1and re-queuing. This ensures they appear first on the next day but retain scheduler-provided priority hints. - Post-processing - after all days, compute daily retention (
1 - lapses/reviews) and returnSimulationStats.
Priority plumbing
- Scheduler hint -
Scheduler.review_priority(view, day)returns a tuple (default(due, id)). FSRS schedulers override it to sort by predicted retrievability or difficulty. The simulator stores the hint inCard.metadata["scheduler_priority"]. - Behavior ordering -
BehaviorModel.priority_key(view)prepends its own policy (e.g., review-first) and consumes the scheduler hint so user strategies can favor reviews or new cards without losing the scheduler's ordering inside each bucket.
This separation lets you benchmark schedulers against arbitrary memory models and user behaviors while keeping transparency about where each decision is made.
Provided models
CLI environments are fsrs6, fsrs3, and lstm; HLR/DASH models are available for custom code but are not wired into the simulate.py CLI.
FSRS6Model: FSRS v6-style environment (21 params loaded fromsrs-benchmarkfor the selected user).FSRS3Model: FSRS v3-style environment (13 params loaded fromsrs-benchmark).HLRModel: half-life regression with three weights loaded fromsrs-benchmark.DASHModel: stateless logistic model with placeholder features and nine weights loaded fromsrs-benchmark.LSTMModel: neural forgetting-curve predictor inspired by the srs-benchmark LSTM (requires PyTorch and--user-idweights; runs on CUDA when available, otherwise CPU; expects day-based intervals like the originaldelta_tfeature).
Provided schedulers
FSRS6Scheduler/FSRSScheduler: FSRS v6-style state; loads weights fromsrs-benchmarkfor the selected user.FSRS3Scheduler: FSRS v3-style scheduler with weights fromsrs-benchmark.HLRScheduler: schedules using half-life regression weights fromsrs-benchmark.DASHScheduler: logistic retention solver that mirrors the DASH model and uses weights fromsrs-benchmark.SSPMMCScheduler: loads precomputed SSP-MMC-FSRS policies (JSON +.npz) and maintains its own FSRS6 state so it can target optimal retention under any environment.FixedIntervalScheduler: stateless fixed-interval baseline (--sched fixed@<days>).AnkiSM2Scheduler: Anki SM-2-style ease scheduler (--sched anki_sm2).MemriseScheduler: Memrise sequence scheduler (--sched memrise).LSTMScheduler: LSTM curve-fit scheduler that targets a desired retention from review history.
Provided behavior and cost models
StochasticBehavior: configurable attendance probability, lazy-good bias, and daily limits (max new/reviews/cost).StatefulCostModel: combines FSRS state rating costs (learning/review/relearning) with a latency penalty that grows as retrievability drops.
Extend
- Add a new memory model: subclass
MemoryModel, implementinit_card,predict_retention, andupdate_card. - Add a new behavior model: subclass
BehaviorModel, implementinitial_ratingandreview_rating. - Add a new cost model: subclass
CostModel, implementreview_cost. - Add a new scheduler: subclass
Scheduler, implementinit_cardandschedulethat operate onCardView. - Swap components in
simulateto study how scheduler policies perform under different ground-truth models, user behaviors, and workload assumptions.
Acknowledgements
- 1DWalker provided acceleration advice.
- Asuka Minato provided GPU servers.
- Expertium provided visualization advice.
- Jarrett Ye supervised Codex on this project.
- Luc Mcgrady inspired the event-driven engine.