Changelog
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
[0.15.0] - 2026-07-04
Maintenance overhaul: full dependency refresh, Python 3.14 support, and a multi-dimension audit that fixed 30+ verified bugs across batch processing, fetch/cache, LLM providers, image handling, and configuration.
Added
- Python 3.14 Support:
requires-pythonrelaxed to<3.15; full test suite passes on 3.14 (the previous onnxruntime blocker is resolved). CI matrix and classifiers updated - MIT License: LICENSE file added and declared in package metadata (
License-Expression: MIT) - Grouped
--help: options are organized into panels (Output & Configuration / LLM Enhancement / OCR / Fetch & Conversion Backends / Batch Processing / Cache & Images / Logging & Info) via rich-click; lazy subcommand loading preserved so--helpstays ~100ms - Garbled-text detection: PDFs whose extracted text is unreadable (broken cmap/substitution ciphers) are detected via a CJK-safe vowel-ratio heuristic
- Scan/garbled advisory: converting a PDF with scanned-looking or garbled pages without
--ocrnow emits one consolidated warning naming the affected pages and suggesting--ocr - Repeated header/footer suppression: running headers/footers (incl. "Page N of M" patterns) repeated across ≥60% of pages are stripped from PDF output — cleaner Markdown, fewer wasted LLM tokens; headings and tables are never touched, <4-page documents exempt
- VLM degeneration guard: vision/screenshot extraction results are checked for repetition loops (a known VLM-OCR failure mode); degenerate tails are truncated with a warning and never persisted to cache, so retries aren't poisoned
- HTML extraction quality (ported from Defuddle upstream): MathJax
script[type="math/tex"]equations preserved as LaTeX; Wikipedia/MediaWiki MathML survives hidden-element removal; partial-selector clutter removal no longer deletes code blocks (<pre>-protection); anchor-wrapped headings unwrap cleanly; code-block language tags validated against an allowlist (no more ```codeblock); whitespace inside<pre>preserved - Footnote engine (full Defuddle port): footnotes/citations across Wikipedia, arXiv, Substack, WordPress, Word/Google Docs exports, Tufte sidenotes and more are standardized and emitted as real Markdown footnotes (
[^1]/[^1]: ...) with renumbering, duplicate-reference handling, multi-paragraph definitions, and back-link stripping — 15 ground-truth fixtures now match Defuddle's expected output - Unified fetch strategy flag: new
-s/--strategy auto|static|playwright|defuddle|jina|cloudflare; the five per-backend flags (--playwright,--defuddle,--static,--jina,--cloudflare) remain as deprecated aliases that print a migration notice - Remote-fetch consent: URLs are no longer sent to third-party extraction services (defuddle.md, Jina, Cloudflare) without consent —
fetch.remote_consent: ask|always|never(defaultask: interactive runs prompt once per process; non-interactive/quiet runs skip remote and crawl locally);MARKITAI_NO_REMOTE_FETCH=1forcesnever; explicit-s defuddle/-s jinacounts as consent - PDF hidden-text sanitization: invisible text (white-on-white, <2pt, zero-opacity, off-page) — a prompt-injection vector for LLM pipelines — is detected;
security.pdf_sanitize: off|warn|remove(defaultwarnlogs a consolidated advisory naming pages) - Per-page OCR routing:
--ocron mixed digital/scanned documents keeps the native text layer for digital pages and only OCRs scanned/garbled ones (ocr.per_page_routing, default on) - Conversion-quality benchmark harness: dev-only
packages/markitai/benchmarks/scores the HTML pipeline against the Defuddle ground-truth corpus (rapidfuzz block alignment + order score, marker-style); committed baseline: mean 91.04 over 83 fixtures - Release automation: release-please drives versioning/changelog from conventional commits (release = merge the release PR; publishing chains via workflow dispatch); PR coverage comments via py-cov-action, no external service. Operational note: tag the 0.15.0 release manually first (or set
bootstrap-sha) so release-please anchors correctly
Hunt round 5 (tweet pipeline root fix & flow polish)
- FxTwitter now actually serves default tweet fetches: the intercept only existed in the top-level PLAYWRIGHT dispatch branch — the default auto chain reached playwright through its own loop and skipped FxTwitter entirely, dropping users onto noisy DOM extraction. Fixed in the chain (with telemetry/return contract); regression tests pin the intercept
- X DOM extractor rebuilt for X's 2026 redesign: X removed every
data-testidand moved to React+Tailwind markup, so the semantic tweet extractor never matched and the generic pipeline leaked avatars, cookie banners, stats bars, and blob: video links. The extractor now handles both the new (data-tweet-id, hover-card slots, permalink-text timestamps) and legacy markup, verified against a committed real-DOM fixture and live; the x.com domain profile's wait selector no longer burns a 10s timeout per tweet - "All fetch strategies failed" is never empty: five silent skip paths (JS-detected static, consent-gated remote, missing playwright/browser, missing CF credentials) now record their reason; an all-skipped chain explains itself
markitai initmerges instead of dead-ending: with an existing config the wizard offers Update (default) / Overwrite / Keep — Update non-destructively appends newly detected providers;init -yapplies it automatically and reports what changed. Auth login hints are config-aware ("adds it to your existing config" / "Already enabled in<path>"). Dependency + provider detection now run concurrently under a spinner (Gemini's userinfo call, up to 5s, no longer serializes the flow)- Login failure cards: provider login failures render the status-card style with context-aware hints (never suggesting the command that just failed; install command first); gemini login no longer dumps a raw traceback
mkaiPyPI alias stub dropped: PyPI's name-similarity guard rejectsmkai(existingmk-aiproject) — the same guard blocks would-be squatters, which was the stub's purpose; themkaicommand itself still ships with markitai
Hunt round 4 (UX & quality polish)
- Tweet conversion at defuddle parity: the FxTwitter path (which serves default x.com fetches, with playwright as fallback) and the DOM extractor were both reworked — bold
**Name @handle** · dateauthor line, paragraphs preserved, t.co links expanded, video rendered as poster + link (was a broken mp4 embed), quoted tweets as blockquotes with author/date/media/permalink, author threads joined into the post body, card previews. Corpus mean 91.04 → 92.26 - Live progress feedback: long single-input conversions show a pure-ASCII stage-aware spinner on stderr (
Fetching (static)…→Rendering (playwright)…→Enhancing with LLM…, bridged from fetch-stage logs); suppressed for pipes/--quiet/-v; stdout stays pure. Root cause of the "looks stuck" complaint: the old spinner machinery was constructed disabled in file-output mode markitai authstatus cards: all four providers render a unified glyph card (✓/✗ login state, CLI/SDK presence, usage + next-step); baremarkitai authshows an all-provider overview; ChatGPT guidance now points atmarkitai auth chatgpt login(device-code flow verified live) instead of "pip install litellm"- Fetch errors are never blank: exceptions with empty messages (e.g. httpx.ConnectError) now render their type via format_error_message across all fetch strategies
Hunt round 3 (release prep)
.emlemail support (native, zero deps): headers/body/attachments via stdlibemail; HTML bodies go through the standard HTML pipeline, image attachments flow into the assets/vision pipeline, nested messages render quoted (depth 1); header values sanitized against injection. EML no longer delegates to kreuzberg- HEIC/HEIF/AVIF input: new
markitai[heif]extra (pillow-heif); 12-byte ftyp sniff, decode-to-PNG at the boundary with EXIF orientation applied, then the normal OCR/vision/compression pipeline — iPhone photos just work - Quality guardrails gate:
benchmarks/guardrails.jsonpins a per-fixture minimum score (0.9 × current) plus corpus/local mean floors;--checkfails CI (new ~1min job) when extraction quality regresses;--update-guardrailsregenerates deliberately --config-json '<json>': inline config overrides for agents/CI — merged over the config file, under explicit CLI flags- Subcommand help polish: all 26 subcommand helps now render rich panels with Examples; empty states got helpful hints (
cache statswith no cache, silentinit -y,config getunknown key → list hint); action commands print the natural next step mkaishort command: ships with markitai as a second console script. (A separate PyPI alias package was evaluated and dropped: PyPI's name-similarity guard rejectsmkaibecausemk-aiexists — the same guard equally prevents anyone else from squatting the name)- kreuzberg floor raised to >=4.9.6: picks up the no-OCR-backend PDF fix (4.7.3), image-heavy-PDF hang fixes (4.9.x), PPTX slide-order fix (4.8.0). Note: kreuzberg >=4.8.0 is Elastic License 2.0 (optional extra; accepted)
- Cache-hit visibility: the
Fetched via <strategy>line notes(cached)— a cached defuddle result had masqueraded as the live default strategy (fresh fetches win withstatic) - Dependency patches: litellm 1.90.3, rapidocr 3.9.1
Hunt round 2 (follow-up fixes)
- Playwright browser detection fixed: newer Playwright ships
Google Chrome for Testing.appinstead ofChromium.app, so the path-only check reported "browser not found" even after a successful install; detection now uses Playwright's ownINSTALLATION_COMPLETEmarker (bundle-name/version agnostic) with executable paths as fallback. Theuv tool run --from 'markitai[all]'install hint was dropped (triggers a uv warning and resolves an ephemeral env whose Playwright version can mismatch) - Remote-fetch consent is now lazy: the consent prompt fired before the chain ran, so even URLs satisfied by the local-first chain asked "Allow sending URLs to remote services?"; consent is now resolved only when a remote strategy is actually about to run — local successes never prompt
- Config filename unified in all messages: hints/errors that said "in markitai.json" (doctor's LLM hint, the Cloudflare workflow error) now show the actually-loaded config path, matching the doctor header
-hparagraph spacing normalized: rich-click renders\bblocks with two trailing blank lines but plain paragraphs with none; docstring now renders with exactly one blank line between all sections (TEXT_PARAGRAPH_LINEBREAKS)- Dependency patch bumps: litellm 1.90.3, rapidocr 3.9.1
CLI & fetch polish (post-release hunt round)
mkaishort alias: installed alongsidemarkitai(verified conflict-free on PyPI/homebrew/system)-b/--backend native|kreuzberg|cloudflare: file conversion backend is now its own orthogonal flag;--kreuzbergremains as a deprecated alias.-sis purely the URL fetch strategy- Local-first auto chain: default order is now
static → playwright → defuddle → jina → cloudflare(static's native extraction matches remote defuddle on the ground-truth corpus and beats it on CJK spacing); SPA/JS-heavy domains go straight to the browser markitai doctorfirst run 36s → ~5s, warm 1.0s → 0.33s: two root causes — (a) the RapidOCR check imported the real module, pulling in opencv's 119MB dylib whose one-time macOS dyld signature validation cost ~25s on a fresh install (now probed via package metadata only, no import); (b) an unconditional litellm import cost 0.55s even with no models configured (now deferred). Output normalized (consistent inline item format, single-blank-line sections, failure summary uses ✗ not ✓) and now shows the loaded config file path- Actionable configuration errors: Cloudflare credentials, Playwright missing-browser, and Jina auth errors now include the concrete config file path, copy-pasteable
markitai config set/env commands, and credential acquisition steps (token URL + required permissions) - Jina refusal fallback: service refusals (e.g. github.com 451 anonymous block) no longer dump raw JSON — interactive runs are asked once per run whether to fall back to the auto chain (default yes); non-interactive runs fall back automatically with a warning
- Claude subscription detection fixed on macOS: Claude Code stores OAuth tokens in the Keychain, not
~/.claude/.credentials.json— markitai reported "not authenticated" on logged-in machines andinitsilently skipped claude-agent.claude-agent/models use the Claude subscription quota via the local CLI (no API key needed);markitai auth claude statusnow shows identity, plan, CLI/SDK state and a config snippet - stdout no longer hard-wraps content: Rich's console.print wrapped output at terminal width, breaking long URLs mid-token in piped output; content is now written raw
- Help panels aligned: metavar column removed (appended to help text instead); deprecated aliases use terse uniform descriptions
- HTML extraction: GitHub repo pages now extract just the README (was: file-tree tables, About sidebar, star counts — ~950 junk words); frontmatter gains
publisheddate, full untruncated titles without site suffixes, and no longer emits homepage canonical_url on article pages; bilibili/Twitter-widget iframes survive as links (root cause: embed canonicalization ran after sanitize stripped iframes). Benchmark mean 91.86 → 92.24, embedded-videos fixture +31 points, zero regressions; local self-baseline fixtures (GitHub repo + CJK blog) added to the benchmark - Xberg (kreuzberg successor): evaluated — PyPI
xbergis currently a placeholder aliasing kreuzberg and real 1.0 wheels aren't published; kreuzberg v4 stays (LTS, maintained), with a documented migration checklist in the converter for when Xberg 1.0 ships
Changed (behavior)
- Mixing INPUT with a subcommand is now an error:
markitai note.txt config listpreviously dropped note.txt silently; it now fails with guidance. A file named like a subcommand (config,doctor, ...) gets a stderr hint to use./config -o out.mdwith a single file/URL writes that file: previously it silently created a directory namedout.md; batch/directory input with a file-looking-onow errors clearly- Diagnostics go to stderr: warnings/notices no longer pollute piped stdout output (
markitai x --alt | pandocreceives pure markdown) - Output naming replaces the extension:
sample.pdf→sample.md(wassample.pdf.md). Colliding batch inputs (a.pdf+a.docx) and outputs that would overwrite the source keep the legacy<name>.<ext>.mdscheme, per file. Re-running an old batch re-converts once under the new names - stdout image links now survive the process:
image.stdout_persistdefaults to on (assets persisted under~/.markitai/assets); absolute temp-dir links from the PDF pipeline are normalized; opting out prints an ephemeral-links warning to stderr - Reports are batch-only by default: single-file/single-URL conversions no longer write
.markitai/reports/unlessoutput.report = true(tri-state;falsedisables even for batches) - Repo hygiene: AI-session artifacts (
.claude/memory,docs/superpowers/working plans containing local dev paths) removed from the repository and gitignored
Fixed
Batch & Workflow (correctness of success/failure reporting)
- LLM failures no longer masquerade as success: LLM API failures previously returned success-shaped results — batch marked files COMPLETED with
cache_hit=truepointing at.llm.mdfiles that were never written, and--resumeskipped them. Failures now propagate, the file is marked FAILED, and the base-markdown fallback path actually runs - Resume re-processes interrupted files: files that were IN_PROGRESS at crash time were silently dropped on
--resume(never re-queued, counted in no summary bucket); they are now converted to FAILED on state load and re-processed - Per-file LLM cost attribution: usage contexts are cleared after each file, fixing double-counted costs for same-basename files in different directories
- Base64 image index desync: an undecodable data URI shifted every subsequent image reference one position, attaching wrong images to wrong locations; extraction and replacement now apply the same skip rule
- Path traversal via custom output names: a
.urlsentry with a crafted output name (../../xor absolute path) could write converted output outside the output directory; custom names are now sanitized like auto-derived ones
Fetch & Cache
- AUTO-strategy cache revalidation: HTTP validators (ETag/Last-Modified) were discarded on the default fetch path, so cached pages were served stale forever; validators are now stored and conditional revalidation works as designed
- Playwright context leaks: cookie-validation errors leaked browser contexts; a
new_page()failure in persistent mode raisedUnboundLocalErrormasking the real error; concurrent same-domain fetches could double-create or double-close cached contexts (now lock-protected) - HTTP client cleanup: old clients are no longer closed via unreferenced fire-and-forget tasks that asyncio could garbage-collect before running
- URL list robustness: a
nullor non-stringurlin a JSON URL list crashed the whole batch; it is now skipped with a warning - Proxy auto-detection: SOCKS-only ports (Tor 9050, SOCKS5 1080, V2Ray 10808) were mislabeled as HTTP proxies and are removed from detection; detected proxies now log at WARNING
LLM & Providers
- Batch vision deadlock: language rewrites re-acquired the already-held concurrency semaphore — with
llm.concurrency=1a single rewrite hung the whole pipeline; rewrites now run after the semaphore is released - Vision cache poisoning: batch results were zipped positionally with requested images; a skipped/reordered model response persisted the wrong analysis under the wrong image's content hash across sessions. Results are now aligned by the echoed
image_index, and ambiguous batches skip cache persistence - Copilot concurrent temp-file races: the singleton provider's shared temp-file list let one request's cleanup delete another in-flight request's image attachments; tracking is now per-request
- gemini-cli rate-limit failover: a 429 raised a non-retryable error that aborted the request instead of retrying on another pool model; it now raises litellm's retryable
RateLimitError(cooldown still recorded) - Event-loop stalls: blocking token refreshes (ChatGPT/gemini-cli auth) and CPU-heavy native HTML extraction now run in worker threads instead of freezing all concurrent tasks
- Retry backoff released: exponential-backoff sleeps no longer hold a concurrency-semaphore slot, so a rate-limit burst can't collapse throughput of healthy models
- Dynamic max_tokens: the retry path now takes the minimum output limit across the model pool (matching instructor call sites) instead of the top-weight model only
- Screenshot extraction cache: keyed by content fingerprint instead of filename, so re-fetches of changed pages aren't served stale extractions
Images & Conversion
- EXIF orientation: rotation is baked in before re-encoding on all compression paths (OpenCV and Pillow); phone photos no longer come out sideways
- LA-mode transparency: grayscale+alpha images composite onto white when converting to JPEG instead of dropping the alpha channel
- Uncompressed image naming: with
image.compress = false, original bytes are now saved under their actual format's extension/MIME instead of the configured output format's - EMF/WMF conversion failures: unconverted EMF bytes are no longer mislabeled as PNG; failures now log a visible warning
- OCR engine consistency: a failed engine rebuild no longer leaves the old engine permanently served under the new config's fingerprint
- Temp directory leaks: converter paths that render page/slide images without an output directory now clean up their temp directories at process exit
Configuration & CLI
- Config editor validation:
markitai config editvalidated nothing before saving — an out-of-range value bricked every subsequent CLI invocation (including the editor itself). The editor now validates before save, and config loading reports a clear actionable error instead of a raw traceback - Symlink safety check: the nested-symlink branch inspected the resolved path (which by definition has no symlinks) and never fired; it now walks the original path's ancestors
- Config bounds:
llm.concurrencyrequires>=1(a persisted0hung every LLM task forever); router retry/timeout fields gained sane lower bounds config settype coercion: values are coerced by the target field's declared type — string fields keep leading zeros (API keys), bool fields accept1/0config setbracket notation:llm.model_list[0].litellm_params.weightnow works for set (previously only get)- JSON log format: log lines are built with proper JSON serialization; messages containing quotes/newlines no longer produce invalid JSON
cache clearprompt: shows the actual configured cache directory instead of a hardcoded pathconfig getnull handling: existing-but-null fields printnulland exit 0 instead of "Key not found" exit 1- Missing config path visibility: a nonexistent
MARKITAI_CONFIG/explicit config path now warns (and~is expanded) instead of silently running with defaults - Loguru misuse: printf-style logging calls that silently dropped the URL and traceback now use loguru idioms
Post-review hardening
- Parallel LLM task isolation: a document-processing failure no longer leaves the sibling image-analysis task running detached (and vice versa); image-analysis failure now degrades gracefully (
.llm.mdkept without alt text) instead of failing the whole file - Usage cleanup on vision fallback paths: partial LLM usage is cleared when vision enhancement fails, so it isn't attributed to the next file
markitai doctorexit code: exits 1 when required dependencies are missing (previously always 0, and the summary claimed success); failure summary now reports the missing count- Symlink check refinement: root-owned symlinks on POSIX (e.g.
/var/run -> /run) are treated as OS artifacts instead of raising false positives
Changed
- Dependency refresh: all dependencies upgraded ~4 months forward, including litellm 1.82.6 → 1.90.x, opencv-python 4.x → 5.x, starlette → 1.x, claude-agent-sdk 0.1 → 0.2, github-copilot-sdk 0.2 → 1.x, instructor 1.15, playwright 1.61, pymupdf 1.28. Test suite fully green on the new set. (markitdown stays at 0.1.5 — 0.1.6 requires a pre-release azure dependency; rich stays at 14.x — capped by instructor
<15) - Version single-sourcing: the package version is now read from
src/markitai/__init__.pyat build time (dynamic = ["version"]); no more triple-bump - Release guard:
publish.ymlverifies the release tag matches the built version before publishing - Dependabot on uv ecosystem: lockfile-aware dependency PRs (previously the pip ecosystem produced PRs that always failed
uv sync --frozen) - README: rewritten with install instructions (uv tool/pipx), extras table, and quick start — this is also the PyPI long description
- CONTRIBUTING.md: new contributor guide (dev setup, commands, conventions, release steps)
- pre-commit: pyright moved from per-commit to pre-push (full-project check was tens of seconds per commit)
.env.example: bilingual (EN/zh) comments- bs4 4.15 compatibility: attrs-only
find/find_allcalls pass an explicit tag matcher;NavigableStringimported frombs4.element(upstream__all__regression) - Ruff target aligned to floor:
target-version = "py311"(was py313, which could suggest syntax breaking 3.11 support) - Modernized asyncio idioms:
asyncio.get_event_loop()→asyncio.get_running_loop()in async code
Security
- litellm supply-chain pin lifted:
litellm>=1.83.0replaces the<1.82.7emergency pin — the March 2026 compromise affected only 1.82.7/1.82.8, upstream audited 1.78.0–1.82.6 clean, and releases are signed since 1.83.0
[0.14.0] - 2026-03-25
Added
- Steam News Extractor: Site-specific extractor for
store.steampowered.com/news/pages that parses BBCode announcements from JSON data attributes - MathML-to-LaTeX Converter: Structural MathML conversion for pages without LaTeX annotations (KaTeX/MathJax), handling
msup,msub,mfrac,msqrt,mover,munder,mtable, and 70+ Unicode math symbol replacements - LibreOffice Functional Check:
is_libreoffice_functional()verifies LibreOffice can actually convert files, not just that the binary exists - CSS Modules Hidden Detection: Detect hashed hidden class names like
isHidden-vzcyV0from CSS-in-JS frameworks
Fixed
- Math Content Extraction: Body fallback now triggers when all retry levels fail to reach the sparse threshold, fixing KaTeX pages where scoring selected a single math div instead of the full article
- Integration Test Reliability: Batch test fixture filters to files with registered converters; LibreOffice tests skip properly when installation is non-functional
- CLI Preset Validation: Unknown presets now show available options and exit with error instead of silently continuing
- BBCode XSS Prevention: Raw HTML in Steam BBCode content is escaped before conversion to prevent injection
Security
- litellm Supply-Chain Pin: Pin litellm to
<1.82.7to exclude compromised versions
Changed
- CI Resilience: Windows LibreOffice install retries up to 3 times with backoff to handle transient Chocolatey failures
[0.13.1] - 2026-03-23
Added
- Config Editor Redesign: Replace questionary select with a custom prompt_toolkit UI featuring a visible search box with frame, fuzzy filtering, scrollable list with cursor, and "↑ N more above / ↓ N more below" scroll indicators
- Fuzzy Match Search: Case-insensitive fuzzy matching for config settings (characters in order, not necessarily consecutive) with scoring that rewards consecutive and early matches
- Config Field Descriptions: Add
Field(description=...)to 66 Pydantic config fields, displayed inline in the config editor - In-Place UI Refresh: Use ANSI cursor position queries to erase only the lines occupied by each UI component, preserving terminal history
Fixed
- Esc Key Support: Inject Esc key bindings into all questionary prompts (text, select) via prompt_toolkit
merge_key_bindings; questionary 2.1.1select()only binds Ctrl+C/Ctrl+Q natively - Bool Editor: Replace
questionary.confirm()withquestionary.select()usingChoice(value=True/False)for consistent Esc support - Search + j/k Conflict: Disable
use_jk_keyswhenuse_search_filteris enabled (questionary 2.1.1 raisesValueErrorotherwise) - Literal Type Preservation: Use
Choice(value=original)to preserve original typed values (int, str) when editing Literal fields, instead of converting to string
0.12.1 - 2026-03-22
Added
- Stdout Terminal Image Display: Inline image rendering for Kitty/iTerm2 terminals in stdout mode, with three-tier resolution cascade (terminal protocol → persistent asset store → markdown placeholder)
- Content-Addressed Asset Store: Persistent image storage with symlink refs at
~/.markitai/assets/, enabling stdout image persistence across sessions - Terminal Image Protocol Detection: Auto-detect Kitty and iTerm2 graphics protocols for native inline image display
stdout_persistConfig Fields: Newimage.stdout_persistandimage.stdout_persist_dirsettings for controlling stdout image persistence- External Image Inline Display: Download and inline-display external images in single URL stdout mode (
image.stdout_fetch_external) - User Journey Documentation: Comprehensive Chinese user journey document covering all features and workflows
Fixed
- Stdout Mode LLM Errors: Make LLM errors visible in quiet/stdout mode via ERROR-level log handler
- LLM Warning Implementation: Address third-party review findings on LLM warning display
- Kitty Graphics Protocol: Convert images to PNG for Kitty protocol compatibility
- Stdout Image Handling: Resolve three bugs in stdout image asset resolution and display
- Cross-Platform Tests: Fix Windows test failures and missing Playwright browser handling
markitai initDuplicate Routes: Deduplicate overlapping default provider entries in generated configs, preferring Claude CLI over Anthropic API and direct Gemini API over OpenRouter Gemini
Changed
- Stdout Asset Resolution: Rename
strip_asset_referencestoresolve_asset_referenceswith three-tier cascade logic - Terminal Image Rendering: Harden rendering pipeline and improve test coverage
markitai initDefault Config: Stop writing redundant defaultimage.compressandimage.qualitysettings into newly generated configs
0.12.0 - 2026-03-20
Added
- Native HTML Extraction Parity: Introduce resolver-based extraction pipeline with typed extraction results, frontmatter builder, quality profiles, and semantic models for threaded pages
- Structured Site Extractors: Rebuild threaded extraction on shared abstractions and add native resolver coverage for GitHub Discussions, X threads, and YouTube pages
- Webextract Quality Enhancements: Add noise removal, enhanced scoring, standardization, multi-level retry, content patterns, heading anchors, callouts, srcset optimization, and code language detection
- CLI Force Flags: Add
--staticto force static HTTP with native webextract and--kreuzbergto force kreuzberg conversion for all formats - Async Enrichment Pipeline: Add policy-aware enrichers and thread inclusion rules for structured extraction
- Language-Aware Vision Retry: Retry and rewrite image analysis outputs in the document language
Fixed
- URL Stdout Fallback: URL mode without
-onow writes to stdout instead of erroring - Concurrency Safety: Make
ContentCache,_image_cache, model cooldown tracking, andio_semaphorethread-safe and reuse the cached semaphore instance - Atomic Writes: Use atomic write patterns for
ConfigManager.save()and async byte writes - Resource Cleanup: Reset semaphores and proxy-bypass state in shared-client cleanup
- Observability: Add debug logging for previously silent exception handlers
- Webextract Regressions: Fix
Nonetag.attrs, selector conflicts, math protection, callout/task-list/table formatting, X.com Playwright crash, tweet noise, and resolver acceptance parity - Tooling Hygiene: Resolve remaining Ruff, Pyright, Pytest, and Bandit issues and close low-priority parity coverage gaps
Changed
- HTML Conversion Path: Route HTML files through the native webextract pipeline by default
- Fetch Internals: Split
fetch.pyinto smaller modules and decomposefetch_url()into composable sub-functions - CLI Logging UX: Improve batch progress reporting and quiet/verbose URL logs
- Release Cleanup: Update dependencies, CI and website docs, model metadata, and clean up project structure for the
0.12.0release
Removed
- Obsolete Project Docs: Remove outdated root docs, archived plans, and historical reference material during project cleanup
0.11.2 - 2026-03-14
Fixed
- Windows Compatibility: Add Windows
GlobalMemoryStatusExRAM detection for proper heavy task semaphore sizing - Lazy Directory Creation: Defer
~/.markitai/directory creation from import-time to first write — prevents side effects when the tool is only imported or used read-onlySPADomainCache: mkdir moved from__init__to_save()SQLiteCache: mkdir moved from__init__to_get_connection()with_dir_ensuredflag to avoid repeated syscalls
- Default Output/Log Dir:
DEFAULT_OUTPUT_DIRandDEFAULT_LOG_DIRnow default toNoneinstead of hardcoded paths — output directory must be explicitly specified via CLI-oor config file - Pyright Warnings: Eliminate all 27 pyright warnings — suppress
reportUnsupportedDunderAllfor PEP 562 lazy-loading modules, fixcurl_cffiProxySpecTypedDict type mismatch - Schema Sync: Update
config.schema.jsonto match newOutputConfig.dirandLogConfig.dirnullable types
0.11.1 - 2026-03-14
Added
- Interactive Pure Mode: Add pure mode option to interactive CLI wizard
Fixed
- Pure Mode Vision Bypass:
--purenow correctly skips screenshot-only and vision enhancement paths, falling through to text-only LLM processing - Pure Mode Warning False Positive:
--pure --screenshot-onlyno longer warns about--screenshotbeing ignored - URL Content Validation: Lower
too_shortthreshold from 100 to 30 characters — minimal landing pages were incorrectly rejected after stripping markdown syntax - Type Safety: Fix
merge_llm_usageparameter type to acceptLLMUsageByModel(pyright warning) - Dead Code: Remove unused
_format_standalone_image_markdownalias
Changed
- CI: Upgrade GitHub Actions to Node.js 24 compatible versions
0.11.0 - 2026-03-13
Added
- Pure Mode (
--pure): Full implementation of transparent LLM pass-through mode — text cleaning only, no frontmatter generation or post-processing - Pure Mode Decoupled from LLM:
--pureno longer implies--llm;--purealone writes raw markdown without frontmatter,--pure --llmsends content through LLM cleaning only - Image Vision in Pure Mode:
--llm --purewith image inputs routes to Vision analysis path (process_image_with_vision_pure) --keep-baseCLI Option: Explicitly write base.mdeven in LLM mode (default: skip base.mdwhen LLM is enabled)- Image-Only Format Handling: Skip image-only formats (PNG, JPG, etc.) in non-LLM/non-OCR mode with clear warning
- LLM Fallback: Write
.mdas fallback when LLM processing fails - Batch Skip Summary: Group skipped files by reason with example filenames in batch summary
- Pure Mode Warning: Warn when
--puresilently overrides--alt/--desc/--screenshot - Mode-Specific Cleaner Prompt:
{mode_rules}template variable in cleaner prompt — standard mode gets image placeholder rules, pure mode gets YAML frontmatter preservation rules
Fixed
- URL Processors: Respect
--pure/--llm/--keep-baseflags for base.mdoutput in both single and batch URL processing - Pure Mode Frontmatter:
process_with_llmusesclean_document_pure()instead ofprocess_document()in pure mode, preventing LLM-generated frontmatter (description, tags, etc.) - Source Frontmatter Reconstruction: Reconstruct original YAML frontmatter from defuddle metadata before sending to LLM in pure mode
- Vision Prompt Drift: Add placeholder REMINDER to vision prompts to reduce LLM drift on
__MARKITAI_IMG_N__placeholders - Stabilization Dedup: Deduplicate stabilization calls and add
paged_stabilizedguard - Vision JSON Mode: Fix wrong message index in vision
json_modeand race condition in parallel gather - Misc Fixes: Frontmatter regex, env variable quoting, Ctrl+C handling, hardcoded weight, docstring corrections
- SVG as Image-Only: Treat SVG as image-only format in batch mode
Changed
- Output Strategy: LLM mode skips writing base
.mdby default (use--keep-baseto override) - Test Performance: Optimize test suite speed (~70s → ~30s)
0.10.0 - 2026-03-12
Added
- Auto-detect LLM Providers: When no
markitai.jsonconfig exists, automatically detect available providers from environment variables and authenticated CLI tools (Claude CLI, Copilot CLI, Gemini CLI, ChatGPT OAuth) - Shared Provider Detection: Extract provider detection into
cli/providers_detect.pyshared module for reuse across interactive and non-interactive modes
Changed
- Interactive Mode UX: Separate OCR and screenshots from LLM features into independent "Additional options" prompt, since they are local processing capabilities (RapidOCR, Playwright) that don't require LLM
- Feature Display: Unified
build_feature_str()inui.pyseparates LLM features from local features with|delimiter (e.g.,LLM alt desc | OCR screenshot) - Interactive Mode Flow: Show configured models after user confirms LLM enablement, not before; warn when no provider detected
- Dependencies: Raise minimum constraints to match tested versions (pymupdf4llm >=1.27.2, litellm >=1.82.0, pydantic >=2.12.0, pytest >=9.0.0, ruff >=0.15.0)
- CLI Flags:
-vis now--verbose(was--version),-Vis now--version
Fixed
- Image Alt Text Language: Strip YAML frontmatter before extracting document context for image analysis, so alt text matches the document's actual language instead of defaulting to English
- Interactive Provider Display: Show actual configured models from config file instead of auto-detected provider name
- URL Processor Feature Display: Add missing OCR to URL processor dry-run features list
- Cold Startup Performance: Lazy imports in
cli/,processors/, andworkflow/__init__.pyreduce cold startup from ~5s to ~0.3s
Removed
- Language Field: Remove LLM-generated
languagefield from Frontmatter model — LLM should only generatedescriptionandtags, not infer extra metadata
0.9.2 - 2026-03-11
Fixed
- Copilot/Claude Login: Revert subprocess output interception for copilot/claude-agent login — always use inherited stdio so the CLI sees a real TTY, fixing credential storage failures
- Login Output Display: Detect URL and device code on the same line (copilot outputs both together); track externally-printed lines for clean erasure after login
- Error Message Clarity: Fix
format_error_messagefollowing__context__(implicit exception chain) to wrapper exceptions like tenacityRetryError, replacing informative provider errors with opaque<Future at 0x...>messages in logs; now only follows__cause__(explicitraise X from Y) - Error Message Consistency: Use
format_error_messagein CLI catch-all handlers (file.py,workflow/core.py) to prevent opaque chained exception messages reaching users
Added
SubprocessInterceptorURL+code same-line formatting for copilot device code flowOutputManager.track_external_lines()for tracking terminal output from inherited-stdio subprocesses
0.9.1 - 2026-03-09
Fixed
- Provider Auth Preflight: Add
can_attempt_login()guard to skip login prompt when provider SDK is missing; fix Rich markup swallowing[gemini-cli]viaescape(); fix "Login failed: Login failed:" duplication - Install Scripts Extras Parsing: Fix greedy regex (
\[.*\]→\[[^]]*\]) that captured TOML outer brackets, corrupting extras names likegemini-cli}] - Install Scripts Resilience: Progressive fallback when full extras install fails (retry without SDK-dependent extras); fix
set -esilent exit onuv tool installfailure; fix PowerShell 5.xJoin-Path3-arg incompatibility - Install Scripts Extras Strategy: Merge-based finalize (no longer replaces manually tracked extras); generic receipt parsing (future-proof for new extras)
Added
markitai doctor --suggest-extrasas single source of truth for install scripts to query recommended extrascan_attempt_login()provider guard withget_auth_resolution_hint()fallback messages- i18n key
not_foundfor zh-CN and en in both setup scripts
0.9.0 - 2026-03-09
Added
- Fetch Strategy Priority: Configurable global and per-domain strategy ordering via
strategy_priorityinpolicyanddomain_profiles - Domain/IP Exemption:
local_only_patternsconfig field restricts specified domains/IPs to local-only strategies (static, playwright) — supports exact domain, suffix (.internal.com), wildcard (*.internal.com), IP, and CIDR notation (10.0.0.0/8,fd00::/8) - NO_PROXY Integration:
inherit_no_proxy(default: true) automatically mergesNO_PROXYenvironment variable patterns into local-only exemptions - Fetch Security Feature: README documentation for the new information security compliance capabilities
Fixed
- LLM Language Consistency: Strengthened 5 prompt templates to prevent language translation when fetching mixed-language content (e.g., English UI + Chinese body) — LLM now determines output language from body text, not UI elements
0.8.1 - 2026-03-06
Added
- Defuddle Fetch Strategy: New
defuddlestrategy (GET https://defuddle.md/<url>) as top-priority URL fetch method — free, no auth, returns clean Markdown with YAML frontmatter (title, author, published, description, word_count, domain) - Aggressive Strategy Ordering: Default ordering changed to
defuddle → jina → static → playwright → cloudflare(both default and SPA scenarios) - CLI
--defuddleFlag: Force defuddle-only URL fetching (mutually exclusive with--playwright,--jina,--cloudflare) - DefuddleConfig: Configurable timeout and RPM rate limiting (conservative defaults for undocumented API limits)
Changed
- FetchPolicyEngine: Simplified ordering logic — removed
has_jina_keybranching; defuddle+jina always first - max_strategy_hops: Default increased from 4 to 5 to accommodate the new strategy
0.8.0 - 2026-03-06
Added
- Extended Format Support: 20+ new file formats via markitdown and kreuzberg converters
- Markitdown-based: HTML/HTM/XHTML, CSV, EPUB, MSG, IPYNB (Jupyter Notebook), Apple Numbers
- Kreuzberg-based (optional dependency): TSV, XML, ODS, ODT, SVG, RTF, RST, ORG, TEX, EML
- Kreuzberg is a pure Rust wheel — install with
uv pip install markitai[kreuzberg]
- Extended Image Support: GIF, BMP, TIFF now supported by ImageConverter; BMP/TIFF auto-converted to PNG for LLM vision APIs
- LLM Vision Format Helpers:
is_llm_supported_image(),get_llm_effective_mime()inutils/mime.pyfor transparent BMP/TIFF → PNG handling
Fixed
- Claude Agent SDK v0.1.46 compatibility: Removed deprecated
allow_dangerously_skip_permissionsparameter (permission_mode="bypassPermissions"is sufficient) - i18n test isolation: Fixed global state leak in
test_i18n.pycausing 3 integration tests to fail when run in full suite - Import-time log leakage: Kreuzberg registration logs changed from
logger.debugtologger.traceto prevent terminal noise before CLI log setup
Changed
- Converter registry: New
FileFormatenum members for all added formats; kreuzberg registers as gap-filler (only for formats without native converters) - Test fixtures: Renamed to consistent
sample.*naming convention; added fixtures for all new formats; removed orphanedsample.mobi - Markitdown lazy init:
MarkItDown()inmarkitdown_ext.pynow initialized on first use instead of import time
0.7.0 - 2026-03-05
Added
- ChatGPT Provider (
chatgpt/): Subscription-based provider using ChatGPT OAuth Device Code Flow and Responses API. No extra SDK required — uses LiteLLM's built-in authenticator. Models:chatgpt/gpt-5.2,chatgpt/codex-mini, etc. - Gemini CLI Provider (
gemini-cli/): Uses Google's Gemini CLI OAuth credentials (~/.gemini/oauth_creds.json) with automatic token refresh. Optional SDK:uv add markitai[gemini-cli]. Models:gemini-cli/gemini-2.5-pro,gemini-cli/gemini-2.5-flash, etc. - Weight=0 Model Disabling: Setting
weight: 0in model config now explicitly disables the model (excluded from routing). Useful for temporarily disabling models without removing config. - Interactive Mode Enhancements: Updated onboarding wizard with ChatGPT and Gemini CLI provider options
Fixed
- ZeroDivisionError in Router: Models with
weight=0are now filtered before LiteLLM Router creation, preventingdivision by zeroinsimple-shufflerouting strategy when all selected models have zero weight - Router Weight Selection:
_select_modelfallback usesrandom.choice()instead ofrandom.uniform(0, 0)when all models have zero weight
Changed
- Weight Field Semantics:
weightfield description updated to clarify that 0 = disabled. Minimum value enforced at 0 (negative weights rejected by validation)
0.6.1 - 2026-03-05
Fixed
- Claude Agent SDK compliance: Add
allow_dangerously_skip_permissions=Truewhen usingbypassPermissions, pass system messages via SDK'ssystem_promptparameter instead of XML tags, setadditionalProperties: falsein JSON object schema - Auth pre-check gaps: Detect
GH_TOKEN/GITHUB_TOKENenv vars as valid Copilot authentication, detectCLAUDE_CODE_USE_BEDROCK/VERTEX/FOUNDRYenv vars as valid Claude authentication - Resolution hints: Include env var alternatives in authentication error messages
Changed
- Docs: Update configuration guide and ai-tools-setup with env var auth methods
0.6.0 - 2026-03-04
Added
- Cloudflare Integration: Unified cloud backend with two capabilities:
- Browser Rendering:
--cloudflareflag for cloud-based URL rendering via CF/markdownAPI, with rate limiting, cache TTL, and advanced params (user_agent,cookies,wait_for_selector,http_credentials) - Workers AI toMarkdown: Cloud-based document conversion for PDF/XLSX/DOCX/PPTX (converter backend)
- Browser Rendering:
- Fetch Policy Engine (
fetch_policy.py): Policy-driven strategy ordering with domain-specific profiles, session persistence, and adaptive targeting - Domain Profiles: Per-domain fetch config (
wait_for_selector,wait_for,extra_wait_ms,prefer_strategy) inmarkitai.json - Playwright Session Persistence:
session_mode(isolated/domain_persistent) andsession_ttl_secondsfor reusing browser contexts across requests - Static HTTP Abstraction (
fetch_http.py): Pluggable HTTP backend withhttpx(default) andcurl-cffi(TLS fingerprint impersonation) viaMARKITAI_STATIC_HTTPenv var - Content Validation Gate: All fetch strategies now validate content quality before accepting results
api_baseenv: syntax:"api_base": "env:MY_BASE_URL"in model config for environment variable expansion- CF Markdown for Agents: Content negotiation via
Accept: text/markdownheader for Cloudflare-enabled sites
Changed
- Vision Router Fallback: When all vision models are disabled (
weight=0), falls back to main router with warning instead of crashing - Playwright UTF-8 Encoding: Force UTF-8 for HTML-to-Markdown conversion to prevent encoding errors
- Integration Test Resilience: Cloudflare integration tests now skip on rate limit (429) instead of failing
Fixed
- ZeroDivisionError in Vision Router: Models with
weight=0(disabled) are now filtered out before litellm Router creation, preventingdivision by zeroinsimple-shufflerouting strategy - Dead Code Cleanup: Removed 21 dead functions/classes across 15+ files (backward compat aliases, deprecated functions, unused constants)
Removed
_html_to_text,_normalize_bypass_list,_get_proxy_bypass,get_proxy_for_url,_url_to_session_idfromfetch.pysanitize_error_messagefromsecurity.py_deep_update,get_configfromconfig.pyorder_dict_keys_sorted,_order_image_entryfromjson_order.pyreset_consolesfromconsole.pyget_llm_not_configured_hintfromhints.pyremove_uncommented_screenshots,_UNCOMMENTED_SCREENSHOT_REfromllm/content.pyget_pending_urls,finish_url_processingfrombatch.pyLLMUsageAccumulatorfromworkflow/helpers.pyDEFAULT_LOG_PANEL_MAX_LINESfromconstants.py- Multiple backward-compatibility aliases from
cli/processors/
0.5.2 - 2026-02-07
Fixed
- SQLite ResourceWarning: Close SQLite connections explicitly via
_connect()context manager, preventingResourceWarning: unclosed databaseon Python 3.13 - Windows path handling:
context_display_name()now handlesC:/forward-slash Windows paths (was only handlingC:\) - Windows install hints:
markitai doctorshows platform-specific install commands (PowerShell/winget on Windows, curl on Unix) - OAuth token expiry:
markitai doctorno longer reports "Token expired" when a valid refresh token exists - Config get output:
markitai config getrenders Pydantic models as formatted JSON with syntax highlighting instead of raw Python repr - Copilot ProviderError: Added missing
providerkwarg when raisingProviderErrorfor unsupported models - Pyright warnings: Resolved all Pyright warnings (lazy
__all__, type narrowing, optional imports)
Changed
- 26 documentation fixes: Comprehensive audit fixing docstring-to-code mismatches across all modules (llm, providers, converter, utils, config)
0.5.1 - 2026-02-07
Added
- Playwright auto-scroll: Auto-scroll pages to trigger lazy-loaded content before extraction (up to 8 steps, inspired by baoyu-skills url-to-markdown)
- DOM noise cleanup: Remove navigation, ads, cookie banners, popups, and inline event handlers before content extraction
python -m markitai: Add__main__.pyfor-minvocation support (fixes Windows execution)- Multi-provider detection: Interactive mode (
-I) now detects and displays all available LLM providers (DeepSeek, OpenRouter included) - Copilot GPT-5 series support: GPT-5, GPT-5.1, GPT-5.2, GPT-5.1-Codex-Mini/Max, GPT-5.2-Codex now fully supported via Copilot provider
- 22 new unit tests: Vision fallback strategies, smart_truncate edge cases, content protection roundtrip, cache fingerprint collision resistance, batch thread safety
Changed
- Default models modernized: Updated outdated defaults across init/interactive/doctor (haiku→sonnet, gpt-4o→gpt-5.2, gemini-2.0→2.5, claude-sonnet-4→4.5)
- Init wizard: Multi-provider default selection, API keys stored in
.envinstead of plaintext config, next-steps hints after completion - LLM code deduplication:
document.pynow delegates_protect_image_positions/_restore_image_positionstocontent.pyshared functions - Cache fingerprint: SHA256 over full content + page structure replaces
text[:1000]prefix-based cache keys, preventing collisions for documents with identical prefixes - Batch thread safety: Double-checked locking with timeout-based lock acquisition (5s) replaces non-blocking
acquire(blocking=force) - LiteLLM model database: Refreshed with 35 new models including Claude Opus 4.6
Fixed
- DOM cleanup JS syntax error: Selectors with double quotes (e.g.,
[role="banner"]) now properly escaped viajson.dumps()instead of f-string interpolation - Copilot model blocklist: Removed outdated GPT-5 series from
UNSUPPORTED_MODELS(only o1/o3 reasoning models remain blocked) - CLI provider display: Truncate provider list with
(+N more)when >3 detected to prevent line overflow
0.5.0 - 2026-02-06
Added
- Unified UI system: New
ui.pycomponents andi18n.pymodule with Chinese/English support across all CLI commands markitai init: One-stop setup wizard — checks dependencies, detects LLM providers, generates config- Interactive mode (
-I): Guided setup with questionary prompts for new users doctor --fix: Auto-install missing components (e.g., Playwright)- Cross-platform install hints: Platform-specific installation commands in doctor output
MARKITAI_LOG_FORMAT: Environment variable override for log format- JSON repair: Fallback parser for malformed LLM JSON responses using
json_repair
Changed
Performance
- CLI startup: Lazy-load processor and command modules (~3x faster
--help) - Dependency checks: Parallelized doctor and init with
ThreadPoolExecutor - LLM processing: Pre-compiled regex patterns and batched replacements
- PDF rendering: Parallel page rendering for standard and LLM modes
- URL fetching: Async-safe cache locking for concurrent requests
- Executor: Auto-detect heavy task limit based on system RAM
- Image processing: Offloaded CPU-intensive work to thread pool
- Cache stats: Merged stats and model breakdown into single SQLite query
Refactoring
- Batch UI: Replaced Rich table/LogPanel with compact unified UI (progress bar with current file, completion summary)
- Log format: Default changed to human-readable text (was JSON)
- LLM cache: Deduplicated
SQLiteCache/PersistentCacheintollm/cache.py - Single file output: Layered output with
--verbosefor detailed logs - Setup scripts: Consolidated 10 scripts into 2 unified files (
setup.sh+setup.ps1) with built-in i18n
Fixed
- Windows: LibreOffice detection with fallback to
Program Filespaths (not just PATH) - Windows: FFmpeg/CLI path display — show "installed" instead of long winget package paths
- Windows:
config pathalignment with dynamic padding and continuous│column - Playwright: Default
wait_forchanged todomcontentloaded(wasnetworkidle, caused hangs) - Config: Schema and function defaults synced with constants
- Exceptions: Preserved exception chains (
raise from) across codebase - Cache: Prevented stale
markitai_processedtimestamp on cache hit - CLI: Version flag reverted to
-v/--version,--verbosekept without short flag
CI
- Added Windows LibreOffice install step (
choco) to CI matrix - Changed to
--all-extrasfor comprehensive dependency testing - Publish workflow: split unit/integration tests with
SKIP_LLM_TESTS
0.4.2 - 2026-02-03
Changed
- Playwright defaults:
wait_forchanged tonetworkidle,extra_wait_msto 5000ms for better SPA support - Frontmatter validation: Pydantic validators reject empty description/tags, triggering Instructor auto-retry
- VitePress: Upgraded to 2.0.0-alpha.16
Fixed
- X/Twitter content: Pages now wait for full JS rendering before capture
- Cache directories: All caches now respect
cache.global_dirconfig instead of hardcoded paths - Setup scripts: Improved piped execution (
curl | sh), proper Playwright installation paths - Config init: Added
--yes/-yflag for non-interactive use
0.4.1 - 2026-02-02
Added
markitai doctor: New diagnostic command for system health and auth status checking- Adaptive timeout: Local providers auto-adjust timeout based on request complexity
- Prompt caching: Claude Agent caches long system prompts (>4KB) for cost reduction
Changed
check-depsrenamed todoctor(old name kept as alias)- Improved error messages with resolution hints for local providers
Fixed
- Request timeouts on large documents with Claude Agent / Copilot
- JSON extraction issues with control characters and markdown code blocks
0.4.0 - 2026-01-28
Added
- Claude Agent SDK:
claude-agent/sonnet|opus|haikuvia Claude Code CLI - GitHub Copilot SDK:
copilot/claude-sonnet-4.5|gpt-4o|o1models - URL HTTP caching: ETag/Last-Modified conditional requests
- Quiet mode:
--quiet/-qflag (auto-enabled for single file) - Module refactoring:
cli.py→cli/,llm.py→llm/, newproviders/ - Setup scripts hardening: default N for high-impact ops, version pinning
- Docs: CONTRIBUTING.md, architecture.md, ai-tools-setup.md, dependabot.yml
Changed
- Python 3.13, docs reorganized to
docs/archive/ - agent-browser locked to 0.7.6 (Windows bug in 0.8.x)
- Default
extra_wait_ms: 1000 → 3000, Instructor mode:JSON→MD_JSON
Fixed
- Windows: UTF-8 console, Copilot CLI path discovery, script argument quoting
- LLM: Frontmatter regex fallback,
sourcefield fix, vision/frontmatter error handling - Prompts: Enhanced prompt leakage prevention, placeholder protection rules
- Content: Social media cleanup rules (X/Twitter, Facebook, Instagram)
- Setup: WSL detection, Python pymanager support, PATH refresh order
0.3.2 - 2026-01-27
Added
- Chinese README (
README_ZH.md) with language toggle - Chinese setup scripts:
setup-zh.sh,setup-zh.ps1,setup-dev-zh.sh,setup-dev-zh.ps1
Changed
- Improved setup scripts with better error handling and user feedback
- Updated Python version note: 3.11-3.13 (3.14 not yet supported)
- Updated documentation language toggle links
0.3.1 - 2026-01-27
Fixed
Prompt Leakage Prevention
- Split all prompts into
*_system.md(role definition) and*_user.md(content template) - Added
_validate_no_prompt_leakage()to detect and handle prompt leakage in LLM output - Updated LLM calls to use proper
[{"role": "system"}, {"role": "user"}]message structure
LLM Compatibility
- Fixed
max_tokensexceeding deepseek limit by using minimum across all router models - Fixed terminal window popup on Windows when running agent-browser verification
URL Fetching
- Improved error messages for browser fetch timeout (no longer suggests installing when already attempted)
- Added auto-proxy detection for Jina API and browser fetching
- Checks environment variables:
HTTPS_PROXY,HTTP_PROXY,ALL_PROXY - Auto-detects local proxy ports: 7890 (Clash), 10808 (V2Ray), 1080 (SOCKS5), etc.
- Checks environment variables:
Added
SPA Domain Learning
- New
SPADomainCachefor automatic detection and caching of JavaScript-heavy sites markitai cache spa-domainscommand to view/manage learned domainsmarkitai cache clear --include-spa-domainsoption
Windows Performance Optimizations
- Thread pool optimization: Windows defaults to 4 workers (vs 8 on Linux/macOS)
- ONNX Runtime global singleton with preheat for OCR engine
- OpenCV-based image compression (releases GIL, 20-40% faster)
- Batch subprocess execution for agent-browser commands
Changed
- Default image quality: 85 → 75
- Default image max_height: 1080 → 99999 (effectively unlimited)
- Default image min_area filter: 2500 → 5000
- Default URL concurrency: 3 → 5
- Default scan_max_depth: 10 → 5
- Extended fallback_patterns with more social media domains
0.3.0 - 2026-01-26
Added
URL Conversion Support
- Direct URL conversion:
markitai <url>converts web pages to Markdown - URL batch processing: Support
.urlsfile format (text or JSON), auto-detected from input - URL image downloading:
download_url_images()with concurrent downloads (5 parallel) - Automatic relative URL resolution for images
- Cross-platform filename sanitization (Windows illegal characters handling)
Multi-Source URL Fetching (fetch.py)
- Three fetch strategies:
--static/--agent-browser/--jinastatic: MarkItDown direct HTTP fetch (default, fastest)browser: agent-browser headless rendering (for JS-heavy pages)jina: Jina Reader API (cloud-based, no local deps)auto: Smart fallback (static → browser/jina if JS detected)
- FetchCache: SQLite-based URL cache with LRU eviction (100MB default)
- Screenshot capture:
--screenshotfor full-page screenshots via browser - Multi-source content: Parallel static + browser fetch with quality validation
- Domain pattern matching for auto-browser fallback (x.com, twitter.com, etc.)
FetchResultwithstatic_content,browser_content,screenshot_path
agent-browser Integration
- Headless browser automation via
agent-browserCLI - Configurable wait states:
load,domcontentloaded,networkidle - Extra wait time for SPA rendering (
extra_wait_ms) - Session isolation for concurrent fetches
verify_agent_browser_ready()with cached readiness check- Screenshot compression with Pillow (JPEG quality + max height)
URL LLM Enhancement
- New
prompts/url_enhance.mdfor URL-specific content cleaning - Multi-source LLM processing: combine static + browser + screenshot
- Smart content selection based on validity detection
Cache Enhancements
--no-cache-for <pattern>: Selective cache bypass with glob patterns- Single file:
--no-cache-for file1.pdf - Glob pattern:
--no-cache-for "*.pdf" - Mixed:
--no-cache-for "*.pdf,reports/**"
- Single file:
markitai cache stats -v: Verbose mode with detailed cache entries--limit N: Control number of entries in verbose output (default: 20)--scope project|global|all: Filter cache statistics by scopeSQLiteCache.list_entries(): List cache entries with metadataSQLiteCache.stats_by_model(): Per-model cache statistics- Improved cache hash: head + tail + length algorithm for better invalidation
Workflow Core Refactor (workflow/core.py)
ConversionContext: Unified single-file conversion contextconvert_document_core(): Main conversion pipelinevalidate_and_detect_format()→convert_document()→process_embedded_images()write_base_markdown()→process_with_vision_llm()/process_with_standard_llm()
- Parallel document + image processing with proper dependency handling
- Alt text injection after LLM processing completes (race condition fix)
Official Website
- VitePress 2.x documentation site with bilingual support (English/Chinese)
- Custom theme with brand colors matching logo
- Local search integration
- GitHub Actions auto-deployment to GitHub Pages
Project
- MIT License: Added LICENSE file
CI/CD
.github/workflows/ci.yml: Automated testing on push/PR.github/workflows/deploy-website.yml: Website deployment to GitHub Pages
Code Architecture
- New
utils/paths.py:ensure_dir(),ensure_subdir(),ensure_assets_dir() - New
utils/mime.py:get_mime_type(),get_extension_from_mime() - New
utils/text.py:normalize_markdown_whitespace(), text utilities - New
utils/executor.py:run_in_executor()with shared ThreadPoolExecutor - New
utils/output.py: Output formatting helpers - New
json_order.py: Ordered JSON serialization for reports/state files - New
urls.py:.urlsfile parser (JSON and plain text formats) LLMUsageAccumulatorclass for centralized cost trackingcreate_llm_processor()factory function- Unified
detect_language()withget_language_name()helper - Centralized
IMAGE_EXTENSIONS,JS_REQUIRED_PATTERNSconstants
Configuration
supports_visionnow optional: Auto-detected from litellm when not explicitly set- No need to manually configure for most models (GPT-4o, Gemini, Claude, etc.)
- Explicit
supports_vision: true/falseoverrides auto-detection if needed
Changed
Package Rename
markit→markitai: Package renamed for clarity- CLI command remains
markitai
Python Version
- Python 3.11+ support: Lowered minimum Python version from 3.13 to 3.11
CLI Behavior
- Single file mode: Direct stdout output (no logging by default)
--verbose: Show logs before output in single file mode- Batch processing behavior unchanged
Code Quality
- Refactored PowerShell COM conversion scripts (~18% code reduction)
- Unified MIME type mapping across codebase
- Extracted common fixtures to
conftest.py - Improved error messages for network failures (SSL/connection/proxy)
- Architecture diagram updated in
docs/spec.md
Fixed
- URL filename cross-platform compatibility
- Cache invalidation for large documents (tail changes now detected)
- Image analysis race condition with
.llm.mdfile writing
0.2.4 - 2026-01-21
Changed
- Restructured
assets.jsonformat with flat asset array - Extract Live display management for early log capture
- Improved MS Office detection with file path fallback
Fixed
- Add openpyxl FileVersion compatibility patch
- Add pptx XMLSyntaxError compatibility patch
- Enhanced
check_symlink_safetywith nested symlink detection - LLM empty response retry logic
normalize_frontmatterfor consistent YAML field order
0.2.3 - 2026-01-20
Added
Persistent LLM Cache
- SQLite-based cache with LRU eviction and size limits (default 1GB)
- Dual-layer lookup: project cache + global cache
CacheConfiginMarkitaiConfigwith enabled/no_cache/max_size options--no-cacheCLI flag: Skip reading but still write (Bun semantics)markitai cache stats [--json]: View cache statisticsmarkitai cache clear [--scope]: Clear cache by scope
Vision Router Optimization
- Smart router selection: auto-detect image content in messages
vision_routerproperty filtering onlysupports_vision=truemodels- Replace hardcoded "vision" model name with "default" + smart routing
Legacy Office Conversion
- MS Office COM batch conversion: one app launch per file type
check_ms_word/excel/powerpoint_available()registry-based detection- Pre-convert legacy files before batch processing to reduce overhead
Performance (Phase 3)
- Parallel PDF processing: Concurrent page OCR & rendering
- Parallel image processing:
ProcessPoolExecutorfor CPU-bound compression - Adaptive worker count based on file size
- LRU eviction and byte-size limits for image cache
- Batch semaphore for memory pressure control
Changed
- OCR optimization:
recognize_numpy()andrecognize_pixmap()for direct array processing - Reuse already-rendered pixmap in PDF OCR (avoid re-rendering)
Fixed
- EMF/WMF format detection and PNG conversion support
DATA_URI_PATTERNregex for hyphenated MIME types (x-emf, x-wmf)- Base64 stripping: remove hallucinated images instead of replacing
- Batch timing: record
start_atbefore pre-conversion for accurate duration - Pyright venv detection: add venvPath/venv to pyproject.toml
0.2.2 - 2026-01-20
Added
constants.pymodule to consolidate hardcoded values- Unit tests for image and llm modules
convert_to_markdown.pyreference script
Changed
- Centralized constants usage across config.py, llm.py, batch.py, image.py
- Improved LLM content restoration with garbage detection logic
- Enable parallel batch processing for image analysis
- Move state saving outside semaphore to reduce blocking
Fixed
- Rich Panel markup parsing issue (escape file paths)
0.2.1 - 2026-01-20
Added
LLM Usage Tracking
- Context-based usage tracking (per-file instead of global)
get_context_cost()andget_context_usage()for per-file stats- Thread-safe lock for concurrent access to usage dictionaries
Type System
types.pywith TypedDict definitions (ModelUsageStats, LLMUsageByModel, AssetDescription)ImageAnalysis.llm_usagefor multi-model tracking (renamed frommodel)
Model Configuration
get_model_max_output_tokens()using litellm.get_model_info()- Auto-inject max_tokens with fallback to conservative default (8192)
Office Detection
utils/office.pymodule with cross-platform detectionhas_ms_office(): Windows COM-based MS Office detectionfind_libreoffice(): PATH + common paths search with@lru_cache
Image Processing
strip_base64_images()methodremove_nonexistent_images()to clean LLM-hallucinated references- Normalize whitespace for standalone image
.llm.mdoutput
Changed
- File conflict rename strategy:
.2.md→.v2.mdfor natural sort order - Batch state: add
screenshotsfield (separate from embedded images) - Batch state: add
log_filefield for run traceability - Store file paths as relative to input_dir in batch state
0.2.0 - 2026-01-19
Added
- Monorepo architecture with uv workspace (
packages/markitai/) - LiteLLM integration for unified LLM provider access
- New converter modules:
pdf,office,image,text,legacy - Workflow system for single file processing (
workflow/single.py) - Markdown-based prompt management system (
prompts/*.md) - Unified config with JSON schema validation (
config.schema.json) - Security module for path validation (
security.py) - Comprehensive test suite with fixtures
Changed
- CLI rewritten with Click (replaced Typer)
- Requires Python 3.13+
Removed
- Old
src/markitai/structure and all legacy code - Complex pipeline/router/state machine architecture
- Individual LLM provider implementations (OpenAI, Anthropic, etc.)
- Docker and CI scripts (to be re-added later)
Breaking Changes
- Configuration format changed (see migration guide)
- CLI command syntax updated
- Python 3.12 and below no longer supported
0.1.6 - 2026-01-14
Fixed
- Model routing strategy bugs
- Documentation accuracy improvements
0.1.5 - 2026-01-13
Changed
- Refactored prompt management system for better maintainability
- Simplified cleaner module logic
0.1.4 - 2026-01-13
Fixed
- JSON parsing edge cases in LLM responses
- Log formatting improvements for readability
0.1.3 - 2026-01-12
Added
- Test coverage improved to 81%
Changed
- Adopted
srclayout for project structure - Reorganized documentation to
docs/reference/ - Added GitHub Actions CI workflow
Fixed
- Provider-specific bugs in fallback handling
0.1.2 - 2026-01-12
Added
- Resilience features for network failures (retry logic, timeout handling)
CLAUDE.mdandAGENTS.mddocumentation for AI assistants
Changed
- Log optimization for cleaner, more informative output
0.1.1 - 2026-01-11
Changed
- Major architecture refactoring with service layer pattern
- Enhanced LLM support with better error handling and retries
0.1.0 - 2026-01-10
Added
Capability-Based Model Routing
required_capabilityandprefer_capabilityparameters for LLM calls- Text tasks prioritize text-only models for cost efficiency
- Vision tasks automatically use vision-capable models
- Backward compatible: parameters default to None (round-robin behavior)
Lazy Model Initialization
- Providers loaded on-demand instead of all at startup
- Significantly reduced initialization time for single-file conversions
warmup()method for batch mode to validate providers upfrontrequired_capabilitiesparameter ininitialize()
Concurrent Fallback Mechanism
- Primary model timeout triggers parallel backup model execution
- Neither model is interrupted - first response wins
- Configurable via
llm.concurrent_fallback_timeout(default: 180s) - Handles Gemini 504 timeout scenarios gracefully
Execution Mode Support
--fastflag for speed-optimized batch processing- Fast mode: skips validation, limits fallback attempts, reduces logging
- Default mode: full validation, detailed logging, comprehensive retries
- Configurable via
execution.modein config file
Enhanced Statistics
BatchStatsclass for comprehensive processing metrics- Per-model tracking: calls, tokens, duration, estimated cost
ModelCostConfigfor optional cost estimation- Summary format: "Complete: X success, Y failed | Total: Xs | Tokens: N"
Changed
- CLI architecture refactored for better modularity
- Config format migrated from JSON to YAML
0.0.1 - 2026-01-08
Added
- Initial release
- CLI commands:
convert,batch,config,provider - Multi-format support: Word (.doc, .docx), PowerPoint (.ppt, .pptx), Excel (.xls, .xlsx), PDF, HTML
- LLM enhancement: markdown formatting, frontmatter generation, image alt text
- 5 LLM providers with fallback: OpenAI, Anthropic, Gemini, Ollama, OpenRouter
- 3 PDF engines: pymupdf4llm (default), pymupdf, pdfplumber
- Image processing: extraction, compression (oxipng/mozjpeg), LLM analysis
- Batch processing with resume capability and concurrency control
- Unit and integration tests
- Docker multi-stage build
- Chinese and English documentation