Commit graph

11 commits

Author SHA1 Message Date
fe89db11bd Improve link matching to avoid more classes of bad edits
- Skip no-parameter templates (navboxes) and add annotated link,
  excerpt, main, see to the list of skipped parameterised templates
- Preserve sentence-initial capitalisation when replacement is lowercase
- Skip matches that sit entirely inside an existing [[link]] destination
- Treat link destinations that start with q as more specific links to
  preserve, in both find_link_in_chunk and find_link_and_section

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-10 14:44:28 +01:00
4fe0acc167 Improve link matching to avoid many classes of bad edits
parse_cite: extend to skip {{cite}}/{{citation}}, {{short description}},
{{gli}}, {{defn}}, external links [https://...], italic text ''...'',
and bullet-point lines containing bare URLs (unformatted bibliography
entries). Uses brace-counting to handle nested templates correctly.

parse_links: yield [[Category:...]] links as 'category' tokens so they
are never modified.

add_link: handle three new boundary cases where the match spans an
existing [[link]]:
- match ends exactly at the link boundary: replace the whole thing with
  a single clean link (e.g. surface [[runoff (hydrology)|runoff]] →
  [[surface runoff]])
- match starts right after [[: absorb the stray [[ (e.g.
  [[anti-globalization]] movement → [[anti-globalization movement]])
- match starts partway inside a link: skip (would produce broken wikitext)
- match spans into but not through a link: use a piped prefix link
  (e.g. cross-platform [[interchange station]] →
  [[cross-platform interchange|cross-platform]] [[interchange station]])

Fallback search: mask [[Category:...]] spans with spaces so the pattern
cannot match inside them. Guard against matches that are part of a
longer named entity (title-case phrase followed by extra words then an
abbreviation in parentheses, e.g. "Anti-Globalization Movement of
Russia (AGMR)").

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-09 18:11:23 +01:00
95ca5f755d Fix User-Agent header, timeouts, and JSON error handling
mediawiki_oauth: set User-Agent on all OAuth1Session instances so
Wikimedia doesn't reject token and API requests with 403; add timeout
parameter to api_post_request (default 4s).

mediawiki_api: add APIError exception; wrap .json() in call() to raise
APIError with status code and response body on decode failure; raise
timeout to 30s for edit POSTs.

api: wrap call_get_diff .json() with the same JSONDecodeError guard,
raising MediawikiError with HTTP status and body on failure.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-09 18:11:23 +01:00
479dc864fd Remove debugging output 2023-12-09 18:43:05 +00:00
14d8539298 Link matching improvements 2023-12-09 18:42:53 +00:00
1da620875a Add type hints and docstrings 2023-12-09 18:42:03 +00:00
d76c74395b Fix name of module 2023-12-06 20:56:59 +00:00
2c267c67e2 Add types and docstrings 2023-12-06 11:30:34 +00:00
ea95c82b37 Rename wikidata_oauth to mediawiki_oauth 2023-12-06 11:29:03 +00:00
39f9ba31ed Raise LoginNeeded if not logged in 2023-12-06 09:53:06 +00:00
f07b407e7a Initial commit 2023-10-04 12:56:21 +01:00