parse_cite: extend to skip {{cite}}/{{citation}}, {{short description}},
{{gli}}, {{defn}}, external links [https://...], italic text ''...'',
and bullet-point lines containing bare URLs (unformatted bibliography
entries). Uses brace-counting to handle nested templates correctly.
parse_links: yield [[Category:...]] links as 'category' tokens so they
are never modified.
add_link: handle three new boundary cases where the match spans an
existing [[link]]:
- match ends exactly at the link boundary: replace the whole thing with
a single clean link (e.g. surface [[runoff (hydrology)|runoff]] →
[[surface runoff]])
- match starts right after [[: absorb the stray [[ (e.g.
[[anti-globalization]] movement → [[anti-globalization movement]])
- match starts partway inside a link: skip (would produce broken wikitext)
- match spans into but not through a link: use a piped prefix link
(e.g. cross-platform [[interchange station]] →
[[cross-platform interchange|cross-platform]] [[interchange station]])
Fallback search: mask [[Category:...]] spans with spaces so the pattern
cannot match inside them. Guard against matches that are part of a
longer named entity (title-case phrase followed by extra words then an
abbreviation in parentheses, e.g. "Anti-Globalization Movement of
Russia (AGMR)").
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
||
|---|---|---|
| .. | ||
| __init__.py | ||
| api.py | ||
| core.py | ||
| language.py | ||
| match.py | ||
| mediawiki_api.py | ||
| mediawiki_api_old.py | ||
| mediawiki_oauth.py | ||
| util.py | ||
| wikipedia.py | ||