ArXiv — Scraping & Data Extraction
Facilitates data extraction from ArXiv using the Atom API for efficient paper searches and metadata retrieval.
Install this skill
Security score
The ArXiv — Scraping & Data Extraction skill was audited on May 16, 2026 and we found 29 security issues across 1 threat category. Review the findings below before installing.
Categories Tested
Security Issues
External URL reference
| 3 | `https://arxiv.org` — open-access preprint server. **Never use the browser for ArXiv.** All data is reachable via `http_get` using the Atom API or HTML meta tags. No API key required. |
External URL reference
| 13 | NS = {'atom': 'http://www.w3.org/2005/Atom', 'arxiv': 'http://arxiv.org/schemas/atom'} |
External URL reference
| 15 | xml = http_get("http://export.arxiv.org/api/query?search_query=ti:transformer+AND+cat:cs.LG&max_results=5&sortBy=submittedDate&sortOrder=descending") |
External URL reference
| 22 | Use `http_get` on `https://arxiv.org/abs/{id}` + regex for `citation_*` meta tags when you need the full abstract from an HTML page. |
External URL reference
| 32 | NS = {'atom': 'http://www.w3.org/2005/Atom', 'arxiv': 'http://arxiv.org/schemas/atom'} |
External URL reference
| 35 | "http://export.arxiv.org/api/query" |
External URL reference
| 59 | # PDF: https://arxiv.org/pdf/2604.15259v1 |
External URL reference
| 68 | NS = {'atom': 'http://www.w3.org/2005/Atom', 'arxiv': 'http://arxiv.org/schemas/atom'} |
External URL reference
| 70 | xml = http_get("http://export.arxiv.org/api/query?id_list=1706.03762") |
External URL reference
| 84 | # PDF: https://arxiv.org/pdf/1706.03762v7 |
External URL reference
| 96 | NS = {'atom': 'http://www.w3.org/2005/Atom'} |
External URL reference
| 99 | xml = http_get(f"http://export.arxiv.org/api/query?id_list={','.join(ids)}&max_results={len(ids)}") |
External URL reference
| 123 | NS = {'atom': 'http://www.w3.org/2005/Atom'} |
External URL reference
| 126 | xml = http_get(f"http://export.arxiv.org/api/query?id_list={arxiv_id}") |
External URL reference
| 153 | html = http_get("https://arxiv.org/abs/1706.03762", headers={"User-Agent": "Mozilla/5.0"}) |
External URL reference
| 170 | # PDF: https://arxiv.org/pdf/1706.03762 (no version suffix — always latest) |
External URL reference
| 192 | 'atom': 'http://www.w3.org/2005/Atom', |
External URL reference
| 193 | 'opensearch': 'http://a9.com/-/spec/opensearch/1.1/', |
External URL reference
| 198 | "http://export.arxiv.org/api/query" |
External URL reference
| 212 | "http://export.arxiv.org/api/query" |
External URL reference
| 223 | http://export.arxiv.org/api/query |
External URL reference
| 226 | HTTPS also works: `https://export.arxiv.org/api/query` |
External URL reference
| 259 | pdf_versioned = f"https://arxiv.org/pdf/{arxiv_id}" # specific version |
External URL reference
| 260 | pdf_latest = f"https://arxiv.org/pdf/{bare_id}" # always redirects to latest |
External URL reference
| 261 | abs_versioned = f"https://arxiv.org/abs/{arxiv_id}" |
External URL reference
| 262 | abs_latest = f"https://arxiv.org/abs/{bare_id}" |
External URL reference
| 281 | Full category taxonomy: https://arxiv.org/category_taxonomy |
External URL reference
| 287 | - **Always define the namespace dict.** Without `NS = {'atom': 'http://www.w3.org/2005/Atom', 'arxiv': 'http://arxiv.org/schemas/atom'}`, `findall('atom:entry')` silently returns `[]`. All ArXiv Atom |
External URL reference
| 291 | - **`atom:id` contains a URL, not a bare ID.** The element text is `http://arxiv.org/abs/1706.03762v7` — always split on `/` and take `[-1]` to get the bare ID with version. Strip version with `re.sub |