Skip to main content

ArXiv — Scraping & Data Extraction

Facilitates data extraction from ArXiv using the Atom API for efficient paper searches and metadata retrieval.

Install this skill

or
71/100

Security score

The ArXiv — Scraping & Data Extraction skill was audited on May 16, 2026 and we found 29 security issues across 1 threat category. Review the findings below before installing.

Categories Tested

Security Issues

low line 3

External URL reference

SourceSKILL.md
3`https://arxiv.org` — open-access preprint server. **Never use the browser for ArXiv.** All data is reachable via `http_get` using the Atom API or HTML meta tags. No API key required.
low line 13

External URL reference

SourceSKILL.md
13NS = {'atom': 'http://www.w3.org/2005/Atom', 'arxiv': 'http://arxiv.org/schemas/atom'}
low line 15

External URL reference

SourceSKILL.md
15xml = http_get("http://export.arxiv.org/api/query?search_query=ti:transformer+AND+cat:cs.LG&max_results=5&sortBy=submittedDate&sortOrder=descending")
low line 22

External URL reference

SourceSKILL.md
22Use `http_get` on `https://arxiv.org/abs/{id}` + regex for `citation_*` meta tags when you need the full abstract from an HTML page.
low line 32

External URL reference

SourceSKILL.md
32NS = {'atom': 'http://www.w3.org/2005/Atom', 'arxiv': 'http://arxiv.org/schemas/atom'}
low line 35

External URL reference

SourceSKILL.md
35"http://export.arxiv.org/api/query"
low line 59

External URL reference

SourceSKILL.md
59# PDF: https://arxiv.org/pdf/2604.15259v1
low line 68

External URL reference

SourceSKILL.md
68NS = {'atom': 'http://www.w3.org/2005/Atom', 'arxiv': 'http://arxiv.org/schemas/atom'}
low line 70

External URL reference

SourceSKILL.md
70xml = http_get("http://export.arxiv.org/api/query?id_list=1706.03762")
low line 84

External URL reference

SourceSKILL.md
84# PDF: https://arxiv.org/pdf/1706.03762v7
low line 96

External URL reference

SourceSKILL.md
96NS = {'atom': 'http://www.w3.org/2005/Atom'}
low line 99

External URL reference

SourceSKILL.md
99xml = http_get(f"http://export.arxiv.org/api/query?id_list={','.join(ids)}&max_results={len(ids)}")
low line 123

External URL reference

SourceSKILL.md
123NS = {'atom': 'http://www.w3.org/2005/Atom'}
low line 126

External URL reference

SourceSKILL.md
126xml = http_get(f"http://export.arxiv.org/api/query?id_list={arxiv_id}")
low line 153

External URL reference

SourceSKILL.md
153html = http_get("https://arxiv.org/abs/1706.03762", headers={"User-Agent": "Mozilla/5.0"})
low line 170

External URL reference

SourceSKILL.md
170# PDF: https://arxiv.org/pdf/1706.03762 (no version suffix — always latest)
low line 192

External URL reference

SourceSKILL.md
192'atom': 'http://www.w3.org/2005/Atom',
low line 193

External URL reference

SourceSKILL.md
193'opensearch': 'http://a9.com/-/spec/opensearch/1.1/',
low line 198

External URL reference

SourceSKILL.md
198"http://export.arxiv.org/api/query"
low line 212

External URL reference

SourceSKILL.md
212"http://export.arxiv.org/api/query"
low line 223

External URL reference

SourceSKILL.md
223http://export.arxiv.org/api/query
low line 226

External URL reference

SourceSKILL.md
226HTTPS also works: `https://export.arxiv.org/api/query`
low line 259

External URL reference

SourceSKILL.md
259pdf_versioned = f"https://arxiv.org/pdf/{arxiv_id}" # specific version
low line 260

External URL reference

SourceSKILL.md
260pdf_latest = f"https://arxiv.org/pdf/{bare_id}" # always redirects to latest
low line 261

External URL reference

SourceSKILL.md
261abs_versioned = f"https://arxiv.org/abs/{arxiv_id}"
low line 262

External URL reference

SourceSKILL.md
262abs_latest = f"https://arxiv.org/abs/{bare_id}"
low line 281

External URL reference

SourceSKILL.md
281Full category taxonomy: https://arxiv.org/category_taxonomy
low line 287

External URL reference

SourceSKILL.md
287- **Always define the namespace dict.** Without `NS = {'atom': 'http://www.w3.org/2005/Atom', 'arxiv': 'http://arxiv.org/schemas/atom'}`, `findall('atom:entry')` silently returns `[]`. All ArXiv Atom
low line 291

External URL reference

SourceSKILL.md
291- **`atom:id` contains a URL, not a bare ID.** The element text is `http://arxiv.org/abs/1706.03762v7` — always split on `/` and take `[-1]` to get the bare ID with version. Strip version with `re.sub
Scanned on May 16, 2026
View Security Dashboard
Installation guide →