ArXiv — Scraping & Data Extraction

Facilitates data extraction from ArXiv using the Atom API for efficient paper searches and metadata retrieval.

Install this skill

71/100

Security score

The ArXiv — Scraping & Data Extraction skill was audited on May 16, 2026 and we found 29 security issues across 1 threat category. Review the findings below before installing.

Categories Tested

Security Issues

low line 3

External URL reference

SourceSKILL.md

3	`https://arxiv.org` — open-access preprint server. Never use the browser for ArXiv. All data is reachable via `http_get` using the Atom API or HTML meta tags. No API key required.

low line 13

External URL reference

SourceSKILL.md

13	NS = {'atom': 'http://www.w3.org/2005/Atom', 'arxiv': 'http://arxiv.org/schemas/atom'}

low line 15

External URL reference

SourceSKILL.md

15	xml = http_get("http://export.arxiv.org/api/query?search_query=ti:transformer+AND+cat:cs.LG&max_results=5&sortBy=submittedDate&sortOrder=descending")

low line 22

External URL reference

SourceSKILL.md

22	Use `http_get` on `https://arxiv.org/abs/{id}` + regex for `citation_*` meta tags when you need the full abstract from an HTML page.

low line 32

External URL reference

SourceSKILL.md

32	NS = {'atom': 'http://www.w3.org/2005/Atom', 'arxiv': 'http://arxiv.org/schemas/atom'}

low line 35

External URL reference

SourceSKILL.md

35	"http://export.arxiv.org/api/query"

low line 59

External URL reference

SourceSKILL.md

59	# PDF: https://arxiv.org/pdf/2604.15259v1

low line 68

External URL reference

SourceSKILL.md

68	NS = {'atom': 'http://www.w3.org/2005/Atom', 'arxiv': 'http://arxiv.org/schemas/atom'}

low line 70

External URL reference

SourceSKILL.md

70	xml = http_get("http://export.arxiv.org/api/query?id_list=1706.03762")

low line 84

External URL reference

SourceSKILL.md

84	# PDF: https://arxiv.org/pdf/1706.03762v7

low line 96

External URL reference

SourceSKILL.md

96	NS = {'atom': 'http://www.w3.org/2005/Atom'}

low line 99

External URL reference

SourceSKILL.md

99	xml = http_get(f"http://export.arxiv.org/api/query?id_list={','.join(ids)}&max_results={len(ids)}")

low line 123

External URL reference

SourceSKILL.md

123	NS = {'atom': 'http://www.w3.org/2005/Atom'}

low line 126

External URL reference

SourceSKILL.md

126	xml = http_get(f"http://export.arxiv.org/api/query?id_list={arxiv_id}")

low line 153

External URL reference

SourceSKILL.md

153	html = http_get("https://arxiv.org/abs/1706.03762", headers={"User-Agent": "Mozilla/5.0"})

low line 170

External URL reference

SourceSKILL.md

170	# PDF: https://arxiv.org/pdf/1706.03762 (no version suffix — always latest)

low line 192

External URL reference

SourceSKILL.md

192	'atom': 'http://www.w3.org/2005/Atom',

low line 193

External URL reference

SourceSKILL.md

193	'opensearch': 'http://a9.com/-/spec/opensearch/1.1/',

low line 198

External URL reference

SourceSKILL.md

198	"http://export.arxiv.org/api/query"

low line 212

External URL reference

SourceSKILL.md

212	"http://export.arxiv.org/api/query"

low line 223

External URL reference

SourceSKILL.md

223	http://export.arxiv.org/api/query

low line 226

External URL reference

SourceSKILL.md

226	HTTPS also works: `https://export.arxiv.org/api/query`

low line 259

External URL reference

SourceSKILL.md

259	pdf_versioned = f"https://arxiv.org/pdf/{arxiv_id}" # specific version

low line 260

External URL reference

SourceSKILL.md

260	pdf_latest = f"https://arxiv.org/pdf/{bare_id}" # always redirects to latest

low line 261

External URL reference

SourceSKILL.md

261	abs_versioned = f"https://arxiv.org/abs/{arxiv_id}"

low line 262

External URL reference

SourceSKILL.md

262	abs_latest = f"https://arxiv.org/abs/{bare_id}"

low line 281

External URL reference

SourceSKILL.md

281	Full category taxonomy: https://arxiv.org/category_taxonomy

low line 287

External URL reference

SourceSKILL.md

287	- Always define the namespace dict. Without `NS = {'atom': 'http://www.w3.org/2005/Atom', 'arxiv': 'http://arxiv.org/schemas/atom'}`, `findall('atom:entry')` silently returns `[]`. All ArXiv Atom

low line 291

External URL reference

SourceSKILL.md

291	- `atom:id` contains a URL, not a bare ID. The element text is `http://arxiv.org/abs/1706.03762v7` — always split on `/` and take `[-1]` to get the bare ID with version. Strip version with `re.sub

Scanned on May 16, 2026

View Security Dashboard

Installation guide →

GitHub Stars 7

Rate this skill

Categorydata analytics

UpdatedJune 15, 2026

openclaw api data-analyst researcher ml-ai-engineer data analytics education research development

yangchuansheng/browser-harness-rust