anyone used headless browsers for scraping bibtex from webpages ?

John Scott jscott at posteo.net
Wed May 20 23:51:00 CEST 2020


I don't know about specifically for BibTeX, but for web scripting or doing 
basic forms cURL is pretty handy. For activating elements on a web page, 
you'll probably want to look at saving/using cookies with --cookie-jar and --
cookie, and how to send POST requests.

For example I recently wrote a script to allow me to do a form and complete a 
CAPTCHA all from the CLI. So I did
    curl --cookie-jar jar.txt http://foo.com/do.php
to get it to save the cookie for my session. Then I'd recycle this cookie to 
get my CAPTCHA:
    curl --cookie jar.txt -o image.png  http://foo.com/captcha.php
and lastly after reading it, send the request (figure out the field names from 
Inspect Element in browser)
    curl --cookie jar.txt -X POST -F 'captcha_code=FfFfFf' http://foo.com/
do.php

For help with particular sites, please feel free to share details on or off-
list.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 228 bytes
Desc: This is a digitally signed message part.
URL: <https://tug.org/pipermail/texhax/attachments/20200520/ee176ae0/attachment.sig>


More information about the texhax mailing list.