anyone used headless browsers for scraping bibtex from webpages ?

Mike Marchywka marchywka at hotmail.com
Thu May 21 11:45:22 CEST 2020


On Wed, May 20, 2020 at 05:51:00PM -0400, John Scott wrote:
> I don't know about specifically for BibTeX, but for web scripting or doing 
> basic forms cURL is pretty handy. For activating elements on a web page, 
> you'll probably want to look at saving/using cookies with --cookie-jar and --
> cookie, and how to send POST requests.
> 
> For example I recently wrote a script to allow me to do a form and complete a 
> CAPTCHA all from the CLI. So I did
>     curl --cookie-jar jar.txt http://foo.com/do.php
> to get it to save the cookie for my session. Then I'd recycle this cookie to 
> get my CAPTCHA:
>     curl --cookie jar.txt -o image.png  http://foo.com/captcha.php
> and lastly after reading it, send the request (figure out the field names from 
> Inspect Element in browser)
>     curl --cookie jar.txt -X POST -F 'captcha_code=FfFfFf' http://foo.com/
> do.php
> 
> For help with particular sites, please feel free to share details on or off-
> list.
The publisher finally reverted back, that usually happens.
But, I did find the headless browser output to a pdf file could then
be converted to text and i could get the doi ...
Thanks.


-- 

mike marchywka
306 charles cox
canton GA 30115
USA, Earth 
marchywka at hotmail.com
404-788-1216
ORCID: 0000-0001-9237-455X


More information about the texhax mailing list.