something dumb on bibtex entry format,

Mike Marchywka marchywka at hotmail.com
Wed Jan 13 16:24:45 CET 2021


On Tue, Jan 12, 2021 at 08:09:25PM +0100, barbara beeton wrote:
> I'm not sure how you'd accomplish it, and can't offer help, but if
> you can add this check, to at least give a warning, you'd almost
> certainly be able to identify all likely problems of this sort:
> 
> In a bibliographic entry, it's extremely unlikely that either a
> leading group-defining open brace or printing brace ( \{ ) would
> be "matched" by a closing brace of the other variety.  So checking
> for properly nested type-braces, and reporting unmatched ones,
> would allow you to manually identify and correct as necessary.

Thanks, I'm trying to update and generalize the parser anyway so it
is easy to include nesting levels for many things. Right now,
all I have is braces along with toggles for escapes and quotes.
Adding the "escaped" brace count turned out to be easy although I
don't track order right now. I guess a user could want to write
about a right brace or something :) So, I list it as an observation
and dump with output if there is a real error. Ultimately I want
to clean up and canonicalize the entry, including my own fields
currently included as comments. I'm not too worried about copying
bibtex behavior as I'll just run bibtex with an ok bst. My parser
should at least pickup things that don't conform to basic expecations :) 

Right now the parser just complains at the EOF when it doesn't have enough
right braces. At this point, it dumps the partial strings it already parsed
and you can see it ends around the month field ( the "strings" may contain
"spurious" chars as it dumps 8 chars even if string shorter... ) so you
may start looking there but the observation at the end does help. 

An excerpt, 

.dump_errors(1)=18538 inproceedings earlyend  looking for comma or brace     pc=806 braces=1 state=2 buffer = % programmatically fixed probably bu toobib
% loaded from bbb written on 2019-11-02:17:33:45
%0 prio[...]l = {http://online.liebertpub.com/doi/full/10.1089/ten.tea.2015.5000.abstracts},
    year = {2015}
\0 strings :[0]inprocee,[1]18538ad,[2]address,[3]Boston, ,[4]authorS,[5]Silva, J,[6]doi10.1,[7]10.1089/,[8]journal,[9]Tissue E,[10]keywords,[11]Drug del,[12]month20,[13]2015-09-,[14],
observation  user braces underflow   ps.dump()= pc=530 state=50 braces=2 ubraces=-1 skipped=104 state=32
 

>From this input, 

cat xxx
% programmatically fixed probably bu toobib
% loaded from bbb written on 2019-11-02:17:33:45
%0 prior 0
@inproceedings{18538,
    address = {Boston, MA, USA},
    author = {Silva, J. C. and Aroso, I. M. and Mano, F. and S{\'a}-Nogueira, I. and Barreiros, S. and Reis, R. L. and Paiva, A. and Duarte, A. R. C.},
    doi = {10.1089/ten.tea.2015.5000.abstracts},
    journal = {Tissue Engineering Part A},
    keywords = {Drug delivery systems, green chemistry, Therapeutic deep eutectic solvents},
    month = {2015-09-08 00:00:00 \},
    publisher = {John Wiley \& Sons, Ltd },
    title = { Therapeutic deep eutectic solvents as solubility enhancers fordifferent active pharmaceutical ingredients},
    url = {http://online.liebertpub.com/doi/full/10.1089/ten.tea.2015.5000.abstracts},
    year = {2015}
}


> 
> Also, but generally unrelated, initial or terminal spaces within
> individual bib items (author, title, etc.) are generally undesirable
> and can result in unwanted spaces in output, unless care is taken to
> ignore them in processing (may be done, but not guaranteed; I don't
> know how rigorous bib-processing macros are in this regard).

In theory once parsed, I can write everything out uniformly and fix
stuff like that or at least make it consistent...

> 
> Hope these observations are useful.
> 						-- bb
> 
> On Tue, 12 Jan 2021, Mike Marchywka wrote:
> 
> > On Tue, Jan 12, 2021 at 04:13:51PM +0000, David Carlisle wrote:
> > >    \} is the latex syntax for \the character } so in fields taking tex streams you have a literal } character and a missing }
> > >    to match the {  at the beginning.
> > >    in fields taking numeric entries it's again a missing closing brace and then a spurious } in the date I would expect.
> > >    actually what happens is that bibtex takes this as a literal \ so the generated bbl file has
> > >    \newblock Boston, MA, USA, 2015-09-08 00:00:00 2015. John Wiley \& Sons, Ltd \.
> > >    but \. is the accent command  and the argument here is the \par from the following blank line so latex gives the error
> > >    ! Paragraph ended before \OT1\. was complete.
> > >    same with the month field teh \ is just passed to latex but here you get
> > >    \newblock Boston, MA, USA, 2015-09-08 00:00:00 \ 2015. John Wiley \& Sons, Ltd.
> > >    so you get the safe \   rather than \.{\par} just by accidental chance of this bib style.  If the bib style had put a full
> > >    stop rather than a space after the date you would have had the same error as before.
> > > 
> > 
> > Thanks, I was originally just trying to do syntax but apparently that is not possible without a format.
> > My syntax parser is just that and  I wanted a format-independent validation. However, your point
> > makes sense and now I just returned the rendered reference which really is what matters as long
> > as the format picks up the right fields. The rendered/pdftotext outputs appear identical with the
> > backslash or not in the month field,
> > 
> > 
> > [1] J. C. Silva, I. M. Aroso, F. Mano, I. Sá-Nogueira, S. Barreiros, R. L. Reis,
> > A. Paiva, and A. R. C. Duarte. Therapeutic deep eutectic solvents as solubility enhancers fordifferent active pharmaceutical ingredients. Boston, MA,
> > USA, 2015-09-08 00:00:00 2015. John Wiley & Sons, Ltd.
> > 
> > 
> > [1] J. C. Silva, I. M. Aroso, F. Mano, I. Sá-Nogueira, S. Barreiros, R. L. Reis,
> > A. Paiva, and A. R. C. Duarte. Therapeutic deep eutectic solvents as solubility enhancers fordifferent active pharmaceutical ingredients. Boston, MA,
> > USA, 2015-09-08 00:00:00 2015. John Wiley & Sons, Ltd.
> > 
> > 1
> > 
> > I guess I'll have to make a validation bst entry that picks up everything and makes errors
> > more apparent.
> > 
> > 
> > echo val xxx  | ../a.outmjm_assemble_putative_bibtex.h615  MJM_ASSEMBLE_PUTATIVE_BIBTEX Jan 12 2021 13:34:29
> > 
> > ../../mjm/hlib/mjm_pawnoff.h439 ONCE  fuxed m_today to exclude time wtf
> > ../../mjm/hlib/mjm_instruments.h791  popping an old stream
> > mjm>val xxx
> > ../../mjm/hlib/mjm_pawnoff.h347 ONCE  Fileio is not thread of process safe doh
> > mjm_assemble_putative_bibtex.h290  cmd=cat checkbib_test_output.xxx | grep -i "output\|error\|warning" | sed -e 's/  */ /g'
> > mjm_assemble_putative_bibtex.h292  c=0 StrTy(err)= StrTy(out)=No pages of output .
> > Warning - - empty booktitle in 18538
> > ( There was 1 warning )
> > LaTeX Warning : Label ( s ) may have changed . Rerun to get cross - references right .
> > Output written on xxx . pdf ( 1 page , 28929 bytes ) .
> > StrTy(data)=
> > mjm_assemble_putative_bibtex.h295  rcl=0 fnerr=checkbib_test_output.xxx
> > mjm_assemble_putative_bibtex.h299  StrTy(rendered)=References
> > [1] J. C. Silva, I. M. Aroso, F. Mano, I. Sá-Nogueira, S. Barreiros, R. L. Reis,
> > A. Paiva, and A. R. C. Duarte. Therapeutic deep eutectic solvents as solubility enhancers fordifferent active pharmaceutical ingredients. Boston, MA,
> > USA, 2015-09-08 00:00:00 2015. John Wiley & Sons, Ltd.
> > 
> > 1
> > 
> > 
> > 
> > mjm_assemble_putative_bibtex.h648  m_n=0 m_errors=0 m_be.name()= m_be.type()= m_be.size()=0 m_be.errors()=0 m_latex_output=No pages of output .
> > Warning - - empty booktitle in 18538
> > ( There was 1 warning )
> > LaTeX Warning : Label ( s ) may have changed . Rerun to get cross - references right .
> > Output written on xxx . pdf ( 1 page , 28929 bytes ) .
> > 
> > mjm>../../mjm/hlib/mjm_instruments.h340  readline returns null danger will robinson 01
> > marchywka at happy:/home/documents/cpp/proj/toobib/junk$ vi xxx
> > marchywka at happy:/home/documents/cpp/proj/toobib/junk$ echo val xxx  | ../a.outmjm_assemble_putative_bibtex.h615  MJM_ASSEMBLE_PUTATIVE_BIBTEX Jan 12 2021 13:34:29
> > 
> > ../../mjm/hlib/mjm_pawnoff.h439 ONCE  fuxed m_today to exclude time wtf
> > ../../mjm/hlib/mjm_instruments.h791  popping an old stream
> > mjm>val xxx
> > ../../mjm/hlib/mjm_pawnoff.h347 ONCE  Fileio is not thread of process safe doh
> > mjm_assemble_putative_bibtex.h290  cmd=cat checkbib_test_output.xxx | grep -i "output\|error\|warning" | sed -e 's/  */ /g'
> > mjm_assemble_putative_bibtex.h292  c=0 StrTy(err)= StrTy(out)=No pages of output .
> > Warning - - empty booktitle in 18538
> > ( There was 1 warning )
> > LaTeX Warning : Label ( s ) may have changed . Rerun to get cross - references right .
> > Output written on xxx . pdf ( 1 page , 28937 bytes ) .
> > StrTy(data)=
> > mjm_assemble_putative_bibtex.h295  rcl=0 fnerr=checkbib_test_output.xxx
> > mjm_assemble_putative_bibtex.h299  StrTy(rendered)=References
> > [1] J. C. Silva, I. M. Aroso, F. Mano, I. Sá-Nogueira, S. Barreiros, R. L. Reis,
> > A. Paiva, and A. R. C. Duarte. Therapeutic deep eutectic solvents as solubility enhancers fordifferent active pharmaceutical ingredients. Boston, MA,
> > USA, 2015-09-08 00:00:00 2015. John Wiley & Sons, Ltd.
> > 
> > 1
> > 
> > 
> > 
> > mjm_assemble_putative_bibtex.h648  m_n=0 m_errors=0 m_be.name()= m_be.type()= m_be.size()=0 m_be.errors()=0 m_latex_output=No pages of output .
> > Warning - - empty booktitle in 18538
> > ( There was 1 warning )
> > LaTeX Warning : Label ( s ) may have changed . Rerun to get cross - references right .
> > Output written on xxx . pdf ( 1 page , 28937 bytes ) .
> > 
> > mjm>../../mjm/hlib/mjm_instruments.h340  readline returns null danger will robinson 01
> > marchywka at happy:/home/documents/cpp/proj/toobib/junk$
> > 
> > 
> > 
> > >    On Tue, 12 Jan 2021 at 15:59, Mike Marchywka <[mailto:marchywka at hotmail.com]marchywka at hotmail.com> wrote:
> > > 
> > >      I'm finally moving my bibtex scraping script to c++ and cleaning up a lot
> > >      of stuff. I validate a foreign or scraped bibtex entry using latex run on
> > >      a test document and my own parser. Its unlikely my parser conforms to
> > >      bibtex requirements exactly so I want to do both. I found what appears
> > >      to be an old test case and it still does not make sense. The question
> > >      seems to be about backslashes preceding a terminating right brace.
> > >      Sometimes they are ok, others not. The test files are xxx.tex and xxx.bib
> > >      as shown below.  If I put a backslash on the "month" line before the right brace
> > >      it seems to work ( originally there was an abstract entry with the problem but
> > >      I deleted it for space and clarity ),
> > >          month = {2015-09-08 00:00:00 \},
> > >      However, doing it on the publisher line fails,
> > >          publisher = {John Wiley \& Sons, Ltd \},
> > >       cat /tmp/xxx.tex
> > >      \documentclass{article}
> > >      \begin{document}
> > >      \nocite{*}
> > >      \bibliographystyle{plain}
> > >      \bibliography{xxx}
> > >      \end{document}
> > >      marchywka at happy:/home/documents/cpp/proj/toobib/junk$ cat /tmp/xxx.bib
> > >      % programmatically fixed probably bu toobib
> > >      % loaded from bbb written on 2019-11-02:17:33:45
> > >      %0 prior 0
> > >      @inproceedings{18538,
> > >          address = {Boston, MA, USA},
> > >          author = {Silva, J. C. and Aroso, I. M. and Mano, F. and S{\'a}-Nogueira, I. and Barreiros, S. and Reis, R. L. and
> > >      Paiva, A. and Duarte, A. R. C.},
> > >          doi = {10.1089/ten.tea.2015.5000.abstracts},
> > >          journal = {Tissue Engineering Part A},
> > >          keywords = {Drug delivery systems, green chemistry, Therapeutic deep eutectic solvents},
> > >          month = {2015-09-08 00:00:00 },
> > >          publisher = {John Wiley \& Sons, Ltd },
> > >          title = { Therapeutic deep eutectic solvents as solubility enhancers fordifferent active pharmaceutical
> > >      ingredients},
> > >          url =
> > >      {[http://online.liebertpub.com/doi/full/10.1089/ten.tea.2015.5000.abstracts]http://online.liebertpub.com/doi/full/10.108
> > >      9/ten.tea.2015.5000.abstracts},
> > >          year = {2015}
> > >      }
> > >      marchywka at happy:/home/documents/cpp/proj/toobib/junk$
> > >      I run 3 times, latex, bibtex, and latex again, then grep for error, output, and warning
> > >      giving the following lines in the two cases,
> > >      No pages of output .
> > >      Warning - - empty booktitle in 18538
> > >      ( There was 1 warning )
> > >      . / xxx . bbl : 9 : = = > Fatal error occurred , no output PDF file produced !
> > >      versus,
> > >      =No pages of output .
> > >      Warning - - empty booktitle in 18538
> > >      ( There was 1 warning )
> > >      LaTeX Warning : Label ( s ) may have changed . Rerun to get cross - references right .
> > >      Output written on xxx . pdf ( 1 page , 28937 bytes ) .
> > >      What is the backslash before the brace supposed to do or is there something silly I'm
> > >      missing? Thanks.
> > >      note new address
> > >       Mike Marchywka 306 Charles Cox Drive Canton, GA 30115
> > >       2295 Collinworth  Drive Marietta GA 30062.  formerly 487 Salem Woods Drive Marietta GA 30067 404-788-1216 (C)<- leave
> > >      message 989-348-4796 (P)<- emergency
> > 
> > -- 
> > 
> > mike marchywka
> > 306 charles cox
> > canton GA 30115
> > USA, Earth
> > marchywka at hotmail.com
> > 404-788-1216
> > ORCID: 0000-0001-9237-455X
> > 


-- 

mike marchywka
306 charles cox
canton GA 30115
USA, Earth 
marchywka at hotmail.com
404-788-1216
ORCID: 0000-0001-9237-455X


More information about the texhax mailing list.