Software: beastie

Tue May 21 16:29:50 CEST 2024

Zeping, hello.

On 21 May 2024, at 12:01, Zeping Lee wrote:

> Just for your information. The repository https://github.com/aclements/biblib provides the
> most accurate grammar of the original BibTeX. The syntax for entry keys is very different from
> identifiers like entry types or filed names. Additionally, the grammar that biber recognizes is
> described in https://metacpan.org/dist/Text-BibTeX/view/btparse/doc/bt_language.pod.

Oooh, thanks for this.  I didn't know of btparse and its grammar (though the name does ring a bell, so I've clearly been conscious of it in the past).

I did carefully consult Nelson Beebe's grammar and lexer, alongside the btxdoc document, but didn't follow Beebe's grammar directly.

I didn't aim to be 100% compatible with bibtex-the-program,  That program includes, by design, a lot of compatibility with Scribe, which includes lexical features which I've never seen in a .bib file.  Once I'd dropped such complete compatibility as a goal, and once I'd added in @include and %-comments, I felt a certain licence to deviate from the strictest interpretation of the btxdoc text in other respects, at least mildly, where it made the grammar or lexer simpler.  I think the modish phrase is that this is an at least slightly opinionated parser.

I think that the grammar below will cover everything I think of as a 'normal' .bib file.  That's obviously a fairly subjective definition of what counts as acceptable input, but my heuristic there has been to regard 'non-normal' as anything I'd advise a colleague or student to avoid.  I'm not massively committed to details of either the grammar or lexer, in the face of significant counterexamples.

The other significant deviation from bibtex's behaviour is that, in a field like title={Ol{\'e}}, the value will be lexed as "Olé", with a selection of 'well-known' commands being recognised (this can be turned off at run-time).  This makes it easier to use the output in non-TeX downstream programs.  In such a case, such command sequences have to be expanded somewhere, and it might as well be at parse time.  Also, we're in a UTF-8 world, now.

The non-TeX downstream is part of my motivation.  HTML output is an obvious example, but I was also thinking of other programs which might want to process .bib file, but which don't want to write a parser of their own.  Thus a beastie output format which is easily parseable by other tools seemed to be a potentially useful contribution, even if in that scenario beastie did no actual bibliography generation.

A final fragment of my motivation was that this might start a conversation about bibliographies, in a world where bibtex v0.99 has been current, and v1.0 has been anticipated, since 1988.

Best wishes,

Norman

input: opt_interentry_text list_of_stanzas opt_interentry_text
list_of_stanzas: stanza
  | list_of_stanzas opt_interentry_text stanza
opt_interentry_text: /* empty */
  | opt_interentry_text INTERENTRYTEXT
stanza: entry
  | atpreamble
  | atstring
  | atcomment
  | atinclude
entry: ENTRYTYPE '{' NAME ',' list_of_fields '}'
  | ENTRYTYPE '{' NAME '}'
  | ENTRYTYPE '{' NAME ',' '}'
list_of_fields: field
  | list_of_fields ',' field
  | list_of_fields ','
field: NAME '=' string
atpreamble: ATPREAMBLE '{' string '}'
atstring: ATSTRING '{' NAME '=' string '}'
atinclude: ATINCLUDE '{' NAME '}'
atcomment: ATCOMMENT string
string: STRINGVALUE
  | NAME
  | string '#' STRINGVALUE
  | string '#' NAME

This does include the @include form that Beebe suggests, and the lexer also supports %-comments (to line end) within entries.

Some lexemes from my implementation:

NAME            [0-9]*[A-Za-z][-A-Za-z0-9:.+/_&]*

this is Beebe's choice, plus [_&] (both of which I've seen in the wild) and minus ['], and permitting names to start with digits.  Like Beebe, I decided to have the same lexeme for citation keys, entries, and fields, even though bibtex the program does distinguish them.  I don't _recall_ having seen entry keys including ['], but feel that including that character in a key is just asking for trouble _somewhere_, even when using the bibtex program directly, so I'm pretty relaxed about that deviation.

I see that btparse permits also [!$*;<>?[]^`|] -- myself, I'd put those in the 'asking for trouble' category, especially if the output is intended for a non-TeX pipeline, but I don't think anything would break if I added at least some of them.

Both [{(] turn into the '{' lexeme, and similarly for '}'.  No other Scribe delimiters, anywhere.

-- 
Norman Gray  :  https://nxg.me.uk