dataset 2024 年 10 月 7 日

Examining Language Modeling Assumptions Using an Annotated Literary Dialect Corpus

dataset all search terms

title: Examining Language Modeling Assumptions Using an Annotated Literary Dialect Corpus

publish date:

2024-10-03

authors:

Craig Messner et.al.

paper id

2410.02674v1

download

2410.02674v1

abstracts:

We present a dataset of 19th century American literary orthovariant tokens with a novel layer of human-annotated dialect group tags designed to serve as the basis for computational experiments exploring literarily meaningful orthographic variation. We perform an initial broad set of experiments over this dataset using both token (BERT) and character (CANINE)-level contextual language models. We find indications that the “dialect effect” produced by intentional orthographic variation employs multiple linguistic channels, and that these channels are able to be surfaced to varied degrees given particular language modelling assumptions. Specifically, we find evidence showing that choice of tokenization scheme meaningfully impact the type of orthographic information a model is able to surface.

QA:

coming soon

编辑整理： wanghaisheng 更新日期：2024 年 10 月 7 日