Quantifying Memorization of Domain-Specific Pre-trained Language Models using Japanese Newspaper and Paywalls

Shotaro Ishihara (2024). Quantifying Memorization of Domain-Specific Pre-trained Language Models using Japanese Newspaper and Paywalls. Fourth Workshop on Trustworthy Natural Language Processing.
https://arxiv.org/abs/2404.17143

Shotaro Ishihara

May 08, 2024

More Decks by Shotaro Ishihara

See All by Shotaro Ishihara

JAPAN AI CUP Prediction Tutorial

upura

720

情報技術の社会実装に向けた応用と課題：ニュースメディアの事例から / appmech-jsce 2025

upura

310

日本語新聞記事を用いた大規模言語モデルの暗記定量化 / LLMC2025

upura

470

Quantifying Memorization in Continual Pre-training with Japanese General or Industry-Specific Corpora

upura

JOAI2025講評 / joai2025-review

upura

1.4k

AI エージェントを活用した研究再現性の自動定量評価 / scisci2025

upura

210

JSAI2025 企画セッション「人工知能とコンペティション」/ jsai2025-competition

upura

生成的推薦の人気バイアスの分析：暗記の観点から / JSAI2025

upura

330

Semantic Shift Stability: 学習コーパス内の単語の意味変化を用いた事前学習済みモデルの時系列性能劣化の監査

upura

110

Other Decks in Research

See All in Research

病院向け生成AIプロダクト開発の実践と課題

hagino3000

540

世界の人気アプリ100個を分析して見えたペイウォール設計の心得

akihiro_kokubo

PRO

37k

データサイエンティストをめぐる環境の違い2025年版〈一般ビジネスパーソン調査の国際比較〉

datascientistsociety

PRO

710

第二言語習得研究における明示的・暗示的知識の再検討：この分類は何に役に立つか，何に役に立たないか

tam07pb915

1.2k

【NICOGRAPH2025】Photographic Conviviality: ボディペイント・ワークショップによる同時的かつ共生的な写真体験

toremolo72

170

Aurora Serverless からAurora Serverless v2への課題と知見を論文から読み解く/Understanding the challenges and insights of moving from Aurora Serverless to Aurora Serverless v2 from a paper

bootjp

1.5k

一般道の交通量減少と速度低下についての全国分析と熊本市におけるケーススタディ（20251122 土木計画学研究発表会）

trafficbrain

160

社内データ分析AIエージェントをできるだけ使いやすくする工夫

fufufukakaka

900

Combining Deep Learning and Street View Imagery to Map Smallholder Crop Types

satai

580

svc-hook: hooking system calls on ARM64 by binary rewriting

retrage

110

Grounding Text Complexity Control in Defined Linguistic Difficulty [Keynote@*SEM2025]

yukiar

110

Earth AI: Unlocking Geospatial Insights with Foundation Models and Cross-Modal Reasoning

satai

490

Featured

See All Featured

Evolution of real-time – Irina Nazarova, EuRuKo, 2024

irinanazarova

1.2k

Templates, Plugins, & Blocks: Oh My! Creating the theme that thinks of everything

marktimemedia

2.7k

Information Architects: The Missing Link in Design Systems

soysaucechin

780

Cheating the UX When There Is Nothing More to Optimize - PixelPioneers

stephaniewalter

287

14k

Why Your Marketing Sucks and What You Can Do About It - Sophie Logan

marketingsoph

The Art of Delivering Value - GDevCon NA Keynote

reverentgeek

1.8k

How to Align SEO within the Product Triangle To Get  Buy-In & Support - #RIMC

aleyda

1.4k

How to Create Impact in a Changing Tech Landscape [PerfNow 2023]

tammyeverts

3.3k

Building Better People: How to give real-time feedback that sticks.

wjessup

370

20k

Ecommerce SEO: The Keys for Success Now & Beyond - #SERPConf2024

aleyda

1.8k

Stop Working from a Prison Cell

hatefulcrawdad

273

21k

The Illustrated Guide to Node.js - THAT Conference 2024

reverentgeek

260

Transcript

Quantifying Memorization of Domain-Speciﬁc Pre-trained Language Models using Japanese Newspaper
and Paywalls Shotaro Ishihara (Nikkei Inc.) https://arxiv.org/abs/2404.17143 Research Question: Do Japanese PLMs memorize the training data as well as the English PLMs? Approach: We pre-trained GPT-2 models using Japanese newspaper articles. The string at the beginning (public) is used as a prompt, and the remaining string within the paywall (private) is used for the evaluation. Findings: 1. Japanese PLMs sometimes “copy and paste” on a large scale. 2. We replicated the empirical ﬁnding that memorization is related to duplication, model size, and prompt length. Memorized strings are highlighted in green. (48 chars) The more epochs (more duplication), the larger the model size, the longer the prompt, the more memorization.

Quantifying Memorization of Domain-Specific Pre...

Quantifying Memorization of Domain-Specific Pre-trained Language Models using Japanese Newspaper and Paywalls

Shotaro Ishihara

More Decks by Shotaro Ishihara

Other Decks in Research

Featured

Transcript

Quantifying Memorization of Domain-Speciﬁc Pre-trained Language Models using Japanese Newspaper