NYT
Analysts rate NYT as a hold. Recent commentary argues AI training copyright risk is limited in most scenarios, while data quality and pipeline complexity are growing operational drivers for model builders and platform providers.
Recent proof-backed thesis calls
Two recent theses: (1) Copyright risk from using works in AI training is often "laundered" into model weights and only meaningfully materializes when models reproduce long copyrighted passages; this implies limited legal overhang for commercialization (source: https://x.com/doodlestein). (2) Practical language-model training issues—HTML-heavy web data, PDFs needing OCR, language ID filtering, dataset auditing (e.g., C4), and deduplication via LSH—make data pipelines and curated/licensed datasets increasingly important, raising ongoing spend on compute, storage, and data tooling.
Короткий тезис: «AI slop всех утомил» — усталость аудитории от низкокачественного/массового AI-контента. Это скорее сигнал о возможном сдвиге спроса: меньше толерантности к «генерёнке», больше ценности у курируемого/премиального контента и у инструментов модерации/проверки подлинности. Конкретики (платформа, регион, метрики) нет, поэтому торговая применимость низкая.
Post argues that using copyrighted works in AI training isn’t a major issue because the information is “laundered” into model weights, and the real concern is only if users generate long copyrighted passages. This frames copyright/training-data litigation risk as manageable for model developers and platforms, implying reduced regulatory/legal overhang for AI commercialization.
Lecture focuses on practical LM training-data issues: web data is mostly HTML; PDFs require detection + OCR (often via VLMs); language ID filtering; dataset auditing (e.g., C4 issues); and dedup/near-duplicate detection via LSH. Key takeaway is a research signal that *data quality and pipeline sophistication increasingly gate model performance*, especially as training runs get longer—implying sustained spend on compute + storage + data tooling, and rising strategic value of licensed/curated data
Current stance
Hold. The team views AI copyright overhang as a manageable risk and emphasizes mitigation via output guardrails and data pipeline practices. Confidence in the primary signal is moderate (0.28).
- risk via AI copyright overhang perceived as manageable (training risk discounted; output guardrails emphasized) from https://x.com/doodlestein (confidence 0.28)
Top authors on this asset
Active and historical ticker theses
Active play highlights litigation optionality: rights-holder litigation risk could be marked down if courts broadly permit training on copyrighted works; otherwise, focus remains on output-level reproduction risk and guardrails.
Unlock full asset monitoring
Monitor developments in copyright litigation and regulatory guidance on training data, plus signals on dataset auditing and enterprise demand for licensed/curated data—these will affect the legal/regulatory overhang and the cost structure for model developers.