The race for Artificial General Intelligence (AGI) has hit a “data wall.” To teach AI agents how to handle complex, multi-day professional tasks, OpenAI is moving beyond public datasets. The company is now reportedly recruiting contractors to provide real-world “artifacts”โWord documents, PDFs, PowerPoint decks, and Excel spreadsheetsโthat were created during their actual professional careers.
The Goal: Establishing a Human Baseline
OpenAI’s objective is to measure its models against human professionals. By collecting real assignments from various industries, the company can evaluate how well an AI agent can replicate high-level white-collar work, such as:
- Drafting market analysis reports.
- Constructing complex financial budget models.
- Turning messy meeting notes into actionable project plans.
- Managing technical documentation and code repositories.
The “Superstar Scrubbing” Tool
Recognizing the massive legal risks involved in uploading proprietary data, OpenAI has reportedly provided contractors with a specialized tool to sanitize their files.
- How it Works: Contractors are directed to a ChatGPT-powered tool called “Superstar Scrubbing.”
- The Function: This tool is designed to identify and redact personally identifiable information (PII), company secrets, and proprietary brand names before the file is uploaded to OpenAIโs training corpus.
- The Risk: Legal experts warn that “scrubbing” is not foolproof. Contextual information left in a document can often be used to reverse-engineer its origin, potentially leading to breaches of non-disclosure agreements (NDAs).
Legal and Ethical Tripwires
The initiative has sparked a firestorm of debate regarding intellectual property (IP) and employment law.
| Stakeholder | Primary Concern |
| Former Employers | Claim that work produced under an employment contract belongs to the company, and uploading it to a third party is trade secret theft. |
| Contractors | Risking blacklisting or legal action from previous clients in exchange for short-term gig payments from OpenAI. |
| Industry Experts | Worry that AI is being trained on “stolen” expertise to eventually automate the very professions of the people providing the data. |
| Attorneys | Argue that relying on contractors to decide what is “safe” to upload places both the worker and OpenAI at extreme legal risk. |
The Industry Trend: Specialized Data Sourcing
This strategy reflects a broader 2026 trend where AI labs are no longer satisfied with “scraping the web.”
- Exhaustion of Public Data: Models have largely exhausted high-quality public text.
- Domain Expertise: To move from “chatbots” to “agents,” AI needs to see the internal logic of business processes (emails, memos, and process docs).
- Human-in-the-Loop: Labs are increasingly hiring thousands of specialized professionals (lawyers, engineers, doctors) to curate bespoke datasets rather than relying on automated scraping.
Advice for Professional Contractors
If you are approached for a data-collection gig, legal analysts recommend:
- Review Your NDAs: Most employment contracts explicitly forbid the distribution of “work product” to third parties.
- Create “Synthetic” Samples: Instead of uploading a real past project, create a new, fictionalized version that demonstrates the same complexity without using real data.
- Verify Ownership: Only upload materials for which you hold the full copyright (e.g., personal projects or open-source contributions).


