No description

JavaScript 59.9%
CSS 23.7%
Python 15.1%
HTML 0.7%
Shell 0.6%

Find a file

Anonymous 81429fb460 Initial anonymous release		2026-01-22 01:26:08 +01:00
backend	Initial anonymous release	2026-01-22 01:26:08 +01:00
frontend	Initial anonymous release	2026-01-22 01:26:08 +01:00
vllm_test	Initial anonymous release	2026-01-22 01:26:08 +01:00
.gitignore	Initial anonymous release	2026-01-22 01:26:08 +01:00
LICENSE	Initial anonymous release	2026-01-22 01:26:08 +01:00
package-lock.json	Initial anonymous release	2026-01-22 01:26:08 +01:00
package.json	Initial anonymous release	2026-01-22 01:26:08 +01:00
Procfile	Initial anonymous release	2026-01-22 01:26:08 +01:00
README.md	Initial anonymous release	2026-01-22 01:26:08 +01:00
start-development.sh	Initial anonymous release	2026-01-22 01:26:08 +01:00

README.md

ConvoScope

A Privacy-Preserving Platform for LLM Conversation Research

ConvoScope enables researchers to collect and study LLM conversations while protecting participant privacy. All conversation processing happens client-side in the browser—raw data never leaves the user's device.

Features

Alignment-Based Classifier Training: Define target conversations through natural language descriptions. No ML expertise required.
Client-Side Privacy: All filtering, classification, and PII detection runs entirely in the browser using ONNX models and Transformers.js.
Flexible Conversation Generation: Choose between OpenAI API or local vLLM (UserLM-8b + Qwen3-8B) for synthetic conversation generation.
Custom Binary Classifiers: Train study-specific classifiers that export to ONNX for browser inference.
Demo Mode: Try the platform with pre-generated sample conversations.
Complete Research Workflow: From study creation to data collection, designed for non-technical researchers.

Quick Start

Demo Mode (Recommended for First-Time Users)

Clone the repository:

git clone https://repo.paperbackwriters.club/code/convoscope.git
cd convoscope

Install dependencies:

npm install
cd frontend && npm install && cd ..
cd backend && npm install && cd ..

Enable demo mode:

# Frontend
echo "REACT_APP_DEMO_MODE=true" > frontend/.env

# Backend
echo "DEMO_MODE=true" >> backend/.env

Start the development servers:
```
npm run dev
```
Open http://localhost:3000 in your browser.

Full Installation

Prerequisites

Node.js >= 18.0.0
npm >= 8.0.0
Python 3.9+ (for classifier training and local vLLM)
32GB RAM (for local model inference)

Backend Setup

Navigate to the backend directory:
```
cd backend
```
Install Node.js dependencies:
```
npm install
```
Install Python dependencies (for classifier training):
```
pip install -r services/classifier/requirements.txt
```

Configure environment variables:

cp .env.example .env

Edit .env:

NODE_ENV=development
PORT=3001
DEMO_MODE=false

# Optional: OpenAI API key for conversation generation
# OPENAI_API_KEY=sk-your-key-here

# Optional: SendGrid for email notifications (production only)
# SENDGRID_API_KEY=your-sendgrid-key

Start the server:
```
npm run dev
```

Frontend Setup

Navigate to the frontend directory:
```
cd frontend
```
Install dependencies:
```
npm install
```

Configure environment variables:

echo "REACT_APP_DEMO_MODE=false" > .env

Start the development server:
```
npm start
```

Conversation Generation Options

Option 1: OpenAI API

Use the OpenAI API for generating synthetic conversations during the alignment phase.

Obtain an API key from OpenAI Platform
Enter the key in the study creation form when prompted

Option 2: Local vLLM (Privacy-Preserving)

Run conversation generation entirely locally using open-source models.

Required Models

User Simulator: microsoft/UserLM-8b - Trained on WildChat for realistic user behavior
Assistant Simulator: Qwen/Qwen3-8B - Natural conversational responses

Setup

Navigate to the vLLM directory:
```
cd vllm_test
```
Install Python dependencies:
```
pip install -r requirements.txt
```
Download the models (first run will download automatically):
```
python test_mps_setup.py
```
The backend will automatically detect available models when you select "Local vLLM" in the study creation form.

Hardware Requirements

Backend	Device	Memory	Speed
vLLM	CUDA GPU	~32GB VRAM	Fast
Transformers	Apple Silicon	~32GB RAM	Medium
Transformers	CUDA GPU	~32GB VRAM	Medium
OpenAI API	N/A	Minimal	Fast

Classifier Training Service

ConvoScope includes a Python-based classifier training service that:

Generates synthetic training data using few-shot learning from alignment samples
Trains a binary classifier (relevant vs not relevant) using sentence-transformers
Exports to ONNX for client-side browser inference

Architecture

backend/services/classifier/
├── config.py               # Configuration classes
├── synthetic_generator.py  # Few-shot data generation (threaded)
├── model.py               # Binary classifier (MiniLM-L6-v2 + classification head)
├── trainer.py             # Training loop with early stopping
├── export_onnx.py         # ONNX export with quantization
└── train_classifier.py    # Main CLI entry point

Training Pipeline

Accepted Samples (positive few-shot) ──┐
                                       ├─→ Synthetic Generator (8 threads)
Rejected Samples (negative few-shot) ──┘           │
                                                   ↓
                                         1000 synthetic samples
                                         (500 positive + 500 negative)
                                                   │
                                                   ↓
                                         Binary Classifier Training
                                         (5 epochs, early stopping)
                                                   │
                                                   ↓
                                         ONNX Export (quantized)
                                                   │
                                                   ↓
                                         Browser-ready model

Demo Mode vs Real Mode

Mode	Behavior
Demo (`DEMO_MODE=true`)	Training completes instantly with mock metrics. Frontend uses built-in category classifier.
Real (`DEMO_MODE=false`)	Full pipeline: generates 1000 samples, trains classifier, exports ONNX model.

Manual Training

You can also run the training service directly:

cd backend/services/classifier
python train_classifier.py \
  --input training_data.json \
  --output ./output \
  --study-id my-study

Input JSON format:

{
  "study_id": "abc123",
  "prompt_spec": "Conversations about mental health...",
  "accepted_samples": [...],
  "rejected_samples": [...],
  "generation_method": "openai",
  "openai_api_key": "sk-...",
  "demo_mode": false
}

Project Structure

convoscope/
├── backend/                    # Express.js API server
│   ├── routes/                 # API routes (studies, forms)
│   ├── services/
│   │   ├── classifier/         # Python classifier training service
│   │   │   ├── config.py
│   │   │   ├── synthetic_generator.py
│   │   │   ├── model.py
│   │   │   ├── trainer.py
│   │   │   ├── export_onnx.py
│   │   │   └── train_classifier.py
│   │   ├── classifierService.js  # Node.js wrapper
│   │   ├── modelTrainer.js       # Training orchestration
│   │   ├── storageService.js     # Data persistence
│   │   └── syntheticGenerator.js # Alignment sample generation
│   └── server.js               # Main server entry
├── frontend/                   # React application
│   ├── src/
│   │   ├── components/
│   │   │   ├── Form/           # Data donation form (participant)
│   │   │   └── ResearcherDashboard/  # Study management
│   │   └── config.js           # App configuration
│   └── public/                 # Static assets
├── vllm_test/                  # Local LLM generation scripts
├── train_classifier/           # Legacy classifier utilities
├── paper/                      # CHI 2026 paper source
├── README.md
├── LICENSE
└── .gitignore

Usage Guide

For Researchers

Create a Study: Navigate to the Researcher Dashboard and define your study parameters:
- Study name and description
- Target conversation topics (prompt specification)
- Generation method (OpenAI API or Local vLLM)
- Conversation filters (turn count, message length, etc.)
Alignment Phase: Review generated sample conversations:
- Accept conversations that match your research criteria
- Reject conversations that don't match
- Need at least 5 accepted samples to proceed
Training: The system trains a custom binary classifier:
- Generates 1000 synthetic samples using your accepted/rejected examples
- Trains a sentence-transformer classifier
- Exports to ONNX for browser inference
Publish: Receive a shareable link for participants.
Collect Data: Participants upload their conversations, which are filtered and anonymized client-side before submission.

For Participants

Export Conversations: Download your ChatGPT conversations from chat.openai.com → Settings → Data Controls → Export.
Upload: Visit the study link and upload your exported ZIP file.
Review: The system filters relevant conversations and highlights detected PII for your review.
Anonymize: Add custom anonymization terms and review auto-detected sensitive information.
Submit: Only reviewed, anonymized content is sent to the researcher.

API Endpoints

Health Check

GET /api/health

Studies (Researcher)

GET    /api/studies              # List all studies
POST   /api/studies              # Create new study
GET    /api/studies/:id          # Get study details
PUT    /api/studies/:id          # Update study
DELETE /api/studies/:id          # Delete study
POST   /api/studies/:id/generate # Generate alignment samples
POST   /api/studies/:id/accept   # Accept a sample
POST   /api/studies/:id/reject   # Reject a sample
POST   /api/studies/:id/start-training  # Start classifier training
GET    /api/studies/:id/training-status # Get training status

Forms (Participant)

GET    /api/form/:id          # Get form configuration
POST   /api/form/:id/submit   # Submit anonymized data

System

GET    /api/vllm/status       # Check local model availability

Environment Variables

Backend

Variable	Description	Default
`NODE_ENV`	Environment mode	`development`
`PORT`	Server port	`3001`
`DEMO_MODE`	Enable demo mode	`false`
`OPENAI_API_KEY`	OpenAI API key (optional)	-
`SENDGRID_API_KEY`	SendGrid key for emails (production)	-

Frontend

Variable	Description	Default
`REACT_APP_DEMO_MODE`	Enable demo mode	`false`
`REACT_APP_API_URL`	Backend API URL	`http://localhost:3001/api`

Deployment

Heroku

Create a Heroku app:
```
heroku create your-app-name
```

Set environment variables:

heroku config:set NODE_ENV=production
heroku config:set SENDGRID_API_KEY=your-key

Add Python buildpack (for classifier training):

heroku buildpacks:add --index 1 heroku/python
heroku buildpacks:add --index 2 heroku/nodejs

Deploy:
```
git push heroku main
```

Privacy Architecture

ConvoScope implements a client-side privacy architecture:

Conversation Parsing: ZIP file extraction and JSON parsing happen in the browser.
Classification: ONNX models run via Transformers.js to filter relevant conversations.
PII Detection: Named entity recognition identifies sensitive information client-side.
User Review: Participants review and approve all data before submission.
Anonymization: PII is replaced with placeholders before leaving the browser.

No raw conversation data ever reaches the server.

Citation

If you use ConvoScope in your research, please cite:

@inproceedings{convoscope2026,
  title={ConvoScope: A Privacy-Preserving Platform for LLM Conversation Research},
  author={[Authors]},
  booktitle={CHI '26 Extended Abstracts},
  year={2026},
  publisher={ACM}
}

License

MIT License - see LICENSE for details.

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Support

Issues: Repository Issues
Contact: For support, please open an issue in the repository.