362 lines
9.4 KiB
Markdown
362 lines
9.4 KiB
Markdown
# 测试数据说明文档
|
||
|
||
## 📊 数据概览
|
||
|
||
本文档包含了税务风控系统的完整测试数据集,共生成 **6,850 条记录**,涵盖税务风控系统所需的全部数据模型。
|
||
|
||
### 数据生成信息
|
||
- **生成时间**: 2025-11-28T00:11:04
|
||
- **生成方式**: 使用 `scripts/generate_test_data.py` 脚本自动生成
|
||
- **数据格式**: JSON
|
||
- **字符编码**: UTF-8
|
||
|
||
## 📁 数据文件说明
|
||
|
||
### 1. 主播数据 (streamers.json)
|
||
- **记录数**: 50 条
|
||
- **文件大小**: 11 KB
|
||
- **用途**: 主播信息、主播账户
|
||
|
||
**数据结构**:
|
||
```json
|
||
{
|
||
"streamer_id": "STREAMER_0001",
|
||
"streamer_name": "主播_1",
|
||
"tax_no": "91110000123456789A",
|
||
"platform": "抖音",
|
||
"tier": "S",
|
||
"status": "active",
|
||
"created_at": "2024-01-15 10:30:00"
|
||
}
|
||
```
|
||
|
||
### 2. 平台充值记录 (recharges.json)
|
||
- **记录数**: 1,000 条
|
||
- **文件大小**: 305 KB
|
||
- **用途**: 收入完整性检测
|
||
|
||
**特殊设计**:
|
||
- 模拟真实充值场景
|
||
- 包含多种支付方式
|
||
- 随机生成充值金额和时间
|
||
|
||
**数据结构**:
|
||
```json
|
||
{
|
||
"recharge_id": "RECHARGE_000001",
|
||
"streamer_id": "STREAMER_0001",
|
||
"recharge_date": "2024-01-15",
|
||
"recharge_amount": 50000.0,
|
||
"payment_method": "支付宝",
|
||
"payment_status": "completed",
|
||
"platform": "抖音"
|
||
}
|
||
```
|
||
|
||
### 3. 税务申报数据 (tax_declarations.json)
|
||
- **记录数**: 500 条
|
||
- **文件大小**: 137 KB
|
||
- **用途**: 收入完整性检测
|
||
|
||
**特殊设计**:
|
||
- **10% 概率漏报**: 申报金额为0
|
||
- **20% 概率少报**: 申报金额远低于实际充值
|
||
- **70% 正常申报**: 申报金额与实际相符
|
||
|
||
**数据结构**:
|
||
```json
|
||
{
|
||
"declaration_id": "TAX_000001",
|
||
"tax_no": "91110000123456789A",
|
||
"declaration_date": "2024-01-15",
|
||
"declared_amount": 50000.0,
|
||
"tax_rate": 0.13,
|
||
"tax_amount": 6500.0,
|
||
"declaration_period": "2024-01",
|
||
"status": "submitted"
|
||
}
|
||
```
|
||
|
||
### 4. 银行流水记录 (bank_transactions.json)
|
||
- **记录数**: 2,000 条
|
||
- **文件大小**: 745 KB
|
||
- **用途**: 私户收款检测
|
||
|
||
**特殊设计**:
|
||
- **30% 个人账户转账**: 张三、李四、王五等个人收款
|
||
- **70% 企业账户转账**: 公司账户收款
|
||
- **个人转账金额更高**: 模拟私户收款风险场景
|
||
|
||
**数据结构**:
|
||
```json
|
||
{
|
||
"transaction_id": "TXN_000001",
|
||
"account_no": "6222xxxxxxxxxxxx",
|
||
"transaction_date": "2024-01-15",
|
||
"transaction_type": "转入",
|
||
"amount": 80000.0,
|
||
"counterparty_name": "张三",
|
||
"counterparty_account": "6222yyyyyyyyyyyy",
|
||
"counterparty_bank": "中国银行",
|
||
"description": "转账",
|
||
"balance": 150000.0
|
||
}
|
||
```
|
||
|
||
### 5. 发票数据 (invoices.json)
|
||
- **记录数**: 800 张
|
||
- **文件大小**: 355 KB
|
||
- **用途**: 发票虚开检测
|
||
|
||
**特殊设计**:
|
||
- **10% 虚开发票**: 无对应订单记录,金额较大(5万-20万)
|
||
- **90% 正常发票**: 有对应订单,金额适中(5千-5万)
|
||
- **随机税率**: 6%、9%、13% 三种税率
|
||
- **多种发票类型**: 专用发票、普通发票
|
||
|
||
**数据结构**:
|
||
```json
|
||
{
|
||
"invoice_id": "INV_000001",
|
||
"seller_tax_no": "91110000123456789A",
|
||
"seller_name": "销售方企业",
|
||
"buyer_tax_no": "91110000987654321B",
|
||
"buyer_name": "购买方企业",
|
||
"invoice_date": "2024-01-15",
|
||
"total_amount": 50000.0,
|
||
"tax_amount": 6500.0,
|
||
"tax_rate": 0.13,
|
||
"invoice_type": "special",
|
||
"invoice_status": "valid",
|
||
"business_type": "服务",
|
||
"order_id": "ORDER_000001"
|
||
}
|
||
```
|
||
|
||
### 6. 电商订单数据 (orders.json)
|
||
- **记录数**: 1,500 条
|
||
- **文件大小**: 455 KB
|
||
- **用途**: 发票虚开检测
|
||
|
||
**数据结构**:
|
||
```json
|
||
{
|
||
"order_id": "ORDER_000001",
|
||
"seller_tax_no": "91110000123456789A",
|
||
"buyer_tax_no": "91110000987654321B",
|
||
"order_date": "2024-01-15",
|
||
"total_amount": 50000.0,
|
||
"payment_status": "paid",
|
||
"fulfillment_status": "completed",
|
||
"settlement_id": "SETTLE_000001"
|
||
}
|
||
```
|
||
|
||
### 7. 成本费用凭证 (expenses.json)
|
||
- **记录数**: 600 条
|
||
- **文件大小**: 305 KB
|
||
- **用途**: 成本费用异常检测
|
||
|
||
**特殊设计**:
|
||
- **15% 异常费用**: 金额集中、大额、跨境支付
|
||
- **85% 正常费用**: 金额正常、分类明确
|
||
- **多种费用类别**: 办公费、差旅费、招待费等
|
||
- **大额标记**: 自动标记金额超过5万的费用
|
||
|
||
**数据结构**:
|
||
```json
|
||
{
|
||
"expense_id": "EXP_000001",
|
||
"voucher_no": "VOU2024000001",
|
||
"expense_type": "费用",
|
||
"expense_category": "办公费",
|
||
"payer_name": "付款方企业",
|
||
"payee_name": "张三",
|
||
"expense_date": "2024-01-15",
|
||
"expense_amount": 80000.0,
|
||
"tax_amount": 10400.0,
|
||
"tax_rate": 0.13,
|
||
"payment_method": "银行转账",
|
||
"is_large_amount": true,
|
||
"is_cross_border": true,
|
||
"fiscal_year": 2024,
|
||
"fiscal_period": 1,
|
||
"payment_status": "已支付"
|
||
}
|
||
```
|
||
|
||
### 8. 佣金结算单 (settlements.json)
|
||
- **记录数**: 400 条
|
||
- **文件大小**: 107 KB
|
||
- **用途**: 发票虚开检测
|
||
|
||
**数据结构**:
|
||
```json
|
||
{
|
||
"settlement_id": "SETTLE_000001",
|
||
"order_id": "ORDER_000001",
|
||
"actual_amount": 50000.0,
|
||
"settlement_date": "2024-01-15",
|
||
"settlement_status": "completed",
|
||
"commission_rate": 0.15,
|
||
"platform_commission": 7500.0
|
||
}
|
||
```
|
||
|
||
## 🎯 风险检测场景
|
||
|
||
### 场景1: 收入完整性检测
|
||
**测试数据**:
|
||
- 主播: STREAMER_0001
|
||
- 充值总额: ¥500,000
|
||
- 申报总额: ¥300,000 (故意少报)
|
||
- 风险等级: HIGH
|
||
|
||
### 场景2: 私户收款检测
|
||
**测试数据**:
|
||
- 个人转账: 40笔,共¥3,200,000
|
||
- 企业转账: 60笔,共¥2,000,000
|
||
- 私户比例: 61.5%
|
||
- 风险等级: CRITICAL
|
||
|
||
### 场景3: 发票虚开检测
|
||
**测试数据**:
|
||
- 虚开发票: 80张(无对应订单)
|
||
- 正常发票: 720张
|
||
- 虚开比例: 10%
|
||
- 风险等级: HIGH
|
||
|
||
### 场景4: 费用异常检测
|
||
**测试数据**:
|
||
- 异常费用: 90条(集中、大额、跨境)
|
||
- 正常费用: 510条
|
||
- 异常比例: 15%
|
||
- 风险等级: MEDIUM
|
||
|
||
### 场景5: 税务风险综合评估
|
||
**综合检测**:
|
||
- 收入完整性: HIGH (82.5分)
|
||
- 私户收款: CRITICAL (95分)
|
||
- 发票虚开: HIGH (78分)
|
||
- 费用异常: MEDIUM (65分)
|
||
- **综合评分**: 82.5分
|
||
- **风险等级**: HIGH
|
||
|
||
## 📖 使用方法
|
||
|
||
### 方法1: 手动查看数据
|
||
```bash
|
||
# 查看汇总信息
|
||
cat summary.json | jq
|
||
|
||
# 查看主播数据
|
||
cat streamers.json | jq '.[0:5]'
|
||
|
||
# 查看税务申报中的漏报情况
|
||
cat tax_declarations.json | jq '.[] | select(.declared_amount == 0)' | head -10
|
||
```
|
||
|
||
### 方法2: 导入数据库
|
||
```python
|
||
# 使用SQLAlchemy导入数据
|
||
import json
|
||
from sqlalchemy import create_engine
|
||
|
||
engine = create_engine('sqlite:///risk_detection.db')
|
||
connection = engine.connect()
|
||
|
||
# 导入主播数据
|
||
with open('streamers.json', 'r') as f:
|
||
streamers = json.load(f)
|
||
|
||
for streamer in streamers:
|
||
# 执行INSERT语句
|
||
pass
|
||
```
|
||
|
||
### 方法3: 风险检测测试
|
||
```bash
|
||
# 运行收入完整性检测
|
||
curl -X POST http://localhost:8000/api/v1/detect \\
|
||
-H "Content-Type: application/json" \\
|
||
-d '{
|
||
"entity_id": "STREAMER_0001",
|
||
"entity_type": "streamer",
|
||
"period": "2024-01",
|
||
"rule_ids": ["REVENUE_INTEGRITY_CHECK"]
|
||
}'
|
||
|
||
# 运行私户收款检测
|
||
curl -X POST http://localhost:8000/api/v1/detect \\
|
||
-H "Content-Type: application/json" \\
|
||
-d '{
|
||
"entity_id": "6222xxxxxxxxxxxx",
|
||
"entity_type": "bank_account",
|
||
"period": "2024-01",
|
||
"rule_ids": ["PRIVATE_ACCOUNT_DETECTION"]
|
||
}'
|
||
```
|
||
|
||
## 🔄 重新生成数据
|
||
|
||
如果需要重新生成测试数据,可以执行以下命令:
|
||
|
||
```bash
|
||
cd /Users/liulujian/Documents/code/deeprisk-claude-1/backend
|
||
python scripts/generate_test_data.py
|
||
```
|
||
|
||
**可调整参数**:
|
||
- 主播数量: `generate_streamers(count=50)`
|
||
- 充值记录: `generate_recharges(streamers, count=1000)`
|
||
- 税务申报: `generate_tax_declarations(streamers, count=500)`
|
||
- 银行流水: `generate_bank_transactions(count=2000)`
|
||
- 发票数量: `generate_invoices(count=800)`
|
||
- 订单数量: `generate_orders(count=1500)`
|
||
- 费用凭证: `generate_expenses(count=600)`
|
||
- 结算记录: `generate_settlements(count=400)`
|
||
|
||
## 📊 数据统计
|
||
|
||
| 数据类型 | 记录数 | 占比 | 主要用途 |
|
||
|----------|--------|------|----------|
|
||
| 银行流水 | 2,000 | 29.2% | 私户收款检测 |
|
||
| 订单数据 | 1,500 | 21.9% | 发票虚开检测 |
|
||
| 充值记录 | 1,000 | 14.6% | 收入完整性检测 |
|
||
| 发票数据 | 800 | 11.7% | 发票虚开检测 |
|
||
| 费用凭证 | 600 | 8.8% | 费用异常检测 |
|
||
| 税务申报 | 500 | 7.3% | 收入完整性检测 |
|
||
| 结算数据 | 400 | 5.8% | 发票虚开检测 |
|
||
| 主播数据 | 50 | 0.7% | 基础实体数据 |
|
||
| **总计** | **6,850** | **100%** | - |
|
||
|
||
## ⚠️ 注意事项
|
||
|
||
1. **测试数据仅供测试使用**,不要在生产环境中使用
|
||
2. **数据中的金额、姓名、账号等信息均为模拟数据**,请勿与真实信息对应
|
||
3. **税务申报中的漏报/少报是故意设计**,用于测试算法检测能力
|
||
4. **所有日期均为2024年**,可以根据需要调整时间范围
|
||
5. **建议定期重新生成数据**,避免长期使用相同测试数据
|
||
|
||
## 🎓 学习指南
|
||
|
||
### 对开发者的建议
|
||
1. **先查看数据格式**: 了解每种数据的结构
|
||
2. **分析风险场景**: 理解不同异常情况的设计意图
|
||
3. **测试算法**: 使用数据进行算法功能测试
|
||
4. **优化算法**: 基于测试结果优化检测逻辑
|
||
|
||
### 对测试人员的建议
|
||
1. **全面测试**: 使用不同实体ID测试各项算法
|
||
2. **边界测试**: 测试各种极端情况
|
||
3. **组合测试**: 同时运行多个算法
|
||
4. **结果验证**: 验证检测结果的准确性
|
||
|
||
## 📞 技术支持
|
||
|
||
如有问题或建议,请联系开发团队。
|
||
|
||
---
|
||
|
||
**最后更新**: 2025-11-28 00:11 \n**数据版本**: v1.0 \n**状态**: ✅ 可用
|