Code Authorship Attribution via Stylometry Analysis
A comprehensive tool for identifying code authorship using stylometric analysis techniques. This project implements multiple machine learning approaches for source code authorship attribution, including N-gram analysis and GraphCodeBERT-based methods.
Overview
This project provides both Jupyter notebook-based analysis and a user-friendly GUI application for performing code authorship attribution. It can analyze source code samples and predict the author based on coding style patterns (stylometry).
Key Features
- Multiple Attribution Methods: N-gram character analysis and GraphCodeBERT (GC-BERT) deep learning approach
- Multi-Language Support: Supports C#, Python, Java, C++, Go, JavaScript, PHP, and Ruby
- Web-based GUI: Streamlit-powered interface for easy use
- Docker Support: Containerized deployment for consistent environments
- Comprehensive Analysis: Built-in evaluation metrics and visualization tools
- Dataset Generation: Tools for building training datasets from GitHub repositories
Methods
1. N-Gram Character Analysis
- Uses character-level n-grams (default: 7-gram) to capture coding style patterns
- Supports all major programming languages
- Fast training and prediction
- Best for single-language datasets
2. GraphCodeBERT (GC-BERT)
- Deep learning approach using pre-trained transformer model
- Leverages code structure and semantic understanding
- Pre-trained on Python, Java, JavaScript, PHP, Ruby, and Go
- More robust for complex authorship patterns
Project Structure
├── code_authorship_attribution.ipynb # Main analysis notebook
├── code_authorship_attribution_gui_integration.ipynb # GUI backend notebook
├── code_authorship_attribution_terse.ipynb # Condensed version
├── code_authorship_attribution_annotated.ipynb # Detailed annotated version
├── test_results_analysis.ipynb # Performance analysis
├── GUI integration/ # Web application
│ ├── app.py # Streamlit GUI
│ ├── code_authorship_attribution_gui_integration.py # Backend functions
│ ├── requirements.txt # Python dependencies
│ ├── Dockerfile # Docker configuration
│ ├── docker-compose.yml # Docker Compose setup
│ ├── train_samples.zip # Training data sample
│ └── analyze_samples.zip # Test data sample
├── subsets.csv # Dataset configuration
├── model_test_results.xlsx # Model performance results
├── authver_test_results.xlsx # Authentication verification results
└── balance_test_results.xlsx # Dataset balance analysis
Quick Start
Using the GUI Application
Option 1: Local Installation
-
Install Python 3.10+ and ensure pip is available
-
Clone the repository:
git clone https://github.com/MSUSEL/wsu-pique-stylometry.git
cd wsu-pique-stylometry -
Navigate to GUI directory:
cd "GUI integration" -
Install dependencies:
pip install -r requirements.txt -
Run the application:
streamlit run app.py -
Access the GUI: Open your browser to
http://localhost:8501
Option 2: Docker Deployment
-
Clone the repository:
git clone https://github.com/MSUSEL/wsu-pique-stylometry.git
cd wsu-pique-stylometry/"GUI integration" -
Run with Docker Compose:
docker-compose up --build -
Access the GUI: Open your browser to
http://localhost:3003
Option 3: Docker Manual Build
cd "GUI integration"
docker build -t stylometry-app .
docker run -p 3003:3003 stylometry-app
Using the GUI
- Upload Training Data: Upload a
train_samples.zipfile containing folders named after each author with their code samples - Upload Test Data: Upload an
analyze_samples.zipfile with code samples to analyze - Select Method: Choose between N-Gram or GC-BERT analysis
- Select Language (N-Gram only): Choose the programming language of your samples
- Optional Evaluation: Check the evaluation box if your test files follow the naming convention
<author>_<filename>.<ext> - Run Attribution: Click the button to start analysis
Data Format Requirements
Training Data (train_samples.zip)
train_samples/
├── author1/
│ ├── file1.py
│ ├── file2.py
│ └── ...
├── author2/
│ ├── file1.py
│ ├── file2.py
│ └── ...
└── ...
Test Data (analyze_samples.zip)
analyze_samples/
├── unknown_file1.py
├── unknown_file2.py
├── author1_test.py # For evaluation (optional)
└── ...
Using Jupyter Notebooks
Main Analysis Notebook
Open code_authorship_attribution.ipynb for complete dataset generation and analysis:
- Part 1: Build dataset from GitHub repositories
- Part 2: Preprocess and prepare data
- Part 3: Feature extraction and vectorization
- Part 4: Model training and evaluation
- Part 5: Hyperparameter optimization
Quick Analysis
Use code_authorship_attribution_terse.ipynb for streamlined analysis with existing datasets.
Configuration
N-Gram Parameters (in notebooks)
vec_kwargs = {
'input': 'filename',
'strip_accents': 'ascii',
'lowercase': False,
'ngram_range': (7, 7), # Character n-gram size
'analyzer': 'char',
'max_features': 5000 # Maximum number of features
}
Dataset Generation Parameters
part1_kwargs = {
'terms': 'machine learning',
'params': {'language': 'python', 'size': '100000..250000'},
'sorting': {'sort': 'stars', 'order': None},
'num_authors': 10,
'max_results': 100,
'min_files': 15,
'max_files': 50,
'min_lines': 10,
'data_directory': './data',
'temp_directory': './temp'
}
Performance
The project includes comprehensive performance analysis in test_results_analysis.ipynb. Key findings:
- N-Gram Method: Fast execution, good accuracy for single-language datasets
- GC-BERT Method: Higher accuracy for complex patterns, slower execution
- Language Impact: Performance varies by programming language
- Dataset Size: Larger, balanced datasets improve accuracy
Research Applications
This tool has been used for:
- Academic research in software engineering
- Code plagiarism detection
- Authorship verification in collaborative projects
- Stylometric analysis of programming patterns
Contributing
- Fork the repository
- Create a feature branch (
git checkout -b feature/new-feature) - Commit your changes (
git commit -am 'Add new feature') - Push to the branch (
git push origin feature/new-feature) - Create a Pull Request
Requirements
Core Dependencies
- Python 3.10+
- scikit-learn
- pandas
- numpy
- tensorflow/keras
- torch
- transformers
- streamlit (for GUI)
Full Requirements
See GUI integration/requirements.txt for complete dependency list.
Troubleshooting
Common Issues
- "No valid samples found": Ensure your code samples match the selected language
- Memory errors: Reduce
max_featuresin N-gram configuration or use smaller datasets - Docker build fails: Ensure Docker has sufficient memory allocation (4GB+ recommended)
- GC-BERT performance issues: Use recommended languages (Python, Java, JavaScript, PHP, Ruby, Go)
File Format Requirements
- Code files must have appropriate extensions (.py, .java, .js, etc.)
- Zip files must contain the exact folder structure shown above
- Files should contain sufficient code (minimum 10 lines recommended)
License
This project is part of the WSU PIQUE research initiative. Please cite appropriately if used in academic work.
Support
For questions and support:
- Create an issue in the GitHub repository
- Check the annotated notebook (
code_authorship_attribution_annotated.ipynb) for detailed explanations - Review the test results analysis for performance insights
Note: This tool is designed for research and educational purposes. Ensure you have appropriate permissions when analyzing code repositories.