Code Authorship Attribution via Stylometry Analysis

A comprehensive tool for identifying code authorship using stylometric analysis techniques. This project implements multiple machine learning approaches for source code authorship attribution, including N-gram analysis and GraphCodeBERT-based methods.

Overview

This project provides both Jupyter notebook-based analysis and a user-friendly GUI application for performing code authorship attribution. It can analyze source code samples and predict the author based on coding style patterns (stylometry).

Key Features

Multiple Attribution Methods: N-gram character analysis and GraphCodeBERT (GC-BERT) deep learning approach
Multi-Language Support: Supports C#, Python, Java, C++, Go, JavaScript, PHP, and Ruby
Web-based GUI: Streamlit-powered interface for easy use
Docker Support: Containerized deployment for consistent environments
Comprehensive Analysis: Built-in evaluation metrics and visualization tools
Dataset Generation: Tools for building training datasets from GitHub repositories

Methods

1. N-Gram Character Analysis

Uses character-level n-grams (default: 7-gram) to capture coding style patterns
Supports all major programming languages
Fast training and prediction
Best for single-language datasets

2. GraphCodeBERT (GC-BERT)

Deep learning approach using pre-trained transformer model
Leverages code structure and semantic understanding
Pre-trained on Python, Java, JavaScript, PHP, Ruby, and Go
More robust for complex authorship patterns

Project Structure

├── code_authorship_attribution.ipynb          # Main analysis notebook
├── code_authorship_attribution_gui_integration.ipynb  # GUI backend notebook
├── code_authorship_attribution_terse.ipynb    # Condensed version
├── code_authorship_attribution_annotated.ipynb # Detailed annotated version
├── test_results_analysis.ipynb                # Performance analysis
├── GUI integration/                            # Web application
│   ├── app.py                                 # Streamlit GUI
│   ├── code_authorship_attribution_gui_integration.py # Backend functions
│   ├── requirements.txt                       # Python dependencies
│   ├── Dockerfile                            # Docker configuration
│   ├── docker-compose.yml                    # Docker Compose setup
│   ├── train_samples.zip                     # Training data sample
│   └── analyze_samples.zip                   # Test data sample
├── subsets.csv                               # Dataset configuration
├── model_test_results.xlsx                   # Model performance results
├── authver_test_results.xlsx                 # Authentication verification results
└── balance_test_results.xlsx                 # Dataset balance analysis

Quick Start

Using the GUI Application

Option 1: Local Installation

Install Python 3.10+ and ensure pip is available

Clone the repository:

git clone https://github.com/MSUSEL/wsu-pique-stylometry.git
cd wsu-pique-stylometry

Navigate to GUI directory:
```
cd "GUI integration"
```
Install dependencies:
```
pip install -r requirements.txt
```
Run the application:
```
streamlit run app.py
```
Access the GUI: Open your browser to http://localhost:8501

Option 2: Docker Deployment

Clone the repository:

git clone https://github.com/MSUSEL/wsu-pique-stylometry.git
cd wsu-pique-stylometry/"GUI integration"

Run with Docker Compose:
```
docker-compose up --build
```
Access the GUI: Open your browser to http://localhost:3003

Option 3: Docker Manual Build

cd "GUI integration"
docker build -t stylometry-app .
docker run -p 3003:3003 stylometry-app

Using the GUI

Upload Training Data: Upload a train_samples.zip file containing folders named after each author with their code samples
Upload Test Data: Upload an analyze_samples.zip file with code samples to analyze
Select Method: Choose between N-Gram or GC-BERT analysis
Select Language (N-Gram only): Choose the programming language of your samples
Optional Evaluation: Check the evaluation box if your test files follow the naming convention <author>_<filename>.<ext>
Run Attribution: Click the button to start analysis

Data Format Requirements

Training Data (`train_samples.zip`)

train_samples/
├── author1/
│   ├── file1.py
│   ├── file2.py
│   └── ...
├── author2/
│   ├── file1.py
│   ├── file2.py
│   └── ...
└── ...

Test Data (`analyze_samples.zip`)

analyze_samples/
├── unknown_file1.py
├── unknown_file2.py
├── author1_test.py    # For evaluation (optional)
└── ...

Using Jupyter Notebooks

Main Analysis Notebook

Open code_authorship_attribution.ipynb for complete dataset generation and analysis:

Part 1: Build dataset from GitHub repositories
Part 2: Preprocess and prepare data
Part 3: Feature extraction and vectorization
Part 4: Model training and evaluation
Part 5: Hyperparameter optimization

Quick Analysis

Use code_authorship_attribution_terse.ipynb for streamlined analysis with existing datasets.

Configuration

N-Gram Parameters (in notebooks)

vec_kwargs = {
    'input': 'filename',
    'strip_accents': 'ascii',
    'lowercase': False,
    'ngram_range': (7, 7),      # Character n-gram size
    'analyzer': 'char',
    'max_features': 5000        # Maximum number of features
}

Dataset Generation Parameters

part1_kwargs = {
    'terms': 'machine learning',
    'params': {'language': 'python', 'size': '100000..250000'},
    'sorting': {'sort': 'stars', 'order': None},
    'num_authors': 10,
    'max_results': 100,
    'min_files': 15,
    'max_files': 50,
    'min_lines': 10,
    'data_directory': './data',
    'temp_directory': './temp'
}

Performance

The project includes comprehensive performance analysis in test_results_analysis.ipynb. Key findings:

N-Gram Method: Fast execution, good accuracy for single-language datasets
GC-BERT Method: Higher accuracy for complex patterns, slower execution
Language Impact: Performance varies by programming language
Dataset Size: Larger, balanced datasets improve accuracy

Research Applications

This tool has been used for:

Academic research in software engineering
Code plagiarism detection
Authorship verification in collaborative projects
Stylometric analysis of programming patterns

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/new-feature)
Commit your changes (git commit -am 'Add new feature')
Push to the branch (git push origin feature/new-feature)
Create a Pull Request

Requirements

Core Dependencies

Python 3.10+
scikit-learn
pandas
numpy
tensorflow/keras
torch
transformers
streamlit (for GUI)

Full Requirements

See GUI integration/requirements.txt for complete dependency list.

Troubleshooting

Common Issues

"No valid samples found": Ensure your code samples match the selected language
Memory errors: Reduce max_features in N-gram configuration or use smaller datasets
Docker build fails: Ensure Docker has sufficient memory allocation (4GB+ recommended)
GC-BERT performance issues: Use recommended languages (Python, Java, JavaScript, PHP, Ruby, Go)

File Format Requirements

Code files must have appropriate extensions (.py, .java, .js, etc.)
Zip files must contain the exact folder structure shown above
Files should contain sufficient code (minimum 10 lines recommended)

License

This project is part of the WSU PIQUE research initiative. Please cite appropriately if used in academic work.

Support

For questions and support:

Create an issue in the GitHub repository
Check the annotated notebook (code_authorship_attribution_annotated.ipynb) for detailed explanations
Review the test results analysis for performance insights

Note: This tool is designed for research and educational purposes. Ensure you have appropriate permissions when analyzing code repositories.

Overview​

Key Features​

Methods​

1. N-Gram Character Analysis​

2. GraphCodeBERT (GC-BERT)​

Project Structure​

Quick Start​

Using the GUI Application​

Option 1: Local Installation​

Option 2: Docker Deployment​

Option 3: Docker Manual Build​

Using the GUI​

Data Format Requirements​

Training Data (train_samples.zip)​

Test Data (analyze_samples.zip)​

Using Jupyter Notebooks​

Main Analysis Notebook​

Quick Analysis​

Configuration​

N-Gram Parameters (in notebooks)​

Dataset Generation Parameters​

Performance​

Research Applications​

Contributing​

Requirements​

Core Dependencies​

Full Requirements​

Troubleshooting​

Common Issues​

File Format Requirements​

License​

Support​