Skip to main content

Code Authorship Attribution via Stylometry Analysis

A comprehensive tool for identifying code authorship using stylometric analysis techniques. This project implements multiple machine learning approaches for source code authorship attribution, including N-gram analysis and GraphCodeBERT-based methods.

Overview

This project provides both Jupyter notebook-based analysis and a user-friendly GUI application for performing code authorship attribution. It can analyze source code samples and predict the author based on coding style patterns (stylometry).

Key Features

  • Multiple Attribution Methods: N-gram character analysis and GraphCodeBERT (GC-BERT) deep learning approach
  • Multi-Language Support: Supports C#, Python, Java, C++, Go, JavaScript, PHP, and Ruby
  • Web-based GUI: Streamlit-powered interface for easy use
  • Docker Support: Containerized deployment for consistent environments
  • Comprehensive Analysis: Built-in evaluation metrics and visualization tools
  • Dataset Generation: Tools for building training datasets from GitHub repositories

Methods

1. N-Gram Character Analysis

  • Uses character-level n-grams (default: 7-gram) to capture coding style patterns
  • Supports all major programming languages
  • Fast training and prediction
  • Best for single-language datasets

2. GraphCodeBERT (GC-BERT)

  • Deep learning approach using pre-trained transformer model
  • Leverages code structure and semantic understanding
  • Pre-trained on Python, Java, JavaScript, PHP, Ruby, and Go
  • More robust for complex authorship patterns

Project Structure

├── code_authorship_attribution.ipynb          # Main analysis notebook
├── code_authorship_attribution_gui_integration.ipynb # GUI backend notebook
├── code_authorship_attribution_terse.ipynb # Condensed version
├── code_authorship_attribution_annotated.ipynb # Detailed annotated version
├── test_results_analysis.ipynb # Performance analysis
├── GUI integration/ # Web application
│ ├── app.py # Streamlit GUI
│ ├── code_authorship_attribution_gui_integration.py # Backend functions
│ ├── requirements.txt # Python dependencies
│ ├── Dockerfile # Docker configuration
│ ├── docker-compose.yml # Docker Compose setup
│ ├── train_samples.zip # Training data sample
│ └── analyze_samples.zip # Test data sample
├── subsets.csv # Dataset configuration
├── model_test_results.xlsx # Model performance results
├── authver_test_results.xlsx # Authentication verification results
└── balance_test_results.xlsx # Dataset balance analysis

Quick Start

Using the GUI Application

Option 1: Local Installation

  1. Install Python 3.10+ and ensure pip is available

  2. Clone the repository:

    git clone https://github.com/MSUSEL/wsu-pique-stylometry.git
    cd wsu-pique-stylometry
  3. Navigate to GUI directory:

    cd "GUI integration"
  4. Install dependencies:

    pip install -r requirements.txt
  5. Run the application:

    streamlit run app.py
  6. Access the GUI: Open your browser to http://localhost:8501

Option 2: Docker Deployment

  1. Clone the repository:

    git clone https://github.com/MSUSEL/wsu-pique-stylometry.git
    cd wsu-pique-stylometry/"GUI integration"
  2. Run with Docker Compose:

    docker-compose up --build
  3. Access the GUI: Open your browser to http://localhost:3003

Option 3: Docker Manual Build

cd "GUI integration"
docker build -t stylometry-app .
docker run -p 3003:3003 stylometry-app

Using the GUI

  1. Upload Training Data: Upload a train_samples.zip file containing folders named after each author with their code samples
  2. Upload Test Data: Upload an analyze_samples.zip file with code samples to analyze
  3. Select Method: Choose between N-Gram or GC-BERT analysis
  4. Select Language (N-Gram only): Choose the programming language of your samples
  5. Optional Evaluation: Check the evaluation box if your test files follow the naming convention <author>_<filename>.<ext>
  6. Run Attribution: Click the button to start analysis

Data Format Requirements

Training Data (train_samples.zip)

train_samples/
├── author1/
│ ├── file1.py
│ ├── file2.py
│ └── ...
├── author2/
│ ├── file1.py
│ ├── file2.py
│ └── ...
└── ...

Test Data (analyze_samples.zip)

analyze_samples/
├── unknown_file1.py
├── unknown_file2.py
├── author1_test.py # For evaluation (optional)
└── ...

Using Jupyter Notebooks

Main Analysis Notebook

Open code_authorship_attribution.ipynb for complete dataset generation and analysis:

  1. Part 1: Build dataset from GitHub repositories
  2. Part 2: Preprocess and prepare data
  3. Part 3: Feature extraction and vectorization
  4. Part 4: Model training and evaluation
  5. Part 5: Hyperparameter optimization

Quick Analysis

Use code_authorship_attribution_terse.ipynb for streamlined analysis with existing datasets.

Configuration

N-Gram Parameters (in notebooks)

vec_kwargs = {
'input': 'filename',
'strip_accents': 'ascii',
'lowercase': False,
'ngram_range': (7, 7), # Character n-gram size
'analyzer': 'char',
'max_features': 5000 # Maximum number of features
}

Dataset Generation Parameters

part1_kwargs = {
'terms': 'machine learning',
'params': {'language': 'python', 'size': '100000..250000'},
'sorting': {'sort': 'stars', 'order': None},
'num_authors': 10,
'max_results': 100,
'min_files': 15,
'max_files': 50,
'min_lines': 10,
'data_directory': './data',
'temp_directory': './temp'
}

Performance

The project includes comprehensive performance analysis in test_results_analysis.ipynb. Key findings:

  • N-Gram Method: Fast execution, good accuracy for single-language datasets
  • GC-BERT Method: Higher accuracy for complex patterns, slower execution
  • Language Impact: Performance varies by programming language
  • Dataset Size: Larger, balanced datasets improve accuracy

Research Applications

This tool has been used for:

  • Academic research in software engineering
  • Code plagiarism detection
  • Authorship verification in collaborative projects
  • Stylometric analysis of programming patterns

Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/new-feature)
  3. Commit your changes (git commit -am 'Add new feature')
  4. Push to the branch (git push origin feature/new-feature)
  5. Create a Pull Request

Requirements

Core Dependencies

  • Python 3.10+
  • scikit-learn
  • pandas
  • numpy
  • tensorflow/keras
  • torch
  • transformers
  • streamlit (for GUI)

Full Requirements

See GUI integration/requirements.txt for complete dependency list.

Troubleshooting

Common Issues

  1. "No valid samples found": Ensure your code samples match the selected language
  2. Memory errors: Reduce max_features in N-gram configuration or use smaller datasets
  3. Docker build fails: Ensure Docker has sufficient memory allocation (4GB+ recommended)
  4. GC-BERT performance issues: Use recommended languages (Python, Java, JavaScript, PHP, Ruby, Go)

File Format Requirements

  • Code files must have appropriate extensions (.py, .java, .js, etc.)
  • Zip files must contain the exact folder structure shown above
  • Files should contain sufficient code (minimum 10 lines recommended)

License

This project is part of the WSU PIQUE research initiative. Please cite appropriately if used in academic work.

Support

For questions and support:

  • Create an issue in the GitHub repository
  • Check the annotated notebook (code_authorship_attribution_annotated.ipynb) for detailed explanations
  • Review the test results analysis for performance insights

Note: This tool is designed for research and educational purposes. Ensure you have appropriate permissions when analyzing code repositories.