haniwers.v1.cli.preprocess#

Data preprocessing commands for raw2csv functionality.

Provides commands for converting raw detector CSV files to processed format. Supports v0 compatibility mode and configurable preprocessing parameters.

Commands: raw2csv: Full processing pipeline with statistics raw2tmp: Quick conversion for temporary analysis (v0 compatible) run2csv: Process raw data from a specific run using metadata from Google Sheets export

Dependencies: - haniwers.v1.preprocess: Core preprocessing functions - typer: CLI framework - loguru: Structured logging - pandas: Data frame manipulation

Module Contents#

Functions#

_validate_input_directory

Validate input directory exists and is accessible.

_discover_files

Discover files matching pattern in input directory.

_validate_runs_csv

Validate runs.csv file exists and is readable.

_get_run_info

Get run metadata from runs.csv using run_id.

raw2csv

Convert raw detector data to processed CSV format.

raw2tmp

Quick conversion for temporary analysis (v0 compatible output).

run2csv

Convert raw detector data for a specific run using metadata from runs.csv.

Data#

app

API#

haniwers.v1.cli.preprocess.app#

‘Typer(…)’

haniwers.v1.cli.preprocess._validate_input_directory(read_from: str) pathlib.Path#

Validate input directory exists and is accessible.

Args: read_from: Directory path to validate

Returns: Path: Validated Path object

Raises: typer.Exit: If directory doesn’t exist or is not a directory

haniwers.v1.cli.preprocess._discover_files(input_dir: pathlib.Path, pattern: str, verbose: bool = False) list[pathlib.Path]#

Discover files matching pattern in input directory.

Args: input_dir: Directory to search pattern: Glob pattern for matching files verbose: Enable detailed logging

Returns: list[Path]: Sorted list of matching file paths

Raises: typer.Exit: If no files matching pattern are found

haniwers.v1.cli.preprocess._validate_runs_csv(runs_csv_path: str) pathlib.Path#

Validate runs.csv file exists and is readable.

Args: runs_csv_path: Path to runs.csv file

Returns: Path: Validated Path object

Raises: typer.Exit: If file doesn’t exist or is not readable

haniwers.v1.cli.preprocess._get_run_info(runs_csv_path: pathlib.Path, run_id: str, workspace: str) dict#

Get run metadata from runs.csv using run_id.

Extracts raw data directory, search pattern, and output paths from runs.csv. Handles Google Sheets exports with type hint rows.

Expected CSV columns:

  • run_id: Run ID

  • path_raw_data: Directory containing raw data files

  • search_pattern: Glob pattern for raw data files

  • path_preprocessed_data: Output path for preprocessed data

  • path_resampled_data: Output path for resampled data

Args: runs_csv_path: Path to runs.csv exported from Google Sheets run_id: Run ID to lookup workspace: Base workspace directory to construct full paths

Returns: dict: Run metadata with keys: - data_dir: Full path to raw data directory - pattern: Search pattern for raw data files - preprocessed_path: Relative path for preprocessed output - resampled_path: Relative path for resampled output

Raises: InvalidCSVError: If CSV columns are invalid or file cannot be read InvalidIDError: If run_id not found in runs.csv FileNotFoundError: If data directory doesn’t exist NotADirectoryError: If data path exists but is not a directory

haniwers.v1.cli.preprocess.raw2csv(read_from: str = typer.Argument(..., help='Directory containing raw CSV files from detector'), save: bool = typer.Option(False, '--save', help='Save processed data to files'), interval: int = PreprocessOptions.interval, offset: int = PreprocessOptions.offset, tz: str = typer.Option('UTC+09:00', '--tz', help='Timezone string for timestamp parsing (e.g., UTC+09:00, Asia/Tokyo)', rich_help_panel='Processing Options'), pattern: str = typer.Option('*data*.csv', '--pattern', help='Glob pattern for matching input files', rich_help_panel='Processing Options'), verbose: bool = LoggerOptions.verbose) None#

Convert raw detector data to processed CSV format.

Performs full preprocessing pipeline:

  1. Load raw CSV files from detector

  2. Parse timestamps and apply timezone/offset

  3. Compute hit classifications (per-layer and composite)

  4. Resample data into time windows

  5. Calculate statistics (mean, std, event rates)

Output files (when --save is used):

  • processed.csv.gz: Full processed data with all columns

  • resampled.csv: Resampled data with statistical aggregates

Example: \( haniwers-v1 preprocess raw2csv sandbox/test_data/20240611_run93 --save \) haniwers-v1 preprocess raw2csv data/ --interval 300 --tz Asia/Tokyo $ haniwers-v1 preprocess raw2csv data/ --pattern “run*.csv” --verbose

Args: read_from: Directory path containing raw CSV files save: Save output files (processed.csv.gz, resampled.csv) interval: Time window size in seconds for resampling offset: Timestamp offset in seconds tz: Timezone for timestamp interpretation pattern: File glob pattern (default: “*.csv”) verbose: Enable detailed logging

haniwers.v1.cli.preprocess.raw2tmp(read_from: str = typer.Argument(..., help='Directory containing raw CSV files'), interval: int = PreprocessOptions.interval, pattern: str = typer.Option('*data*.csv', '--pattern', help='Glob pattern for matching input files', rich_help_panel='Processing Options')) None#

Quick conversion for temporary analysis (v0 compatible output).

Faster alternative to raw2csv for quick data exploration. Always saves output with v0-compatible filenames:

  • tmp_raw2tmp.csv.gz: Processed data (compressed)

  • tmp_raw2tmp.csv: Resampled data

This command uses default timezone (UTC+09:00) and no time offset. For advanced options, use ‘raw2csv’ command.

Example: \( haniwers-v1 preprocess raw2tmp sandbox/test_data/20240611_run93 \) haniwers-v1 preprocess raw2tmp data/ --interval 300

Args: read_from: Directory path containing raw CSV files interval: Time window size in seconds for resampling pattern: File glob pattern (default: “*.csv”)

haniwers.v1.cli.preprocess.run2csv(run_id: str = typer.Argument(..., help='Run ID from runs.csv'), load_from: str = typer.Option('runs.csv', '--load_from', help='Path to runs.csv configuration exported from Google Sheets', rich_help_panel='Input Options'), workspace: str = typer.Option('.', '--workspace', help='Base workspace directory (root for relative paths in runs.csv)', rich_help_panel='Input Options'), preprocessed: str = typer.Option('.', '--preprocessed', help='Directory to save preprocessed (processed.csv.gz) files', rich_help_panel='Output Options'), resampled: str = typer.Option('.', '--resampled', help='Directory to save resampled (resampled.csv) files', rich_help_panel='Output Options'), save: bool = typer.Option(False, '--save', help='Save processed data to files'), interval: int = PreprocessOptions.interval, offset: int = PreprocessOptions.offset, tz: str = typer.Option('UTC+09:00', '--tz', help='Timezone string for timestamp parsing (e.g., UTC+09:00, Asia/Tokyo)', rich_help_panel='Processing Options'), verbose: bool = LoggerOptions.verbose) None#

Convert raw detector data for a specific run using metadata from runs.csv.

This command reads run information from a Google Sheets export (runs.csv), looks up the raw data directory and search pattern, then processes the raw detector data through the full preprocessing pipeline.

The runs.csv file should have:

  • run_id: Run ID column

  • path_raw_data: Directory containing raw data files

  • search_pattern: Glob pattern for raw data files (e.g., “*.dat”)

  • path_preprocessed_data: Output path for preprocessed data

  • path_resampled_data: Output path for resampled data

If --save is used with --preprocessed and --resampled options, output files are saved to those directories; otherwise paths from runs.csv are used relative to the base workspace.

Example: \( haniwers-v1 preprocess run2csv 1 --load_from runs.csv --workspace ./data --save \) haniwers-v1 preprocess run2csv 100 --load_from runs.csv --preprocessed ./output/processed --resampled ./output/resampled --save $ haniwers-v1 preprocess run2csv 85 --workspace /mnt/data --interval 300 --save --verbose

Args: run_id: Run ID to process (from runs.csv) load_from: Path to runs.csv configuration file exported from Google Sheets workspace: Base directory for relative paths in runs.csv preprocessed: Directory for preprocessed output (overrides runs.csv path) resampled: Directory for resampled output (overrides runs.csv path) save: Save output files interval: Time window size in seconds for resampling offset: Timestamp offset in seconds tz: Timezone for timestamp interpretation verbose: Enable detailed logging