haniwers.v1.cli.preprocess#
Data preprocessing commands for raw2csv functionality.
Provides commands for converting raw detector CSV files to processed format. Supports v0 compatibility mode and configurable preprocessing parameters.
Commands: raw2csv: Full processing pipeline with statistics raw2tmp: Quick conversion for temporary analysis (v0 compatible) run2csv: Process raw data from a specific run using metadata from Google Sheets export
Dependencies: - haniwers.v1.preprocess: Core preprocessing functions - typer: CLI framework - loguru: Structured logging - pandas: Data frame manipulation
Module Contents#
Functions#
Validate input directory exists and is accessible. |
|
Discover files matching pattern in input directory. |
|
Validate runs.csv file exists and is readable. |
|
Get run metadata from runs.csv using run_id. |
|
Convert raw detector data to processed CSV format. |
|
Quick conversion for temporary analysis (v0 compatible output). |
|
Convert raw detector data for a specific run using metadata from runs.csv. |
Data#
API#
- haniwers.v1.cli.preprocess.app#
‘Typer(…)’
- haniwers.v1.cli.preprocess._validate_input_directory(read_from: str) pathlib.Path#
Validate input directory exists and is accessible.
Args: read_from: Directory path to validate
Returns: Path: Validated Path object
Raises: typer.Exit: If directory doesn’t exist or is not a directory
- haniwers.v1.cli.preprocess._discover_files(input_dir: pathlib.Path, pattern: str, verbose: bool = False) list[pathlib.Path]#
Discover files matching pattern in input directory.
Args: input_dir: Directory to search pattern: Glob pattern for matching files verbose: Enable detailed logging
Returns: list[Path]: Sorted list of matching file paths
Raises: typer.Exit: If no files matching pattern are found
- haniwers.v1.cli.preprocess._validate_runs_csv(runs_csv_path: str) pathlib.Path#
Validate runs.csv file exists and is readable.
Args: runs_csv_path: Path to runs.csv file
Returns: Path: Validated Path object
Raises: typer.Exit: If file doesn’t exist or is not readable
- haniwers.v1.cli.preprocess._get_run_info(runs_csv_path: pathlib.Path, run_id: str, workspace: str) dict#
Get run metadata from runs.csv using run_id.
Extracts raw data directory, search pattern, and output paths from runs.csv. Handles Google Sheets exports with type hint rows.
Expected CSV columns:
run_id: Run ID
path_raw_data: Directory containing raw data files
search_pattern: Glob pattern for raw data files
path_preprocessed_data: Output path for preprocessed data
path_resampled_data: Output path for resampled data
Args: runs_csv_path: Path to runs.csv exported from Google Sheets run_id: Run ID to lookup workspace: Base workspace directory to construct full paths
Returns: dict: Run metadata with keys: - data_dir: Full path to raw data directory - pattern: Search pattern for raw data files - preprocessed_path: Relative path for preprocessed output - resampled_path: Relative path for resampled output
Raises: InvalidCSVError: If CSV columns are invalid or file cannot be read InvalidIDError: If run_id not found in runs.csv FileNotFoundError: If data directory doesn’t exist NotADirectoryError: If data path exists but is not a directory
- haniwers.v1.cli.preprocess.raw2csv(read_from: str = typer.Argument(..., help='Directory containing raw CSV files from detector'), save: bool = typer.Option(False, '--save', help='Save processed data to files'), interval: int = PreprocessOptions.interval, offset: int = PreprocessOptions.offset, tz: str = typer.Option('UTC+09:00', '--tz', help='Timezone string for timestamp parsing (e.g., UTC+09:00, Asia/Tokyo)', rich_help_panel='Processing Options'), pattern: str = typer.Option('*data*.csv', '--pattern', help='Glob pattern for matching input files', rich_help_panel='Processing Options'), verbose: bool = LoggerOptions.verbose) None#
Convert raw detector data to processed CSV format.
Performs full preprocessing pipeline:
Load raw CSV files from detector
Parse timestamps and apply timezone/offset
Compute hit classifications (per-layer and composite)
Resample data into time windows
Calculate statistics (mean, std, event rates)
Output files (when --save is used):
processed.csv.gz: Full processed data with all columns
resampled.csv: Resampled data with statistical aggregates
Example: \( haniwers-v1 preprocess raw2csv sandbox/test_data/20240611_run93 --save \) haniwers-v1 preprocess raw2csv data/ --interval 300 --tz Asia/Tokyo $ haniwers-v1 preprocess raw2csv data/ --pattern “run*.csv” --verbose
Args: read_from: Directory path containing raw CSV files save: Save output files (processed.csv.gz, resampled.csv) interval: Time window size in seconds for resampling offset: Timestamp offset in seconds tz: Timezone for timestamp interpretation pattern: File glob pattern (default: “*.csv”) verbose: Enable detailed logging
- haniwers.v1.cli.preprocess.raw2tmp(read_from: str = typer.Argument(..., help='Directory containing raw CSV files'), interval: int = PreprocessOptions.interval, pattern: str = typer.Option('*data*.csv', '--pattern', help='Glob pattern for matching input files', rich_help_panel='Processing Options')) None#
Quick conversion for temporary analysis (v0 compatible output).
Faster alternative to raw2csv for quick data exploration. Always saves output with v0-compatible filenames:
tmp_raw2tmp.csv.gz: Processed data (compressed)
tmp_raw2tmp.csv: Resampled data
This command uses default timezone (UTC+09:00) and no time offset. For advanced options, use ‘raw2csv’ command.
Example: \( haniwers-v1 preprocess raw2tmp sandbox/test_data/20240611_run93 \) haniwers-v1 preprocess raw2tmp data/ --interval 300
Args: read_from: Directory path containing raw CSV files interval: Time window size in seconds for resampling pattern: File glob pattern (default: “*.csv”)
- haniwers.v1.cli.preprocess.run2csv(run_id: str = typer.Argument(..., help='Run ID from runs.csv'), load_from: str = typer.Option('runs.csv', '--load_from', help='Path to runs.csv configuration exported from Google Sheets', rich_help_panel='Input Options'), workspace: str = typer.Option('.', '--workspace', help='Base workspace directory (root for relative paths in runs.csv)', rich_help_panel='Input Options'), preprocessed: str = typer.Option('.', '--preprocessed', help='Directory to save preprocessed (processed.csv.gz) files', rich_help_panel='Output Options'), resampled: str = typer.Option('.', '--resampled', help='Directory to save resampled (resampled.csv) files', rich_help_panel='Output Options'), save: bool = typer.Option(False, '--save', help='Save processed data to files'), interval: int = PreprocessOptions.interval, offset: int = PreprocessOptions.offset, tz: str = typer.Option('UTC+09:00', '--tz', help='Timezone string for timestamp parsing (e.g., UTC+09:00, Asia/Tokyo)', rich_help_panel='Processing Options'), verbose: bool = LoggerOptions.verbose) None#
Convert raw detector data for a specific run using metadata from runs.csv.
This command reads run information from a Google Sheets export (runs.csv), looks up the raw data directory and search pattern, then processes the raw detector data through the full preprocessing pipeline.
The runs.csv file should have:
run_id: Run ID column
path_raw_data: Directory containing raw data files
search_pattern: Glob pattern for raw data files (e.g., “*.dat”)
path_preprocessed_data: Output path for preprocessed data
path_resampled_data: Output path for resampled data
If --save is used with --preprocessed and --resampled options, output files are saved to those directories; otherwise paths from runs.csv are used relative to the base workspace.
Example: \( haniwers-v1 preprocess run2csv 1 --load_from runs.csv --workspace ./data --save \) haniwers-v1 preprocess run2csv 100 --load_from runs.csv --preprocessed ./output/processed --resampled ./output/resampled --save $ haniwers-v1 preprocess run2csv 85 --workspace /mnt/data --interval 300 --save --verbose
Args: run_id: Run ID to process (from runs.csv) load_from: Path to runs.csv configuration file exported from Google Sheets workspace: Base directory for relative paths in runs.csv preprocessed: Directory for preprocessed output (overrides runs.csv path) resampled: Directory for resampled output (overrides runs.csv path) save: Save output files interval: Time window size in seconds for resampling offset: Timestamp offset in seconds tz: Timezone for timestamp interpretation verbose: Enable detailed logging