haniwers.v1.preprocess.converter

haniwers.v1.preprocess.converter#

High-level API orchestrating the data processing pipeline.

This module provides the main entry point convert_files() that coordinates the reader, transformer, and aggregator modules to process detector data.

Design (Principle IX - SRP):

  • Single responsibility: Orchestrate pipeline stages

  • Pure function, no side effects

  • Returns DataFrames for downstream processing

References:

  • FR-007: Return two datasets (raw processed + resampled events)

  • ADR-005: Layered data processing pipeline

  • ADR-012: Pure orchestration function

Module Contents#

Functions#

convert_files

Convert raw detector CSV files to processed and resampled DataFrames.

API#

haniwers.v1.preprocess.converter.convert_files(file_paths: List[pathlib.Path], tz_string: str = 'UTC+09:00', offset_seconds: int = 0, resample_interval: int = 600) Tuple[pandas.DataFrame, Optional[pandas.DataFrame]]#

Convert raw detector CSV files to processed and resampled DataFrames.

Main entry point for the v1 preprocess module. Orchestrates:

  1. Load raw CSV files (reader layer)

  2. Add calculated columns (transformer layer)

  3. Optionally resample to time windows (aggregator layer - Phase 2)

Phase 1 (User Story 1) returns only processed events. Phase 2 (User Story 2) will also return resampled aggregations.

Args: file_paths: List of Path objects to raw CSV files from detector tz_string: Timezone for datetime parsing (default: UTC+09:00) offset_seconds: Time offset correction in seconds (default: 0) resample_interval: Interval in seconds for aggregation (default: 600)

Returns: Tuple of (processed_df, resampled_df): - processed_df: All events with calculated columns (datetime -> time, hit_* columns) - resampled_df: None in Phase 1, time-windowed aggregations in Phase 2+

Raises: ValueError: If no files provided or files missing required columns

Example: >>> from pathlib import Path >>> files = [Path(“run93_001.csv”), Path(“run93_002.csv”)] >>> raw_df, resampled_df = convert_files(files) >>> print(raw_df.shape, resampled_df) (13856, 13) None # Phase 1: no resampling yet