haniwers.v1.preprocess.converter#
High-level API orchestrating the data processing pipeline.
This module provides the main entry point convert_files() that coordinates the reader, transformer, and aggregator modules to process detector data.
Design (Principle IX - SRP):
Single responsibility: Orchestrate pipeline stages
Pure function, no side effects
Returns DataFrames for downstream processing
References:
FR-007: Return two datasets (raw processed + resampled events)
ADR-005: Layered data processing pipeline
ADR-012: Pure orchestration function
Module Contents#
Functions#
Convert raw detector CSV files to processed and resampled DataFrames. |
API#
- haniwers.v1.preprocess.converter.convert_files(file_paths: List[pathlib.Path], tz_string: str = 'UTC+09:00', offset_seconds: int = 0, resample_interval: int = 600) Tuple[pandas.DataFrame, Optional[pandas.DataFrame]]#
Convert raw detector CSV files to processed and resampled DataFrames.
Main entry point for the v1 preprocess module. Orchestrates:
Load raw CSV files (reader layer)
Add calculated columns (transformer layer)
Optionally resample to time windows (aggregator layer - Phase 2)
Phase 1 (User Story 1) returns only processed events. Phase 2 (User Story 2) will also return resampled aggregations.
Args: file_paths: List of Path objects to raw CSV files from detector tz_string: Timezone for datetime parsing (default: UTC+09:00) offset_seconds: Time offset correction in seconds (default: 0) resample_interval: Interval in seconds for aggregation (default: 600)
Returns: Tuple of (processed_df, resampled_df): - processed_df: All events with calculated columns (datetime -> time, hit_* columns) - resampled_df: None in Phase 1, time-windowed aggregations in Phase 2+
Raises: ValueError: If no files provided or files missing required columns
Example: >>> from pathlib import Path >>> files = [Path(“run93_001.csv”), Path(“run93_002.csv”)] >>> raw_df, resampled_df = convert_files(files) >>> print(raw_df.shape, resampled_df) (13856, 13) None # Phase 1: no resampling yet