# `latdata` — Tabular Data Named-axis data structures built on top of `NDArray`. The `latdata` module provides lightweight labeled 2D containers. The core design philosophy is that a **Table is an NDArray with row and column labels** — not a full DataFrame clone. This means: - Selection always delegates to `NDArray` under the hood and inherits its computation methods. - Label-based indexing is the primary interface, but integer positional indexing also works. - There is no separate index object, no multi-index, and no set-operations on labels. The Axis is simply an ordered list of names with optional shortcuts (aliases). --- ## Axis ```python from latpy.latdata import Axis ``` An **Axis** is an ordered list of labels with support for name-based lookups and user-defined aliases. Every Table has two Axes (`.rows` and `.cols`). | Signature | Description | |---|---| | `Axis(name, labels)` | Create axis with string name and label list | | `len(axis)` | Number of labels | | `axis.has(label) -> bool` | Check if label exists | | `axis.pos(label) -> int` | Integer position of label (0-based) | | `axis.alias(name, sel)` | Register a named selector (slice, int, label list, predicate) | | `axis.resolve(sel) -> tuple` | Convert a user selector to internal index form | **Selectors** (accepted by `resolve`): | Selector Type | Example | Meaning | |---|---|---| | `slice` | `1:5:2` | Positional slice (integers only) | | `int` | `2` | Direct integer index (supports negative) | | `Label` | `"NYC"` | Single label lookup | | `Sequence[Label]` | `["a", "b", "c"]` | List of labels | | `(Label, Label)` | `("jan", "jun")` | Inclusive label range (by axis order, supports reverse) | | `Callable[[Label], bool]` | `lambda x: x.startswith("t")` | Predicate filter | | Alias name (str) | `"top_10"` | Expands to previously registered selector | ### Basic usage ```python axis = Axis("city", ["NYC", "LA", "CHI", "HOU"]) len(axis) # 4 axis.has("LA") # True axis.has("SF") # False axis.pos("CHI") # 2 # Resolve various selectors: axis.resolve(0) # ("int", 0, ["NYC"]) axis.resolve(-1) # ("int", 3, ["HOU"]) — negative wraps axis.resolve("LA") # ("int", 1, ["LA"]) axis.resolve(["NYC", "HOU"]) # ("list", [0, 3], ["NYC", "HOU"]) axis.resolve(("CHI", "LA")) # ("slice", slice(2, 0, -1), ["CHI", "LA", "NYC"]) — reverse range axis.resolve(lambda x: len(x) == 3) # ("list", [0, 1, 3], ["NYC", "LA", "HOU"]) ``` ### Aliases Aliases are named shortcuts. They are stored by name and expanded during `resolve`. This is useful for giving semantic names to frequently used sub-selections, e.g. `"features"`, `"top_10"`, `"validation_set"`. **Rationale:** Aliases make code self-documenting. Instead of repeating `table[:, ["col1", "col3", "col7"]]` everywhere, you define `table.cols.alias("features", ["col1", "col3", "col7"])` once and write `table[:, "features"]` thereafter. ```python axis = Axis("row", ["train", "val", "test"]) axis.alias("train_val", ["train", "val"]) axis.alias("not_test", lambda x: x != "test") axis.resolve("train_val") # ("list", [0, 1], ["train", "val"]) axis.resolve("not_test") # ("list", [0, 1], ["train", "val"]) ``` Aliases can wrap any selector type: a single label, a list, a slice, a callable, or even another alias name. ### Edge cases **Empty labels list** — Axis allows zero labels. `resolve` and `pos` will raise errors on lookups, but `len(axis)` returns 0 and the Axis can still store aliases. ```python empty = Axis("x", []) len(empty) # 0 empty.has("a") # False empty.pos("a") # ShapeError: Axis[x]: unknown label 'a' ``` **Duplicate labels** — When constructing an Axis, duplicate labels are silently deduplicated: the **first** occurrence wins. The `labels` list keeps duplicates but the internal `_index` dict only stores the first position. ```python ax = Axis("x", ["a", "b", "a"]) # duplicate "a" len(ax) # 3 ax.has("a") # True ax.pos("a") # 0 (first occurrence) ax.labels # ["a", "b", "a"] (may show duplicates) ``` This can produce surprising behavior — `resolve` returns label lists based on the stored labels, but positional lookups always use the first occurrence. Users are encouraged to avoid duplicate labels. **Alias with an unknown label** — If an alias name is registered but its selector refers to a label that does not exist, the error surfaces at `resolve` time, not at `alias` registration time. ```python ax = Axis("x", ["a", "b"]) ax.alias("bad", ["a", "z"]) # registers fine ax.resolve("bad") # ShapeError: Axis[x]: unknown label 'z' ``` **Resolve with a missing label** — Direct label lookup on a non-existent label raises `ShapeError`. ```python ax = Axis("x", ["a", "b"]) ax.resolve("z") # ShapeError: Axis[x]: unknown label 'z' ``` **Resolve with `None`** — Special case: `None` is treated as `slice(None)`, meaning "all labels". ```python ax = Axis("x", ["a", "b", "c"]) ax.resolve(None) # ("slice", slice(0, 3, 1), ["a", "b", "c"]) ``` --- ## Table ```python from latpy.latdata import Table ``` A **Table** is a 2D labeled grid: a single `NDArray` together with a row `Axis` and a column `Axis`. It is not a full DataFrame — it intentionally omits pivot tables, joins, stacked operations, and fancy index manipulation. Its strength is label-based reading and writing with a clean `T[rows, cols]` syntax. | Signature | Description | |---|---| | `Table.from_list(values_2d, row_labels, col_labels, dtype=None, name_rows="rows", name_cols="cols") -> Table` | Construct from nested lists | | `.data` | `NDArray` backing store | | `.rows` | Row `Axis` | | `.cols` | Column `Axis` | | `.tolist() -> list[list]` | Convert to Python lists | | `table[key] -> scalar \| NDArray \| Table` | Index by label / integer / slice | | `.where(mask, a, b) -> Table` | Element-wise select (like `ndarray.where`) | | `.sum(axis="rows") -> NDArray` | Sum over rows or columns | ### Construction ```python # From nested lists with explicit labels T = Table.from_list( [[1, 2, 3], [4, 5, 6]], row_labels=["r1", "r2"], col_labels=["a", "b", "c"], ) # T.data is NDArray of shape (2, 3), dtype i64 # T.rows → Axis("rows", ["r1", "r2"]) # T.cols → Axis("cols", ["a", "b", "c"]) # T.tolist() → [[1, 2, 3], [4, 5, 6]] # Auto-generated labels when omitted: T2 = Table.from_list([[1, 2], [3, 4]]) # T2.rows.labels → ["r0", "r1"] # T2.cols.labels → ["c0", "c1"] ``` ### Indexing — return-type rules The indexing expression `T[rows, cols]` follows these rules: | rows selector | cols selector | return type | Example | |---|---|---|---| | single label/int | single label/int | **scalar** | `T["r1", "b"] → 2` | | single label/int | multiple | **1D NDArray** | `T["r1", :] → [1,2,3]` | | multiple | single label/int | **1D NDArray** | `T[:, "b"] → [2,5]` | | `:` | `:` | **Table** (view) | `T[:, :] → Table` | | multiple (non-slice) | multiple (non-slice) | **Table** (copy) | `T[["r1"], ["a","c"]] → Table` | **Rationale:** 1D results are plain `NDArray` objects (not 1-column Tables) because a 1D NDArray is simpler, lighter, and interoperates directly with mathematical operations. A 2D result is always wrapped in a Table so that labels are preserved for further chaining. ```python T = Table.from_list( [[1, 2, 3], [4, 5, 6]], row_labels=["r1", "r2"], col_labels=["a", "b", "c"], ) # Scalar T["r1", "b"] # → 2 (Python int) # 1D NDArray (single row, all cols) T["r1", :] # → NDArray(shape=(3,), dtype=i64, axes=("cols",)) T["r1", :].tolist() # [1, 2, 3] # 1D NDArray (all rows, single column) T[:, "b"] # → NDArray(shape=(2,), dtype=i64, axes=("rows",)) T[:, "b"].tolist() # [2, 5] # Table (row range, column range) S = T["r1":"r2", "a":"b"] # S is Table(shape=(2, 2), rows=rows, cols=cols) S.tolist() # [[1, 2], [4, 5]] # Table (full slice — view) S = T[:, :] S.tolist() # [[1, 2, 3], [4, 5, 6]] # Column shortcut: T["a"] is T[:, "a"] T["a"] # NDArray([1, 4]) # List-of-labels selection T[:, ["a", "c"]] # Table(shape=(2, 2), rows=rows, cols=cols) T[:, ["a", "c"]].tolist() # [[1, 3], [4, 6]] ``` ### Boolean / predicate indexing When you pass a callable predicate as a selector, the same dimensionality rules apply. If both dimensions resolve to multiple entries, the result is a Table. ```python T = Table.from_list( [[1, 2, 3], [4, 5, 6], [7, 8, 9]], row_labels=["train", "val", "test"], col_labels=["a", "b", "c"], ) # Predicate on rows → Table (multiple rows, all cols) S = T[lambda r: r != "val", :] # S is Table(shape=(2, 3), rows=rows, cols=cols) S.rows.labels # ["train", "test"] S.tolist() # [[1, 2, 3], [7, 8, 9]] # Predicate on cols → Table (all rows, filtered cols) S2 = T[:, lambda c: c > "a"] S2.cols.labels # ["b", "c"] S2.tolist() # [[2, 3], [5, 6], [8, 9]] ``` ### Edge cases **`from_list` with empty list** — Raises `ShapeError` because the underlying NDArray cannot infer shape from `[]`. ```python Table.from_list([], row_labels=[], col_labels=[]) # ShapeError: array(): expected list or nested list ``` **`from_list` with mismatched label lengths** — Raises `ShapeError` in `__post_init__` because the data shape must match the label count. ```python Table.from_list( [[1, 2], [3, 4]], row_labels=["r1"], # length 1, data has 2 rows col_labels=["a", "b"], ) # ShapeError: Table: row labels length must match data.shape[0] ``` **Indexing with out-of-range label** — Raises `ShapeError` from `Axis.pos`. ```python T[:, "z"] # ShapeError: Axis[cols]: unknown label 'z' ``` **Indexing with out-of-range integer** — Also a `ShapeError`. ```python T[10, :] # ShapeError: Axis[rows]: index out of bounds ``` **Mixed label/integer indexing** — Labels and integers can be freely mixed. An integer is treated as a positional index, not a label name. ```python T[0, "b"] # scalar: 2 (row 0, column "b") T["r1", 2] # scalar: 3 (row "r1", column index 2) T[1, 0] # scalar: 4 (row 1, column 0) T[0, ["a", "c"]] # NDArray([1, 3]) — row 0, cols "a" and "c" ``` ### Computation **`sum`** — Reduces along one axis, returning a 1D NDArray. Compatible axis name aliases: `"rows"`, `"row"`, `"r"`, `"0"` for axis=0; `"cols"`, `"col"`, `"c"`, `"1"` for axis=1. ```python T = Table.from_list( [[1, 2, 3], [4, 5, 6]], row_labels=["r1", "r2"], col_labels=["a", "b", "c"], ) T.sum(axis="rows") # NDArray([5, 7, 9]) — sum down each column T.sum(axis="cols") # NDArray([6, 15]) — sum across each row ``` **`where`** — Element-wise select between two values based on a boolean mask. The mask can be: - A Table with `b1` data - A callable `(row_label, col_label) -> bool` - A broadcastable NDArray with dtype `b1` ```python T = Table.from_list( [[1, 2], [3, 4]], row_labels=["a", "b"], col_labels=["x", "y"], ) # Mask via callable: put 99 wherever label pair matches result = T.where(lambda r, c: r == "a" and c == "x", 99, T) result.tolist() # [[99, 2], [3, 4]] # result is a Table with same shape and labels # Mask via b1 Table from latpy.latmath.array import zeros from latpy.latmath.array.dtypes import B1 mask_arr = zeros((2, 2), B1) mask_arr[0, 0] = 1 mask_t = Table(mask_arr, T.rows, T.cols) result2 = T.where(mask_t, T, 0) result2.tolist() # [[1, 0], [0, 0]] ``` --- ## GroupBy ```python from latpy.latdata import GroupBy ``` **GroupBy** implements split-apply-combine over a Table's rows. You group rows by a column of the table, by row label, or by a callable predicate, then apply an aggregation (sum, mean, count, min, max). **Rationale:** GroupBy is the classic split-apply-combine pattern from SQL/pandas. Splitting by row labels is the most common use case for labeled data, since row labels often encode categorical membership (e.g. `"train"`, `"val"`, `"test"`). Each aggregation produces a new Table whose row labels are the group keys and whose columns match the original table. | Signature | Description | |---|---| | `GroupBy(table, by)` | Group rows by label string, label list, or callable predicate | | `.sum() -> Table` | Sum of each group along columns | | `.mean() -> Table` | Mean of each group along columns | | `.count() -> Table` | Row count of each group | | `.min() -> Table` | Minimum of each group along columns | | `.max() -> Table` | Maximum of each group along columns | ### Basic example ```python t = Table.from_list( [[1, 2], [3, 4], [5, 6]], row_labels=["a", "b", "a"], col_labels=["x", "y"], ) gb = GroupBy(t, "row") # group by row label result = gb.sum() result.tolist() # [[6, 8], [3, 4]] result.rows.labels # ["a", "b"] (group keys become row labels) result.cols.labels # ["x", "y"] (columns preserved) ``` **Output of each aggregation on this data:** | Aggregation | Output | |---|---| | `gb.sum()` | `[[6, 8], [3, 4]]` | | `gb.mean()` | `[[3.0, 4.0], [3.0, 4.0]]` | | `gb.count()` | `[[2, 2], [1, 1]]` | | `gb.min()` | `[[1, 2], [3, 4]]` | | `gb.max()` | `[[5, 6], [3, 4]]` | ### Chaining GroupBy results are Tables, so they can be further indexed or converted. ```python # Group → sum → extract one column as NDArray GroupBy(t, "row").sum()[:, "y"] # NDArray([8, 4]) # Group → mean → convert to plain Python lists GroupBy(t, "row").mean().tolist() # [[3.0, 4.0], [3.0, 4.0]] # Group → sum → sum again (grand total) GroupBy(t, "row").sum().sum(axis="cols") # NDArray([14, 7]) ``` ### Grouping by callable predicate When you pass a callable, it receives each row label and the return value is stringified to form the group key. ```python t = Table.from_list( [[1, 2], [3, 4], [5, 6], [7, 8]], row_labels=["set_a", "set_b", "set_a", "set_b"], col_labels=["x", "y"], ) gb = GroupBy(t, lambda lbl: lbl.split("_")[1]) gb.sum().tolist() # [[6, 8], [10, 12]] gb.sum().rows.labels # ["a", "b"] ``` ### Edge cases **Group by string that doesn't match any row label** — The `by` parameter refers to row labels. If you pass a string that doesn't match any row label, all rows fall into a single group named by that string. ```python t = Table.from_list([[1], [2]], row_labels=["a", "b"], col_labels=["x"]) gb = GroupBy(t, "z") # no row label "z" gb.sum().rows.labels # ["z"] — all rows grouped together gb.sum().tolist() # [[3]] gb.count().tolist() # [[2]] ``` **Single-element groups** — Groups with exactly one row work naturally; the aggregation returns that row's values unchanged. ```python t = Table.from_list([[1, 2], [3, 4]], row_labels=["a", "b"], col_labels=["x", "y"]) GroupBy(t, "row").sum().tolist() # [[1, 2], [3, 4]] (each row is its own group) GroupBy(t, "row").count().tolist() # [[1, 1], [1, 1]] ``` **All-identical rows** — When all rows share the same label, there is one group containing all rows. Aggregation behaves normally. ```python t = Table.from_list([[1, 2], [3, 4]], row_labels=["a", "a"], col_labels=["x", "y"]) GroupBy(t, "row").sum().tolist() # [[4, 6]] GroupBy(t, "row").mean().tolist() # [[2.0, 3.0]] GroupBy(t, "row").count().tolist() # [[2, 2]] ``` **Empty table** — If the table has no rows (shape `(0, n)`), GroupBy produces zero groups and aggregation returns an empty Table with shape `(0, n_cols)`. --- ## Design rationale ### Table wraps NDArray with labels, not a full DataFrame clone The Table deliberately does **not** replicate pandas DataFrames: - **No multi-index** — Axes are flat label lists. Hierarchical indexing is left to the application layer. - **No joins/merges** — Table is for in-memory computation on a single grid. Combining tables is done via `NDArray` operations. - **No mutation** — There is no `drop`, `insert`, or `rename` on Table. Create a new Table with the desired axes instead. - **1D results are NDArrays, not 1-column Tables** — This keeps the API simple and ensures computed vectors can be immediately used in math ops. - **Inherits NDArray computation** — Methods like `where` and `sum` delegate directly to NDArray, so performance characteristics match the underlying array library. ### What Table does well - Clean `T[rows, cols]` syntax with label, slice, list, range, callable, and alias selectors. - Interop with NDArray for computation. - Lightweight — no dependency on pandas; uses only NDArray + standard library.