# `latdata` — Tabular Data

Named-axis data structures built on top of `NDArray`.

The `latdata` module provides lightweight labeled 2D containers. The core design
philosophy is that a **Table is an NDArray with row and column labels** — not a
full DataFrame clone. This means:

- Selection always delegates to `NDArray` under the hood and inherits its
  computation methods.
- Label-based indexing is the primary interface, but integer positional indexing
  also works.
- There is no separate index object, no multi-index, and no set-operations on
  labels. The Axis is simply an ordered list of names with optional shortcuts
  (aliases).

---

## Axis

```python
from latpy.latdata import Axis
```

An **Axis** is an ordered list of labels with support for name-based lookups
and user-defined aliases. Every Table has two Axes (`.rows` and `.cols`).

| Signature | Description |
|---|---|
| `Axis(name, labels)` | Create axis with string name and label list |
| `len(axis)` | Number of labels |
| `axis.has(label) -> bool` | Check if label exists |
| `axis.pos(label) -> int` | Integer position of label (0-based) |
| `axis.alias(name, sel)` | Register a named selector (slice, int, label list, predicate) |
| `axis.resolve(sel) -> tuple` | Convert a user selector to internal index form |

**Selectors** (accepted by `resolve`):

| Selector Type | Example | Meaning |
|---|---|---|
| `slice` | `1:5:2` | Positional slice (integers only) |
| `int` | `2` | Direct integer index (supports negative) |
| `Label` | `"NYC"` | Single label lookup |
| `Sequence[Label]` | `["a", "b", "c"]` | List of labels |
| `(Label, Label)` | `("jan", "jun")` | Inclusive label range (by axis order, supports reverse) |
| `Callable[[Label], bool]` | `lambda x: x.startswith("t")` | Predicate filter |
| Alias name (str) | `"top_10"` | Expands to previously registered selector |

### Basic usage

```python
axis = Axis("city", ["NYC", "LA", "CHI", "HOU"])

len(axis)          # 4
axis.has("LA")     # True
axis.has("SF")     # False
axis.pos("CHI")    # 2

# Resolve various selectors:
axis.resolve(0)            # ("int", 0, ["NYC"])
axis.resolve(-1)           # ("int", 3, ["HOU"])   — negative wraps
axis.resolve("LA")         # ("int", 1, ["LA"])
axis.resolve(["NYC", "HOU"])  # ("list", [0, 3], ["NYC", "HOU"])
axis.resolve(("CHI", "LA"))   # ("slice", slice(2, 0, -1), ["CHI", "LA", "NYC"]) — reverse range
axis.resolve(lambda x: len(x) == 3)  # ("list", [0, 1, 3], ["NYC", "LA", "HOU"])
```

### Aliases

Aliases are named shortcuts. They are stored by name and expanded during
`resolve`. This is useful for giving semantic names to frequently used
sub-selections, e.g. `"features"`, `"top_10"`, `"validation_set"`.

**Rationale:** Aliases make code self-documenting. Instead of repeating
`table[:, ["col1", "col3", "col7"]]` everywhere, you define
`table.cols.alias("features", ["col1", "col3", "col7"])` once and write
`table[:, "features"]` thereafter.

```python
axis = Axis("row", ["train", "val", "test"])
axis.alias("train_val", ["train", "val"])
axis.alias("not_test", lambda x: x != "test")

axis.resolve("train_val")  # ("list", [0, 1], ["train", "val"])
axis.resolve("not_test")   # ("list", [0, 1], ["train", "val"])
```

Aliases can wrap any selector type: a single label, a list, a slice, a
callable, or even another alias name.

### Edge cases

**Empty labels list** — Axis allows zero labels. `resolve` and `pos` will
raise errors on lookups, but `len(axis)` returns 0 and the Axis can still
store aliases.

```python
empty = Axis("x", [])
len(empty)          # 0
empty.has("a")      # False
empty.pos("a")      # ShapeError: Axis[x]: unknown label 'a'
```

**Duplicate labels** — When constructing an Axis, duplicate labels are
silently deduplicated: the **first** occurrence wins. The `labels` list
keeps duplicates but the internal `_index` dict only stores the first
position.

```python
ax = Axis("x", ["a", "b", "a"])   # duplicate "a"
len(ax)                 # 3
ax.has("a")             # True
ax.pos("a")             # 0   (first occurrence)
ax.labels               # ["a", "b", "a"]  (may show duplicates)
```

This can produce surprising behavior — `resolve` returns label lists
based on the stored labels, but positional lookups always use the first
occurrence. Users are encouraged to avoid duplicate labels.

**Alias with an unknown label** — If an alias name is registered but its
selector refers to a label that does not exist, the error surfaces at
`resolve` time, not at `alias` registration time.

```python
ax = Axis("x", ["a", "b"])
ax.alias("bad", ["a", "z"])        # registers fine
ax.resolve("bad")                  # ShapeError: Axis[x]: unknown label 'z'
```

**Resolve with a missing label** — Direct label lookup on a non-existent
label raises `ShapeError`.

```python
ax = Axis("x", ["a", "b"])
ax.resolve("z")     # ShapeError: Axis[x]: unknown label 'z'
```

**Resolve with `None`** — Special case: `None` is treated as
`slice(None)`, meaning "all labels".

```python
ax = Axis("x", ["a", "b", "c"])
ax.resolve(None)  # ("slice", slice(0, 3, 1), ["a", "b", "c"])
```

---

## Table

```python
from latpy.latdata import Table
```

A **Table** is a 2D labeled grid: a single `NDArray` together with a row
`Axis` and a column `Axis`. It is not a full DataFrame — it intentionally
omits pivot tables, joins, stacked operations, and fancy index manipulation.
Its strength is label-based reading and writing with a clean `T[rows, cols]`
syntax.

| Signature | Description |
|---|---|
| `Table.from_list(values_2d, row_labels, col_labels, dtype=None, name_rows="rows", name_cols="cols") -> Table` | Construct from nested lists |
| `.data` | `NDArray` backing store |
| `.rows` | Row `Axis` |
| `.cols` | Column `Axis` |
| `.tolist() -> list[list]` | Convert to Python lists |
| `table[key] -> scalar \| NDArray \| Table` | Index by label / integer / slice |
| `.where(mask, a, b) -> Table` | Element-wise select (like `ndarray.where`) |
| `.sum(axis="rows") -> NDArray` | Sum over rows or columns |

### Construction

```python
# From nested lists with explicit labels
T = Table.from_list(
    [[1, 2, 3],
     [4, 5, 6]],
    row_labels=["r1", "r2"],
    col_labels=["a", "b", "c"],
)
# T.data is NDArray of shape (2, 3), dtype i64
# T.rows → Axis("rows", ["r1", "r2"])
# T.cols → Axis("cols", ["a", "b", "c"])
# T.tolist() → [[1, 2, 3], [4, 5, 6]]

# Auto-generated labels when omitted:
T2 = Table.from_list([[1, 2], [3, 4]])
# T2.rows.labels → ["r0", "r1"]
# T2.cols.labels → ["c0", "c1"]
```

### Indexing — return-type rules

The indexing expression `T[rows, cols]` follows these rules:

| rows selector | cols selector | return type | Example |
|---|---|---|---|
| single label/int | single label/int | **scalar** | `T["r1", "b"] → 2` |
| single label/int | multiple | **1D NDArray** | `T["r1", :] → [1,2,3]` |
| multiple | single label/int | **1D NDArray** | `T[:, "b"] → [2,5]` |
| `:` | `:` | **Table** (view) | `T[:, :] → Table` |
| multiple (non-slice) | multiple (non-slice) | **Table** (copy) | `T[["r1"], ["a","c"]] → Table` |

**Rationale:** 1D results are plain `NDArray` objects (not 1-column Tables)
because a 1D NDArray is simpler, lighter, and interoperates directly with
mathematical operations. A 2D result is always wrapped in a Table so that
labels are preserved for further chaining.

```python
T = Table.from_list(
    [[1, 2, 3],
     [4, 5, 6]],
    row_labels=["r1", "r2"],
    col_labels=["a", "b", "c"],
)

# Scalar
T["r1", "b"]      # → 2  (Python int)

# 1D NDArray (single row, all cols)
T["r1", :]        # → NDArray(shape=(3,), dtype=i64, axes=("cols",))
T["r1", :].tolist()  # [1, 2, 3]

# 1D NDArray (all rows, single column)
T[:, "b"]         # → NDArray(shape=(2,), dtype=i64, axes=("rows",))
T[:, "b"].tolist()  # [2, 5]

# Table (row range, column range)
S = T["r1":"r2", "a":"b"]
# S is Table(shape=(2, 2), rows=rows, cols=cols)
S.tolist()        # [[1, 2], [4, 5]]

# Table (full slice — view)
S = T[:, :]
S.tolist()        # [[1, 2, 3], [4, 5, 6]]

# Column shortcut: T["a"] is T[:, "a"]
T["a"]            # NDArray([1, 4])

# List-of-labels selection
T[:, ["a", "c"]]  # Table(shape=(2, 2), rows=rows, cols=cols)
T[:, ["a", "c"]].tolist()  # [[1, 3], [4, 6]]
```

### Boolean / predicate indexing

When you pass a callable predicate as a selector, the same dimensionality
rules apply. If both dimensions resolve to multiple entries, the result is
a Table.

```python
T = Table.from_list(
    [[1, 2, 3],
     [4, 5, 6],
     [7, 8, 9]],
    row_labels=["train", "val", "test"],
    col_labels=["a", "b", "c"],
)

# Predicate on rows → Table (multiple rows, all cols)
S = T[lambda r: r != "val", :]
# S is Table(shape=(2, 3), rows=rows, cols=cols)
S.rows.labels  # ["train", "test"]
S.tolist()     # [[1, 2, 3], [7, 8, 9]]

# Predicate on cols → Table (all rows, filtered cols)
S2 = T[:, lambda c: c > "a"]
S2.cols.labels  # ["b", "c"]
S2.tolist()     # [[2, 3], [5, 6], [8, 9]]
```

### Edge cases

**`from_list` with empty list** — Raises `ShapeError` because the
underlying NDArray cannot infer shape from `[]`.

```python
Table.from_list([], row_labels=[], col_labels=[])
# ShapeError: array(): expected list or nested list
```

**`from_list` with mismatched label lengths** — Raises `ShapeError` in
`__post_init__` because the data shape must match the label count.

```python
Table.from_list(
    [[1, 2], [3, 4]],
    row_labels=["r1"],         # length 1, data has 2 rows
    col_labels=["a", "b"],
)
# ShapeError: Table: row labels length must match data.shape[0]
```

**Indexing with out-of-range label** — Raises `ShapeError` from
`Axis.pos`.

```python
T[:, "z"]   # ShapeError: Axis[cols]: unknown label 'z'
```

**Indexing with out-of-range integer** — Also a `ShapeError`.

```python
T[10, :]    # ShapeError: Axis[rows]: index out of bounds
```

**Mixed label/integer indexing** — Labels and integers can be freely
mixed. An integer is treated as a positional index, not a label name.

```python
T[0, "b"]          # scalar: 2  (row 0, column "b")
T["r1", 2]         # scalar: 3  (row "r1", column index 2)
T[1, 0]            # scalar: 4  (row 1, column 0)
T[0, ["a", "c"]]   # NDArray([1, 3])  — row 0, cols "a" and "c"
```

### Computation

**`sum`** — Reduces along one axis, returning a 1D NDArray. Compatible
axis name aliases: `"rows"`, `"row"`, `"r"`, `"0"` for axis=0;
`"cols"`, `"col"`, `"c"`, `"1"` for axis=1.

```python
T = Table.from_list(
    [[1, 2, 3],
     [4, 5, 6]],
    row_labels=["r1", "r2"],
    col_labels=["a", "b", "c"],
)

T.sum(axis="rows")   # NDArray([5, 7, 9])   — sum down each column
T.sum(axis="cols")   # NDArray([6, 15])     — sum across each row
```

**`where`** — Element-wise select between two values based on a boolean
mask. The mask can be:

- A Table with `b1` data
- A callable `(row_label, col_label) -> bool`
- A broadcastable NDArray with dtype `b1`

```python
T = Table.from_list(
    [[1, 2],
     [3, 4]],
    row_labels=["a", "b"],
    col_labels=["x", "y"],
)

# Mask via callable: put 99 wherever label pair matches
result = T.where(lambda r, c: r == "a" and c == "x", 99, T)
result.tolist()  # [[99, 2], [3, 4]]
# result is a Table with same shape and labels

# Mask via b1 Table
from latpy.latmath.array import zeros
from latpy.latmath.array.dtypes import B1
mask_arr = zeros((2, 2), B1)
mask_arr[0, 0] = 1
mask_t = Table(mask_arr, T.rows, T.cols)
result2 = T.where(mask_t, T, 0)
result2.tolist()  # [[1, 0], [0, 0]]
```

---

## GroupBy

```python
from latpy.latdata import GroupBy
```

**GroupBy** implements split-apply-combine over a Table's rows. You group
rows by a column of the table, by row label, or by a callable predicate,
then apply an aggregation (sum, mean, count, min, max).

**Rationale:** GroupBy is the classic split-apply-combine pattern from
SQL/pandas. Splitting by row labels is the most common use case for labeled
data, since row labels often encode categorical membership (e.g. `"train"`,
`"val"`, `"test"`). Each aggregation produces a new Table whose row labels
are the group keys and whose columns match the original table.

| Signature | Description |
|---|---|
| `GroupBy(table, by)` | Group rows by label string, label list, or callable predicate |
| `.sum() -> Table` | Sum of each group along columns |
| `.mean() -> Table` | Mean of each group along columns |
| `.count() -> Table` | Row count of each group |
| `.min() -> Table` | Minimum of each group along columns |
| `.max() -> Table` | Maximum of each group along columns |

### Basic example

```python
t = Table.from_list(
    [[1, 2],
     [3, 4],
     [5, 6]],
    row_labels=["a", "b", "a"],
    col_labels=["x", "y"],
)

gb = GroupBy(t, "row")      # group by row label
result = gb.sum()
result.tolist()              # [[6, 8], [3, 4]]
result.rows.labels           # ["a", "b"]   (group keys become row labels)
result.cols.labels           # ["x", "y"]   (columns preserved)
```

**Output of each aggregation on this data:**

| Aggregation | Output |
|---|---|
| `gb.sum()` | `[[6, 8], [3, 4]]` |
| `gb.mean()` | `[[3.0, 4.0], [3.0, 4.0]]` |
| `gb.count()` | `[[2, 2], [1, 1]]` |
| `gb.min()` | `[[1, 2], [3, 4]]` |
| `gb.max()` | `[[5, 6], [3, 4]]` |

### Chaining

GroupBy results are Tables, so they can be further indexed or converted.

```python
# Group → sum → extract one column as NDArray
GroupBy(t, "row").sum()[:, "y"]    # NDArray([8, 4])

# Group → mean → convert to plain Python lists
GroupBy(t, "row").mean().tolist()  # [[3.0, 4.0], [3.0, 4.0]]

# Group → sum → sum again (grand total)
GroupBy(t, "row").sum().sum(axis="cols")  # NDArray([14, 7])
```

### Grouping by callable predicate

When you pass a callable, it receives each row label and the return value
is stringified to form the group key.

```python
t = Table.from_list(
    [[1, 2], [3, 4], [5, 6], [7, 8]],
    row_labels=["set_a", "set_b", "set_a", "set_b"],
    col_labels=["x", "y"],
)

gb = GroupBy(t, lambda lbl: lbl.split("_")[1])
gb.sum().tolist()   # [[6, 8], [10, 12]]
gb.sum().rows.labels  # ["a", "b"]
```

### Edge cases

**Group by string that doesn't match any row label** — The `by` parameter
refers to row labels. If you pass a string that doesn't match any row
label, all rows fall into a single group named by that string.

```python
t = Table.from_list([[1], [2]], row_labels=["a", "b"], col_labels=["x"])
gb = GroupBy(t, "z")             # no row label "z"
gb.sum().rows.labels             # ["z"]   — all rows grouped together
gb.sum().tolist()                # [[3]]
gb.count().tolist()              # [[2]]
```

**Single-element groups** — Groups with exactly one row work naturally;
the aggregation returns that row's values unchanged.

```python
t = Table.from_list([[1, 2], [3, 4]], row_labels=["a", "b"], col_labels=["x", "y"])
GroupBy(t, "row").sum().tolist()    # [[1, 2], [3, 4]]  (each row is its own group)
GroupBy(t, "row").count().tolist()  # [[1, 1], [1, 1]]
```

**All-identical rows** — When all rows share the same label, there is one
group containing all rows. Aggregation behaves normally.

```python
t = Table.from_list([[1, 2], [3, 4]], row_labels=["a", "a"], col_labels=["x", "y"])
GroupBy(t, "row").sum().tolist()      # [[4, 6]]
GroupBy(t, "row").mean().tolist()     # [[2.0, 3.0]]
GroupBy(t, "row").count().tolist()    # [[2, 2]]
```

**Empty table** — If the table has no rows (shape `(0, n)`), GroupBy
produces zero groups and aggregation returns an empty Table with shape
`(0, n_cols)`.

---

## Design rationale

### Table wraps NDArray with labels, not a full DataFrame clone

The Table deliberately does **not** replicate pandas DataFrames:

- **No multi-index** — Axes are flat label lists. Hierarchical indexing is
  left to the application layer.
- **No joins/merges** — Table is for in-memory computation on a single
  grid. Combining tables is done via `NDArray` operations.
- **No mutation** — There is no `drop`, `insert`, or `rename` on Table.
  Create a new Table with the desired axes instead.
- **1D results are NDArrays, not 1-column Tables** — This keeps the API
  simple and ensures computed vectors can be immediately used in math ops.
- **Inherits NDArray computation** — Methods like `where` and `sum`
  delegate directly to NDArray, so performance characteristics match the
  underlying array library.

### What Table does well

- Clean `T[rows, cols]` syntax with label, slice, list, range, callable,
  and alias selectors.
- Interop with NDArray for computation.
- Lightweight — no dependency on pandas; uses only NDArray + standard
  library.