Spaces:

siyagajbhe
/

legal-llama-rag

Sleeping

siyagajbhe commited on 27 days ago

Commit

0f37de1

verified ·

1 Parent(s): a32d077

Create src/preprocess_caselaw.py

Files changed (1) hide show

src/preprocess_caselaw.py ADDED Viewed

+from datasets import load_dataset
+import json, os
+os.makedirs("data", exist_ok=True)
+def chunk_text(text, size=1000):
+    return [text[i:i+size] for i in range(0, len(text), size)]
+ds = load_dataset("common-pile/caselaw_access_project", split="train[:0.1%]")
+with open("data/caselaw_chunks.jsonl", "w") as f:
+    for item in ds:
+        text = item.get("text", "")
+        if len(text) < 200:
+            continue
+        for chunk in chunk_text(text):
+            f.write(json.dumps({
+                "case_name": item.get("case_name", ""),
+                "court": item.get("court", ""),
+                "text": chunk
+            }) + "\n")
+print("✅ Preprocessed and saved data/caselaw_chunks.jsonl")