Shutong: AI Study Helper with Enhanced OCR for Academic Research

The biggest shortcoming of today’s AI is the loss of the ability to trace back to the sources while absorbing and compressing the original data.

However, for academic studies, it is essential being able to trace the sources which is like the cornerstone of building a house. Opinions and ideas that can’t be traced back to their sources can’t be grounded and trusted.

However, and now the LLM is at this stage in answering questions (especially for those less scientific but more associating with thoughts and ideas); it can answer any question, but it can’t offer the sources of its answers.

From a human ethical point of view, the LLM is very good at “making things up”, but it probably doesn’t know what it’s talking about. This is one of the major shortcomings of today’s AIs. The problem of trust due to its answers are untracable.

Shutong try to makes up “more or less” this shortage by ensuring that all its answers and response can be tracable by academic standard citations. In other word, we try to making all its outputs can be accurately cited to the source material through citations.

:warning: Keep in mind, the more processing the data, the more it may lose information; the processing itself may have mistakes and lack accuracy when using AI to do this.

现今AI最大的缺失是在吸收、压缩数据的同时丧失了追根溯源的能力。对于文科学术研究而言,能追根溯源就好比一个房子的基石。不能追根溯源的观点和思想是无法立足和相信的,而现在大语言模型在回答人文类的问题上就处于这个阶段,它可以回答任何问题,但它不能告知自己答案的来源。从人类伦理的角度而言,LLM“瞎编”的能力很强,但它很可能不知道自己在说什么。这实在是当今AI的一大缺陷。Shutong 的产生「虽然不是终极解决方案」则或多或少地弥补这一缺陷,即它可以确保自己的回答是有根据的。这种根据是它输出的一切结果都能找到原始材料的准确引用信息「具体点就是能够提供注脚和书单来指出来源」。

Document Processing Capabilities

Shutong processes various document formats with specialized OCR capabilities:

  • PDF documents: Processed for superior text layout preservation
  • Traditional Chinese vertical text: Special handling for right-to-left reading order
  • PowerPoint files (pptx/ppt): Converted to images for OCR processing
  • Images: Direct OCR with layout detection
  • Word documents (docx/doc): Converted to PDF for processing
  • Audio/video: For transcribing interviews, lectures, and other references

Shutong excels with published materials but may work with unpublished documents for personal use. It’s not designed for original materials that require specialized processing (manuscripts, archaeological artifacts, paintings, etc.).

Enhanced OCR Features

  • Vertical Chinese text detection: Automatically identifies top-to-bottom text with right-to-left reading order
  • Traditional character preservation: Maintains traditional Chinese characters without simplification
  • Layout preservation: Retains headers, footers, and multi-column layouts
  • Image extraction: Saves images from documents for reference
  • Citation metadata: Extracts author, title, year, and other bibliographic information when available

Features

Document Ingestion and Processing

Shutong ingests various document formats and processes them for academic research:

  • Convert documents (PDF, DOC, DOCX, PPT, PPTX, TXT) to LLM-ready formats (CSV, Markdown, JSONL)
  • Extract text from images using specialized OCR with layout preservation
  • Process traditional Chinese with correct reading order for vertical text (top-to-bottom, right-to-left)
  • Transcribe audio/video files
  • Maintain citation information (author, year, title, location, publisher, pages)

Search and Retrieval Pipeline

After ingestion, Shutong offers powerful search capabilities:

  1. Keyword search: Find specific terms in your personal database with proper citations
  2. Q&A pipeline: Ask questions and receive answers backed by citations from the ingested data
  3. Vector search: Find semantically similar content using embeddings

Command Line Interface

Shutong provides a powerful command-line interface for document processing and search:

OCR Command

shutong ocr [FILES] --lang [auto|english|chinese] --vertical-chinese --output-dir [DIR] --save-format [md|txt]

Options:

  • --lang: Specify document language (auto, english, chinese)
  • --vertical-chinese: Enable specialized processing for traditional Chinese vertical text
  • --output-dir: Directory to save OCR results (default: output/)
  • --save-format: Format for saving results (md or txt)
  • --debug: Enable debug logging

Ingest Command

shutong ingest [TARGET] --ocr-engine [engine] --langs [languages] --citations-forms [format] --llm [model]

Where TARGET can be a folder or file (PDF, DOC, DOCX, PPT, PPTX, images, audio, video, URL)

The ingest process:

  1. Converts files to JSONL or Markdown format
  2. Creates OCR’d PDFs with filename-ocr.pdf in the target folder
  3. Indexes content into ~/.study/
  4. Embeds and tokenizes content for LLM processing
  5. Preserves headers and footers to maintain page numbers for citations

Search Command

shutong keywords [TERM] --lines [COUNT]

The search displays results with:

  • Citations (author, title, year, publisher or filename)
  • Page numbers containing the keyword
  • Context around the search term

Ask Command

shutong ask [QUESTION]

Similar to paperQA but works with all ingested file types, not just PDFs.