DeepSeek AI: A Crash Course in Local LLMs

2025-02-27

Introduction to DeepSeek

Welcome to this crash course on DeepSeek AI. This chapter will guide you through the fundamentals of utilizing DeepSeek, a Chinese AI company specializing in open-weight Large Language Models (LLMs). We will explore various aspects of DeepSeek, from interacting with their online platform to running their models locally on different hardware configurations.

Large Language Models (LLMs): Sophisticated artificial intelligence models trained on massive datasets of text. They are designed to understand and generate human-like text, answering questions, writing different kinds of creative content, and translating languages.

Course Outline:

This crash course will cover the following key areas:

DeepSeek Website Exploration: We will begin by examining the DeepSeek website and how to interact with their AI assistant, similar to using platforms like ChatGPT.
Local Download and Initial Capabilities (OLLAMA): Next, we will download and run a DeepSeek model locally using OLLAMA, gaining an understanding of its basic capabilities.
Agentic Behavior with Studio LM: We will then utilize Studio LM to run DeepSeek models locally but with added “agentic behavior,” exploring its potential as an AI assistant.
Hardware Considerations and Troubleshooting: We will test DeepSeek on different hardware setups, including an AI PC and a workstation with a dedicated graphics card (RTX 4080), highlighting troubleshooting steps and hardware limitations.
Hugging Face Transformers for Local Inference: Finally, we will delve into using the Hugging Face Transformers library to perform local inference with DeepSeek models, providing a programmatic approach.

Agentic Behavior: In the context of AI models, this refers to the model exhibiting proactive, goal-directed behavior. It can involve reasoning, planning, and taking actions to achieve a specific objective, often beyond simple response generation.

Primer on DeepSeek AI

Before diving into practical applications, let’s understand what DeepSeek is and what they offer. DeepSeek is a Chinese company known for creating open-weight LLMs.

Open-weight Models: AI models where the model weights (the parameters learned during training that define the model’s behavior) are publicly available. This allows for greater transparency, customization, and community contribution compared to closed-source models.

DeepSeek’s Model Family:

DeepSeek offers a range of open-weight models, including:

R1 Series: The primary focus of this crash course, particularly DeepSeek R1.
R1.5: An enhanced version of R1.
R10: An earlier model in the series, trained via reinforcement learning.
DeepSeek V3: Utilized on the DeepSeek.com website, a more advanced model compared to R1.
Math Coder: Specialized models for mathematical and coding tasks.
MoE (Mixture of Experts) and MoM (Mixture of Models): Advanced model architectures that combine multiple sub-models for improved performance.

While DeepSeek offers various models, this chapter will primarily concentrate on DeepSeek R1 and initially explore DeepSeek V3 as it is available on their website.

DeepSeek R1: Key Features and Significance

DeepSeek R1 is built upon the foundation of DeepSeek R10, a model detailed in a research paper.

Reinforcement Learning: A type of machine learning where an agent learns to make decisions in an environment to maximize a reward. In the context of LLMs, it can be used to refine model behavior based on feedback.

DeepSeek R10 was trained using large-scale reinforcement learning without supervised fine-tuning. It demonstrated impressive reasoning capabilities but suffered from issues like poor readability and language mixing.

DeepSeek R1 was subsequently trained to address these shortcomings, aiming for performance comparable to OpenAI’s models. Benchmarks presented by DeepSeek suggest their models often outperform or match OpenAI in various text generation tasks.

Important Note: DeepSeek R1 is primarily focused on text generation and does not inherently possess capabilities like image recognition or other modalities.

The primary significance of DeepSeek R1 lies in its cost-efficiency. It is speculated to offer a 95-97% reduction in cost compared to OpenAI models. This drastic reduction in training and operational costs is a major advantage, as training and running large models typically involve millions of dollars in compute resources. DeepSeek reportedly trained R1 for approximately $5 million, a fraction of the cost associated with similar models.

The emergence of cost-effective models like DeepSeek R1 has even impacted the stock market, with chip manufacturers experiencing stock drops as companies reconsider the necessity for expensive, high-compute hardware.

This crash course aims to explore DeepSeek R1, investigate its capabilities, and identify its limitations in local deployment scenarios.

Hardware Considerations for Local LLM Deployment

Running LLMs locally is heavily dependent on your hardware capabilities. While cloud-based solutions exist, local deployment offers better control and privacy, making it a worthwhile investment to understand hardware requirements and limitations.

Hardware Used in this Course:

This course utilizes two primary hardware setups to demonstrate DeepSeek deployment:

Intel Lunar Lake AI PC Dev Kit (Core Ultra 200V Series):
- A mobile chip-based development kit released in September 2024.
- Features an integrated Graphics Processing Unit (IGPU), designed for running machine learning models.
- Includes a Neural Processing Unit (NPU) intended for smaller models and AI acceleration.
- Represents a cost-effective solution for local AI processing.
Integrated Graphics Processing Unit (IGPU): A graphics processor built directly into the CPU (Central Processing Unit) chip. It shares system memory with the CPU and is generally less powerful than dedicated GPUs but more energy-efficient and cost-effective.

Neural Processing Unit (NPU): A specialized hardware accelerator designed to speed up machine learning tasks, particularly neural network computations. NPUs are optimized for efficiency and low-latency inference.
Dell Precision 3680 Tower Workstation:
- A more powerful workstation equipped with a 14th generation Intel Core i9 processor.
- Features a dedicated NVIDIA GeForce RTX 4080 graphics card.
- Represents a higher-performance setup, leveraging a dedicated GPU for LLM processing.
Dedicated Graphics Card (GPU): A separate, high-performance graphics processor, distinct from the CPU and IGPU. Dedicated GPUs have their own memory (VRAM) and are significantly more powerful than IGPU, making them ideal for demanding tasks like gaming, video editing, and running large AI models.

Hardware Performance and Limitations:

Testing DeepSeek R1 on both the AI PC and the workstation revealed:

Dedicated GPU (RTX 4080) Performance: The workstation with the RTX 4080 generally performed better due to the dedicated graphics processing power.
AI PC (Lunar Lake) Cost-Effectiveness: The AI PC dev kit, being a mobile chip solution, offers a more budget-friendly entry point to local LLM experimentation. Commercially available equivalents from manufacturers like ASUS or Lenovo exist for consumers. Mac M4-based systems also fall into a similar performance and capability category.
Model Size and Hardware Limits: Both setups could effectively run models around the 7-8 billion parameter range. Larger models or unoptimized configurations could lead to system instability and crashes, even on these relatively capable machines.

Practical Considerations:

Local Hardware Investment: Investing in local hardware is crucial for hands-on learning and experimentation with LLMs. Understanding hardware limitations is key to optimizing model selection and deployment strategies.
Networked Computing: For enhanced performance, consider networked setups. This could involve:
- AI PCs on a Network: Utilizing multiple AI PCs on a local network to distribute the LLM workload.
- Dedicated Computers with Multiple GPUs: Employing workstations with multiple dedicated GPUs for parallel processing.
Distributed Computing Frameworks: Tools like Ray and Ray Serve can facilitate distributed computing, allowing you to run LLMs across multiple devices for improved performance and handling larger models.

Despite hardware limitations, local deployment enables valuable experimentation and understanding of LLM capabilities. The following sections will explore practical methods for interacting with and utilizing DeepSeek models locally.

Interacting with DeepSeek’s AI Assistant (DeepSeek.com)

Before programmatically working with DeepSeek models, let’s explore the user-friendly AI assistant available on their website, deepseek.com. This platform offers a ChatGPT-like experience and provides a convenient starting point to understand DeepSeek’s capabilities.

Accessing and Using the AI Assistant:

Free Access: Currently, the DeepSeek.com AI assistant is offered for free. However, it’s important to note that this might be subject to change, especially due to its origin as a product from a Chinese company, which might lead to future limitations in certain regions like North America.
Model in Use: The online assistant utilizes DeepSeek V3, a more advanced and capable model compared to DeepSeek R1, which we will focus on for local deployment.
Account Login: Accessing the assistant is straightforward, with easy login options like connecting through a Google account.

Testing DeepSeek V3’s Capabilities:

To evaluate the AI assistant, a practical use-case for learning Japanese was employed. A prompt document designed for language learning was used to test DeepSeek V3’s performance.

Prompt Document and Testing Methodology:

Japanese Language Teacher Role-Play: The prompt document instructs the AI to act as a Japanese language teacher, assisting with sentence translation and grammatical understanding.
Comparative Testing: The same prompt document was previously used with other AI models like Meta AI, Claude, and ChatGPT, allowing for a comparative analysis of DeepSeek V3’s performance.
Prompt Document Components: The prompt document includes detailed instructions, examples, and context to guide the AI’s behavior, including:
- Defined role (Japanese language teacher)
- Specified language (Japanese)
- Teaching instructions
- Agent flow and state handling
- Examples for guidance

The prompt documents are publicly available on GitHub at OmenKing/free-gen-ai-bootcamp-2025 under the “sentence-constructor” repository.

Initial Results and Observations:

File Upload Functionality: DeepSeek.com allows users to upload multiple documents (text and images) for context and processing.
Prompt Following: Initially, DeepSeek V3 demonstrated strong performance in breaking down sentence structures as instructed by the prompt document.
Instruction Subversion: When tested to “subvert” instructions by asking for direct answers (which the prompt was designed to avoid), DeepSeek V3 provided the answer, indicating a potential weakness in strictly adhering to complex instructions.
Error Correction: Despite the initial instruction subversion, the AI acknowledged the error and apologized, showcasing a degree of responsiveness and correction capability.
Cost Factor: Even with minor imperfections, the potential cost-effectiveness of DeepSeek models remains a significant advantage, especially when compared to proprietary models like Claude or ChatGPT.
Model Comparison Context: It’s important to compare DeepSeek V3 (and R1) to other open-weight models like Meta AI’s Llama or Mistral 7, as they might operate with fewer “extra checks” or layers of processing compared to commercial models like Claude or ChatGPT.

Vision Capabilities Testing:

To further explore DeepSeek V3’s capabilities, its vision functionality was tested.

Image Input: A Japanese newspaper image with Japanese text was used as input.
Transcription and Translation Request: The AI was asked to transcribe the Japanese text from the image and provide a translation and grammatical breakdown.
Vision Capability Confirmation: DeepSeek V3 successfully transcribed the Japanese text and provided a translation, confirming its vision capabilities. This is a notable feature, especially for DeepSeek V3 on their website, although DeepSeek R1 itself is primarily a text-generation model.

Conclusion:

The DeepSeek.com AI assistant, powered by DeepSeek V3, offers a compelling and accessible way to interact with a powerful language model. While minor instruction adherence issues were observed, its overall performance, vision capabilities, and potential cost advantages make it a valuable tool for exploration and experimentation.

Local Model Download and Basic Usage with OLLAMA

This section focuses on downloading and running DeepSeek models locally, addressing scenarios where online access to DeepSeek.com might be unavailable or for users who prefer local processing for privacy or performance reasons. We will use OLLAMA, a tool that simplifies the process of downloading and running LLMs on local machines.

OLLAMA: A tool that simplifies the process of downloading, setting up, and running Large Language Models locally on your computer from the command line. It handles model management and provides a straightforward interface for interacting with LLMs.

Setting up OLLAMA:

Installation: OLLAMA is easy to install. Download the installer for your operating system (Windows, macOS, Linux) from the official OLLAMA website.
Running OLLAMA: Once installed, OLLAMA typically runs in the background. On Windows, it appears in the system tray. On macOS and Linux, it runs as a background service.
Terminal Interface: Interaction with OLLAMA is primarily through the terminal or command prompt.

Downloading DeepSeek R1 Models with OLLAMA:

OLLAMA provides a command-line interface to download and run various LLMs, including DeepSeek R1 models.

Model Selection: DeepSeek R1 offers different parameter sizes, ranging from 1.5 billion to 671 billion parameters. Larger models generally have greater capabilities but require more computational resources. OLLAMA lists these models and their sizes.
- 7 Billion Parameter Model: A good starting point for local experimentation, balancing performance and resource requirements.
- 8 Billion Parameter Model: Slightly larger than the 7 billion model, potentially offering improved performance.
- 14 Billion Parameter Model: A more demanding model, pushing the limits of typical local hardware.
- 671 Billion Parameter Model: The largest DeepSeek R1 model (404GB). This model is extremely demanding and generally requires specialized hardware setups with multiple GPUs and significant memory. Running this model on standard consumer hardware is not feasible.
Parameters: In the context of Large Language Models, parameters are the learned variables within the model that determine its behavior. A model with more parameters generally has a greater capacity to learn complex patterns and perform more sophisticated tasks, but also requires more computational resources.
Downloading Models via Terminal: To download a DeepSeek R1 model using OLLAMA, use the ollama pull command followed by the model name. For example, to download the 7 billion parameter model, the command would be: ollama pull deepseek-r1:7b.
- OLLAMA downloads models from repositories like Hugging Face.
Hugging Face: A platform and community hub for machine learning, particularly for Natural Language Processing. It hosts a vast collection of pre-trained models, datasets, and tools, making AI resources more accessible to developers and researchers.
Model Storage: Downloaded models are stored locally by OLLAMA.

Running and Interacting with DeepSeek R1 via OLLAMA:

Once a model is downloaded, you can interact with it directly through the OLLAMA terminal interface.

Running a Model: To run a downloaded model, use the ollama run command followed by the model name. For example: ollama run deepseek-r1:7b.
Chat Interface: After running the command, OLLAMA provides a simple chat interface in the terminal where you can type prompts and receive text-based responses from the LLM.
Example Interaction: Typing “Hello, how are you?” will prompt the DeepSeek R1 model to generate a response.

Hardware Performance Observations with OLLAMA:

7 Billion Parameter Model Performance: The 7 billion parameter DeepSeek R1 model runs reasonably well on both the Intel Lunar Lake AI PC and the RTX 4080 workstation.
14 Billion Parameter Model Performance: The 14 billion parameter model can also run, but performance may be slower, and system instability becomes more likely, especially on less powerful hardware.
Hardware Resource Limits: Running larger models like the 671 billion parameter version is practically impossible on typical consumer hardware due to memory and computational demands.
RAM Importance: Sufficient RAM is crucial for loading and running LLMs. 32GB of RAM was available on the test systems.
GPU Utilization: OLLAMA can utilize GPUs if available, which significantly improves inference speed.

Inference: The process of using a trained machine learning model to make predictions or generate outputs based on new input data. In the context of LLMs, inference refers to generating text responses to user prompts.

Model Management in OLLAMA:

Listing Models: The command ollama list will display a list of models downloaded and managed by OLLAMA.
Removing Models: To remove a model and free up disk space, use the command ollama rm followed by the model name. For example: ollama rm deepseek-r1:7b.

Conclusion:

OLLAMA provides a user-friendly way to download and experiment with DeepSeek R1 models locally. It offers a basic terminal interface for interaction and allows users to explore the capabilities of smaller parameter models within the constraints of their local hardware. However, for more advanced features and a more user-friendly interface, tools like LM Studio are worth exploring.

Enhanced Local Interaction with LM Studio

This section introduces LM Studio, a graphical user interface (GUI) tool that provides a more user-friendly experience for interacting with local LLMs, including DeepSeek models. LM Studio offers features beyond basic terminal interaction, resembling a more complete AI assistant environment.

LM Studio: A desktop application that provides a user-friendly graphical interface for downloading, managing, and running Large Language Models locally. It offers features like model browsing, chat interfaces, and server functionality, making local LLM interaction more accessible.

Setting up LM Studio:

Download and Installation: Download LM Studio for your operating system (macOS, Windows, Linux) from the official LM Studio website.
Installation Process: Follow the standard installation procedure for your operating system.

Exploring LM Studio Interface and Features:

Model Selection: LM Studio features a model browser that allows you to search and download LLMs from Hugging Face and other sources.
Model Catalog: Access a wide range of models directly within the application, categorized and searchable.
Model Download: Download models directly through the LM Studio interface, simplifying the model acquisition process compared to command-line tools.
Chat Interface: LM Studio provides a chat interface similar to web-based AI assistants, allowing for conversational interaction with loaded models.
Settings and Customization: Offers various settings to configure model loading, hardware utilization (CPU, GPU offloading), context length, and other parameters.
Dark Mode and UI Customization: LM Studio provides options for customizing the user interface, including dark mode themes for improved readability.

Downloading and Running DeepSeek R1 in LM Studio:

Model Search: Use the LM Studio model catalog to search for “deepseek r1”.
Model Selection: Choose a DeepSeek R1 model variant, such as “deepseek-r1-distill-llama-8b”. Distilled models are often optimized for efficiency while retaining key capabilities.

Distillation: A technique in machine learning where a smaller, simpler model (the “student”) is trained to mimic the behavior and knowledge of a larger, more complex model (the “teacher”). This allows for creating more efficient models with comparable performance.
Download Model: Click the “Download” button within LM Studio to download the selected DeepSeek R1 model. LM Studio downloads models in the GGUF file format, a format optimized for CPU inference and compatible with tools like OLLAMA and llama.cpp.

GGUF File Format: A file format specifically designed for storing and distributing quantized Large Language Models. It is optimized for efficient inference, particularly on CPUs, and is commonly used with tools like llama.cpp and OLLAMA.
Load Model: Once downloaded, load the model into LM Studio.
Chat Interaction: Use the chat interface to interact with the loaded DeepSeek R1 model. Type prompts and engage in conversations, similar to using a web-based AI assistant.

Agentic Behavior and Reasoning in LM Studio:

LM Studio incorporates agentic behavior, providing more than just direct model output. It appears to add a layer of reasoning and thought processing to the interaction.

Thought Process Display: LM Studio can display the model’s “thought process” as it generates responses, showing intermediate reasoning steps and considerations. This offers insights into the model’s decision-making process.
Enhanced Interaction: This agentic behavior makes the interaction feel more like engaging with an AI assistant capable of reasoning and planning, rather than just a direct model output playground.

Hardware Performance and Troubleshooting with LM Studio:

GPU Offloading: LM Studio offers options to offload model layers to the GPU, improving inference speed if a compatible GPU is available. Experiment with GPU offloading settings to optimize performance based on your hardware.
Context Length Adjustment: LM Studio allows adjusting the context length, which can impact memory usage and performance. Reducing the context length might be necessary for resource-constrained systems.

Context Length: The amount of preceding text that a Large Language Model considers when generating the next token or response. A longer context length allows the model to understand and maintain context over longer conversations or documents, but also increases computational demands.
Resource Exhaustion and System Instability: Running LLMs, especially with agentic behavior, can be resource-intensive. On resource-constrained systems (like the Intel Lunar Lake AI PC), LM Studio might lead to system slowdowns, crashes, or reboots if models are too large or settings are not optimized.
Hardware Compatibility: While LM Studio aims to support various hardware, compatibility and performance can vary. Experimentation and hardware-aware configuration are often necessary to achieve optimal results.
Potential for Optimization: Optimizing LM Studio settings, such as GPU offloading, context length, and CPU thread allocation, can help improve performance and stability, especially on less powerful hardware.

Conclusion:

LM Studio offers a significant step up from basic terminal interaction with LLMs. Its GUI, model catalog, chat interface, and agentic behavior features provide a more complete and user-friendly experience for exploring and experimenting with DeepSeek R1 and other local LLMs. However, hardware limitations and resource management remain important considerations for achieving smooth and stable performance, especially with larger models or on less powerful hardware.

Programmatic Interaction with DeepSeek R1 using Hugging Face Transformers

This section explores how to interact with DeepSeek R1 models programmatically using the Hugging Face Transformers library. This approach offers the greatest flexibility and control over model usage, enabling integration into custom applications and workflows.

Hugging Face Transformers: A widely used Python library that provides pre-trained models and tools for Natural Language Processing (NLP) tasks, including working with Large Language Models. It simplifies the process of downloading, loading, and using state-of-the-art transformer-based models.

Setting up the Environment:

Python Environment: Ensure you have Python installed. Conda environments are recommended for managing dependencies and avoiding conflicts.
Install Transformers Library: Install the Hugging Face Transformers library along with necessary dependencies like PyTorch or TensorFlow. Use pip in your terminal:
```
pip install transformers
pip install torch  # or pip install tensorflow
```
PyTorch and TensorFlow: Popular open-source machine learning frameworks used for building and training neural networks. Hugging Face Transformers supports both frameworks.
Hugging Face API Token (Optional but Recommended): For accessing models from the Hugging Face Hub, you may need a Hugging Face API token. Create an account on huggingface.co and generate an API token under your profile settings. Set this token as an environment variable named HF_TOKEN.

Loading DeepSeek R1 Model using Transformers:

Model Identifier: Identify the Hugging Face model identifier for the DeepSeek R1 model you want to use. For example, “deepseek-ai/deepseek-r1-llama-8b-instruct”.

Pipeline Approach (Simplified Inference): The Transformers library offers a pipeline function for simplified inference. This approach abstracts away some of the lower-level details.

from transformers import pipeline

# Initialize the pipeline with the DeepSeek R1 model identifier
pipe = pipeline("text-generation", model="deepseek-ai/deepseek-r1-llama-8b-instruct")

# Generate text using the pipeline
prompt = "Write a short story about a robot learning to feel emotions."
output = pipe(prompt)

print(output[0]['generated_text'])

Direct Model and Tokenizer Loading (More Control): For more control, you can load the model and tokenizer separately.

from transformers import AutoModelForCausalLM, AutoTokenizer

# Model identifier
model_name = "deepseek-ai/deepseek-r1-llama-8b-instruct"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load model
model = AutoModelForCausalLM.from_pretrained(model_name)

# Prepare input for the model
prompt = "Translate 'Hello, world!' to French."
inputs = tokenizer(prompt, return_tensors="pt") # pt for PyTorch tensors

# Generate output
outputs = model.generate(**inputs)
decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(decoded_output)

Tokenizer: A component in NLP models that converts text into numerical tokens (integers) that the model can process. It also handles the reverse process, converting model outputs (tokens) back into human-readable text.

Local Inference: Running inference (generating outputs) using a machine learning model directly on your local machine, as opposed to sending requests to a remote server or cloud service.

Running Inference and Generating Text:

Prompt Engineering: Craft effective prompts to guide the model towards desired outputs. Experiment with different prompt styles and instructions.
Text Generation: Use the loaded pipeline or model to generate text based on your prompts.
Output Handling: Process and utilize the generated text output as needed for your application.

Hardware Considerations for Programmatic Inference:

GPU Utilization: Transformers can leverage GPUs for faster inference if PyTorch or TensorFlow is configured to use CUDA (for NVIDIA GPUs) or other GPU acceleration libraries.
Memory Management: Be mindful of memory usage, especially when working with larger models. Techniques like quantization and model optimization can help reduce memory footprint.
Performance Tuning: Experiment with different inference parameters (e.g., sampling strategies, decoding parameters) to optimize performance and output quality.

Conclusion:

Using Hugging Face Transformers provides a powerful and flexible way to programmatically interact with DeepSeek R1 models. It allows for fine-grained control over model loading, inference, and integration into custom applications. While requiring some coding knowledge, this approach unlocks the full potential of local LLMs for diverse AI-driven tasks.

Final Thoughts and Reflections

This crash course has explored the landscape of DeepSeek AI and local LLMs, covering practical aspects from using online assistants to programmatic model interaction. Here are some key takeaways and reflections:

DeepSeek’s Potential: DeepSeek R1 and its family of models represent a significant advancement in open-weight LLMs, offering impressive capabilities and cost-efficiency.
Local LLM Accessibility: Tools like OLLAMA and LM Studio are making local LLM deployment increasingly accessible to developers and enthusiasts, even on consumer-grade hardware.
Hardware Limitations Remain: While progress is being made, hardware limitations are still a significant factor in running large, complex LLMs locally. Resource management and optimization are crucial.
Optimization is Key: Optimized model formats (like GGUF) and hardware acceleration (GPUs, NPUs when fully supported) are essential for efficient local LLM inference.
Experimentation and Exploration: The field of local LLMs is rapidly evolving. Continuous experimentation, exploring different models, tools, and hardware setups, is key to staying at the forefront of this technology.
Trade-offs and Considerations: Local LLM deployment involves trade-offs between performance, resource requirements, ease of use, and desired features. Choosing the right approach depends on specific needs and hardware capabilities.
Future of Local AI: The trend towards more powerful and efficient local AI processing is likely to continue. As hardware improves and models become more optimized, local LLMs will play an increasingly important role in various applications, offering privacy, control, and potentially cost-effectiveness.

This crash course serves as a starting point for your journey into DeepSeek AI and local LLMs. Further exploration, experimentation, and community engagement will deepen your understanding and unlock the exciting possibilities of this rapidly developing field.