Recently, with this crazy weather, I caught a cold if I’m not careful.
What’s even more deadly is that if I lose my voice and have to record a tutorial video with a stiff upper lip, it’s just like adding insult to injury.
Just when I was about to despair, a friend recommended a magic weapon to me – Hailuo AI’s voice cloning.

I had heard that Hailuo AI’s voice cloning models were well made, but I never had the opportunity to try them.
This time, I tried it out with an experimental mindset and found that the results were really amazing. I couldn’t help but swear after using it.
I have to say, this AI voice cloning technology is literally a lifesaver for cold sufferers.
Not only is the voice very realistic, but it’s also free to use now, which is so kind.
I have now embedded it into my AIvideo recording workflow.
In the future, even if I have a cold or the ambient sound is not good, it can solve my urgent needs. And it is not just for recording tutorial videos, it can be used in any situation where spoken audio is required.
For example, making short videos, doing editing, etc., but don’t use it for evil.
If you are suffering from poor expression skills, are afraid to speak in front of the camera, or are worried that your accent is too heavy and your voice is not professional enough, AI voice cloning technology can help you solve these problems.
Today, I will share with you Hailuo AI’s voice cloning method and workflow. It is easy to learn and you will be able to do it after just one look!
Hailuo AI voice cloning tutorial
Register Login: Start AI voice cloning step 1
First, we need to open the Hailuo AI overseas version official website:
Note that it must be the overseas version. Hailuo AI has both a Chinese version and an overseas version.

Getting straight to the point: find the voice cloning function
After logging in, you will see the homepage of Hailuo AI, which has quite a few functions, but today’s topic is voice cloning.
Find the “audio” option in the navigation bar.

Record/upload a sound sample: Let AI learn your voice
The next step is the key one. On the voice page, you can see the prompt area for uploading a sound sample. There are two ways to do this:
Record on the spot:
Click the “record audio” button. The system will pop up a recording prompt. Follow the prompts and read a piece of text aloud in a clear voice. It is recommended to record in a quiet environment to ensure recording quality.
Usually, recording 10-60 seconds of audio is sufficient.
Next upload: If you have previously recorded audio, you can also directly upload it for use.

Submit training and wait for the model to be generated
After recording or uploading is complete, please check the recording quality to ensure that your voice is clear and there are no noticeable noises.
After confirming that there are no errors, click the submit button to upload the audio sample.
The system will begin training a dedicated AI voice model. Model training usually takes a short time, just a few tens of seconds, so please be patient.
Test fine-tuning to optimize the sound effect
After the model training is complete, you can enter text in the text input box at the bottom of the page and click the “Generate Speech” button. The system will generate an audio file of the text being read aloud in your voice.
You can then listen to the generated audio to check the similarity of the voice.
After I tried it, I found that Hailuo AI’s voice cloning effect is really awesome, it’s so natural.
If you feel there is room for improvement, Hailuo AI also provides a fine-tuning function.
However, I want to highlight an important point here! When fine-tuning, it is highly recommended to only adjust the first option!
I have personally tested this and found that the other options either have no effect, or the voice will become strange after adjustment, and it is not as good as the default.

Then don’t adjust this setting outside either, keeping the default is the most natural and most in line with your own voice.

Generate and use: a convenient and efficient sound tool
After completing the above steps, you can start using the AI sound cloning function.
Paste the text into the text box, click Generate in the lower right corner, and AI will read the text aloud using the cloned voice.
However, the upper limit for generating text at a time is 5,000 characters. If you exceed the limit, you need to generate it in batches.

If you are satisfied with the generated sound effect, you can download the audio file and save it locally.
This way, you can use the AI clone voice in scenes where sound is needed.
The above completes the tutorial on using Hailuo AI’s voice clone. If you just want to learn how to use Hailuo AI, you’ve seen enough here.
But the real fun is actually yet to come.
AI video recording workflow
Let’s go back to the scenario of recording a tutorial video. The first step in using voice cloning is to prepare the input text, right?
But here’s the key question: how do you get the exact text?
One way is to write the script in advance and then record it, but this may result in the audio and video being out of sync.
The rhythm of your video screen operations is likely to be incompatible with the pre-written script, and editing will be a headache.
So I’ve adopted a more practical workflow: even if my voice is not in good condition, I will record and explain while operating.
At this point, it doesn’t matter if the ambient sound is bad, or if you have a stutter or an accent. Hailuo AI voice cloning will take care of it.
The specific steps are:
- Finish recording the video
- Convert the mp4 video to mp3 audio
- Upload the audio to Tongyi Tingwu to convert speech to text (Tongyi Qwen is not the only option here; Feishu Miaoji and Xunfei Tingjian are also fine, as long as they can easily export the text content)
But then you will encounter the second problem: the accuracy of AI speech recognition.
These homophonic characters may cause errors
If you rely solely on manual proofreading, the workload will be huge.
Since this is an AI workflow, this step should of course be handled by AI.
At this point, you may think of tools such as ChatGPT, KimiChat, Doubao, and DeepSeek.
But here I would like to share two important experiences:
“Choose an AI that can solve this problem” – not all AI is suitable for this task
Why? Because in the following AI conversation, we not only have to provide a transcription of more than 9,000 words, but also a lot of contextual information to help the AI understand the content of the video.
If you don’t tell the AI the correct information, it won’t be able to make accurate corrections.
At this point, you may be thinking: Providing so much information is also troublesome, so why not just correct it yourself?
Actually, it’s not. Let me explain how to efficiently make the AI understand the content:
This video is teaching the “workflow for quickly writing an AI tool article”. The entire video contains three parts:
- The initial state before the operation
- The specific details of the workflow
- The final completed article
So, to make the AI fully understand the content of the video, I will provide three things:
- The first draft (1k words)
- The final draft (2k words)
- Workflow documentation (several thousand words)
This adds up to more than 10,000 words and about 20,000 tokens.
This leads to the key to choosing an AI tool: we need an AI that is smart enough.
Although KimiChat and Doubao can both receive so much text, their processing power may not be able to keep up.
Because the performance of an AI model will decrease as the number of tokens increases, the human way of putting it is that “there is too much content, and the AI has become stupid”.
So I chose Gemini 2.0 Pro, which supports 200w context tokens and has very strong performance.
With the right AI tools, complete drafts and final drafts, detailed workflow documents, and texts that need to be corrected, the next step is to figure out how to communicate with AI. The key is to make AI understand your needs accurately.
Now, you have a good AI model, complete drafts and final drafts, workflow documents, and long texts that have been corrected.
You can then feed all of this content to AI and explain clearly what you are doing and what you need AI to help you with.
You can refer to my prompt words:
While I was operating this method, I also recorded a video. The transcript above is the speech-to-text transcript of the video I recorded. There will be some typos, stutters, and pauses in it, for example, DeepSeek was recognized as something else. I need you to help me modify and optimize it, and output it without stutters, pauses, and tics. I have also modified the typos and repeated versions:
Why did I say “important things four times” in the picture?
Because, at the beginning, the AI misunderstood what I meant and thought I wanted it to use this workflow to generate the final article.
So, I repeated the background and purpose of my task four times in the prompt.
(This is a prompt technique: “Say important things three times” means that when we send large passages of content to the AI, due to the attention mechanism, our instructions will be “swamped” by other text and some instructions will be ignored.)
But I found that no matter how much I emphasized this, it was useless.
At this time, it is necessary to emphasize it in a new round of dialogue. When AI starts to try to write the article, click “Stop” to interrupt the spell casting and point out AI’s mistakes.
Then we can re-emphasize our task:
I have to say, Gemini 2.0 Pro is still awesome. It can correct the mistake that the general meaning of the sentence is understood as “cough” instead of “CurSor”.
But the prerequisite for this is that we provide the AI with enough information.
After Gemini 2.0 Pro has finished generating the text, we can manually check it once more to complete the text correction.
Next, we can feed this text to Hailuo AI voice cloning to generate audio, export the audio file, and then use video edited tool and paste to edit the matching images.