| Set | Doc | Words | Class |
|---|---|---|---|
| Training | 1 | Chinese Beijing Chinese | c |
| 2 | Chinese Chinese Shanghai | c | |
| 3 | Chinese Macao | c | |
| 4 | Tokyo Japan Chinese | j | |
| Test | 5 | Chinese Chinese Chinese Tokyo Japan | ? |
| beijing | chinese | japan | macao | shanghai | tokyo |
| Category | Document |
|---|---|
| Spam | send us your password |
| Spam | review us |
| Spam | send us your account |
| Spam | send your password |
| Non-spam | password review |
| Non-spam | send us your review |
| ? | review us now |
| ? | review account |
| Set | Document ID | Keywords in the document | Class h |
|---|---|---|---|
| Training Set | 1 | Love Happy Joy Joy Happy | Yes |
| 2 | Happy Love Kick Joy Happy | Yes | |
| 3 | Love Move Joy Good | Yes | |
| 4 | Love Happy Joy Love Pain | Yes | |
| 5 | Joy Love Pain Kick Pain | No | |
| 6 | Pain Pain Love kick | No | |
| Testing Set | 7 | Love Pain Joy Love Kick | ? |
| STT | Sentence | Category |
|---|---|---|
| 1 | AI is transforming industries. | Tech |
| 2 | Quantum computing is the future. | Tech |
| 3 | New smartphone released today. | Tech |
| 4 | Football match was exciting. | Non Tech |
| 5 | New movie breaking records. | Non Tech |
| 6 | Cooking shows are popular. | Non Tech |
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
df = pd.DataFrame([
[1, 'Chinese Beijing Chinese', 'c'],
[2, 'Chinese Chinese Shanghai', 'c'],
[3, 'Chinese Macao', 'c'],
[4, 'Tokyo Japan Chinese', 'j'],
], columns=['Doc', 'Text', 'Class'])
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['Text'])
y = df['Class']
print('Vocabulary', vectorizer.vocabulary_)
nb = MultinomialNB()
nb.fit(X, y)
text = ['Chinese Chinese Chinese Tokyo Japan']
X_test = vectorizer.transform(text).toarray()
print('Predict', nb.predict(X_test), 'Probality', nb.predict_proba(X_test))