Close
Close

Removing rows from a dataframe based on duplicate 'Title'

   Vnay

Hi,
I have a dataframe which has 27949 rows & 7 columns & the first few rows look like this https://i.stack.imgur.com/1Pipf.png

Task :
In the dataframe I have a 'title' column which has many duplicate titles which I want to remove (duplicate title:almost all the title is same except for 1 or 2 words).
Pseudo code :
I want to check the 1st row with all other rows & if any of these is a duplicate then I want to remove it.
Then I want to check the 2nd row with all other rows & if any of these is a duplicate I want to remove it - similarly with all rows i.e. i = 1st line to last line j = i+1 to last line.

My code (which doesn’t work) :

for i in range(0,27950):
for j in range(1,27950):
    a = data_sorted['title'].iloc[i].split()
    b = data_sorted['title'].iloc[j].split()
    if len(a)-len(b)<=2:
        data_sorted.drop(b)
        j=j
    else:
        j+=1
i+=1

I am having a hard time writing code for the above task, can you please help me out with this.


Answers

Ask Yours
Post Yours
Write your answer