Yes, you should generally avoid using regular expressions to parse structured data. But this is a pretty simple case if you are 100% that all occurrences of +
followed 11 digits are valid targets. You can tell sed
to only remove +
if it is followed by 11 numbers (I assume you meant 11 not 10, since you have 11 in your data):
sed -E 's/\+([0-9]{11}[^0-9]*)\b/\1/' file.xml
The -E
enables extended regular expressions which give a simplified syntax and the ability to use {N}
to mean "match N times". So here, we are matching a +
(this needs to be escaped as \+
since otherwise it means "match 1 or more") that is followed by exactly 11 numbers, then 0 or more non-numbers until the first word boundary (\b
).
The entire match except the +
is captured in parentheses, so \1
, the replacement, is everything except the +
.
A slightly safer approach, since all of your target numbers seem to be in address
tags, would be:
sed -E 's|<address>\+([0-9]{11})<\/address>|<address>\1</address>|' file.xml
Or even, if your problem can be restated as "remove all +
from lines where the first non-space string is <address>
", you could do:
sed -E '/<address>+/{s/\+//}' file.xml